Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NODE_COMPILE_CACHE doesn't cache all loaded files #4484

Open
2 tasks done
znewsham opened this issue Sep 16, 2024 · 8 comments
Open
2 tasks done

NODE_COMPILE_CACHE doesn't cache all loaded files #4484

znewsham opened this issue Sep 16, 2024 · 8 comments

Comments

@znewsham
Copy link

znewsham commented Sep 16, 2024

Node.js Version

v22.8.0

NPM Version

10.8.2

Operating System

Linux xxx 6.5.0-41-generic #41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 3 11:32:55 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

fs, v8, vm

Description

I'm working on a large application that loads around 15k files at startup and takes 20 seconds - a lot of this time I suspect is just compiling the modules. As such, I've been playing around with the new NODE_COMPILE_CACHE option, and I'm a bit confused - this is primarily a request to understand the caching, or some hints on how to debug it myself.

  • I registered a load plugin to get an exact count of the files loaded - it came out to 15312. The same exact count across multiple runs - exactly as I'd expect
  • A single run of the application generates between 2700 and 3800 cache files
    • Why isn't this constant each time?
    • Why isn't this closer to the 15k loaded files?
    • an admittedly naive approach to figure out which files have been included in the cache shows that a very small % of application files seem to match any of the hashes - I recursively iterate over every single file in the application to find all files, then compute the hash using CRC32 (with the esm seed, etc)
    • on the most recent first run, only 6 files matched - sibling files in the exact same folder did not match
  • A second run of the exact same application code, without clearing the cache directory, generates 5155 total files (e.g., an additional 1300-2400 files)
    • why are there new files at all?
    • If the cache relies on some transient field for the hash key, why aren't there double?
    • on the most recent second run, 54 files matched the cache
      • on a third run, this increased to 61

Each run of the application should have been exactly the same - e.g., there are no dynamic imports, I literally ran npm start multiple times.

Is there a good way of debugging this? Any hooks or logs that can be triggered either during cache insert, hit or miss? There is no obvious pattern to the files that were matched - e.g., there was a smattering of node_modules and individual/pairs of files from folders within the application with very similar siblings, loaded in the exact same way.

Have I materially misunderstood the implementation of this feature? E.g., is the hash key perhaps dependant on load order?

Minimal Reproduction

It would be extremely difficult to provide a minimal repro here - I expect the complexity of the application is part of the problem. I was not able to repro on a trivial application (e.g., a single huge source file) - in this situation the cache contains a single entry and I'm able to match the hash from the absolute path to the cache file.

Output

No response

Before You Submit

  • I have looked for issues that already exist before submitting this
  • My issue follows the guidelines in the README file, and follows the 'How to ask a good question' guide at https://stackoverflow.com/help/how-to-ask
@znewsham
Copy link
Author

cc @joyeecheung as I believe you implemented the cache

@joyeecheung
Copy link
Member

I registered a load plugin to get an exact count of the files loaded - it came out to 15312. The same exact count across multiple runs - exactly as I'd expect

Have you checked how many unique files are loaded? The load hooks are triggered multiple times whenever a module identified by the same URL is loaded (which I believe is a drawback of the current module.register() API, I found this behavior weird personally, some in the loaders team said that this would make hot-module replacement easier but I am skeptical about that).

Why isn't this constant each time?

The cache would be re-generated (invalidated) when the code changes, or the flags used to run Node.js is changed etc. It's hard to tell why without looking at the code. Have you tried running it without using the load hooks? To debug it without using the load hooks, you can try the NODE_DEBUG_NATIVE=COMPILE_CACHE environment variable instead.

@znewsham
Copy link
Author

znewsham commented Sep 16, 2024

Thanks for the fast response!

Have you checked how many unique files are loaded?

I used a set - so as long as the uniqueness isn't based on something other than the module URL, it should be good

Have you tried running it without using the load hooks?

I have not - about a year ago we migrated the code from something heinous into a more standard ESM approach, but to do this we used a load hook to add the import attributes where we import JSON files. I can probably fix these (maybe 100 places in total) - would simply eliminating the load hooks be sufficient? Or would I also need to remove the resolve hooks too?

you can try the NODE_DEBUG_NATIVE=COMPILE_CACHE environment variable

as in, remove COMPILE_CACHE=path and use the literal NODE_DEBUG_NATIVE=COMPILE_CACHE instead?

@joyeecheung
Copy link
Member

joyeecheung commented Sep 16, 2024

I suspect the load hooks might play a role in the strange behaviors you've seen here - the loader hooks are a bit all over the place and loading : compilation is not really 1:1. The compile cache is only hit when the module gets to the compilation phase. If the application somehow maps many different source code to the same URL, it may fare poorly as the compilation cache is basically mapped to source files 1:1, keyed by the file name (converted from URL if it's ESM).

as in, remove COMPILE_CACHE=path and use the literal NODE_DEBUG_NATIVE=COMPILE_CACHE instead?

You put in both - NODE_COMPILE_CACHE=path to enable the cache, NODE_DEBUG_NATIVE=COMPILE_CACHE to enable the logging.

@znewsham
Copy link
Author

That gave me exactly the output I wanted - thanks! The problem was me killing the process made it exit halfway through writing the cache files. Now the startup takes only 11 seconds (roughly what I'd expected), there are 25766 cache files (still no idea why it's not closer to the 15k loaded)

Is there a programatic way to persist the files? Or a specific exit code that will ensure it will wait?

Not super important, but how is the hash calculated? I'm still only matching 359 files (out of either 15k or 25k, depending on how you look at it).

const CRC32 = require("crc-32");
const zero = CRC32.buf(Buffer.from([]))	
const esm = CRC32.buf(Buffer.from([1]),zero)
function getHash(file, seed) {
	const hash = CRC32.buf(Buffer.from(file), seed).toString(16)
	if (hash.length === 7) {
		return `0${hash}`
	}
	return hash;
}
getHash('file:///home/zacknewsham/dev/qualia/services/qualia/app/imports/ai_cd_import/tests/eval.test.js', esm)
> '-4f0dfcf2'

No obvious variations (e.g., absolute) exist in the cache directory - can you point me to how to calculate the value? I'm pretty close because it works...some of the time :D

For reference, The loader is super trivial:

  • it noops .jsx files (the reason for this is historic, but I wouldn't expect to impact the cache meaningfully)
  • it adds import sttributes to the JSON imports

@joyeecheung
Copy link
Member

joyeecheung commented Sep 17, 2024

Is there a programatic way to persist the files? Or a specific exit code that will ensure it will wait?

There is a WIP in nodejs/node#54971

Not super important, but how is the hash calculated?

It's roughly:

const kCommonJS = 0, kESM = 1;
function getCacheName(filename, type) {
  let crc = 0;
  crc = zlib.crc32(Buffer.from([type]), crc);
  crc = zlib.crc32(Buffer.from(filename, 'utf-8'), crc);
  return (crc >>> 0).toString(16) + '.cache';
}

getCacheName('/Users/joyee/projects/node/test/fixtures/compile-cache-flush.js', kCommonJS);
// '6a7f7d45.cache' - matches what I see locally with the logs

Note that the full file name, not the file URL, is part of the hash, and it should be utf-8 encoded.

@znewsham
Copy link
Author

return (crc >>> 0).toString(16) + '.cache';

The >>> 0 got me to 920 matches 🤔 I think I'll just dump the logs to a file - and get the list of files from there!

@joyeecheung
Copy link
Member

joyeecheung commented Sep 17, 2024

Actually I am not sure if the >>> 0 part is needed for zlib.crc32() as the built-in version should always return a uint32 already. I think it's only necessary when you use the crc-32 npm pacakge (I am also not sure how much that differs from zlib.crc32())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants