Node: Clear the internal require `statCache` on an unsuccessful module load?

Created on 14 Feb 2020  Â·  24Comments  Â·  Source: nodejs/node

Currently, it is impossible to write code like:

let Bluebird;
try { 
  Bluebird = require('bluebird');
} catch (e) { 
  execSync('npm install bluebird');
  Bluebird = require('bluebird');
} 

Because the cjs loader caches the results of stat calls internally so it would not look for bluebird the second time we require it.

It would be pretty easy to deal with this use case from a technical point of view by setting statCache to a new Map() on a MODULE_NOT_FOUND.

I can see several use cases like:

  • Writing repls in userland
  • Writing a "lazy loading" require hook that recovers from 'module not found' errors by npm install`

Would people be in favor of this change and enabling this?

cc @nodejs/modules

cc @BridgeAR because of https://github.com/nodejs/node/pull/26928 that enables this specifically for the REPL.

module

Most helpful comment

Rather than clearing the cache on failure (which means no longer caching failed lookups which makes repeated failed lookups very expensive), why not clear the cache on invocation of a process, fs, or child_process API

-1 That's way too magical

All 24 comments

That approach sounds fine to me. It would hurt the loading percormance of all following calls though. I'll check if it is possible to isolate the failed modules stats to only clear those.

How would this be similar to ESM behavior for a similar pattern?

Perhaps there should be an explicit API to do this instead of always doing it, so that you opt-in to any potential performance issues.

Rather than clearing the cache on failure (which means no longer caching failed lookups which makes repeated failed lookups very expensive), why not clear the cache on invocation of a process, fs, or child_process API (as those are liable to trigger an fs write (or just mark for some kind of lazy re-validation on next attempt?))? (In addition to providing a sort of manual clear opt-in?)

How would this be similar to ESM behavior for a similar pattern?

So for an esm equivalent:

async function main() {
  let Bluebird;
  try { 
    Bluebird = await import('bluebird');
  } catch (e) { 
    execSync('npm install bluebird');
    Bluebird = await import('bluebird');
  } 
}
main();

since each dynamic import invocation is a unique sub-module-graph, the second invocation of import is completely unrelated to the first and uses a separate cache. This isn't really tenable for cjs, and only works for esm because of the upfront graph traversal (which allows everything in the graph to share a cache). This does mean that if the community starts to rely heavily on dynamic import a lot, performance might eventually be bad, since things aren't cached between imports, but... eh?

since each dynamic import invocation is a unique sub-module-graph, the second invocation of import is completely unrelated to the first and uses a separate cache.

Just to be explicit, import() is required by spec to cache if it succeeds (on a per-module basis, and node actually has a stricter behavior because it caches across all modules). If it fails, it is not required to cache the failure, and node doesn't cache the failure.

Rather than clearing the cache on failure (which means no longer caching failed lookups which makes repeated failed lookups very expensive), why not clear the cache on invocation of a process, fs, or child_process API

-1 That's way too magical

I’m -1 of clearing the cache. A lot of modules use a “conditional loading pattern” and this change is going to slow their startup time significantly.

Apart from technical considerations, the example introduces a kind of non-determinism (the install can fail - or not) that shouldn't be encouraged. The time for installing packages is at install time, not run time.

Does require.resolve have the same restriction? If not, it can be used to check if the module is there before trying to require it

@targos

Does require.resolve have the same restriction?

It does, it goes through Module._findPath and uses the stat cache.

@bnoordhuis

The time for installing packages is at install time, not run time.

Package installation is just one use case - the only reason Node works this way at the moment is because we have caching for performance reasons and that is observable. Moreover we have already worked around it in the REPL.

@mcollina

Hmm, if it only invalidated the entries it created for the problematically required module rather than the whole cache on an error would it help?

@mscdex

Perhaps there should be an explicit API to do this instead of always doing it, so that you opt-in to any potential performance issues.

I tend to agree though I believe the current behavior is only because our performance optimization leaks a behavioral change.

Your OP only has a use case for runtime install, or a repl; are there others you have in mind?

If you know that the specifier and package name match, as does in OP, you could use an alternative method (not resolve) of checking if the package exists.

// index.js

const {accessSync} = require('fs');
const {execSync} = require('child_process');
const node_modules = './node_modules/';
const specifier = 'bluebird';

let Bluebird;
try {
  // does package exist?
  accessSync(node_modules + specifier);
  // yes, so require it
  Bluebird = require(specifier);
} catch (e) {
  // no, so install it...
  execSync('npm install ' + specifier);
  // ...and require it
  Bluebird = require(specifier);
}

console.log(Bluebird);

Hmm, if it only invalidated the entries it created for the problematically required module rather than the whole cache on an error would it help?

It would not cause significant performance issues in that case.

This problem also affects the case where the entire node_modules directory is created within node. In this case, there doesn't seem to be any workaround.

If you do something like what OP is doing:

let Bluebird;
try { 
  Bluebird = require('bluebird');
} catch (e) { 
  execSync('npm install bluebird');
  Bluebird = require('bluebird');
} 

but you do it when there is no node_modules directory in your cwd, running execSync('npm install bluebird'); will create the directory and install the module, but the statCache will already have cached that path as non-existent. The second call to require('bluebird'); will check the statCache, see that that path is already in the cache, and won't bother checking the node_modules directory.

What’s the use case for “npm install hasn’t been ran first”? Can you elaborate?

I think in a really general sense, these caches shouldn't be updated for failure cases (obviously easier to say than implement, but I think that's what we should aim for)

That pessimizes a pattern that is fairly common in larger applications:

try {
  var debug = require('debug');  // just an example, of course
} catch (e) {
  var debug = () => {};
}

The bigger the program, the more occurrences of that pattern pop up, and the more it is punished if the results aren't in the stat cache.

Is that pattern often done for the same package a great many times in the graph, and also not at module level/require time?

Yes and yes (I think - I'm not completely sure what you mean by "also not at module level/require time".)

Like, is that done inside function bodies, as opposed to at the top level of modules (where it'd happen once and be cached after that, and thus the perf hit wouldn't be significant)

Ah, like that. I think it only matters at the top level because the stat cache is cleared when control returns to the event loop.

I'm currently looking at a midsize application that has two dozen instances of the above pattern (that I can find) on a total of 1,200 require() calls, so about 2% of the total.

An inexact science benchmark suggests not caching the result makes such require() calls 2.8x slower. That's not as bad as I feared.

There's a caveat though: require() searches upwards, therefore deeper directory structures are penalized more heavily.

In that case, someone needing to do a conditional require in this case can already do it asynchronously, just not synchronously?

What if we only retry failed attempts without the statCache? That way we would only pay the penalty of not going through the cache failed attempts.

I am going to go ahead and close this since it staled, if anyone feels strongly about supporting this - please feel free to reopen!

Was this page helpful?
0 / 5 - 0 ratings