Node: Clear the internal require `statCache` on an unsuccessful module load?

Created on 14 Feb 2020 · 24Comments · Source: nodejs/node

Currently, it is impossible to write code like:

let Bluebird;
try { 
  Bluebird = require('bluebird');
} catch (e) { 
  execSync('npm install bluebird');
  Bluebird = require('bluebird');
}

Because the cjs loader caches the results of stat calls internally so it would not look for bluebird the second time we require it.

It would be pretty easy to deal with this use case from a technical point of view by setting statCache to a new Map() on a MODULE_NOT_FOUND.

I can see several use cases like:

Writing repls in userland
Writing a "lazy loading" require hook that recovers from 'module not found' errors by npm install`

Would people be in favor of this change and enabling this?

cc @nodejs/modules

cc @BridgeAR because of https://github.com/nodejs/node/pull/26928 that enables this specifically for the REPL.

module

Source

benjamingr

👍1

Most helpful comment

Rather than clearing the cache on failure (which means no longer caching failed lookups which makes repeated failed lookups very expensive), why not clear the cache on invocation of a process, fs, or child_process API

-1 That's way too magical

mscdex on 15 Feb 2020

👍5

All 24 comments

That approach sounds fine to me. It would hurt the loading percormance of all following calls though. I'll check if it is possible to isolate the failed modules stats to only clear those.

BridgeAR on 14 Feb 2020

How would this be similar to ESM behavior for a similar pattern?

ljharb on 14 Feb 2020

Perhaps there should be an explicit API to do this instead of always doing it, so that you opt-in to any potential performance issues.

mscdex on 15 Feb 2020

Rather than clearing the cache on failure (which means no longer caching failed lookups which makes repeated failed lookups very expensive), why not clear the cache on invocation of a process, fs, or child_process API (as those are liable to trigger an fs write (or just mark for some kind of lazy re-validation on next attempt?))? (In addition to providing a sort of manual clear opt-in?)

How would this be similar to ESM behavior for a similar pattern?

So for an esm equivalent:

async function main() {
  let Bluebird;
  try { 
    Bluebird = await import('bluebird');
  } catch (e) { 
    execSync('npm install bluebird');
    Bluebird = await import('bluebird');
  } 
}
main();

since each dynamic import invocation is a unique sub-module-graph, the second invocation of import is completely unrelated to the first and uses a separate cache. This isn't really tenable for cjs, and only works for esm because of the upfront graph traversal (which allows everything in the graph to share a cache). This does mean that if the community starts to rely heavily on dynamic import a lot, performance might eventually be bad, since things aren't cached between imports, but... eh?

weswigham on 15 Feb 2020

since each dynamic import invocation is a unique sub-module-graph, the second invocation of import is completely unrelated to the first and uses a separate cache.

Just to be explicit, import() is required by spec to cache if it succeeds (on a per-module basis, and node actually has a stricter behavior because it caches across all modules). If it fails, it is not required to cache the failure, and node doesn't cache the failure.

devsnek on 15 Feb 2020

Rather than clearing the cache on failure (which means no longer caching failed lookups which makes repeated failed lookups very expensive), why not clear the cache on invocation of a process, fs, or child_process API

-1 That's way too magical

mscdex on 15 Feb 2020

👍5

I’m -1 of clearing the cache. A lot of modules use a “conditional loading pattern” and this change is going to slow their startup time significantly.

mcollina on 15 Feb 2020

Apart from technical considerations, the example introduces a kind of non-determinism (the install can fail - or not) that shouldn't be encouraged. The time for installing packages is at install time, not run time.

bnoordhuis on 15 Feb 2020

Does require.resolve have the same restriction? If not, it can be used to check if the module is there before trying to require it

targos on 15 Feb 2020

👍1

@targos

Does require.resolve have the same restriction?

It does, it goes through Module._findPath and uses the stat cache.

@bnoordhuis

The time for installing packages is at install time, not run time.

Package installation is just one use case - the only reason Node works this way at the moment is because we have caching for performance reasons and that is observable. Moreover we have already worked around it in the REPL.

@mcollina

Hmm, if it only invalidated the entries it created for the problematically required module rather than the whole cache on an error would it help?

@mscdex

Perhaps there should be an explicit API to do this instead of always doing it, so that you opt-in to any potential performance issues.

I tend to agree though I believe the current behavior is only because our performance optimization leaks a behavioral change.

benjamingr on 15 Feb 2020

Your OP only has a use case for runtime install, or a repl; are there others you have in mind?

ljharb on 15 Feb 2020

👍1

If you know that the specifier and package name match, as does in OP, you could use an alternative method (not resolve) of checking if the package exists.

// index.js

const {accessSync} = require('fs');
const {execSync} = require('child_process');
const node_modules = './node_modules/';
const specifier = 'bluebird';

let Bluebird;
try {
  // does package exist?
  accessSync(node_modules + specifier);
  // yes, so require it
  Bluebird = require(specifier);
} catch (e) {
  // no, so install it...
  execSync('npm install ' + specifier);
  // ...and require it
  Bluebird = require(specifier);
}

console.log(Bluebird);

DerekNonGeneric on 15 Feb 2020

Hmm, if it only invalidated the entries it created for the problematically required module rather than the whole cache on an error would it help?

It would not cause significant performance issues in that case.

mcollina on 16 Feb 2020

This problem also affects the case where the entire node_modules directory is created within node. In this case, there doesn't seem to be any workaround.

If you do something like what OP is doing:

let Bluebird;
try { 
  Bluebird = require('bluebird');
} catch (e) { 
  execSync('npm install bluebird');
  Bluebird = require('bluebird');
}

but you do it when there is no node_modules directory in your cwd, running execSync('npm install bluebird'); will create the directory and install the module, but the statCache will already have cached that path as non-existent. The second call to require('bluebird'); will check the statCache, see that that path is already in the cache, and won't bother checking the node_modules directory.

LMojica on 20 Feb 2020

What’s the use case for “npm install hasn’t been ran first”? Can you elaborate?

ljharb on 20 Feb 2020

I think in a really general sense, these caches shouldn't be updated for failure cases (obviously easier to say than implement, but I think that's what we should aim for)

devsnek on 20 Feb 2020

That pessimizes a pattern that is fairly common in larger applications:

try {
  var debug = require('debug');  // just an example, of course
} catch (e) {
  var debug = () => {};
}

The bigger the program, the more occurrences of that pattern pop up, and the more it is punished if the results aren't in the stat cache.

bnoordhuis on 20 Feb 2020

Is that pattern often done for the same package a great many times in the graph, and also not at module level/require time?

ljharb on 20 Feb 2020

Yes and yes (I think - I'm not completely sure what you mean by "also not at module level/require time".)

bnoordhuis on 20 Feb 2020

Like, is that done inside function bodies, as opposed to at the top level of modules (where it'd happen once and be cached after that, and thus the perf hit wouldn't be significant)

ljharb on 20 Feb 2020

Ah, like that. I think it only matters at the top level because the stat cache is cleared when control returns to the event loop.

I'm currently looking at a midsize application that has two dozen instances of the above pattern (that I can find) on a total of 1,200 require() calls, so about 2% of the total.

An inexact science benchmark suggests not caching the result makes such require() calls 2.8x slower. That's not as bad as I feared.

There's a caveat though: require() searches upwards, therefore deeper directory structures are penalized more heavily.

bnoordhuis on 20 Feb 2020

👍1

In that case, someone needing to do a conditional require in this case can already do it asynchronously, just not synchronously?

ljharb on 21 Feb 2020

What if we only retry failed attempts without the statCache? That way we would only pay the penalty of not going through the cache failed attempts.

benjamingr on 22 Feb 2020

I am going to go ahead and close this since it staled, if anyone feels strongly about supporting this - please feel free to reopen!

benjamingr on 30 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Readline printing output history while printing with readline.write

Brekmister · 3Comments

querystring should maybe support Buffers

addaleax · 3Comments

this[searchParams].push is not a function

stevenvachon · 3Comments

crypto sha1 returns different value in nodejs 6.0.0

Icemic · 3Comments

Unable to build with VS2017 after building with VS2015

seishun · 3Comments