Modules: Loader Hooks

Created on 17 Jul 2019  Ā·  105Comments  Ā·  Source: nodejs/modules

Hooking into the dependency loading steps in Nodejs should be easy, efficient, and reliable across CJS+ESM. Loader hooks would allow for developers to make systematic changes to dependency loading without breaking other systems.

It looks like discussion on this topic has died down, but I'm really interested in loader hooks and would be excited to work on an implementation! There's of prior discussion to parse through, and with this issue I'm hoping to reignite discussion and to create a place for feedback.

Some of that prior discussion:


edit (mylesborins)

here is a link to the design doc

https://docs.google.com/document/d/1J0zDFkwxojLXc36t2gcv1gZ-QnoTXSzK1O6mNAMlync/edit#heading=h.xzp5p5pt8hlq

Most helpful comment

We're currently monkey-patching the fs module to add transparent support for zip and a few other extra files. It works for compatibility purposes, and it's likely we'll have to keep it as long as cjs environments are the norm, but I'm aware of the shortcomings of the approach and I'd prefer to have a long-term plan to sunset it someday, at least for ESM šŸ™‚

Part of my worry and reason why I feel we need to expand loader hooks as best we can for CJS is exactly that we don't guarantee this to work currently or in the future. Even if we cannot support C++ modules (the main problem with this FS patching approach that has been known since at least 2014 when I spoke on it) we can cover most situations and WASM at least can begin to replace some C++ module needs. I see this as a strong indication that we need to solve this somehow or provide some level of discretion for what is supported.

All 105 comments

Some use cases I've encountered:

I'm working on a custom dependency bundler and loader designed to improve cold-start startup times by transparently loading from a bundle to avoid file-system overhead. Currently, I have to monkey-patch module and reimplement CJS resolution with @soldair's node-module-resolution. I have to deeply understand and often reimplement CJS+ESM internals to work on this.

I also want to load modules from V8 code-cache similar to v8-compile-cache. Again I have to re-implement Module._compile and manually handle fallback for other extensions.

Some other use cases that would benefit:

  • Transpilers/compileres (babel, ts-node)
  • Bundlers (webpack, browserify, ncc)
  • Other custom loaders (pnp, tink, loading from zip, etc)
  • Code instrumentation (coverage, logging, testing)

I think the current exposed hooks are the right hooks to expose, but we definitely need to work on polishing the API:

  • dynamic modules are a bit rough atm
  • how are hooks registered
  • how do multiple registered hooks interact

Very excited to see interest in this! I believe that @bmeck has a POC that has memory leaks that need to be fixed. @guybedford may know about this too

@devsnek I think there's more things that list is missing. E.g. providing resource content, not just format. Or the question of if we can extend aspects of this feature to CommonJS (e.g. for the tink/entropic/yarn case that currently requires monkey-patching the CommonJS loader or even the fs module itself). The current hooks were a good starting point but I would disagree that they are the right hooks.

@jkrems i think cjs loader hooks are outside the realm of our design. cjs can only deal with local files and it uses filenames, not urls.

Providing resource content is an interesting idea though. I wonder if we could just figure out a way to pass vm modules to the loader.

@devsnek we discussed and even implemented a PoC of intercepting CJS in the middle of last year and had a talk on the how/why in both

These would only allow for files for require since that is what CJS works with but it should be tenable. Interaction with require.cache is a bit precarious but solvable if enough agreement can be reached.

@bmeck i don't doubt it can be done, i'm just less convinced it makes sense to include with the esm loader hooks given the large differences in the systems.

@A-lxe thanks for opening this discussion. It was interesting to hear you say that multiple loaders were one of the features you find important here. The PR at https://github.com/nodejs/node/pull/18914 could certainly be revived. Is this something you are thinking of working on? I'd be glad to collaborate on this work if you would like to discuss it further at all.

@guybedford Yeah! At least to me it seems right now the singular --loader api is insufficient for current loader use cases. For example in my projects I test with ts-node istanbul, mocha, and source-map-support -- each of which hooks into loading in one way or another IIRC. Optimally these could each independently interface with a loader hook api and smoothly compound on each other.

I think a node loader hook api needs to provide mechanisms for compounding on and falling back to already registered hooks (or the default cjs/esm behavior). I'm not really sure yet where to focus work, but I definitely want to collaborate :)

@A-lxe agreed we need a way to chain loaders. Would the approach in nodejs/node#18914 work for you, or if not, how would you want to go about it differently? One way to start might be to get that rebased and working again and then to iterate on it from there.

@guybedford I like the way nodejs/node#18914 chains the loaders and provides parent to allow fallback/augmentation of both the resolve + dynamic instantiation steps. I have some ideals for what a loader hook api should look like (particularly wrt supporting cjs) but I don't think those should get in the way of providing multiple --loader for esm. To be honest working on reviving that PR would be really useful for me in getting up to speed with things, so I would be happy to get started on that.

Some gripes which are more relevant to the initial --loader implementation rather than the multiple --loader feature:

  • Why is there no runtime api for registering loaders? The current mechanisms using --require to register loaders via preloaded modules feel nice to me, and allow for opting to manually register at a particular point and with dynamic parameters.
  • A loader has to implement both hooks (and add fallback overhead) even if it only affects one.
  • Similarly, I feel like the hooks could be more granular. @bmeck 's Resource APIs for Node splits things into locate, retrieve, and translate hooks. With an additional initialize hook for actually creating the module, these match the nodejs/node#18914 functionality, with an added bonus that only initialize needs to have coupling with cjs/esm. I'm curious what you think on this.
  • Doesn't hook into cjs require :'(

Also, your last comment on nodejs/node#18914 hints at another loaders implementation by @bmeck. Does this exist in an actionable state?

@BridgeAR this work also exists as part of the new loaders work which @bmeck started, so that effectively takes over from this PR already. Closing sounds sensible to me.

Why is there no runtime api for registering loaders?

Loaders are a higher-level feature of the environment, kind of like a boot system feature. They sit at the root of the security model for the application, so there are some security concerns here. In addition to that, hooking loaders during runtime can lead to unpredictable results, since any already-loaded modules will not get loaders applied. I'm sure @bmeck can clarify on these points, but those are the two I remember on this discussion offhand.

A loader has to implement both hooks (and add fallback overhead) even if it only affects one.

There is nothing to say we won't have CJS loader hooks or a generalized hook system, but it's just that our priority to date has been getting the ESM loader worked out. In addition the ESM hooks allow async functions, while CJS hooks would need some manipulation to support async calls. There's also the problem of the loaders running in different resolution spaces (URLs v paths) as discussed. Once we have our base ESM loader API finalized I'm sure we could extend it to CJS with some extra resolution metadata and handling of the resolution spaces, but I very much feel that loader unification is a "nice to have" that is additive over the base-level ESM API which should be the priority for us to consolidate and work towards first. That loader stability and architecture should take preference in the development process. That said, if you want to work on CJS unification first, feel free, but there are no guarantees the loader API will be stable or even unflagged unless we work hard towards that singular goal right now. So what I'm saying is chained loaders, whether the loader is off-thread, whether the API will be abstracted to deal with multi-realm and non-registry based API, and the translate hook all take preference in the path to a stable API to me, overy unifying ESM and CJS hooks. And that path is already very tenuous and unlikely, so that we should focus our combined efforts on the API stability first and foremost.

Similarly, I feel like the hooks could be more granular.

Implementing a translate or fetch hook for --loader could certainly be done and was a deliberate omission in the loader API. It is purely a problem of writing the code, making a PR, and the real hard part - getting consensus!

Doesn't hook into cjs require :'(

As mentioned above, this work can be done, but I would prefer to get the ground work done first.

That all makes a lot of sense and I appreciate you describing it for me :slightly_smiling_face:

I can start with pulling nodejs/node#18914 and getting that in a working state.

Just to spark some discussion, here’s a wholly theoretical potential API that I could imagine being useful to me as a developer:

import { registerHook } from 'module';
import { promises as fs, constants as fsConstants } from 'fs';

registerHook('beforeRead', async function automaticExtensionResolution (module) {
  const extensions = ['', '.mjs', '.js', '.cjs'];
  for (let i = 0; i < extensions.length; i++) {
    const resolvedPathWithExtension = `${module.resolvedPath}${extensions[i]}`;
    try {
      await fs.access(resolvedPathWithExtension, fsConstants.R_OK);
      module.originalResolvedPath = module.resolvedPath;
      module.resolvedPath = resolvedPathWithExtension;
      break;
    } catch {}
  }
  return module;
}, 10);

The new registerHook method takes three arguments:

  • The hook name, which is a point in Node’s code where these callbacks will be run. beforeRead, afterRead, etc.
  • The function to call at that point, which takes as input an object with all the properties related to import or require resolution and module loading that developers might want to override. Properties set on this object persist to other callbacks registered to later hooks in the process (e.g. module.foo set during beforeRead would be accessible in a different callback registered to afterRead).
  • (Optional): The priority to call registered callbacks. If multiple callbacks have the same priority level, they are evaluated in the order that they were registered.

In the first example, my automaticExtensionResolution callback is registered to beforeRead because it’s important to rewrite the path that Node tries to load _before_ Node tries to load any files from disk (because './file' wouldn’t exist but './file.js' might, and we don’t want an exception thrown before our callback can tell Node to load './file.js' instead). I’m imagining the module object here has an unused specifier property with whatever the original string was, e.g. pkg/file, and what Node would resolve that to in resolvedPath, e.g. ./node_modules/pkg/file.

Another example:

import { registerHook } from 'module';
import CoffeeScript from 'coffeescript';

registerHook('afterRead', async function transpileCoffeeScript (module) {
  if (/\.coffee$|\.litcoffee$|\.coffee\.md$/.test(module.resolvedPath)) {
    module.source = CoffeeScript.compile(module.source);
  }
  return module;
}, 10);

This hook is registered _after_ Node has loaded the file contents from disk (module.source) but before the contents are added to Node’s cache or evaluated. This gives my callback a chance to modify those contents before Node does anything with them.

And so on. I have no idea how close or far any of the above is from the actual implementation of the module machinery; hopefully it’s not so distant as to be useless. Most of the loader use cases in our README could be satisfied by an API like this:

  • Code coverage/instrumentation: In an afterRead hook, a callback could count lines of code or the like.
  • Runtime loaders, transpilation at import time: In an afterRead hook, example above.
  • Arbitrary sources for module source text: In a beforeRead hook, our callback could assign content into module.source (and then Node would know to not read from disk for this module).
  • Mock modules (injection): Basically the same as previous, a beforeRead hook could return a different path to load instead, or prefill the source code to use: if (module.specifier === 'request') module.source = mockRequest etc. Ideally Node would handle if source were an actual module rather than just a string to be evaluated.
  • Specifier resolution customization: This is like the extension resolution example above, though if we also want to support specifiers that Node _can’t_ resolve, like import 'https://something', we would need another hook like beforeResolve.
  • Package encapsulation: We’re already implementing this as "exports", but it could just as easily be implemented as a loader, at least for ESM. It would be in beforeRead like the extension resolution example.
  • Conditional imports: In beforeRead or afterRead, based on some condition the module.source could be set to an empty string.

Anyway this is just to start a discussion of what kind of public-facing API we would want, and the kind of use cases it would support. I’m not at all married to any of the above, I’m just hoping that we come up with something that has roughly the same versatility as this.

Thanks for this write-up @GeoffreyBooth! Some thoughts to add to the discussion:

To me this looks like a transformer architecture, which exposes the entire in-progress module object to each hook, as opposed to the current --loader implementation, which has functional resolve and instantiate hooks. I would worry about exposing too large a surface area, ie developers doing something like reading a new file to override the old in the afterRead hook. Besides the transformer architecture, the differences largely come down to which hooks are exposed.

This api also doesn't allow for a loader to prevent other loaders from acting, which the wip multiple loader implementation at nodejs/node#18914 does. I don't think that's a bad thing, and I would be interested in hearing what people think on that front.

I'm not sure about the optional priority parameter. I don't think loaders should know much about what other loaders are registered or be making decisions about which order they're executed in. The user controls the order by choosing the order in which they register the loaders.

These are all good points. I would err on the side of exposing a lot of surface area, though, as that’s what users are used to from CommonJS. A lot of the power of things like stubs are because the surface area is huge.

In particular, I think we _do_ want to allow reading a new file to override the old, or at least modifying the loaded string (which of course could be modified by loading the contents of a new file); otherwise we can’t have transpilers, for example, or stubs that are conditional based on the contents of a file rather than just the name in the specifier.

The priority option is just a convenience, so that the user doesn’t need to be careful about the order that they register hooks.

One thing that I thought of after posting was to add the concept of package scope to this. A lot of loaders will only be useful in the app’s package scope, not the combined scope of the app plus all its dependencies. We probably want some easy way to limit the callbacks to just the package scope around process.cwd().

On the afterRead point, you're totally right -- there needs to be a way of knowing+overriding the original loaded source (which can currently be done by loading the source and modifying it in the instantiate hook). I think I gave a bad example: by providing the entire in-progress module object, the user can modify aspects of it even in hooks where it shouldn't be modified (a better example might be module.resolvedPath = 'something' in afterRead).

by providing the entire in-progress module object, the user can modify aspects of it even in hooks where it shouldn’t be modified (a better example might be module.resolvedPath = 'something' in afterRead).

Node could simply ignore any changes to ā€œshouldn’t be modifiedā€ properties. That’s probably better than trying to lock them down or removing them from the module object, since late hooks might very well want to know things like what the resolvedPath was at an earlier step. This also feels like something we can work out in the implementation stage of building this.

So there is a lot to talk about on loaders. We have had multiple meetings discussing some design constraints to keep in mind. I think setting up another meeting just to review things from the past would be helpful.

At the moment, from PnP's perspective:

  • We'd need a hook that takes the bare request + the path of the file that makes the require call, and is expected to return the unqualified resolution (so in a node_modules context it means such a loader would return /n_m/lodash rather than /n_m/lodash/index.js).

    • This is required for the CJS use case, as we have no reason to reimplement the extension / main resolution part.
  • We'd need a hook to instruct Node how to load the source for virtual files. For example, given /cache/lodash.zip/index.js, we would replace the default hook (that would use readFile) by a custom one that would open the zip, get the file source, and return it.

    • Note that in this case the source file might not exist at all. So rather than discussing in terms of paths, I think it would be beneficial to treat the resolution as an opaque identifier that doesn't necessarily match anything on the filesystem. We would then have a particular function (let's say require.toDiskPath) that would turn such opaque identifiers into paths usable on the disk (or throw an exception if impossible).

the first one is already possible with our current design (not including cjs). the second one is interesting and should probably exist, but it is unlikely that a cjs version can be added without breaking some modules that are loaded by it.

the first one is already possible with our current design (not including cjs). the second one is interesting and should probably exist, but it is unlikely that a cjs version can be added without breaking some modules that are loaded by it.

We're currently monkey-patching the fs module to add transparent support for zip and a few other extra files. It works for compatibility purposes, and it's likely we'll have to keep it as long as cjs environments are the norm, but I'm aware of the shortcomings of the approach and I'd prefer to have a long-term plan to sunset it someday, at least for ESM šŸ™‚

I think for cjs it would be doable if the require.resolve API was deprecated and split between two functions: require.req(string): Resolution and require.path(Resolution): URL, but it might be out of scope for this group and as I mentioned the fs layer is working decently well enough at the moment that it's not an emergency to find something else.

We're currently monkey-patching the fs module to add transparent support for zip and a few other extra files. It works for compatibility purposes, and it's likely we'll have to keep it as long as cjs environments are the norm, but I'm aware of the shortcomings of the approach and I'd prefer to have a long-term plan to sunset it someday, at least for ESM šŸ™‚

Part of my worry and reason why I feel we need to expand loader hooks as best we can for CJS is exactly that we don't guarantee this to work currently or in the future. Even if we cannot support C++ modules (the main problem with this FS patching approach that has been known since at least 2014 when I spoke on it) we can cover most situations and WASM at least can begin to replace some C++ module needs. I see this as a strong indication that we need to solve this somehow or provide some level of discretion for what is supported.

We have a mostly stable design document, please feel free to comment or request edit access as needed, at https://docs.google.com/document/d/1J0zDFkwxojLXc36t2gcv1gZ-QnoTXSzK1O6mNAMlync/edit#heading=h.xzp5p5pt8hlq .

The main contention is towards the bottom around potential implementations, but reading the things before then explain a lot of different ideas from historical threads and research over the past few years and have been summarized.

Great to see work moving here! I really like the overall model, we maybe just have a few disagreements about the exact APIs. I’ve already statement my feedback in the doc, but will summarize it again here:

  1. I think the resolve hook and the body hook should be separated to allow for proper composability. By having the same call that resolves the module also load the module, makes it harder to add simple instrumentation hooks of source, or simple resolver hooks. For example, say I wanted a loader that resolves .coffee extensions before .js extensions. Calling the parent resolve function will give me { resolved: /path/to/resolved.js, body: streamingBodyForResolvedJs } for that resolved file. That is the loader might have already opened the file descriptor potentially for the .js resolution, when it is in fact a .coffee resolution that we want to load. This conflation seems like it might cause issues.

  2. Using web APIs like Response and Blob seems like bringing unnecessary web baggage to the API. For example Response can be replaced by an async iterator and a string identifier for the format. Blob can be replaced by simply an interface containing the source string and a format string. I’m not sure what benefit is brought by using these web APIs that were designed to serve specific web use cases we don’t necessarily share (at least without seeing justification for what these use cases are and why we might share them). With the use of MIMEs for formats, do we now expect every transform to define its own MIME type?

  3. How would loader errors be handled over the loader serialization protocol? Can an Error instance be properly cloned over this boundary with the stack information etc? Or would we just print the stack trace to the console directly from the loader context, while providing an opaque error to the user. We need to ensure we maintain a good debugging experience for loader errors, so we need to make sure we can get error stacks. Or is the stack information itself an information leak?

Most of the above is relatively superficial though - the core of the model seems good to me. (1) means having two-phase messaging with loaders, so is slightly architectural though.

@guybedford

Per "separation". I agree there needs to be a "fetch"/"retrieve" hook of some kind, but not that resolve` should not be able to return a body. The problem you explain above is about passing data to parent loaders such as list of extensions, but does not seem to be fixed by separating loaders that I can tell.


Per APIs, we can argue about which APIs to use but we should start making lists of what features are desirable rather than bike shedding without purpose. To that end I'd like to posit the following:

  1. Most APIs working on sources do not support streaming such as JSON.parse, JS parsers such as esprima, and WebAssembly.compile/instantiate. Even naive RegExp searches on the body will want to buffer them to a full body before searching. I think we should not focus on streaming for the first iteration in light of this.
  2. Data may be wanted in either a binary format or a textual format. This largely depends on the format. Consumption methods for both should be available as some naive steps can lead to corruption like split UTF code points. I like Blob because it does support this via .text() and .arrayBuffer().
  3. Streaming sources need care about how they are consumed. For example, reading the start of a stream to see if it begins with a magic number. If they cannot be replayed/cloned safely this is a problem. I like Response or a subset of that API because it has already solved these problems while preserving meta-data.
  4. When possible, opaque data structures allow for streaming either eagerly or lazily and can be largely swapped without consequence as we determine the best approach. When doing things eagerly, they can buffer and even complete reading before being requested. When doing things lazily, they can avoid costly I/O waste if they are not consumed. To this end, I believe we should have an opaque wrapper that does provide meta-data and if a resource is available prior to the stream of the resource's body.

If that sounds fine, we can add constraints and a data structure to the design document.

Overall, I do not think streaming is necessarily the best first pass given how little I expect it to be useful currently.

I found Blob to be a well suited fit for the above points if we wrap it in a container type so that we can iterate on streaming. It has plenty of existing documentation on how to use it as well as compatibility and familiarity. It may not be the most ergonomic API for all use cases, but I think it fits well and don't see advantages in making our own.


Error stacks are able to be serialized properly, but it depends on what you are seeking from a debugging experience. They are a leak technically, but I do not consider them a fatal leak since a loader can throw their own object instead of a JS Error if they wish to censor things. Not all things thrown necessarily have a stack associated with them, so if the question is mostly about how Errors are serialized it would just be ensuring they serialize properly (whatever we decide) when being sent across the messaging system. There is a question of async stack traces if we nest messages across threads but I am unsure if we want to even support cross thread stack traces as the ids of modules could conflict unless we add more data to represent the context.

I would be wary about user actionability on these messages as Loaders are likely to be more difficult to write properly than other APIs. However, debuggers and the like should also work if they want to debug things that way.

Per "separation". I agree there needs to be a "fetch"/"retrieve" hook of some kind, but not that resolve` should not be able to return a body. The problem you explain above is about passing data to parent loaders such as list of extensions, but does not seem to be fixed by separating loaders that I can tell.

As another example, consider a loader which applies a security policy that only certain modules on the file system can be loaded. This loader is added last in the chain, and basically provides a filter on the resolver, throwing for resolutions that are not permitted. The issue then with the model is that by the time the permission loader throws, the file might have already been opened by the underlying parent loader. This is the sort of separation of concerns that concerns me.

Per APIs, we can argue about which APIs to use but we should start making lists of what features are desirable rather than bike shedding without purpose.

The basic requirement is being able to determine what buffer to execute, and how to execute it in the module system. The simplest interface that captures this requirement is -

interface Output {
  source: String | Buffer;
  format: 'wasm' | 'module' | 'addon' | 'commonjs' | 'json'
}

The above could be extended to support streams by supporting source as an async iterator as well, but I'm certainly not pushing streams support yet either.

Error stacks are able to be serialized properly, but it depends on what you are seeking from a debugging experience.

Thanks for the clarifications re error stacks, we should just make sure we are aware of the debugging experience implications and properly support these workflows. Just getting the sync stack copied across as a string should be fine I guess.

As another example, consider a loader which applies a security policy that only certain modules on the file system can be loaded. This loader is added last in the chain, and basically provides a filter on the resolver, throwing for resolutions that are not permitted. The issue then with the model is that by the time the permission loader throws, the file might have already been opened by the underlying parent loader. This is the sort of separation of concerns that concerns me.

Is the concern reading the file, or evaluating the file. I would be surprised if the loader actually evaluated the file. I'm also unclear how this would prevent a loader from fetching that resource even if we split the hooks if we expose the ability to read off disk etc. to loaders.

interface Output {
source: string | Buffer;
format: 'wasm' | 'module' | 'addon' | 'commonjs' | 'json'
}

I want to agree that this is terser, but I do not think it is simpler. A few design decisions here have impacts that I find to have underlying complexity.

  • Using a union type for source

    • This necessitates detecting which type you got using code like typeof source === 'string' at every usage of source.

    • With custom formats, this would be compounded as it may be unclear which is the preferred format to serialize for other loaders.

    • By enforcing it be of 1 type since Union type's exclusive nature this means coordinating how to normalize data somehow to know the expected type. For example, when reading files into memory a general loader would return binary representation rather than conditionally serializing to strings.

    • By enforcing a single type for the source, it makes adding types more difficult. If you expose just string and Buffer types code could be written using if (typeof source === 'string') {} else {/*only assumes a buffer*/}. The solution would be adding another field most likely to prevent breaking patterns like above and you would end up with body.stream which would be exclusive with body.source somehow?

  • Eagerly exposing the source without a method means allocating/normalizing serialized data even if it is never used by the current loader.
  • By not using an async method to expose the body/source a head of line blocking problem has been introduced. A body must be completely read before being handed to another loader.
  • Using an enum for the format

    • This enum would need to be a coordinated list with MIMEs for any loader supporting http/https/data/blob URLs etc. This is compounded by not having a clear conversion step for custom formats so that things like CoffeeScript could be converted from/to these schemes properly which would mean loaders also participating in MIME/enum normalization (either through the runtime, or via some ecosystem module). MIMEs both would not require this normalization, and would have an existing coordination mechanism through IANA even for formats not seeking to live under a standards organization by using the vnd and prs prefixes.

    • Using an enum prevents metadata attachments which are important when dealing with variants of formats. Consider parameters for dialects and encodings such as JSX; a mime can safely encode text/javascript;jsx=true. It will still be picked up as text/javascript even if the parameter is unknown. Unknown parameters are not entirely under the scope of IANA but MIME handling is supposed to ignore unknown parameters per RFC2045.

Why would we ever not want order to matter? If a loader wants to effect other loaders, it should just have to run before them - like any other JavaScript code anywhere.

Is the concern reading the file, or evaluating the file. I would be surprised if the loader actually evaluated the file.

The concern is reading the file - doing unnecessary work in the OS. This is an indication that the abstraction model is missing the the separation of concerns that is needed. File systems and URLs use paths as an abstraction, and separate resolution from retrieval. Yes you can get resolution through retrieval with symlinks and redirects, but that is probably closer to alias modules.

It's pretty important to having a good composable loader API to ensure we maintain this distinction between resolution and retrieval.

This necessitates detecting which type you got using code like typeof source === 'string' at every usage of source.

We could go with just Buffer or TypedArray too by default, this resolves the next three points you mention as well I believe.

When the time comes to introduce streams vi async iteration, just having the [Symbol.asyncIterator] check as part of the API would make sense to me.

Alternatively if we definitely want just one type, then we can always just enforce an async iterator of buffers from the start.

Eagerly exposing the source without a method means allocating/normalizing serialized data even if it is never used by the current loader. By not using an async method to expose the body/source a head of line blocking problem has been introduced. A body must be completely read before being handed to another loader.

By going with an async iterator from the start that seems like it would resolve this concern too.

Note that the function returning such a body output can itself be an async function such that there is already an async method for providing the body.

This enum would need to be a coordinated list with MIMEs for any loader supporting http/https/data/blob URLs etc. This is compounded by not having a clear conversion step for custom formats so that things like CoffeeScript could be converted from/to these schemes properly which would mean loaders also participating in MIME/enum normalization (either through the runtime, or via some ecosystem module). MIMEs both would not require this normalization, and would have an existing coordination mechanism through IANA even for formats not seeking to live under a standards organization by using the vnd and prs prefixes.

The list of enums is already how we do it in the current API and that seems to be working fine to me. What problems have you found with this?

Consider for example a CoffeeScript loader:

export async function body {
  return {
    output: createCoffeeTransformStream(createReadStream(resolved)),
    format: 'module'
  };
}

There is no need to define the format: 'coffee' because the retrieval and transform are the same step, therefore the format only needs to correspond with the engine-level formats, which we already manage internally.

Using an enum prevents metadata attachments which are important when dealing with variants of formats. Consider parameters for dialects and encodings such as JSX; a mime can safely encode text/javascript;jsx=true. It will still be picked up as text/javascript even if the parameter is unknown. Unknown parameters are not entirely under the scope of IANA but MIME handling is supposed to ignore unknown parameters per RFC2045.

Most systems use a configuration file on the file system for managing transform options. tsconfig.json, babel.config.js etc. This provides the high degree of customization that these tools require.

I don't think most build tools would want to register a MIME and use this as a custom serialization scheme for their options.

We could go with just Buffer or TypedArray too by default, this resolves the next three points you mention as well I believe.

It doesn't solve head of line blocking; and it brings up the same issue of boilerplate, instead of type checking Loaders will be manually converting to a string properly for common textual usage. Most of the parsers (all?) take strings and not binary data. However, ArrayBuffer->string is lossy, so we shouldn't make everything strings. I'd be fine only shipping .arrayBuffer() but it would seem prudent to ease the common case here.

Alternatively if we definitely want just one type, then we can always just enforce an async iterator of buffers from the start.

I would not want async iterator in the first iteration as I still don't understand the streaming APIs we are seeking to support, and the complexity of stream propagation. In particular, there remains a peek() problem with AsyncGenerators/Iterators since they cannot replay/tee safely. Also, how streaming data is provided to the result is also needing discussion.

Note that the function returning such a body output can itself be an async function such that there is already an async method for providing the body.

This would be fine and is the case in the design document via async blob().

export async function body {
 return {
   output: createCoffeeTransformStream(createReadStream(resolved)),
   format: 'module'
  };
}

This needs a few things added such as detecting that resolved is CoffeeScript; a CoffeeScript loader would not want to transform WASM. Also, for fetching operations on various schemes such as http, https, data, blob, etc. it needs to maintain a MIME -> format enum converter so that it can detect that those are CoffeeScript. It is unclear how these MIME based schemes should declare their format for custom MIMEs. This is true for file as well, determining the format using something like mime-db which is what lots of things use including GitHub and it outputs MIMEs. IANA would not register a collision with this and is an example of not registering but getting a MIME association.

Most systems use a configuration file on the file system for managing transform options. tsconfig.json, babel.config.js etc. This provides the high degree of customization that these tools require.
I don't think most build tools would want to register a MIME and use this as a custom serialization scheme for their options.

I would agree! This would be meta-data about the format as passed through to other loaders, it would not be useful for specific individual transforms contained within a single loader.

It doesn't solve head of line blocking; and it brings up the same issue of boilerplate, instead of type checking Loaders will be manually converting to a string properly for common textual usage.

Just making it a TypedArray or Buffer instance sounds sensible then to me. String conversion on those is straightforward through either TextDecoder or Buffer.prototype.toString() respectively.

It is worth noting though that we are thinking about this interface from two perspectives:

  1. As an output of a loader retrieval
  2. When requesting the output of another loader's retrieval

If we have a validation step that runs in between those two steps, then we can imagine the primary interface as:

interface Output {
  format: String;
  body: TypedArray
}

while the return type of the "retrieve hook" could allow strings that get converted into buffers through the validation phase for ease of use (since in many use cases that is what the user would be doing anyway, so it is a convenience API):

export async function retrieve (moduleReference: ModuleReferenceInterface) {
  return {
    format: 'module',
    body: 'export var p = 5';
  };
}

where the validator just does - output.body = toTypedArray(output.body) with a guard check.

Now you're welcome to disagree with such an API convenience, in which case that is fine too, since this is just sugar as opposed to a primary architecture argument. I'm just noting the nuance around this.

I would not want async iterator in the first iteration as I still don't understand the streaming APIs we are seeking to support, and the complexity of stream propagation.

The only reason I suggested considering this in the first iteration was because we were discussing a stable RetrieveOutput interface.

I do think supporting a RetrieveOutput as an object with a [Symbol.asyncIterator] would simplify that problem. Peeking as a primary argument seems a bit week since it can always be achieve through straightforward stream interception.

For example, consider a loader which want's to scan for a source map:

export async function retrieve (moduleReference: ModuleReferenceInterface) {
  const { format, body } = await this.parent.retrieve(moduleReference);
  return {
    format,
    body: async function* () {
      // by treating body as an asyncIterator from the start, we need no guards
      for await (const chunk of body()) {
        const str = chunk.toString() // (assuming a Buffer)
        if (str.match(sourceMapRegEx)) {
          doSomethingWithSourceMap();
        }
        // we just made a passthrough stream!
        yield chunk;
      }
    }
  };
}

so my preference would still be to treat body as an asyncIterator from the start.

BUT - I can totally get behind it just being a Buffer / TypedArray initially too, and to be completely honest I'm not sure streaming is vitally important for sources - in fact I don't know of a single transform system that isn't synchronous anyway, or at least have a synchronous serialization step.

This needs a few things added such as detecting that resolved is CoffeeScript; a CoffeeScript loader would not want to transform WASM.

CoffeeScript is detected by file extension. TypeScript is detected by file extension. WASM is detected by file extension. So all of these cases are available in retrieve.

Babel is really the edge case here in being selective on which files it operates on, but the babel.config.js file is there to provide this filtering, and Babel would still filter to only .js extensions in the first place.

I would very much prefer format to only indicate the engine execution format, being one of the Node.js predefined 'module', 'wasm', 'addon', 'builtin'.

There is no reason why a CoffeeScript loader would want to return CoffeeScript. Babel transformation passes are not their own individual loaders, they are passes within the Babel loader. _Every loader should output a valid v8 language._

In terms of handling out-of-band metadata from the resolver, there are a few things we could do here:

  1. Have a custom meta object on the ModuleReferenceInterface:
    _Benefits_: Easy to pass information between the two hooks.
    _Disadvantages_: Different loaders may collide on the meaning of the data as it is unstructured.

  2. Allow loaders to keep a side table:
    _Benefits_: Just an internal memoization, easy to reason about, and people will be doing it anyway for eg fs caching.
    _Disadvantages_: Difficult for a loader to share its internal knowledge with other loaders.

SystemJS did (1) for many years, and I'd say I wouldn't suggest it, and would instead suggest going with (2). If the loader is a class, storing state on the class instance is a natural model for users to apply.

Also, for fetching operations on various schemes such as http, https, data, blob, etc. it needs to maintain a MIME -> format enum converter so that it can detect that those are CoffeeScript.

_Firstly, I think the assumption that users would be loading CoffeeScript over HTTP to transpile in Node.js is simply not a good idea, especially given the lack of a persistent HTTP cache in Node.js. We already have a JS environment that is free of the file system and that is called the browser._

The loaders we are designing here are Node.js loaders, not abstract browser loaders.

But, on the other hand, who am I to suggest what people should be doing. And if they want to write fetch scheme loaders then so be it.

So the problem is - when we extend this system to URLs, how would a user maintain the URL-based Content-Type response metadata?

Well, the URL fetch operation would happen within the retrieve hook, and as such the Content-Type information is returned within the retrieve hook fine. There is not even a need for a side table to manage this process.

In addition, the default loader would throw for non-file retrieval, so the user writing this loader would know that and specially write a retrieval function that wouldn't call the parent for fetch scheme URLs. Because the hooks are separated we can still call the parent resolve hook just fine, potentially even virtualizing files to URLs for node_modules if desired to avoid duplicates over such a scheme.

I have to say I’m loving the unexpected prominence of CoffeeScript in all these examples. I’ll happily take the dollars that used to go to @jdalton on mentions of lodash.

I think the assumption that users would be loading CoffeeScript over HTTP to transpile in Node.js is simply not a good idea

It might very well not be a good idea, but loading TypeScript over HTTP to transpile on the fly is already supported by Deno, so clearly there are users who would want to achieve this use case.

There is no reason why a CoffeeScript loader would want to return CoffeeScript.

It’s common to string together loaders that are meant to operate in sequence; I have one project that uses Browserify, and I have my CoffeeScript files processed in order by Coffeeify (transpile to JavaScript), Envify (replace process.env.* references with values from the environment during building), Babelify (transpile down to IE11-compatible JS) and browserify-shim (replace require calls to certain specifiers, like jquery, with references to global objects for libraries I’m loading via separate <script> tags). Pretty much anything you can do as part of a build pipeline people might theoretically want to do via loaders instead, and some users will likely want to do so to avoid needing the complexity of a separate build tool and watched folders and so on; lots of people use require('coffeescript/register') and require('babel/register') today during development for that reason.

My example admittedly doesn’t have CoffeeScript be output for further processing, but it’s easy to imagine use cases for such a thing. CoffeeScript already supports JSX interspersed within its code, but imagine for a second that it didn’t; someone could write a coffeescript-jsx transpiler that takes CoffeeScript-with-JSX-inside and returns straight CoffeeScript. (Something similar to this actually exists: https://github.com/jsdf/coffee-react.) If in my example above I wanted to use such CoffeeScript with JSX, I would have this ā€œcjsxifyā€ transpiler as the first in my series of transforms. I can imagine lots more examples involving TypeScript, like people extending TypeScript to allow non-standard syntaxes or macros. The package Illiterate extends the ā€œunindented lines are commentsā€ part of Literate CoffeeScript to any language, and would be another example of a transform that would output CoffeeScript or TypeScript or anything else. For awhile I’ve been batting around the idea of the CoffeeScript compiler outputting TypeScript, for CoffeeScript code that somehow contained type annotations. Anyway, long story short, yes transforms need to be chained and they need to be able to output non-JavaScript.

One other part in this is the source type. Not all transforms will know whether the original source is Script or Module, and ideally they shouldn’t be required to determine that or pass it along. Perhaps Node could make that determination in its usual way (extension and package.json type field) and that can be the default value for the source type if a loader doesn’t override it. That way a .coffee or .ts file inside a "type": "module" package scope would be known to be treated as ESM, for example. Or is this irrelevant because these are loaders inside the ESM resolver, and therefore everything is already known to be ESM? Is processing CommonJS files something that can be in scope for a loader, for example if someone writes a transform to convert require calls to import calls?

@guybedford I also agree about (async) streaming not being vitally important, but may be worth discussing an async preload hook for a source.

One related scenario (independent from source map):

  • Using a transpiling loader that takes an entry point for a graph of source files:

    • If transpiler will concatenate, an async preload hook helps defer such overhead so that it only occurs if/when it needs to.

    • If transpiler will remap sources to transpiled modules, an async preload hook can allow for necessary updates as/when needed.

Certainly, the arguments are very appealing for Symbol.iterator over Symbol.asyncIterator for the actual body, and here a preload hook would be more like (based on your previous example):

export async function retrieve (moduleReference: ModuleReferenceInterface) {
  const { format, body } = await this.parent.retrieve(moduleReference);
  return {
    format,
    body: async function () {
      const chunks = [];
      for await (const chunk of body()) {
        const str = chunk.toString() // (assuming a Buffer)
        if (str.match(sourceMapRegEx)) {
          doSomethingWithSourceMap();
        }
        chunks.push(chunk);
      }
      return chunks;
    }
  };
}

I am not sure how I feel about this myself, first impressions is that this actually creates a lot more overhead (tricky to tell) and at least for multi-loaders and large sources I would say it certainly does.

Yet, a single promise in almost all other cases seems to be a reasonable enough offer without compromising too much on performance.

Can we maybe consider giving the option to return either an asyncIterator or a promise?

@guybedford on your points:

CoffeeScript is detected by file extension. TypeScript is detected by file extension. WASM is detected by file extension. So all of these cases are available in retrieve.

If content is loaded from another place than a file (in-memory, from a bundle, etc) then the extension on the URL would really just be a less than direct way of specifying format.

There is no reason why a CoffeeScript loader would want to return CoffeeScript. Babel transformation passes are not their own individual loaders, they are passes within the Babel loader. Every loader should output a valid v8 language.

A CoffeeScript loader would not output CoffeeScript, but its parent _would_, otherwise there's no point in having a CoffeeScript loader. So custom loaders do need to be able to retrieve non-v8 content. Even the default retriever should be able to if it resolves to a file.

Based on that there needs to be a more robust format specifier than just the enum of supported node formats. The current loader hook system supports a dynamic option on top of that enum, but that makes it difficult for transform loaders to infer whether they should act. MIME types make sense to me, though I'm certainly not an expert on those.

FWIW, it is completely valid and safe to check if a file is wasm based on the first few bytes, and a wasm file may not have an extension (generally on linux and macos, where the system can register a wasm just like elf binary)

If we want _custom transport_ loaders to be composable with _custom transform_ loaders, then we have the same problem in the single hook model in that they are being treated as the same thing.

Personally I don't think having custom transport being composable with custom transform loaders should be seen as such an important use case as to define the model.

There seems to be a desire to "virtualize Node.js" here, to free it from the file system. But we already know that the only way to virtualize Node.js _is to virtualize the file system_.

If we really want to support transport loaders being composable with transform loaders, then we would need to separate into three hooks - locate/resolve , fetch/retrieve, transform.

If we were to separate into three hooks with a separated transport hook, then I agree that a Response-style object and MIME model makes sense in the API. But without such a separation it makes no sense to me.

Another concern I have with the web APIs even then though is that Node.js has no existing Response support / web streams support / etc. So that we would likely be putting these new globals into the loader while not quite at full spec parity together just for this use case, that would be a large amount of code to maintain. And introducing large amounts of maintenance overhead into Node.js core should always be taken very seriously.

In addition if we do anything that is not quite spec compatible, then changing to be spec compatible in future must not be a breaking change or we break user loaders.

Sorry, I am getting lost in some of the details here…

If we really want to support transport loaders being composable with transform loaders, then we would need to separate into three hooks - locate/resolve , fetch/retrieve, transform.

I wonder if what is being considered is really "transport loaders" here. In my mind, a platform-independent architecture would separate "access", "locate" (ie scan) and "resolve" (ie join/normalize/map), in the respective reverse order, and theoretically the cleanest custom loader interface would be restricted to resolve only.

The locate (and even access) interface certainly favourable with direct file-system access, but not for the web and possibly even other Node.js-based runtimes — if we are thinking cross-platform then the parallel here should consider NW.js and Electron at least conceptually, as well as built executable binaries that are currently require-centric.

The scenario worth considering with this breakdown is a project using the same chain of resolve | … | transform (ie content) hooks/loaders, and including somehow a separate layer for adaptive locate and access (ie resource) hooks that are not required to resolve the idempotent URL of an imported module (ie the absolute URL for static/dynamic import and the import.meta.url for the context/realm).

This divide between content and resource hooks is important imho to promote good user-facing interfaces. So, content hooks/loaders would be easy to reason about in platform-agnostic terms, separate from more the complicated aspects of how to locate and access a resolved idempotent URL. Having a custom-loader that needs to do both would be the less recommended (or more advanced if you want to call it so) path, which can be easily abstracted as two separate custom-loaders somehow sharing state (likely what is actually needed for such use case).

In short, separate interfaces for content versus resource hooks.

Upon reflection… I think what we are dealing with here — especially if we consider cases for rewriting absolute URLs in source text — is that URLs have both content and resource manifestations. A fully-resolved URL being the single content-facing URL (ie import and import.meta.url) which may usually also be the identical URL of the actual resource (ie for fetch or fs.…) or somehow one that has an idempotent (per context/realm) mapping to it (say for none-pathname URL aspects).

I think @SMotaal has a point and I do think there is some distinction of URL that we haven't quite been able to describe or grasp. For me, the concrete example is when a resolve wants to virtualize a builtin like node:fs.

Doing so means informing child loaders that it is acting as node:fs but has a different body.

I'm unclear how a fetch would work on the "primordial" resource at the URL node:fs. If we are virtualizing, it likely our attenuated module is going to delegate some tasks to the primordial, but at the same time it is unclear what fetch('node:fs') would return for the primordial body, null seems like it would be bad but is roughly what the Loader Design doc currently does. It seems like we need to have this distinguishing characteristic of non-intercepted vs interceptable locations while preserving the ability to virtualize things.

One thing I was playing around with is the concept of protocol handlers. E.g. node: could be a protocol that just simply doesn't allow registering a handler for and so no fetch for it would hit custom code. The downside is that it would make generic retrieve hooks more awkward potentially (since there's no longer a single path for all kinds of URLs).

@jkrems how would a resolve redirect/virtualize node: in a nested manner? If A and B both want to modify node:fs they need to be able to communicate that they are acting as if they are returning node:fs

I'm squarely in the camp of "resolve should only operate on URLs". In that scenario, passing around node:fs isn't an issue because nobody needs to associate it with a resource.

@jkrems I don't understand still / that isn't actually related, given:

  1. A needs to instrument node:fs
  2. B needs to instrument node:fs
  3. A is a parent of B

A returns a reference to node:fs redirected to their attenuated form (e.g. file:///alt-fs). B needs to treat the attenuated form as node:fs, but the redirection has prevented the comparison from working because it sees A resolved to file:///alt-fs.

The way to instrument fs is via -

import fs from 'node:fs';
fs.fn = instrument(fs.fn);

That will apply to both CJS and ESM, and it will update the live bindings.

From a loader, you would do the above by providing the mutator at a custom scheme perhaps:

// it would be nice to provide loaders with an "init" function
// that they can use to "attentuate" / prepare the environment
export async function init () {
  // init returns a module to eval in the target environment
  // here we are loading the fs mutator
  return `
    import 'apm-mutators:fs';
  `;
}

export async function retrieve (specifier) {
  // builtins are never "retrieved" as they are internal to Node.js
  // we could avoid this by exposing the internal loader under an internal:// scheme but that risks exposing Node.js internals to public loaders
  // so it seems advisable to maintain this separation to me
  assert(specifier !== 'node:fs');

  // code to apply the fs mutator
  if (specifier === 'apm-mutators:fs') {
    return {
      body: async () => `import fs from 'node:fs'; fs.fn = instrument(fs.fn);`,
      format: 'esm'
    };
  }
}

Note that builtins to not get the retreive hook called on them as they are internally provided from Node.js core and not hookable by loaders.

Virtualizing cannot be achieved by changing the resolution scheme. Instead virtualizing and attenuation must be achieved within the same original scheme. New schemes are useful for new types of loading, but that should be seen as complementary to the existing types as opposed to a virtualization of them.

I think there's two separate problems here:

  1. If built-ins are exposed to resource retrieval hooks, what would that response type look like? Should they be exposed to resource retrieval hooks?
  2. How can multiple instrumentations of the same target module, be it built-in or not, coordinate?

I'm not sure those two problems are the same discussion. I was only responding the first one.

In the case of (2) I would expect that A and B communicate by imports in their respective instrumentation code. E.g. the one that runs first would have to import the target module which the other could then intercept. The exact semantics of this are tricky and there are definitely unsolved problems around how such a hook can be written safely.

So to quickly recap on the rather rough definitions I mentioned in today's meeting:

  1. A container being the single records interface for a given rootRealm or compartment (ie nested in a realm where separate module mapping could take place) it has a loader which has nested scopeRecords — where potentially we map things like:

    {
    '~': Scope({
    
      // just the one basic idea
      id: '~',
    
      // Realm/compartment container interface for modules records
      //   and where resource idempotency is enforced.
      container,
    
      base: normalizeBaseSomehow(
        // ie this is always a directory URL and assuming it supports
        //   paths we can normalize the base with:
        //
        //     new URL('./', container.base)
        //
        //   but even if it does not support paths, separating resolve
        //   from locate works because we always resolve relative
        //   specifiers that are scoped with:
        //
        //     new URL(specifier, `file:///${referrerId}`).pathname
        //
        //   with URL-based resolutions being simple, universal and
        //   and reliable (more so for http://fake/ than file:///)
        //
        container.base,
      ),
    
      // ie sub scopes like ~/node_modules/‹moduleIds›/
      //   where each can map '~' to subScope.base… etc.
      get scopes() { return getSubScopesSomehow(this.base) },
    
      // … some structure to retain scoped modules and exports
    }),
    }
    
  2. A loader.resolve(moduleSpecifier, referrerId) allows more traditional resolutions where we omit the Scope.base and use Scope.id — this imho makes it easier for hook authors to reason about resolving, easier to avoid inconsistent remapping of scoped specifiers, and the added win is that loaders operating on this hook are not privy to more information than what is necessary.

  3. A loader.locate(moduleId, scopeRecord) returns the actual container-referred location — where it is possible to think of platform specific behaviours like path searching... etc.

  4. A loader.retrieve(location, containerRecord) performs the actual fetch, disk, or cache op to return the source readable.

The most critical thing imho is that when the source text of a module evaluates, it needs to have the import.meta.url safely suited for fetch(…) or fs.(…) ops. This may in fact require some transport layer rewiring on a container basis to affect this kind of virtualization in a clean, transparent and consistent manner (theoretically a URL instance having a container record can safely be handled outside of the container itself to access a specific resource in any other container(s) not having that privilege, where there they would only see it as an opaque but trusted locator object).

Additional considerations worth mentioning for realms, and specifically for augmented or proxy modules — when such synthetic modules are evaluated/initialized, they will have private links (imported bindings) to the original modules which will likely need special records for the realm (ie container) so that they do not collide with the mapped augmentations and still allow the original modules to (if applicable) rely safely on import.meta.url derived ops. @bmeck — I think this relates to your previous point about the distinctions needed, where I think augmentation happens somehow elevated from the realm in which they are mapped to the augmented instances but after the parent realm records are finalized — ie at least for the relevant module subgraph(s) in the parent realm/container.

I will update with a link once I return (end of the month).

CORRECTIONS: marked with <ins>

So the tangent that is missing in all this really about shared state across virtual (realms, contexts) or logical (threads, processes) boundaries…

@jkrems' touch upon this:

I would expect that A and B communicate by imports in their respective instrumentation code.

Referring to @bmeck's example:

A returns a reference to node:fs redirected to their attenuated form (e.g. file:///alt-fs). B needs to treat the attenuated form as node:fs.


Note: Apologies, I realized after-the-fact I mixed up talking about A and B in the sense that they are modules, where originally they were referring to custom loaders. That said, I think that scopes and containers described in my previous comment may improve the dynamics on the loader side of things as originally raised by @bmeck.


So some floating questions… If B attenuates A:

1 If they are in the same context and realm, then B imports A only creates a single instance of A… likely this happens with "safe compartments" (already taking shape elsewhere), so…

  • Are there other unexplored notions of same-realm mapping we are not thinking of?
  • Are there times where multiple instance of A are expected?
  • Does it make sense or help in anyway to divide graph operations so that all modules not attenuating others and/or dependencies of attenuated modules are deferred? (ie circularity references by static bindings here being unlikely but something to ponder).

2 If A is in rootRealm and B is in nestedRealm:

  • Does it help to finish parent-realm graph operations before nested ones?
  • Are there times where multiple instance of A are expected?
  • If A is a single instance, are we concerned about discontinuities (ie getObject().constructor !== globalThis.Object)?

3 If A and B are in realms where A is not in some ancestor of the realm in which B will be instantiated…

or,

4 If A and B are not in the same context:

  • Just for completeness, does that happen anywhere, or more accurately, how/why would it? I don't want to just assume though, because if it did, then it is important to give that some thought.

5 I'm trying to understand and separate containment versus scoping challenges before considering the overlap (derived from @guybedford's example above):

    > ```js > import 'tool'; // which exposes globalThis.instrument > import fs from 'fs'; > export const {fn} = globalThis.instrument(fs); > ``` Here there is overlap between two distinct complexities: 1. Sharing state between module instances of `tool`. 2. Augmentation/mapping (ie like attenuation) happening somehow without collisions. And while `tool` can freely decide on how they would want to address communication between its own instances, it would make sense first verify that this is a pattern that `tool` would want to use. And if so, would catering to this pattern involve additional APIs to be offered (ie to dissuade monkey-patching) or at least additional tests for related aspects?

Meta as a non-native speaker: is there a simpler term for "attenuated"? I definitely had to Google define attenuate and even after doing it I'm only mostly sure I understand it. Maybe "instrumentation code" or "wrapper module" or something..? I'm starting to feel like this discussion gets drowned a bit in "big words".

To try to clarify my proposed flow:

  1. A sees fs, returns known-a:fs.
  2. B sees known-a:fs, returns it unchanged because it's not an instrumentation target.
  3. A generates code for known-a:fs that imports node:fs.
  4. That code gets processed like any other module, resolution starts.
  5. A sees node:fs in the context of its own instrumentation code, returns it unchanged.
  6. B sees node:fs, resolves to known-b:fs.
  7. B generates code for known-b:fs that imports node:fs.
  8. Problem: needs exit condition so that we don't start looping.

One possible solution here would be that "loader owned code" is a first-class concept and the loader chain is adjusted to only run loaders "below" (above?) the loader that owns the code.

I don't really have a better word than "attenuated" but "wrapper" would work for most situations we are talking about. Attenuation ~= a customized view of something (either by mutation, scoping, or wrapping). If we are ok with not being involved in mutation or scope based access "wrapper" should be fine.

One possible solution here would be that "loader owned code" is a first-class concept and the loader chain is adjusted to only run loaders "below" (above?) the loader that owns the code.

I don't understand this, could you expand?

is there a simpler term for "attenuated"? … discussion gets drowned a bit in "big words".

Specifically attenuated here has a distinct concept which I myself only recently was introduced to… from object capabilities (ie SES/Realms weekly meeting) describe this as the process of taking something with a greater degree of authority than necessary for the consumers to function correctly and limiting it to exactly what is needed for security (ie rouge code).

So as @bmeck describes, it functionally is creating some customized view of (altered maybe a good suggestion imho to make it the alternative of) the module, but an important aspect for attenuation is that it presumes that the realm/compartment would not still somehow refer to the original module's namespace

Note: I think this is important enough in that it is different from how other altered modules might generally still coexist in the respective container(s) while attenuated ones should not.

I don't understand this, could you expand?

Right, let me try. We're aiming for the following module graph I assume:

+--------------------+  "fs"   +--------+  $x   +--------+  $y   +-----------+
| file:///client.mjs | ------> | $wrapB | ----> | $wrapA | ----> | nodejs:fs |
+--------------------+         +--------+       +--------+       +-----------+

The idea would be that $wrapB would be marked as "owned by loader B". Which means that only the loader stack above B would run (since it wouldn't run on itself). That way - in theory - we would have an exit condition where the loader stack gets smaller the closer we get to the "real" fs.

Specifically attenuated here has a distinct concept

Okay, so that sounds like it's a very specific kind of instrumentation code / narrow set of goals. I'm not sure if it's necessary to use it as our only example here since the issue doesn't seem specific to this kind of instrumentation code.

Let's be careful not to be too specific unless it actually adds important constraints and if we do, we should be specific about the constraints and how they apply to the problem at hand. :)

The idea would be that $wrapB would be marked as "owned by loader B". Which means that only the loader stack above B would run (since it wouldn't run on itself). That way - in theory - we would have an exit condition where the loader stack gets smaller the closer we get to the "real" fs.

In the design document we have a success criteria of having 2 different loaders instrumenting fs.readFileSync. If I understand the comment above it would prevent 2 loaders from interacting on an "owned" location?

If I understand the comment above it would prevent 2 loaders from interacting on an "owned" location?

The "owned" location is the module generated by/injected by the loader, not fs itself. So in that diagram, the import chain should ensure that $wrapA gets to patch readFileSync, $wrapB sees that patched version, can patch it again, and client.mjs gets the double-patched version. Maybe I'm missing something about the requirements?

@jkrems I was not understanding, I think I'm getting confused on what $wrapA does differently from $wrapB. $wrapA sees the true builtin (no source), but $wrapB modifies the source result of $wrapA? Or does $wrapB also see an opaque source?

This is fundamental to all the "safe modules" work happening… so imho, this is important here @jkrems — there is no way to secure the module system if it only approximately provides attenuation guarantees but not quite.

To elaborate the concept, considering the more abstract container concept, which could be an SES realm or "safe compartment" in that you can overload module identifiers by remapping them opaquely, and so every single module instantiated in that container importing the specifier only ever receives the remapped module and not the original.

With that attenuated:node:fs remapped to node:fs would effectively never also expose the original node:fs and this means:

  1. technically the module map of the container only ever has one node:fs record, but,

  2. attenuation happens in an elevated way where the mappings do not affect attenuated:node:fs itself which still gets to access the original node:fs.

    Note: the assumption is that import … 'attenuated:node:fs' would not be something that would take place in the actual container — but if it did, I am inclined to say that would be a completely different instance, a somewhat pointless one regardless of how you slice it due to redundantly attenuating the attenuated node:fs namespace.

Does that help?

@bmeck Ah, let me clarify: Nothing in my example is modifying any source code. $wrapA is a full module that has an import statement with the specifier $y and then re-exports a patched version.

Does that help?

I'm not sure. It feels like something that already is the case with all loader designs discussed here since loaders control all specifier resolution(s). So I'm not sure what you're trying to say by bringing it up..? I'm also not sure what realms bring to this..? Maybe it would help if you expressed if there's a specific gap that is connected to realms or containers or if it's just an OT aside to elaborate on SES concepts..?

Maybe it would help if you expressed if there's a specific gap that is connected to realms or containers or if it's just an OT aside to elaborate on SES concepts..?

Fairly certain that while "specific gap" is the more concrete thing to aim for, yet "something possibly being glossed over in details not explored" can be equally fatal a mistake in the design process. That said, fresh eyes might help articulate things better.

(below I use _alterer_ to be a Loader that attenuates or instruments or in some other way modifies a parent loaded module by returning new module content that imports from the parent)

@jkrems if I'm understanding your "owned module" idea correctly, it stops both Loader B and Loader A from handling imports in $wrapA. This is fine when both A and B alter the same builtin, but has the side-effect of preventing B from wrapping a different builtin used in $wrapA. I'm not entirely sure if there are real use cases that would be impacted by this, but it does break the expectation that a builtin alterer should alter _all_ uses of the builtin except its own.

An example:

  • Loader A alters node:os to write all calls to a log file (using node:fs)
  • Loader B alters node:fs also to write all calls to a log file (using node:fs)
  • An import of node:os returns $wrapA
  • Evaluation of $wrapA skips Loaders B and A because it's owned by A
  • The evaluated $wrapA's node:fs is the true builtin
  • A logfile of node:os calls is written and correct
  • A logfile of node:fs calls is written, but is missing the calls to write a logfile made in $wrapA

@A-lxe @jkrems @bmeck do you guys think it is possible to discuss this?

I myself appreciate how tricky it is and how much effort everyone of you is doing, and while I am technically taking time off, this specific aspect of hooks for remapping altered modules keeps dragging out.

Before this, the only model for altered modules that I came across — aside from the good'ol require goodness — has been the idea of making a compartment…

That said, if the aim is to alter modules systematically by chaining and ownership, and to do so absolutely by not changing the module map for the particular module id (ie node:fs) then I am certainly feeling a little behind visualizing this myself even with the very clear efforts you are putting in communicating here (my apologies).

I can set us up on zoom, but can someone fire up a doodle with more practical times?

I'm thinking 8am to 8pm CEST which currently works for me being likely not practical for everyone else — and I would not want to hold back having this discussion happening.

I've made a doodle poll with some times that hopefully should work for each of us:

https://doodle.com/poll/g2chvakibfky7drt

Tell me if you think I should add times or anything like that!

It looks like Monday 9-10am PDT (6-7pm CEST) is a good option. I've made a calendar event here.

@SMotaal could you send me a link to the zoom when you set it up? I can add it to the event (or add you as an editor with your email).

I just ended up creating a recurring ad-hoc meeting for discussions and posted the link to the team's discussions.

Please let me know if you have trouble viewing it.

I get a 404 for that link. Does #372 need to be merged?

@A-lxe I sent you the link on hangouts

Based on discussion in #386, to prevent blocking unflagging esm we should create a flag that enables the current (and future) experimental --loader feature.

I propose adding an --experimental-loaders flag that enables the use of the --loader option.

I also figure that --experimental-loaders shouldn't be required when --experimental-modules is set, thereby offsetting the need for a user to set --experimental-loaders until esm is unflagged. I'd appreciate thoughts on that.

I'll have a PR for this by Friday, given no objections.

@A-lxe SGTM. There is a way to infer one flag from another in the flag parser

What about renaming --loader to --experimental-loader so that it is still only one flag, but clearly experimental?

Alternatively, what if we simply retain the --experimental-modules flag for features that are still experimental like loaders?

Alternatively, what if we simply retain the --experimental-modules flag for features that are still experimental like loaders?

i think the naming would confuse people. i'd suggest instead of this a new flag like --experimental-module-features with char-delimited positional values if you want to selectively enable features (e.g. --experimental-module-features='loaders,feature1')

Should we separate --loader-v0/--deprecated-loader from --experimental-loader? So that we can distinguish between people trying to use the current flag from people trying out the next iteration? Or is the idea to change the name as we land the beginnings of the new loader hooks?

For bash scripts wrapping Node.js initialization with a loader, it seems like an error fallback will be needed to properly handle loader support.

The more flag options there are, the slower the startup will be on older Node, and it's not an insignificant 300ms to do this.

Ideally I would prefer if we could try to make a goal of enabling at least two basic inits to get Node.js running with a loader. At the moment we're looking at three though:

  1. First try node --loader (when loaders are stable)
  2. If that fails, try node --experimental-loader (modules stable, loaders unstable)
  3. If that fails, try node --experimental-modules --loader (current shipping implementation)

(by three startup is now 1 second in total before JS bootstrapping)

(getting the above to two is my justification for suggesting loaders under --experimental-modules, but I can appreciate if this goal is not prioritized as well)

we shouldn't have multiple versions of experimental things at the same time. we explicitly created experimental flags to allow us to develop a system without worrying about releases/versions/etc.

we explicitly created experimental flags to allow us to develop a system without worrying about releases/versions/etc

The aspect of an --experimental flag that allows not worrying about versions is that any break is permitted at any time under the flag. So changing the loader API or any other modules API completely while still continuing to use an --experimental-modules flag exactly could be ok.

i'm just referring to stuff like --loader-v0, and needing scripts that try a bunch of node options. a script shouldn't ever do that because if --loader isn't directly available anything else it might try would be an experimental feature with a potentially different api.

i'm just referring to stuff like --loader-v0

To clarify the intent: I wasn't suggesting that both flags co-exist. I was suggesting that the current implementation would move behind that flag asap to give a strong signal that things are about to break if people keep using the flag and aren't changing their code. In that scenario, the flag would be replaced with a new flag the moment the revised loader hook API lands. My hope was to reduce confusion around the time when we revise the loader API.

thanks for clarifying. in that case, we can just use a --experimental-loader flag, and replace it with --loader when it goes stable.

So for the past few weeks I’ve been trying to implement a CoffeeScript loader, and I’ve pretty much given up in frustration trying to make it work with the existing API’s resolve hook and data URLs. This experience led me to implement https://github.com/nodejs/node/pull/30986, to add two new hooks I called getSource and transformSource. Unlike resolve, these are very narrowly scoped: getSource allows the user to provide a function to override Node’s reading a file from disk, to potentially instead get file contents from a memory cache or an HTTP call or somewhere else; and transformSource allows the user to do something with that source just after it’s been loaded but before Node does anything else with it, for example to transpile it.

Neither of these hooks are described at all in the design doc linked above. To be honest I couldn’t quite figure out how I was supposed to convert that document into code; what would help me is a document that shows the theoretical API as it would be written as documentation in https://nodejs.org/api/esm.html#esm_experimental_loader_hooks.

My feedback on the current API is that resolve is too broad. In the current API, building any type of loader requires providing a function for a new resolve hook, but any resolve hook requires a lot of boilerplate because resolve is doing so many things: converting a specifier into the URL to load, deciding its format, deciding what file on disk to load for that URL, etc. As a user I would prefer that these steps be broken up into several smaller hooks rather than this one big one, so that for example I can implement automatic extension resolution without _also_ needing to determine format.

So in the near term, if I were to do another follow-up PR that would be my next improvement. I know others (@guybedford @bmeck) have grander plans for more dramatic redesigns of the API; I’d love to hear what those would be, in any level of detail. To the point of ā€œwhy should we add new hooks on a bad API that we’re planning to redesign,ā€ well, a) I don’t know when the redesign will come if it ever does, and b) I think there’s a lot of value in seeing how the use cases served by the current API can still be served in the refactored API, and what the DX looks like in one versus the other. We would get that diff by incrementally improving what we have until someone opens a PR with the redesign.

Is there a library that exposes the functionality of node's built-in resolver as a class, or something akin to a class, allowing parts of the implementation to be tweaked?

I am implementing modules support for ts-node. We need to implement a resolver that is almost identical to node's built-in resolver, except we need to match a few additional file extensions, (.ts and .tsx) and we need to detect when a .js import specifier is meant to load a .ts file on the local filesystem.

Ideally, the built-in resolver exposes an API that allows customizing its behavior. Exactly what can be customized and what cannot is up for debate.

We would also benefit from programmatic access to the --experimental-specifier-resolution flag without parsing it out of process.execArgv.

Are there any projects that attempt to extract node's built-in resolver, allowing it to be extended?

If it were exposed in some way, I could probably override the hypothetical extensions() and resolveExtensionsWithTryExactName() methods to achieve what we need.

Currently, I've copy-pasted node's resolver into our codebase, wrapped it in a factory function, and tweaked as necessary.
https://github.com/TypeStrong/ts-node/blob/ab/esm-support/dist-raw/node-esm-resolve-implementation.js

@cspotcode https://npmjs.com/resolve, for CJS, but i haven't yet added ESM resolution to it (although i absolutely will be, soonish).

@cspotcode you're the first to extract the JS implementation here that we've heard from - given that it would be a huge service to the community to publish that work or collaborate with @ljharb on getting it in resolve.

The extended resolver APIs aren't clear for Node.js itself - in terms of what options to make available. And opening up every hook in the resolver is contrary to the needs of Node.js core itself. If there was a solid complete proposal for a hookable resolver API that we can use internally and also expose that might be something to consider, but would require a sound proposal, and the bar for that would have to be quite high.

@cspotcode

I am implementing modules support for ts-node. We need to implement a resolver that is almost identical to node's built-in resolver, except we need to match a few additional file extensions

Why would you need such a resolver? If you get a ts file to transpile, your import-s would be either bare-specifiers, in which case you need to leave it as is, or relative paths that are fully specified to other js or ts files, in which case, again, you need to leave it as is.

To make things clearer. If you have:

import foo from './relative/path/to.ts'
import _ from 'bare-spec'

Why would you need to change those imports? Just let Node.js do it's thing, and the next time you will get to.ts to transpile, no? And in the case of the bare-spec, it will already be a JS node module, so there's no need to even transpile it.

In the source, it wouldn’t have the extension, based in my understanding of existing TS usage.

In the source, it wouldn’t have the extension, based in my understanding of existing TS usage.

I think this is just based on common patterns, though, right? Users _could_ write './file.ts' instead of './file' if they wanted to, I think (@cspotcode?). And arguably it wouldn't be a bad habit to get into, as it removes a bit of magic and makes it clearer what's going on (and more closely resembles the JavaScript that will eventually run, in either Node or browsers).

If TypeScript wants to implement automatic extensions as part of their compilation (i.e. transpiling './file' into './file.mjs' or './file.js' as configured), they're welcome to; that would be better than relying on Node for that, as TypeScript's version could then apply everywhere including browsers.

I think this is just based on common patterns, though, right? Users _could_ write './file.ts' instead of './file' if they wanted to

Results in the following error: TS2691: An import path cannot end with a '.ts' extension. "Accepted" workaround is to use import './file.js', which looks weird when what you're actually importing is .ts.

https://github.com/microsoft/TypeScript/issues/18442 has a whole bunch of discussion

@giltayar just as node has 2 resolver modes, --experimental-specifier-resolution=<explicit|node>, we also plan to implement 2 resolver modes:

explicit

We must resolve foo.js to the following files in disk: foo.ts, foo.tsx, or foo.jsx. TypeScript supports this so that developers have explicit control over the emitted module specifiers.

node

In addition to the above, we should allow developers to omit file extensions. This is motivated partly because the default behavior for Typescript's import quick-fix is to omit the file extension.
The language service and code editors do allow users to configure this, so the quick-fix import statements include a file extension. However, if users want to stick to the default and opt-in to a mode akin to --experimental-specifier-resolution=node, we can support it.

Results in the following error: TS2691: An import path cannot end with a '.ts' extension. "Accepted" workaround is to use import './file.js', which looks weird when what you're actually importing is .ts.

I remember discussing this before, probably on that thread you cite. I think what I suggested then is what still seems to make the most sense to me now: TypeScript should start supporting './file.ts', and output this as './file.mjs' or './file.js' depending on configuration. Especially if it currently errors, that seems like an easy change to make.

@GeoffreyBooth afaik the last stated position by the typescript team was that "...we do not believe the compiler should rewrite your imports. module names are resource identifiers, and the compiler should not mess with them"

That comment was from 2016, and a lot has changed since then. TypeScript also has a lot of configurable options, so I don't see why this couldn't be one of them.

Regardless, you could achieve the same either through a latter build step (like a Babel plugin) or via a custom Node loader.

@cspotcode I understand you need it for the "node" resolution module.

Not sure I understood what you said in "explicit" mode:

We must resolve foo.js to the following files in disk: foo.ts, foo.tsx, or foo.jsx. TypeScript supports this so that developers have explicit control over the emitted module specifiers.

Shouldn't foo.js in explicit mode resolve to... foo.js? Isn't that what explicit mode means?

foo.js resolves to foo.js, which is the compiled output of foo.ts and does not exist on disk prior to compilation. ts-node runs prior to compilation.

TypeScript does not modify valid ES syntax unless explicitly asked to. JSX is transformed because it is not valid ES. Type annotations are removed because they are not valid ES. import from './foo.js'; is emitted verbatim, because it is valid ES syntax. (it will be transformed if we ask TS to convert to CommonJS or downlevel to ES3) At design time, the language service understands the semantics, because it knows ./foo.js is the emitted code corresponding to ./foo.ts, ./foo.tsx, or ./foo.jsx.

This gives developers explicit control over their import module specifiers at runtime, and those specifiers can remain stable for certain changes to the codebase. For example, renaming a .js file to .jsx does not requiring changing the import statement. Refactoring .js into .ts, or vice versa, is the same.

ts-node's goal is to be a drop-in replacement for the compile step. If you would normally tsc && node ./index, we allow you to ts-node ./index. We also need to maintain compatibility with the larger ecosystem, in particular the language service, because that is the value of TS.

Will there eventually be the ability to install loader hooks at runtime? Currently, CLI tools that need to bootstrap an execution environment which includes hooks have a tough time doing so in a cross-platform manner that doesn't impose extra performance overhead.

It's possible for one node process to spawn another, but there are lots of caveats with that, and it's slower.

My use-case is ts-node. Our normal interface is ts-node scripts.ts. This is compatible with Linux shebangs.

Today we need to tell users to node --loader ts-node/esm script.ts. Ideally, users can ts-node script.ts, which launches one and only one node process, and we install hooks at runtime.

@cspotcode I'd suggest using ts-node to spawn a new Node.js process with the loader hooks set. There likely will be APIs to spawn subloaders in future in the same Node.js process, but that work has not yet begun. Contributions to loader work is also welcome.

I have another question / bit of feedback. I hope this is the right place to post. Technically it pertains to something require.extensions hooks must do to properly integrate with ESM.

node's built-in require.extensions['.js'] implementation checks if the file should be treated as ESM. If so, it throws an error.

ts-node attaches a custom require.extensions['.ts'] (and .tsx , .jsx, and overwrites the built-in .js hook as needed). We read the file from disk, compile .ts->.js, then pass this string to module._compile.

Our hook needs to mimic the error-throwing behavior of node's built-in hook. For example, mocha relies on this behavior to correctly support both CommonJS hooks and ESM hooks simultaneously: https://github.com/mochajs/mocha/blob/master/lib/esm-utils.js#L10-L23


require.extensions['.js'].toString()


> require.extensions['.js'].toString()
'function(module, filename) {\n' +
  "  if (filename.endsWith('.js')) {\n" +
  '    const pkg = readPackageScope(filename);\n' +
  "    // Function require shouldn't be used in ES modules.\n" +
  "    if (pkg && pkg.data && pkg.data.type === 'module') {\n" +
  '      const parentPath = module.parent && module.parent.filename;\n' +
  "      const packageJsonPath = path.resolve(pkg.path, 'package.json');\n" +
  '      throw new ERR_REQUIRE_ESM(filename, parentPath, packageJsonPath);\n' +
  '    }\n' +
  '  }\n' +
  "  const content = fs.readFileSync(filename, 'utf8');\n" +
  '  module._compile(content, filename);\n' +
  '}'

My plan right now is a hack where I invoke the built-in .js hook, passing it a filename I know does not exist, but that is in the right directory. This costs a failed fs call:

try {
  require.extensions['.js'](
    // make `module` object
    {_compile(){}},
    filename + 'DOESNOTEXIST.js' // I can make this more robust, appending a random UUID
  );
} catch(e) {
  // Inspect the thrown error to see if it's an `ERR_REQUIRE_ESM` error
}

EDIT: actually, won't be using this hack. In the case another require hook has been installed before us, we can't be sure require.extensions['.js'] is going to be node's built-in hook. Instead I'm going to extract the relevant code from node's source into our codebase.

Ideally node exposes its synchronous CJS/ESM classifier to support our use-case. I realize that require.extensions has been deprecated for years, but that doesn't strictly prohibit node from exposing this API.

This has been something I've been thinking about as well. I think for a loader to handle both ESM and CommonJS files, it also needs to hook into require.extensions (at least for now). I'd like to proxy/override or wrap require.extensions['.js'] for extensions that should be handled similarly, but without necessarily having the ā€œthrow if this is ESMā€ check. I wonder if there's a not-terrible way to prevent that, aside from copying the source of require.extensions['.js'] into my code.

Another thing we might want to consider is making the ESM loader hooks simply _the loader hooks,_ to apply to both CommonJS and ESM, finally replacing the deprecated require.extensions. I haven't given thought into how that would work in practice, but there are lots of use cases such as instrumentation where the loader should affect all files, not just ESM ones, and it's more work for loader authors to hook into both of Node's loaders in very different ways.

@GeoffreyBooth the problem with unifying the hooks is that require() must be synchronous.

Right now, when a user installs ts-node's ESM hooks, we know they can't have installed any other hooks. Depending how you think about, we are responsible for implementing some basic features that would ideally be implemented by third-party ESM hooks.

If a third-party library installs require.extensions['.coffee'], for example, do we need special logic in our hooks to resolve() .coffee files and getFormat them the same as .js files? We cannot pass them to defaultGetFormat because it doesn't understand .coffee files.

the problem with unifying the hooks is that require() must be synchronous.

Hmmmm. That is a problem. @jkrems or @weswigham is there any hope here, or would hooks that work with CommonJS run into the same issues that Wes’ ā€œrequire of ESMā€ PR did?

I suppose one could write synchronous hooks? CoffeeScript transpilation is synchronous, for example; I dunno if TypeScript’s is? Obviously there couldn’t be a sync HTTP loader but there are plenty of useful loader cases that don’t need async.

Anyway applying these hooks to CommonJS as well is a long-term maybe goal. AFAIK CommonJS wasn’t designed to be hookable/customizable, and the current ways people do it are monkey-patched hacks more or less. Perhaps it’s best left as is.

@GeoffreyBooth one of the motivations of moving loading off thread/isolated was we have proof of concept that we can use a SharedArrayBuffer to sleep the main thread while the loader does async tasks but looks as if it were blocking via Atomics.wait. It does not have the same effect as require of ESM.

one of the motivations of moving loading off thread/isolated

What's the benefit of moving to a threaded loader in terms of user story? Do we get improved performance? Improved security? New functionality that wouldn't have been possible in a single-threaded loader?

Do we get improved performance?

Decreased resource/memory usage is the most expected outcome, especially with multiple contexts or threads that all need hooks. If the APIs for hooks aren't designed to run in isolation, this would be hard or impossible to achieve in userland (since none of the hook implementations would be compatible with those assumptions by default). It's much more realistic to run _one_ TSC or babel instance than 100 per process.

Another aspect is increased stability: The loader can't accidentally be broken by or break the application, at least not as easily as when they run in the same global scope.

@GeoffreyBooth

So we have 2 users effectively:

  1. application code
  2. loader code

I think for application code not much would be visibly affected and/or varies to much per application to state much about it.

I think for loader code there are a lot of pro/con to consider but I believe the pros outweigh the cons significantly. In theory, a portion of the pros could be left to users to do things like spin up workers inside of their own loader, but a variety of things are not feasible or have higher advantages if done by the runtime itself.

The overall story is a bit complicated in terms of performance but overall I'd say for simple workflows that putting them on a thread is worse, but for complex workflows and applications it is better.

  • šŸ˜„ - loaders only need to be instantiated once instead of per thread/context
  • 😢 - spinning up a loader is much more costly (expect an extra 10ms)
  • šŸ˜„ - you can spin up multiple loaders at once
  • 😢 - loaders have to communicate using serialized messages, no object sharing
  • šŸ˜„ - doing CPU heavy work in a loader doesn't block the application/other loaders (if multiple loaders are ever allowed)
  • 😢 - you have to do duplicate work sometimes in threads

Security has a bunch of discussions about what your security model is but for some simple statements that aren't really controversial:

  • šŸ˜„ - prototype pollution won't be a way to attack loader hook callbacks
  • šŸ˜„ - when auditing, don't have to worry about application code mutating references inside of loader code
  • šŸ˜„ - trust boundary is well defined by using a different context, so things like the permissions PR can apply different levels of trust to the loader vs application code
  • 😐 - generated code from a loader is still susceptible to attacks, same as status quo
  • 😐 - loader DOS E.G. ReDOS won't be a DOS for the application itself. Still a concern though.

Per features that if loaders are not on same thread:

  • šŸ˜„ - blocking the main thread to do asynchronous operations
  • šŸ˜„ - you can share data to be reused across all threads for loading purposes (code/resolution cache)
  • 😢 - you cannot directly share object references with a loader and application code. This affects patterns like testdouble uses

I do strongly think we need to solve the object reference problem, but we haven't really spent time looking into it to my knowledge. Even if we don't move off thread we likely need to solve the object reference problem in order to allow a solution for saving variables in getGlobalPreloadCode such that they can be referenced in code generated by a loader without being mutable/globals.

Another use-case for custom loaders. Stubbing CSS-imports in UI libraries.

Currently there are similar solutions for CommonJS

I tried to implement the same functionality using the new Node.js loader API: https://gist.github.com/just-boris/b07d66e306c94cf42db41b010231fbbf

Works well for such cases.

Was this page helpful?
0 / 5 - 0 ratings