Having both ecmascript-modules and @jkrems hackable loader has opened up tremendous scope for experimentation.
Note: This thread does not make claims for or against existing tooling, some of which have stood the test of time, evolved, and are fixtures of the ecosystem. The intent is simply to consider different perspectives being explored in experimental efforts.
As far as things go, the broad range of tooling that applies to loaders basically iterates over productions in each source, irrespective of the specifics of implementation or operations.
Most tools are designed to be used for much more complex applications than merely loading. To that effect, they often avoid the use of new language features that would prevent them from working on older platforms. They can also avoid new features which may have been prematurely associated with inefficiencies in early stages. Some are also built with infrastructures or features that are not ideal or not optimized specifically for loading, like using workers, verbose error checking (ie as a language service)... etc.
I would like to dedicate this thread to brainstorming experimental or just different ideas to implement related patterns for loader-first designs.
Brainstorming: A safe place to discuss ideas and provide constructive feedback
How to contribute
Please avoid emoting that can be confusing (especially if it can construed as passively aggressive)
😄 | Indication
:-:|:-:
👍 | To indicate a "Yes" response
👎 | To indicate a "No" response
🎉 | To indicate a "Aha" moment
Read the Digest
The following is a set of ideas or conclusions curated from the discussions:
Syntax Detection (CJS vs ESM)
Safely using RegExp — @SMotaal
Fallback for ESM without import and export — @targos
import(…) to resolve ambiguity — @bmeckimport.meta — @bmeckDual parsing a module was deemed inefficient — @MylesBorins
Syntax Identification (CJS vs ESM)
Mime type meta data via something like webpackage — @jkrems
package.json — @GeoffreyBoothMagic bytes — @jkrems
Wrapping CJS in an ESM module system
ECMAScript modules syntax can arguably be detected using a RegExp which bails on first match.
Does anyone have ideas for cjs vs esm syntax detection?
@smotaal you can't use regexp to parse js grammar (you can always make a pattern of string literals or whatever to confuse the regexp) and the differences between valid cjs and valid esm are ambiguous and can't be reliably detected by just looking at the code.
you can always make a pattern of string literals or whatever to confuse the regexp
So, can we constructively say that so long as you guard against string hijacking (maybe there is a better term for this), only then can you safely use RegExp?
@smotaal I would just use acorn
differences between valid cjs and valid esm are ambiguous and can't be reliably detected by just looking at the code
Can you elaborate on this please?
@devsnek humor me in this effort, consider this both an idea-gathering as well as a team-building exercise. Acron is obviously a great solution, but I am trying to create opportunities for people to talk about the aspects that make this and others such great tools. The notion here is that people might just have some evolving ideas that they might want to bounce around. How we connect the dots, like you pointing out the hijacking limitation can potentially inspire untapped solutions to existing problems.
Sounds fair?
@SMotaal You could say that a file with import or export syntax is probably an ES Module (the syntax is invalid in Script mode). However, the problem is that files without import and export could be either Script or Module, and depending on how they are written, could have different behaviour in Script vs Module mode.
For example:
test = 42;
In Script mode, this creates the property test on the global object.
In Module mode, this throws a ReferenceError.
@targos does the issue get any better if we say that such a loader always imports CJS in strict mode regardless of an explicit "use strict"?
are we trying to come up with use cases for loaders or something else?
if you're using a resolve loader hook you'll always be able to read the contents of whatever you're resolving, at which point you can regex or acorn or whatever it as you see fit.
I'm having trouble to see the relation between 'cjs vs esm syntax detection" and the OP. Maybe I don't really understand what this thread is about, sorry.
@targos Actually, I think you are hitting the nail with pointing out that:
without import and export [a file] could be either Script or Module
So would it be possible to say that when dealing with ambiguous code, syntax-based detection is possible for ECMAScript Modules (ie having those explicit syntaxes import and export) as long as there is a mechanism to fallback on when those features are not present.
Sounds right?
@SMotaal you could always fall back to your own opinions of what the file should be but its impossible to know the author's intent.
i agree with targos that i have no idea what this thread is for.
@targos I'm trying to build on some insights gained at the summit on different ways we can try to engage in meta discussions that can be equally productive on the long term and in the short term provide opportunities for everyone to enjoy good conversations outside the pressures of deliverables and deadlines.
@devsnek The ideas you are all expressing here are extremely valuable, they allow others to actually learn or at least consider a different perspective. It also makes it easier for people to be able to better appreciate and understand intent in future discussions. I think that the biggest problem is not that people disagree, this is actually not bad, but more so that sometimes we tend to do but end up arguing in two separate directions due to miscommunication and misunderstanding.
@benjamingr I might be mistaken, but I believe that it is possible to evaluate non-strict code. While I am not certain how --experimental-modules handle it, I believe that if the wrapper function expression is evaluated in a non-strict context, it will only be strict if "use strict" is in the body of the wrapped module. I played around with this a bit when experimenting with realms which is stage 2 and still actively being updated.
I'm having trouble to see the relation between 'cjs vs esm syntax detection" and the OP. Maybe I don't really understand what this thread is about, sorry.
@targos Until we actually figure out how the Modules WG will handle source ambiguity, it can be helpful to explore (maybe even POC) the various ways to achieve it. Thinking of any of those ideas as either core vs extensions is premature, but that should not discourage efforts of reasoning about it and trying to find ways to refine them irrespective of where those aspects end up.
Try to keep in mind performance when exploring options for handling
ambiguity, as well as the fact that this space has been explored
extensively during the prior EPS process.
Dual parsing a module was deemed inefficient
On Wed, Oct 17, 2018, 10:29 AM Saleh Abdel Motaal <[email protected]
wrote:
I'm having trouble to see the relation between 'cjs vs esm syntax
detection" and the OP. Maybe I don't really understand what this thread is
about, sorry.Until we actually figure out how the Modules WG will handle source
ambiguity, it can be helpful to explore (maybe even POC) the various ways
to achieve it. Thinking of any of those ideas as either core vs extensions
is premature, but that should not discourage efforts of reasoning about it
and trying to find ways to refine them irrespective of where those aspects
end up.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/nodejs/modules/issues/203#issuecomment-430651629, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAecV8LGBqH_HEYEFLdQMt3qtlAgrUsSks5ulz7egaJpZM4Xjw5D
.
@SMotaal so we're discussing how the default loader should handle source ambiguity?
@MylesBorins I think it may be important to know more about the dual parsing approach. I state this not to suggest that dual parsing is or is not a solution, but rather to see if a different parsing approach may be something worth exploring.
From my own research (which I know is relatively limited to other folks in this space), I often find the common pattern of tokenizing into ASTs, which in many cases seems to be an eagerly contiguous process, which makes sense for many things, especially for transforms. In contrast loader-first tokenization (AST or not) may be more efficient if it bailes out on the first conclusively deterministic feature, and more so if it is possible to have a non-binary intent which would allow a single scan to be used.
Can you shed some light on the methodology? (maybe a link to follow-up)
@SMotaal so we're discussing how the default loader should handle source ambiguity?
@devsnek I think of this as a parallel discussion altogether, not intended to directly affect other discussions that deal with the specifics of the default loader... etc. That said, there is no harm if we end up drawing some conclusions that positively influence our process in general.
@MylesBorins the inefficiency is tolerable as @jdalton shows with a top level parse, which would be much faster though if v8 directly supported such mechanisms. However, as the language increases, there are a few concerns:
Some heuristics may fail/be unreliable as features get added to different modes:
import(), which is available in both goals. What should we do with this?import.meta is only in Module, but certainly could be proposed to come to Script. If it gets added to Script would that mean that a Source Text would change from a Module to Script because the language added a feature to Script?We could probably think of more as we desire, but the idea of what to do in ambiguous cases seems a bit beyond scope of tooling itself, these would need to be definitive answers that we can provide a direct answer to as they come up.
@SMotaal Per the question about how does the current loader load non-strict code. It uses multiple Source Texts, it does not create a single string that has both Module and CJS code. You cannot inline a sloppy source text into ESM without using Function which would not have direct access to local variables and they would need to be passed in. Ideally, we could avoid using Function to avoid double parsing the same string somehow and violates some people wishing to prevent JS based codegen for security reasons (see things like CSP or v8's SetAllowCodeGenerationFromStringsCallback), but someone may think of a reason why it would be useful to keep.
Things that are possible:
Agreed with above - there's things that are really hard to figure out and "run CJS with implicit strict" might work for app code but not for dependencies. We tried. It breaks with things like:
if (cond) {
/* [...] */
function myHelper(el) { /* [...] */ }
someArr.forEach(myHelper);
}
The above will throw in strict mode IIRC and this pattern does appear in real (popular) npm modules.
It uses multiple Source Texts, it does not create a single string that has both Module and CJS code. You cannot inline a sloppy source text into ESM without using
Functionwhich would not have direct access to local variables and they would need to be passed in.
@bmeck Absolutely… I was inspired by this approach in the early days of --experimental-modules and found a lot uses of for it beyond CJS in a more general sense.
@benjamingr does that align with the concerns you raised?
@SMotaal if you search this repo for “unambiguous grammar” or “unambiguous syntax” you'll find lots of discussion on this topic.
The webpackage idea from @jkrems does give me one idea though: what if an import statement of a file always imports as ESM, and it's only importing of packages where importing of CommonJS is possible? The package.json is a metadata file about the package, capable of holding properties like module parse goal. It's much more capable as a metadata repository than a file extension is. And if someone wants to import a loose CommonJS file into an ESM module, well, we built createRequireFromUrl for that.
The
package.jsonis a metadata file about the package, capable of holding properties like module parse goal.
I think this is roughly what previous mode/mimeMapping/mimeDB proposals for package.json have been. It could even have globs instead of extensions:
{
// [...]
"mimeTypes": {
"bin/my-cli": "application/javascript",
"lib/*.cjs": "application/vnd.node.js",
"lib/*.js": "application/javascript"
}
}
@GeoffreyBooth I recall our last discussion on this, so I think I was inclined to consider this idea in a more abstract form (but never got to do that) where I wanted to break this into two distinct aspects:
package.json as one potential way to declaratively define such scope and mappings@jkrems what a scope is or is not depends largely on what a specifier is and is not.
If we think of this away from the concept of PackageNameMaps, scopes are much like service worker scopes, just like specifiers are much like URLs (please cheer me for my effort to try to not say they are exactly).
I think this is roughly what previous
mode/mimeMapping/mimeDBproposals forpackage.jsonhave been. It could even have globs instead of extensions:
Just for the sake of brainstorming, I think it is also fair to mention that mime (irrespective of its popularity) is just one possible way to indicate parsing goal
Would we even need file extension mappings if import always imported ESM for files and used package.json to determine the parse goal of a package? We would need to use file extensions to tell the difference between WASM and JavaScript, of course, but that’s telling the difference between file _types,_ which is what file extensions are for, and _parse goals,_ which is only relevant for JavaScript files. The package.json metadata could be as simple as main for CommonJS entry point and module (to a file path) for ESM entry point, and the same import rules would apply within the package and on down.
[...] but that’s telling the difference between file types, which is what file extensions are for, and parse goals, which is only relevant for JavaScript files.
I'm not sure I get the difference here. CJS, script, and ESM files are all text files, sure. But they do contain different kinds of text. They all happen to derive from the same spec (with some important gotchas) but they are different file types. The "parse goal" term is important when trying to actually interpret a file after determining its type. But each of the above has it's own mime type, just like JSON has its own mime type. Because they are different file types.
The file type distinction also applies to native modules, HTML in some discussion, and binary AST. We can come up with our own terminology but I'm not sure if that's too helpful because it will mean that if we want to add something like webpackage or other features coming from the wider JavaScript ecosystem, we're opting into a future of constantly mapping the official thing to our own conventions.
So when I explored mappings in a universal sense, I was inclined to consider thinking of them in a way that makes them fitting in both a package.json and a .webmanifest (not the app cache one).
Standards aside, there is nothing to prevent such mappings from being added to a manifest (going rogue) or prevents those mappings from being used by a loader extension (say service worker) independent from the browser, where the mappings can be used to dynamically generate/cache ESM entries that produce any loader effect towards parity.
Wouldn't the web.manifest still assume valid Content-Type headers? It feels weird (read: dangerous) to encourage service workers overwriting the intended content type of a resource.
@jkrems Absolutely, so mime or not is mentioned here independently for completeness, but considering to use it or not should not be in isolation (case in point).
One idea of course is to create a paired system, one for manifest and the other for package.json. This could help, but we still have ambiguities like this:
😄
There are plenty of use cases for node (one-off scripts, for example) that do not have a package.json at all - without any manifest file, how would they be able to distinguish what parse goal the author wrote the file in (not the person who chose to use import vs require)?
how would they be able to distinguish what parse goal _the author_ wrote the file in
@ljharb this supports the need for deterministic syntax and a one-size fits all syntax detection strategy exclusive from the additional layers of mappings which can be used to further optimize detection (avoid expansive scanning) or even coerce goals if detection yields undesired goals (which is not the same as saying that it detects the wrong goal and needs to be fixed).
@SMotaal cjs is not text/javascript.
@jkrems
It feels weird (read: dangerous) to encourage service workers overwriting the intended content type of a resource.
The changing of content-type is explicitly what is desirable for compile to JS languages like coffeescript. It can intercept a application/vnd.coffeescript (bikeshed) and transform it to text/javascript for example. We absolutely should not prevent changing the format of something loaded without good reason.
@SMotaal
we still have ambiguities like this:
esm == text/javascript
cjs == text/javascript
Note that the correct MIME for CJS is not text/javascript so this is not an ambiguity. CJS uses a different grammar goal (FunctionBody) than Script or Module. I don't see any real advantage to trying to make there be 2 strings that mean the same thing except "esm" is more readable than text/javascript. I've also pointed out that text/javascript is currently unambiguously for Module since all forms of loading the Script goal even in HTML do not check MIME.
@SMotaal cjs is not text/javascript.
@jkrems Is that true for browsers too? Aside from the obvious fact that they are not handled by the browser, is the common practice to have them served with 'application/vnd.node.js' today?
@SMotaal
is the common practice to have them served with 'application/vnd.node.js' today?
you cannot directly load CJS into browsers. I'm not sure I understand the question. If someone hosts something with the wrong MIME that is out of our control.
I’m very confused about this thread. The only reliable means of distinguishing two ambiguous parsing goals in a textual format is an explicit out-of-band indicator (like a manifest file) or a file extension (the typical disambiguator). Unambiguous syntax is something that’s largely impossible, for some reasons mentioned in this thread, and for some mentioned elsewhere. (this part may be more contentious) The author is the authority on the parse goal of the file, so the use of import/require shouldn’t dictate the parse goal. What exactly is being brainstormed?
@SMotaal The thing served to browser using text/javascript that isn't a module is a script. I don't think node wants to support scripts. The implication would be that var x = 10 in the script body would create a global etc.. So that's not really a problem we face. :)
@bmeck In my world it would be the job of resource fetching to do that, true. But it requires actual compilation etc., so it wouldn't be in some declarative mapping object - that's the part I was reacting to.
EDIT: Actually, it might be a mime init handler, not resource fetching.
@bmeck I'm just thinking of a scenario where I am using a service worker to intercept a "virtual" fetch of a CommonJS file. In that sense, the service worker will be handling a response with a cjs body, and the correct mime-type should be 'application/vnd.node.js'.
Given that in such a case, 'application/vnd.node.js' must have been associated with the body of the response at some point.
I am just curious (due to my lack of knowledge at least) if this mime association is effectively used today. I am inclined to believe that at some point, when require.js was loading a module, mime was not a factor, which may not be the case today which I am not aware of personally.
I am just curious (due to my lack of knowledge at least) this mime association is effectively used today. I am inclined to believe that at some point, when require.js was loading a module, mime was not a factor, which may not be the case today which I am not aware of personally.
My understanding: require.js was just using ajax to load the file and it comes from an era where people played more fast-and-loose with content types. These days allowing to accidentally run a CSS file as JS is considered an unacceptable security issue, so new additions have safeguards to prevent it.
@jkrems I've always mentally though of the MIME DBs as what the default loader responds with when resolving. I have not considered the MIME to be a guarantee throughout the entire loading process. Maybe we could document a choice of if it should be stable or if it should not be throughout the loading process.
@SMotaal
I am just curious (due to my lack of knowledge at least) [if] this mime association is effectively used today. I am inclined to believe that at some point, when require.js was loading a module, mime was not a factor, which may not be the case today which I am not aware of personally.
It is not currently used today, in part because Node doesn't really expose any convention or configuration to declare something CJS, it just assumes it to be CJS. If we expose a configuration mechanism (preferably static), then it could see some use. As it exists today, there is no real encouragement to differentiate files and you just see errors when loading CJS into the browser due to things like missing variables or syntax problems like early return.
@jkrems I was hesitant to say the good-old ajax days 😄, but in all fairness, the next evolution of bundles (or not) must look that far back.
@bmeck I'm experimenting with a staged loading process, the 2nd stage resulting in a Resource that has a mime type: https://github.com/jkrems/loader#module-loading
If we expose a configuration mechanism (preferably static), then it could see some use. As it exists today, there is no real encouragement to differentiate files and you just see errors when loading CJS into the browser due to things like missing variables or syntax problems like early
return.
@bmeck since I personally (for less common reasons) cannot rely on bundlers, I try to look at solving problems independent from bundlers first. «CJS modules in bundles being a popular thing in browsers» from that sense simply means that «CJS modules are popular in browsers and bundles is just one way they go there»… Is this a fair way to look at it?
Let's keep in mind that this group was formed to implement modules for Node, not the browsers.
@SMotaal
@bmeck since I personally (for less common reasons) cannot rely on bundlers, I try to look at solving problems independent from bundlers first. CJS modules in bundles being a popular thing in browser from that sense simply means that CJS modules are popular in browsers and bundles is just one way they go there… Is this a fair way to look at it?
Not really relevant here since the bundle itself is a Script or Module. If I compile CoffeeScript to a Script it doesn't mean that the browser is loading CoffeeScript to me.
@jkrems
@bmeck I'm experimenting with a staged loading process, the 2nd stage resulting in a Resource that has a mime type: https://github.com/jkrems/loader#module-loading
I have concerns separating the resolution and the fetching.
resolve() that is able to return a body.data: where the fetch is the resolve...I feel like starting with the more minimal hook would be ideal and we can play with adding that second phase ontop it.
I just realized I am running late… Hard to not be distracted by these awesome discussions.
FYI: I am trying to keep a bullet-form digest in the body of this issue based on the ideas outlined in our discussions. I try my best to capture everyone's ideas and welcome everyone's help to refine this idea and to make sure the digest correctly reflects broader ideas pointed out in the discussions which relate directly to the topic.
It differs from that the spec provides.
I disagree to some extend. The spec does have URL resolution of an import as a separate concern from actually getting the module.
It means that fetching cannot affect resolution, or that in order for it to affect resolution a double fetch occurs in both phases.
That is true - but that is, to some extend, by design. It means that figuring out what to load can happen asap without actually finishing the fetching first. Also, depending on how things are implemented, a double fetch may or may not be free.
These separate phases can be represented by a single resolve() that is able to return a body.
Yes, but only if we accept that linking needs to wait for all resource fetches first. It also means that the resolution logic needs to be aware of all possible ways to fetch resources and vice versa.
It seems a bit strange for some things like data: where the fetch is the resolve...
data: is a fully resolved URL. The resolve step would just return it since it's absolute already. Fetching a data URL is converting it into a Resource.
Sometimes for things like cached compilation you end up being able to fetch/resolve in parallel rather than needing to keep them as a sequence.
These phases are per module, not global. E.g. you would still be able to fetch resources in parallel.
I feel like starting with the more minimal hook would be ideal and we can play with adding that second phase ontop it.
I'm assuming that the more minimal hooks would be the ones only affecting very specific aspects. Want to load from an HTTP archive? Provide a fetch implementation. Want to provide support for text/html? Provide an init implementation. Etc..
Created dedicated issue for loader phases here: https://github.com/nodejs/modules/issues/205
I think that is a thread that can continue independent from this.
I'm assuming that the more minimal hooks would be the ones only affecting very specific aspects. Want to load from an HTTP archive? Provide a fetch implementation. Want to provide support for text/html? Provide an init implementation. Etc..
I don't think this is true, that is 3 hooks where the spec only provides one. Loaders intercepting that may not even do URL resolution (some things are not urls like @nodejs/fs). However, this single hook the spec provides is capable of doing all 3 behaviors mentioned.
I don't think this is true, that is 3 hooks where the spec only provides one.
That is true - but in practice we don't actually implement that behavior because it would require everything to wrap up synchronously. What is more realistic is what HTML standard describes. And there (even though it's not fully spelled out), it's much closer to what the current core implementation does - HostResolveImportedModule is called but it's not remotely close to actually doing the work:
@jkrems that enforces some constraints that the spec hook does not. In particular I don't really see why it needs to be split up or what advantage you are seeking to gain here. It mandates a sequential ordering rather than allowing potential intermingling of phases. Even if we use that pattern ourselves I'd rather leave it more flexible and closer to the spec itself. Things can always be instrumented to expose these behaviors.
I want to implement "load from webpackage" without having to know how to instantiate a WASM module. And vice versa. Or implement "resolve from package-lock" without having to know that http URLs should be fetched using a set of pre-loaded webpackage archives. That's the big opportunity I see in splitting these up.
@bmeck Can you give an example of where the constraint prevents certain actions? Maybe there's a misunderstanding here. :)
@jkrems
I want to implement "load from webpackage" without having to know how to instantiate a WASM module. And vice versa. Or implement "resolve from package-lock" without having to know that http URLs should be fetched using a set of pre-loaded webpackage archives. That's the big opportunity I see in splitting these up.
Given 2 loaders node --loader wasm --loader webpackage main.mjs
// main
import '.webpackage#foo.wasm';
// in webpackage loader
resolve('file:///main.mjs', '.webpackage#foo.wasm') -> {
id: 'file:///.webpackage#foo.wasm',
type: '', // could be any "unknown" value, browsers use ''
bytes: ...
}
// in wasm loader
// calls out to webpackage "parent" loader, gets the result
{
id: 'file:///.webpackage#foo.wasm',
type: '', // could be any "unknown" value, browsers use ''
bytes: ...
} -> {
id: 'file:///.webpackage#foo.wasm',
type: 'text/javascript', // ESM
bytes: ... /* facade it using WebAssembly.compile or whatever you want */...
}
You still get to implement things as separate loaders that handle the discrete steps.
@jkrems
Can you give an example of where the constraint prevents certain actions? Maybe there's a misunderstanding here. :)
Easiest one is that I might not want my loader to resolve to a URL, it might have some special enum identifier it uses for fetching. So you would need to let resolve() just return a string. If resolve just returns a string, that means that fetch() needs to operate on strings not URLs.
But so the assumption is that bytes will always be a simple ESM. There's no way to actually handle wasm in that way. You are skipping over the awkward initDynamicModule hook I assume..?
@jkrems
But so the assumption is that bytes will always be a simple ESM. There's no way to actually handle wasm in that way. You are skipping over the awkward initDynamicModule hook I assume..?
I don't understand this. Any format supported by the host would be detected via type in the example above, if Node supported application/wasm it would just be a regular WASM module, not ESM facade. This approach is different from dynamic in the current loader which doesn't let you completely replace the source text being loaded.
I’m brainstorming about the “ESM in .js” case. Basically, what I want to do is see if there’s a way to solve our CommonJS/ESM interoperability issues without relying on file extensions for determining parse goal. File extension can still tell the difference between JavaScript and C or WASM, but between Script and CommonJS and ESM (and the case of ESM in .js) there’s a lot of metadata to load onto a file extension to disambiguate.
So for packages, this is easy: we put the metadata in package.json. You could have main point to commonjs-entry-point.js and module point to esm-entry-point.js like is already supported in various tools today like Webpack (I think?). Then import 'that-package' would import as ESM if it finds module, or CommonJS if it finds main or neither. Packages are the easy case.
Bare files are hard, since there’s nowhere except the filename/extension (or maybe pragma) to store metadata. Hence .mjs. _But,_ when are bare .js/mjs files ever distributed on their own? Even if your NPM module is just a single JavaScript file, you need a package.json file to upload it to the NPM registry. There really shouldn’t be too many bare files on your system not written by you. Even if there are, some of them could very well be ESM in .js and then you’re still lacking author-driven disambiguity.
So I guess what my thought was, if we just drop the requirement that bare files be able to support author-driven disambiguity, our implementation gets much easier:
import can import ESM or CommonJS packages, and only ESM filescreateRequireFunction--modepackage.jsonThere are millions of .js files out there expecting an ESM parse goal, so the ship has sailed for trying to migrate the world to being explicit on the file level. But there are lots and lots of NPM modules with a module key or similar that make their parse goal explicit at the package level.
Yes, but the question was about adding support for WASM. Or HTML. Or binary AST. Or anything else that isn't a simple compilation into an equivalent ESM source text. The resource fetch can get the data but that isn't the actually interesting bit for those. The interesting bit is taking the data and turning it into something that can be linked into the module graph (in the above: an init hook).
Also, the webpackage example should start with http://some-url-in-the-package. In your example - what would the webpackage loader receive?
Side note: I also dislike that a single --loader array means there's now an order dependence for when exactly which loader needs to be passed. In a world where there's phases, they could be passed in any order.
@jkrems
Side note: I also dislike that a single --loader array means there's now an order dependence for when exactly which loader needs to be passed. In a world where there's phases, the could be passed in any order.
The load order is still required even with phases, if one phase loader always guess the type to be text/javascript and doesn't properly delegate to another that would guess it to be application/wasm, flipping the order of those loaders would still mean a change in behavior. Phases do not fix load ordering, we must rely on users to properly configure things.
Yes, but the question was about adding support for WASM. Or HTML. Or binary AST. Or anything else that isn't a simple compilation into an equivalent ESM source text. The resource fetch can get the data but that isn't the actually interesting bit for those. The interesting bit is taking the data and turning it into something that can be linked into the module graph (in the above: an init hook).
I don't understand how this relates, like I said, any supported format that Node can link into a graph works. This is unrelated and doesn't need a separate phase. Even with an init phase like you propose, if ESM linking cannot directly integrate with a WASM module because the host doesn't provide a way, you still must create a facade in your proposal.
Also, the webpackage example should start with http://some-url-in-the-package. In your example - what would the webpackage loader receive?
I don't understand this. Webpackage could support file: last I saw, I can update the strings to have file: in them I guess in the example.
I don't understand this. Webpackage could support file: last I saw, I can update the strings to have file: in them I guess in the example.
But why would it be limited to file: URLs? Especially since those would risk conflicting with real on-file URLs. A portable webpackage provided by a registry should either use HTTPS URLs (that could even resolve potentially) or a custom scheme. Reusing file: would mean that you'd end up hitting the disk for every file first and worst case even load something.
Phases do not fix load ordering, we must rely on users to properly configure things.
They do fix ordering for unrelated concerns, like fetching a resource and actually interpreting it.
any supported format that Node can link into a graph works.
So the disagreement is if init should be exposed or not, not if it is a separate phase. Because if init is hard-coded to a well-known list of supported module types, then it's still there, just not configurable.
So the disagreement is if init should be exposed or not, not if it is a separate phase. Because if init is hard-coded to a well-known list of supported module types, then it's still there, just not configurable.
I'm saying init doesn't make sense as you are explaining it, you can't make the VM accept new unknown Module types into the graph. Same way, you can't just make new Module types work in Node. There is always a minimal set. CoffeeScript modules could compile to WASM or JS, it doesn't matter, but we can't suddenly make V8 accept something like JVM bytecode and have it act as a Module without turning it into a supported Module type.
But why would it be limited to file: URLs? Especially since those would risk conflicting with real on-file URLs. A portable webpackage provided by a registry should either use HTTPS URLs (that could even resolve potentially) or a custom scheme. Reusing file: would mean that you'd end up hitting the disk for every file first and worst case even load something.
It isn't? It accepts the full specifier and id of the module loading some dependency, I would expect there to be no constraints except that the id should be unique, and the specifier is a string.
It isn't?
Ah, I misread your example code. My bad.
but we can't suddenly make V8 accept something like JVM bytecode and have it act as a Module without turning it into a supported Module type.
Yes, but turning it into a supported Module type doesn't necessarily mean turning it into a supported module type source code. E.g. for WASM (or for the JVM bytecode example actually), you would realistically analyze/compile the resource content first to determine the interface, then generate a facade, and then expose the compilation result inside of the module. Trying to inline the original bytes in the source text and then recompiling on execution would be fairly inefficient and in some cases not practical. The only alternative I can think of is globals and unique ids but that's not really a proper solution.
For me CoffeeScript isn't the target I'd want to optimize for. If what your loading can easily be converted into self-contained JS code on the fly, it might just as well have been compiled ahead of time. The same isn't true for things that do not compile to JS and have different execution semantics. One example would be importing a DLL for example.
Yes, but turning it into a supported Module type doesn't necessarily mean turning it into a supported module type source code. E.g. for WASM (or for the JVM bytecode example actually), you would realistically analyze/compile the resource content first to determine the interface, then generate a facade, and then expose the compilation result inside of the module. Trying to inline the original bytes in the source text and then recompiling on execution would be fairly inefficient and in some cases not practical.
We already have an example that doesn't do inline based transformation, currently our CJS translator is creating a separate module record and just loading in the CJS without inlining it. It doesn't recompile on execution at all currently.
The only alternative I can think of is globals and unique ids but that's not really a proper solution.
Modules will need unique ids anyway in order to ensure the (module, specifier) pair is unique. I'm not sure how any other solution could be "proper" since without unique ids that makes the pair unable to correctly have a 1-1 relationship with an import.
For me CoffeeScript isn't the target I'd want to optimize for. If what your loading can easily be converted into self-contained JS code on the fly, it might just as well have been compiled ahead of time. The same isn't true for things that do not compile to JS and have different execution semantics. One example would be importing a DLL for example.
It isn't just CoffeeScript that does JS compilation; historically code coverage has done this (no longer!!!), eslint certainly could be useful to enforce at boot time, development runs without having 2 commands for build vs run, etc.
It certainly isn't the only thing we should optimize for, but it is part of it. If the concern is mostly around avoiding duplicate parse/eval phases that is something we can design around, but I don't see how init solves this in any new way.
init allows us to officially support initializing a module using "real" APIs. E.g. module.setLazyDynamicExports(exportLists, getExports) or whatever the final API could look like. With loader hooks that can just returns bytes, this will always be somewhat awkward and indirect. Afaik our CJS translator isn't implemented as a loader hook that just spits out bytes..?
@jkrems
Afaik our CJS translator isn't implemented as a loader hook that just spits out bytes..?
Correct. It currently doesn't, but if we wanted to we could rewrite it to do so. I'm not sure if that information is for or against anything given that.
init allows us to officially support initializing a module using "real" APIs. E.g. module.setLazyDynamicExports(exportLists, getExports) or whatever the final API could look like.
If you follow the Realms proposal there are fewer JS APIs being considered and most interactions for things are being moved to be purely string based. I'm not sure what APIs are being talked about here.
As I catch up on this thread, I am appreciating how everyone tries to follow a more brainstorming approach to allow everyone to pose ideas to see how they materialize (or not) later on.
I think this type of discussion helps people with very diverse backgrounds, experiences, and extents of familiarity with the intricacies of ESM and CJS to mutually share and gain insights that are sometimes missed during goal-oriented debates.
On the idea of top-level parsing to disambiguate JS sources. I took some time to put together an experiment to roughly demonstrate the relative costs associated with different parsing strategies.
The gist of it is that a parser would bail out at the first occurence of a particular syntax, where it will parse through the entire file length otherwise, using as little grammars as possible for a safe parse. The current experiment does not bail out, it simply identifies escapable entities that can be used for hijacking, contextualizing symbols, and the set of keywords that would satisfy the condition.
I added new parsing modes to the experimental parser "esm", "cjs" and "esx". In "esm", the parser will operate in strictly top-level and only look for the keywords import, export, from, as (for completeness). In "cjs", it will parse deep and look for keywords module and exports (though they are really not keywords, still working on that). In "esx", it will parse deep and also look for the combined set of keywords, with the intent to consider a single differential parse versus multiple binary parses. The "es" mode is an incomplete mode intended for full source analysis.
The demo page is served from http://smotaal.github.io/experimental/markup/markup using the ordered parametric notation: #[url]![mode][replicates]*[iterations]. If mode is omitted, it is inferred from content-type. If iterations are specified and ≥ 1, a separate loop will run the tokenizer on the same code without rendering it (average time of loops will be shown, for sampling purposes). Replicates, which are not needed for this demo, if ≥ 2, the source text is repeated, so it will parse and render x repeats of the original text as a single source text, however, if you are working with really large sources (like babel) try *0**[iterations] to eliminate rendering overhead which can crash in some browsers.
Demo: acorn.mjs
Demo: acorn.js
Note: This experimental code works in the latest Chrome, Safari, and Firefox Nightly with varying performance. If you try this on a slower device, use a smaller source and change the
**iterations value as needed. All parsing happens in the main thread.
Obviously this does not address disambiguation of ambiguous source texts. If relative performance gains can be further improved or optimized, then disambiguation (loader or not) by source text will be something will likely be favoured by some down the road.
@smotaal That’s a great start for something that I can see as a loader. For your CommonJS detection I would add a check to look for globally-referenced require.
Perhaps it would be good to start compiling a list somewhere of things that people might want to see as loaders. Besides this case, off the top of my head there’s transpilers, automatic completion of file extensions/folder root files, configuration of module loading behavior based on file extension, and general backward compatibility to bridge the gap between what will be possible in ESM in Node and what is/was possible/allowed in Babel and other transpiled/built versions of ESM.
@GeoffreyBooth obviously my efforts are gearing towards complimenting any potential implementations for loaders once the design process matures, but for this particular experiment, I decided to isolate for any such efforts and instead tried to focus on some proxy problems. In my effort, thinking of syntax highlighting was a great way to visually solve parsing challenges, and parsing in the main thread without dependencies was a great way to address performance issues, adhering to generators was a great way to force a stream-like approach... the list goes on.
Regardless, it was just perfectly timed to use it to demo relative performance gains compared to more common AST all the things then do one small thing, which would be rather expansive for esm vs cjs (or my proposed esx) detection in my best estimate (but still needs real world benchmarks).
ESX parsing currently scans the full length of source text, but the intent is to actually keep reference of enclosing ranges and not analyze them unless there are no signals of ESM syntax on top-level, then finally scan enclosures to find the first cjs hint or not, this makes it possible to report ESM, CJS, or still ambiguous so use the default based on out-of-band settings... etc.
If we have out of band settings, and that info conflicts with a parsed result, I’d expect it to throw - the two shouldn’t be in disagreement.
That’s actually a very important aspect, because I in my rushed vision of eliminating parsing errors which are handled normally by the runtime, I have not given thought to certain errors that belong specifically to the intent at hand.
More of this kind of insights here can go a long way down the road when making decisions. Awesome 🙂
I am holding of on pushing changes made to the demo since somehow pages had trouble building when I uploaded so until I figure why and more so why it worked fine a day later I am hesitant to upload minor or partial changes.
That said, I am interested to see if I could get a good model in play where we can visualize ideas to facilitate discussions.
When considering the case of parsing, I was having trouble mentally placing the metadata communicated between two loaders for instance.
In this case, it is in-band (imo) but it is not "directly" from source, it is inferred and attributed to the source text, and is triggered (or bypassed) and responds to out-of-band (one-to-many) and out-of-source (one-to-one) aspects or settings.
Can I propose the following complementary pairs: (examples in brackets)
- "out-of-band" — setting that trickles down to one or more resolved specifiers (flag, ext, mime…)
- "in-band" — settings determined from resolved source features (pragma, this parse…).
- "from-source" — settings declared in the source text (pragma, shebang…)
- "out-of-source" — settings inferred or attributed to a source text (some in- and out-of-band)
Can anyone find a more practical breakdown of such information regarding a source text's journey?
This is all crude thoughts, it needs magic from the group. I feel that a distinction between what maps to sources and what is specific to a source but not baked right into the body are essential distinctions.
I finally updated the README and pushed the revisions made last week. Timing is more accurate now. I also converted the rendering pipeline to async APIs. Tokenization APIs remain sync but use generators so they yield and return as needed. I improved the modes for esm, cjs, esx, and added the missing alias es for the regular javascript syntax mode.
I am really interested to hear some feedback on the three modes (esm, cjs, esx) with various sources, especially if you find a source that breaks or chokes in one of those modes.
@SMotaal its cool i guess? i don't really understand why we have an issue open for it though.
@devsnek This thread is about ideas in general separate from implementation. As we move closer to loaders and defaults, those discussions and demos can be helpful, at the very least, they can serve as a reference for those who need to find more about them.
@jdalton Can you pitch in on the idea of syntax detection relating to top-level parse. I tried to find a way to model this to the benefit of everyone in the group and was able to show a 200% increase in performance (theoretical) relative to the same method to full ES grammar parsing like ASTs would.
This was done avoiding the conventional all-or-nothing AST approach, using half-way optimized RegExps addressing usual concerns like hijacking.
Ideas like dual-parsing (@MylesBorins) and your top-level parse (@bmeck) made me think of a single-parse limited to the minimal subset of both grammars and it was roughly capped at 175% depending on nested complexity but on average better than 150%.
Since we're trying to find the first clue to determine syntax, the expectation is that such clues will often materialize early on in a text, making it reasonable to bail or delay the rest of the parsing (if at all needed).
Can we hash out pseudo code for syntax determination based on your initial thoughts on top-level parse?
About this thread…
I'm trying to brainstorm ideas parallel to our implementation efforts that make it possible for our broadly diverse members to appreciate the various technical challenges associated with decisions we are making.
Based on an early digest of this discussion, which I took liberty to summarize at top. I tried to pick ideas which seemed to create rifts in discussions elsewhere, mainly in on the topics of syntax detection and interoperability.
@SMotaal This is impressive . . . just to understand what you’ve done here, is your goal to determine parse goal by analyzing the syntax? A.k.a. a real implementation of the “unambiguous syntax”/grammar that we’ve been discussing?
If so, and assuming that you find an algorithm that works, have you thought about how to address the related concerns listed in https://github.com/nodejs/modules/pull/150#issuecomment-406515253?
to be clear, it's just a lightweight way of parsing js. this doesn't make the ambiguity go away.
Confirmed; there does not exist any approach based on parsing that is unambiguous in all cases, absent a language spec change.
Yeah, while I would love to be the one that can solve ambiguity of source text and other sources, this is really nothing more than a very modest effort to model different parsing methods separate from the usual tools.
My gut feeling tells me that while implementing solutions is best served by employing tried and tested tools, coming up with optimal solutions may not always share in those benefits. So in other words, AST's have a way about them that force looking at problems in certain ways, so modeling the problem without is a way to avoid restricting ourselves to the givens of using them.
So this is far from a solution, just an attempt to provide a way to explore solutions, and the bottom line holds, ambiguity is ultimately a source problem, and if it is, then the only way to resolve it is out of band.
@devsnek the underlying motivation behind my markup experiment in general is not restricted to JS, in fact, I was interested to find different ways for efficient and responsive multi-syntax parsing without the pitfalls of conventional methods. And on that, I think I am ready to dare make the claim that it can be done with virtually no switching overhead, using less popular features like generators and regexps: html (and script tags)