This is a discussion issue for adding ability to run eslint in parallel for multiple files.
The idea is that ESLint is mostly CPU bound, not IO bound, so creating multiple threads (for machine with multiple cores) might (and probably will) increase performance in a meaningful way. The downside is that currently ESLint's codebase is synchronous. So this would require rewriting everything up to and including eslint.js to be asynchronous, which would be a major effort.
I played with this a little while ago and found a few libraries for Node that handle thread pool, including detection of number of cores available on the machine.
If anyone had any experience writing multithreaded applications for node.js and would like to suggest alternatives or comment on the above list, please feel free.
P.S. https://www.airpair.com/javascript/posts/which-async-javascript-libraries-should-i-use
Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.
Nice. I'm very interested in trying this myself too.
Another question that I had in my mind, if we need to rewrite everything to be async, should we use callback pattern? Promises? If so which library? Q or Bluebird? I personally would prefer promises to callback hell.
I vote for promises. Bluebird is fast, but makes me nervous because it adds methods to the native Promise, which is likely not a great idea. I think Q might be the best bet. There is also Asequence, but I have no personal experience with it.
Why not use built in promises? Just a question as I have no experience with promises yet.
They're not supported in node 0.10, to my knowledge.
Besides that, the libraries give some nice "sugar" methods when working with Promises.
I've had plenty of success using native promises (or a polyfill when native promises aren't supported). That seems like a good starting point to me; if we need more than they provide we could probably swap out something that's API-compatible.
I think we're putting the cart before the horse here. Let's hold off on promises vs. callbacks until we're at least ready to prototype. Get something working with callbacks and let's see how bad it is (or not).
The idea is that ESLint is mostly CPU bound, not IO bound
ESLint also does a lot of IO (directory traversal, reading source files). So I think we would also profit here if we rewrite eslint to do non-blocking IO.
@lo1tuma I haven't profiled it yet, but in my mind, amount of IO we do is negligible comparing to amount of CPU cycles we eat. I will try to profile it and post results here if I will get anything meaningful.
Using something like NodeClusters - or most other per-process implementations - would avoid the issue of needing to [vastly] rewrite ESLint. (Such implementations are strictly _not_ threading, but allow managed parallel process execution.)
It would mostly just need to IPC-in/out the current ESLint; ESLint parallelism would then be able to work freely over different files in a per-process manner, but it could not (without more rework) run concurrently over a single file.
Thus if the goal is to run ESLint over _different files_ in parallel I would urge such a 'simple' per-process concurrency approach. If the goal is to make ESLint parallel across the same source/AST then .. that is a more complicated can of worms as it changes the 'divergence' point.
If there is a strict v0.10 node target for ESLint, maybe have this as an feature _when_ running a compatible node version.
My idea is:
There are source codes which has a variety of size, so I think the queue control in pull-style is important.
Library.... I think child_process.fork is enough.
Sorry for this question: Based on all the comments, do we think the reward of this functionality is worth the effort/changes/complications? (like I said just a question)
Or is this functionality too early (like google glass) to implement without any actual estimates or ask or whatever.
@gyandeeps My projects are not big enough, or computer slow enough, for me to really care either way.
In cases where there are sizable projects of many files, on computers with several cores and non-bound I/O, I imagine that it could lead to significantly reduced wall-clock, approaching Amdal's law. I would be less optimistic about this gain with fewer larger files, even with 'actual threading' or post-AST handling - but that is what performance profiles are for.
Of course another option is to only lint 'changed' files and provide some sort of result cache, but comes with additional data management burdens.
@gyandeeps To answer your question - we do not have definitive information on this. Right now my assumption is that we are CPU-bound, not IO. In that case utilizing more cores should have significant impact on larger projects (as @pnstickne mention, impact will be small or there might even be negative impact on a few large files).
I think the first step of the process would be to prove or disprove my assumption on CPU vs IO. Granted, if it turns out I'm wrong and we are IO-bound, then changing ESLint code to async would improve performance anyways.
@pnstickne Thanks for the insights. I'm really not familiar with NodeClusters, just know that they exist and do not require everything to be async for them to act. We definitely not going to try to make ESLint run AST analysis in parallel, that's going to be a huge change, and a lot of our rules expect nodes to be reported in the right order, so we would not only need to rewrite pretty much the whole core, we would also need to rewrite a lot of rules.
I would be interested in learning more how we could incorporate code that would only execute on node 0.12 and not on earlier version. I'm not sure that's the approach I would like to take, but it's an interesting option never the less.
So I did some very unscientific performance analysis on both Windows and OSX.
I specifically chose to lint a very large number of files (TodoMVC, 1597 files, 854 folders).
On Windows, results where very inconclusive. Basically my CPU usage was never over 15% (on 8 core machine, none of the cores were overtaxed at any point). But then my I/O never hit anything over 10MB/s on a SSD that's theoretically capable of 550 MB/s. So I have no idea why it didn't run any faster.
On OSX it never hit more then 15% CPU (I have no idea how many cores my macbook has, probably 8 too), but I/O was pretty taxed. It looked like there were a few spikes that were reaching 100% disk throughput. So maybe my assumption was wrong any we are not CPU bound?
@ilyavolodin Try this:
Start running 8 different eslint processes (over the same set of files should be fine, although it would be 'more fair' to break up the files into 8 different equally-sized sets).
Compare the wall-clock times it takes for 1 process to complete and 8 processes to complete.
The 8 processes will have done 8x the work (if using the same set of files for each process as for the single process) or the same amount of work (if having split the source files among them), but in how much x the time?
This very crudely should show an approximate gain - if any - for using multi-process concurrency.
Late to the conversation, but... Is anyone opposed to someone starting up a pull request to implement the callback hell approach (i.e., make ESLint async across the board, including for all I/O operations)? Seems like it would make the eventual parallelization easier anyway.
Of course another option is to only lint 'changed' files and provide some sort of result cache, but comes with additional data management burdens.
-@pnstickne
This is essentially what ESLinter from @royriojas does.
@IanVS That's pretty cool .. now if only it was built-in to my eslint grunt task :}
(Okay, I could get it shimmed in pretty easy - but it'd still be nice to see a 'done package'.)
@pnstickne #2998
@platinumazure I think I would like to first prove that there's something to gain by going async/multithreaded just so nobody would waist their time creating a very large PR that will then be closed, because there is no gain.
@ilyavolodin That's fair enough. I'm wondering, though, would it be worth
creating a separate issue with the goal of making ESLint's I/O operations
asynchronous? Synchronous I/O is a bit of an anti-pattern in Node and it
might help us figure out just how much of a breaking change parallelization
would likely be.
On Aug 30, 2015 9:46 AM, "Ilya Volodin" [email protected] wrote:
@pnstickne https://github.com/pnstickne #2998
https://github.com/eslint/eslint/issues/2998
@platinumazure https://github.com/platinumazure I think I would like to
first prove that there's something to gain by going async/multithreaded
just so nobody would waist their time creating a very large PR that will
then be closed, because there is no gain.—
Reply to this email directly or view it on GitHub
https://github.com/eslint/eslint/issues/3565#issuecomment-136151391.
@platinumazure While sync code is node anti-pattern, in our case, we can't do anything with the code (as in parse it into AST) until we read the whole file. So if improving performance is not on the plate, then changing code to async will increase complexity of the code, but not gain us anything. It's worth testing out and I still think there's a performance gain that we can get by doing parallel execution, but I would like to get some proof of that first.
@ilyavolodin reading files asynchronously doesn’t mean you read them chunk-by-chunk. While you are waiting for one file to be read, you can lint a different file which has been read already.
Wall clock time is the only metric that matters for this evaluation. We have had people report they were using ESLint on projects with thousands of files, so if there's a benefit, it will be seen.
The thing to keep in mind is that directory traversal must be done before linting can begin. Thus, we get zero benefit from switching to an async directory traversal. We need to build up the list of files to lint and then fork off to start linting, at which point each process is doing a file read, and there's no benefit to async there either.
Overall, I'd just like to squash the "rewrite to make things async" approach. We should be able to make a change in CLIEngine to do this without a major rewrite.
Ok, I think I finally figured out a good way to check if ESLint is CPU bound vs. IO. What I did is I ran eslint on the same set of files (544 files in 234 directories) first with just one rule on (`yoda: [2, "never"]') and then with .eslintrc file from ESLint repository (so a lot of rules on). As far as I can tell, the amount of IO should be exactly the same in both cases, but number of CPU cycles will be significantly different. This doesn't mean that IO is not a problem, but at least it's something. And then numbers where as follows:
Test #1: 9.630s
Test #2: 1:27.880s
Based on dramatic difference in performance, I want to say that we are not IO bound. Next step would be to create threaded prototype of some sort and see what the difference in performance is going to be.
I'm still for 'per process' concurrency.
Ok, I created a very simple prototype of "process-pool" that will spawn a bunch of processes based on the number of cores available and send files to those processes for linting. I had to turn off reporting (formatters are not invoked) but here are the numbers:
Single process linting on 544 files in 234 directories took 9.585 seconds. Same files took 4.168 seconds over 8 processes (I have 8 cores). I still need to do some major cleanup on "process-pool" and everything up to and including CLIEngine needs to be turned into async code, but it looks like it might be worth it.
@ilyavolodin, is your prototype public? would love to see it. Currently the cache module uses a statSync, for simplicity sake. but it would be nice to see if that can be done async as well
@royriojas Not yet, I only wrote pooling library for child_process and even that needs some major cleanup before I would be comfortable posting it anywhere:-) I didn't convert anything to async code yet, I just added process.exit(0) once all of the linting is done (so no reporting of any kind for now). I also commented out all of the cache code for this test, because it was hard to find a spot to plug my code in with caching on.
Just wanted to chime in. We have a medium size project, linting 1370 (larger) files with a good number of rules in a powerful VM, and it takes about 30 seconds. It's manageable, but painful.
Here's an update on where this is at:
multithreading branchexecuteOnFiles. This is the main entry point, and should be refactored to allow for creation of another entry point called executeOnFilesAsync. In order to do that, this function should be streamlined to have a single entry point and a single exit point. All of the helper functions should be refactored to be that - helper, send something in, get a return value and then branch the code based on the return.cli.js - execute
cli-engine.js - executeOnFiles
cli-engine.js - executeOnFile
cli-enigne.js - executeOnText
cli-engine.js - processText
eslint.js - verify
Here's what it should be like to make it easy to do async functions:
cli.js - execute
cli-engine.js - executeOnFiles
eslint.js - verify
Everything that was in between should be made into a helper function. Since spawned processes can only execute a module, it should be as simple as possible, preferably just pure return eslint.verify(text, config, filename)
Aren't executeOnFile, executeOnText, and processText already helpers?
They are not, they are part of the main stack trace. executeOnFiles calls into executeOnFile, which calls into executeOnText, etc. until the call into verify. The point is, for making async function the depth of the stack trace has to be minimal, otherwise, every step of the stack trace has to be turned async as well.
Sorry, I'm still confused. Let's start by defining terms. What do you mean by "helper"?
A helper function (in my own made-up definition) is a function that takes a set of arguments and returns back a value, it can call other functions, but it cannot be on the main execution path of the code. For example, in the current version of cli-engine I consider calculateStatsPerFile, calculateStatsPerRun, createIgnoreResult, etc. to be helper functions, they are called by a function on the main execution path, but they are not part of the main execution path.
Helper functions can be kept sync even when writing async code. But functions on the main execution path has to be async.
Basically, if you were to run eslint in a debugger and put a break point in eslint.verify method, anything that will be displayed in the stack trace once you hit that break point has to have async counterpart. I'm trying to limit the number of async functions that need to be created, and at the same time, not have a lot of logic in those functions, so I wouldn't have to duplicate that logic between sync/async counterparts.
Got it, thanks for explaining
hi @ilyavolodin,
So here is my take, I do agree with you that we should probably move the code for the cache to a better place where it is now (executeOnFiles). In a perfect world the process should be something like this:
In the main process:
In the forked process
In the main process
I don't know if this will be feasible, but I guess it will make the process simpler in order to also use the cache.
Let me know what you think about it.
@royriojas That's exactly the way I was thinking about it. It's a bit more complicated though, since not just cache stuff needs to be refactored. Currently there are too many things happening on the main path of execution (for example, the list of files contains just filenames, for each one config needs to be constructed, content of the file has to be read, etc.). If all of those things will be turned into helper methods, doing parallel linting will actually be pretty straight forward.
@ilyavolodin I see, so instead of list of files is a list of objects with the filepath and the config object already inferred. right?
We are already taking the config in consideration for the cache, so once the methods are refactored adding the cache back shouldn't be an issue. I would say if you have a working branch on this, just remove the cache code, we can add it back to the right place.
My only concern is the issue #4607, since it is requesting to move the cache inside the executeOnText method and you're suggesting it should be outside of it. Maybe we need a wrapper for it that can also include the cache so the original remains untoched?
Ideally, what I think should be done, is cache should be turned into a utility, and moved into it's own module. Each method that's publicly exposed should be doing one operation only (for example, given config object and file name, determine if the file is already cached and return true/false). This will allow caching to be used in async and sync executeOnFiles as well as in executeOnText without any duplication of code.
@ilyavolodin I see, it is currently pretty much a very small amount of code for the cache. So moving it to a helper shouldn't be complicated if #4607 is accepted then I can do that as part of the work on that issue.
This look like it would actually be great to use for parallel linting: https://github.com/SyntheticSemantics/ems However, I don't think we could do that, since it's a native extension and currently only works on Linux/OSX:-(
Is there any progress going on in here? I would love being able to split up linting to share the load among cores.
@elodszopos No, sorry, no progress. This feature was moved out of v2 timeline and will come in at a later date.
For those looking for a temporary solution I wrote this which works for my purposes:
Master process:
https://github.com/AGhost-7/dyna-table/blob/master/linter-run.js
Worker processes:
https://github.com/AGhost-7/dyna-table/blob/master/linter-worker.js
Makes a very big difference with my quad-core processor.
@AGhost-7 Thanks for sharing. Very nice! The only problem is you are bypassing our glob resolution and all of the cli switches, but I'm sure for your use-case it would be perfectly fine. In our case, we need to continue to support all CLI switches as well as glob resolution, so it's a bit more complicated.
P.S. Edge-case, but since you are loading fork processes with 10 files at a time, you might get into a situation where first 10 files are all huge, and second 10 files are all small, so second process is going to finish up much faster then first one. You can take a look at https://github.com/eslint/eslint/blob/multithreading/lib/process-pool.js it should balance execution equally between all of the processes (more or less).
Yea its just a quick throwtogether to spread the workload for now. Far from perfect.
ps. I think the biggest file I have is 300 lines or so anyways.
@nzakas I'm looking into some perf stuff involving an automation tool that uses eslint. Earlier in this thread you mentioned:
The thing to keep in mind is that directory traversal must be done before linting can begin. Thus, we get zero benefit from switching to an async directory traversal.
Is this due to how reporters work, or is this strict ordering (full directory traversal before linting) due to something else?
(By the way, thanks so much for building such a great tool!)
@drifkin we have to calculate configuration for each file before it can be linted. So we have to go through the trouble of looking for all the configuration files, starting in the file to be linted's directory and working our way back up the directory structure. At that point, you've calculated the configuration for everything in that directory as well.
Then we have to apply the .eslintignore filter over each potential file to lint.
My main goal here was to discourage a complete rewrite of our CLI for questionable gains (it's complicated enough as it is). Maybe there's some intricate way to interweave directory traversal and linting to take advantage of async, but I don't see it. In general, I don't think async is useful for CLI utilities.
You don't need to have the workers calculate the configs. Workers shouldn't do the management IMO; in fact I believe the better way is to keep workers "dumb". What I'm thinking is sending for each file to lint as a message containing the processed configuration and the file's absolute path.
HI @nzakas, I just want to understand, can the check for configuration per file and .eslintignore calculation be skipped? I guess it is the common case for a vast majority of us to use a single configuration per project. And also we might even able to provide a list of files instead of a glob, so the files we don't want to be linted are already skipped ahead of time, so no need to check .eslintignore.
This will have a very nice side effect because we can even calculate which files did change before creating the glob to pass to eslint which means that eslint will only have to operate over a very small number of files after first run. And now those files can be run on parallel.
If this is not possible, then, maybe a wrapper around eslint cli can be made to lint files in parallel. It might be a nice proof of concept to see if the performance is increased.
That said I do believe eslint right now is fast enough, for most cases. I guess the pain is more notorious when you start linting a project that wasn't linted before.
Even if we lint in parallel, I'm not sure how much will be the gain obtained from that.
@royriojas @AGhost-7 already create a wrapper like that (look for a few posts above), but it skips glob resolution, config cascading, ignore files, etc. Basically most of the utility methods that make ESLint into well rounded tool.
I've been working on this issue on and off for a little while now. I have converted all of the code in CLI and CLI-Engine to handle async code execution (it's not a full rewrite, just adding callbacks to necessary places and it works fine). I ran into some issues with spawning child processes for some reason, and haven't had time to figure those out yet, though.
Hi @ilyavolodin, thanks for your answer,
I see, but my point was maybe such a functionality should not be in eslint but in a third party wrapper, mostly because it is something that actually not use those extra features. :) (Note that I' not saying they are not valuable, they for sure are!)
In any case it was just my two cents
I solved a tangentially related problem (CPU and IO bound pipeline) with a fully concurrent build system I created as an experiment for my personal website here. It's called lots of Jade and Stylus compilation (i.e. lots of CPU), and I managed to shave about a third of my build time off from that. I will admit there was significant amounts of IO, but incidentally, the directory traversal is done in a single-threaded fashion. It was the file operations done during the traversal that was concurrent.
Looks like Node is running eslint in parallel also, at 40 threads. Code here: https://github.com/nodejs/node/blob/master/tools/jslint.js
@ilyavolodin Is that ever done locally? 40 threads is a lot more than how many active processes most PCs can support (even the Macbook). Maybe, ESLint might work best with about 1.5 threads per core, since ESLint still has to read files and various configs (although in a multithreaded implementation, that's less of a problem), but I'd be shocked if that wasn't at least configurable or CI-specific.
It's not actually 40 different workers, it's a maximum of 40 files per worker at a time. The default number of workers is equal to the number of cores reported by Node's os module.
Oh. I didn't really read the file's 284 lines that closely. Nice to know!
What is the status of this issue? Will there be a migration to async/parallelisim?
@amilajack the issue is open, so we are planning on addressing it at some point. No one has the time to dig into it currently.
Sorry about that. I'd love to help out with this but i dont know where I should start. Any suggestions on where I should make changes? Also can you identify which parts of the ESLint codebase are responsible for running the rules of ESLint?
@amilajack that tweet wasn't actually in reference to you, specifically, so I apologize if it seemed that way. There's just been a lot of these comments on various issues in the past week.
I'd suggesting reviewing our architecture docs and reading over this thread, as there are a lot of details contained in the comments that will point you in the right direction. FWIW, we don't expect that this something someone outside of the team will be able to successfully design and implement, as it's very complicated (that's why the issue has been open for so long). It's probably also not a great first PR for a new contributor given the complexity. That said, you're definitely welcome to take a stab at it.
The next step for this issue is to come up with a design describing how this would work (you can see a few such attempts in this thread already).
@amilajack If you decide to take a stab at it, a good starting point would be this branch: https://github.com/eslint/eslint/tree/multithreading
It's very old and outdated, but it gives you an idea of want needs to be done. Basically everything between cli.js and lib/eslint.js has to become async in order to do multi-threading. I've tried doing that about 3 times but every time I ran into issues along the way that blocked me from completing. Those are not show-stopping issues, but they just required more time then I had to solve them.
I see. Do you remember what the issues were?
@amilajack It was just refactoring that had to be done in order to enable code to work as both async and sync. Code is pretty complicated and the stack trace depth is quite substantial, so you have to refactor large number of methods and it's pretty easy to get lost on a way (and that's exactly what happened).
Would a primitive event loop or similar async helper like actors, hand
rolled futures, or whatever that's purely internal (i.e. not connected to
the browser/Node/etc) help make the stack traces easier to digest? In the
sync version, you could just run it with a final end callback until it
finishes, and then return the result pushed into the callback, but in the
async version, you could hook into the native event loop and maintain true
asynchrony.
On Sun, Oct 2, 2016, 15:33 Ilya Volodin [email protected] wrote:
@amilajack https://github.com/amilajack It was just refactoring that
had to be done in order to enable code to work as both async and sync. Code
is pretty complicated and the stack trace depth is quite substantial, so
you have to refactor large number of methods and it's pretty easy to get
lost on a way (and that's exactly what happened).—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/eslint/eslint/issues/3565#issuecomment-250990437, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AERrBFlYbs0HqrF4MkZYTvnwYdfcBWKwks5qwAcTgaJpZM4F0RNW
.
@ilyavolodin
It was just refactoring that had to be done in order to enable code to work as both async
Why did it need to work both sync and async?
@amilajack Fully async and parallel linting will make it close to impossible to debug the code. Also, since Node only supports threaded application through child processes there is an overhead for parallel applications that only run on very low number of threads (like linting a single file only, for example).
@isiahmeadows Maybe, depends on the implementation details. The issue is that we need to resolve all of the cli options/configs/ignores first on a single thread, and then start branching the code for linting into separate processes. After that we have to combine results back into single thread to create error report.
How about doing something like Ava does?
Ava only uses promises to parallelize. We're also planning to use threads
@ilyavolodin
I'm aware of that, and was mainly speaking of the core code, if that
requires any adjustments directly (it likely won't much).
I've already created my own parallelism utility 1 similar in
functionality to what's being requested here, where I created a job queue
to manage processes with a specified limit for a promise-based build tool
experiment (trust me in that it works). I'm sure it would be relatively
straightforward to adapt for this use case (it's BSD-licensed, so go right
ahead if you wish).
On Wed, Oct 5, 2016, 16:55 Amila Welihinda [email protected] wrote:
Ava only uses promises to parallelize. We're also planning to use threads
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/eslint/eslint/issues/3565#issuecomment-251796631, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AERrBL-pyBrKliiJyJK-XEV9OKVG_SQbks5qxA6qgaJpZM4F0RNW
.
@isiahmeadows I already created something similar on the branch I linked: https://github.com/eslint/eslint/blob/multithreading/lib/process-pool.js
Problem is, you need to spawn a new child process and run code in it separately (usually by executing cli command). We currently can't really do that, so the code needs to be refactored to support something like that, which is what I was getting stuck on. If somebody wants to prototype refactoring needed to enable it, go ahead. It's definitely possible, just requires time and effort.
I'm considering it. The parallelism will be tricky, but I'm considering
making a dedicated npm module out of it (I could use it myself for other
reasons). If and when I finish it, I'll link it here.
On Thu, Oct 6, 2016, 11:05 Ilya Volodin [email protected] wrote:
@isiahmeadows https://github.com/isiahmeadows I already created
something similar on the branch I linked:
https://github.com/eslint/eslint/blob/multithreading/lib/process-pool.js
Problem is, you need to spawn a new child process and run code in it
separately (usually by executing cli command). We currently can't really do
that, so the code needs to be refactored to support something like that,
which is what I was getting stuck on. If somebody wants to prototype
refactoring needed to enable it, go ahead. It's definitely possible, just
requires time and effort.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/eslint/eslint/issues/3565#issuecomment-251990830, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AERrBGQWSFsXXmUymSnNsqx5_BRamVk5ks5qxQ46gaJpZM4F0RNW
.
@ilyavolodin What do you think about using a sync and non-sync method using a library similar to this:
@amilajack Hmm.. haven't seen that before. Sounds pretty good, but unfortunately, it wouldn't work for us, as native dependencies are a pain in cross-platform applications.
@ilyavolodin what if we use it only in development for testing purposes to make debugging easier?
That would be OK (although, the only other native dependency that we have is Phantom, and it constantly gives us trouble). We have discussed it before with the team, and while everyone thinks it's a very nice idea to try to lint files in parallel, and there is a lot of interest in this feature, we decided that we don't want to switch to it completely by default. We have to still support linting single files (for example inside an editor), which would actually suffer from complete switch to parallel infrastructure, also a lot of build systems a single core/single processor systems that are virtualized (so they don't really support threading all that well either), they will also suffer from complete switch to async/parallel linting. So we want to continue single-process linting as a default option, and provide a switch that would enable parallel linting. In the future we might want to switch to parallel linting by default, but we would want to try it out first.
@ilyavolodin If you use deasync in an async context, you're bound to have problems, by the looks of it, and Node's relevant committee appears very reluctant to fix the issue, because the API itself is highly unsafe, since it completely sidesteps Node's control of the libuv event loop.
I've created a library that supports all the CLI parameters and execute things in parallel. This is the repo:
https://github.com/alansouzati/eslint-parallel
This is the strategy I took:
1 - Take all the glob patterns and flat it in a list of all files to lint excluding the files that match the patterns described in the .eslintignore and ignore-pattern.
2 - Use child_process fork to distribute different processes. I use the amount of CPU cores available so that you would not create too many different processes for slow machines (like the free travis instance).
3 - I do not fork if the list is smaller than 50, this is a magical number that in my machine triggered the benifit of using fork. I can change that if you guys think it should be smaller/bigger.
4 - I do not fork if the machine does not have more than 2 cores.
https://github.com/alansouzati/eslint-parallel/blob/master/src/linter.js#L111
Feedback is gladly appreciated. Maybe eslint does not need to change the core library as we can build small wrappers around it to parallelize things.
@alansouzati That looks pretty good. Couple of things to note: You are running executeOnFiles instead the forks, that works pretty well, but it will not allow you to share config file cache, since each fork runs in a separate process. What that means is that ESLint will spent significant amount of extra time re-calculating configs (especially in cases were you have deeply nested folders with a lot of .eslintrc files, or when you do multiple extends). Unfortunately, that's not fixable outside of the ESLint core.
I haven't spent too much time reading the code, but there are a lot of gotchyas in the way we implement ignore patterns. I'm almost 100% sure that what you are doing isn't going to match native ESLint ignore pattern resolution completely.
I think you can improve formatting support by collecting all of the outputs from all forks into a single array and than running CLIEngine.getFormatter()(allMessages), this way you don't need to build your own formatter and you can use all of the built-in ESLint formatter. But that also means that you need to wait until all of the forks are finished before outputting anything to the user.
Could you provide some performance numbers on native ESLint run vs. your implementation? Would be interesting too see how big is the gap.
Thanks for the feedback.
I'm not sure I understand the cache limitation. Even though we have different processes, we have the same location for the cache file, correct? I'm using the default one .eslintcache. When I enable cache in my project I see the file gets created and in fact it lints pretty fast the second time. The actual problem is that creating fork itself is expensive. Also, I'm not sure if the .eslintcache will be the content of the last process or if eslint will augment it with new files as we complete things.
For the ignore files, do you know if the API is open for eslint? If yes, I can give all the files in a flat list and the return of this function would be the final list of files excluding the ignore ones.
My usecase is pretty simple. I've applied this to Grommet (https://github.com/grommet/grommet). We have around 200 files to lint. The native eslint executed in around 6 seconds. The one that runs in parallel executes in 3.9 seconds.
It would be great if other folks here in the community could evaluate it and provide performance numbers as well.
Finally, great idea on the formatter enhancement. I believe it is ok to wait until all the processes complete to provide the list. I like the fact I can use the native formatter. I will apply that soon to the project.
@alansouzati Sorry, I wasn't talking about lint cache, I was talking about config cache (or rather hashmap). When ESLint goes through files to lint, it builds a config file for each folder. Building includes traversing to the upper directories to find all parent config files, merging them all together, resolving extends chains from files and eslint-config- packages and loading necessary plugins. Once config file is built for a given directory it's added to the hashmap and linting any other file in the current directory should be a lot faster, since config file is ready and can be retrieved from hashmap. In your case, since you call executeOnFiles inside each fork, each fork will create it's own hashmap, but will not share it between processes. So you will potentially end up recalculating the same config multiple times (which can be very expensive). Unfortunately, you can't really improve this outside of the core. Supposedly you could use CLIEngine.getConfigForFile() and cache it on your own, and then use CLIEngine.executeOnText instead of CLIEngine.executeOnFiles, but that would be a lot of code that you would need to duplicate from ESLint core, and you will loose things like lint caching and so on.
Thanks for the great explanation. Makes fully sense now.
As in my case it improved the performance, I believe recalculating the config cache was worth it. I see your point though, it may not apply to all cases.
I will keep my eyes open if/when you guys add parallel support. I will deprecate eslint-parallel as soon as it happens.
Another potential concern with running in parallel is manifest at the rule level. Any rule that keeps tabs on what's been seen across multiple files (e.g., global variables, modules, etc.) won't have a complete picture if the scans were split across multiple processes.
I used to run ESLint in parallel, but was getting inconsistent results when detecting "unused" values.
Most of ESLint's own rules don't care about anything beyond the scope of a single file, so are fine to run in parallel. There can exist rules that do, and those rules will need a means of sharing their state across processes.
@chrisdarroch ESLint rules do not know about each other by design. None of the core rules share information. In some cases, however, the same rule can share information between executions (as in, a single rule can have listeners for multiple AST nodes, each listener will be executed separately, but the context is shared), but only in the context of the same file. That's the reason why we don't want to parallelise rule execution, but instead want to split it per file. For each file ESLint creates a new instance of linter (more or less), so none of the information can possible be shared.
In theory, it would be possible for a rule to be thread-unsafe. Rules like this probably exist somewhere:
'use strict'
const stack = [];
module.exports = {
create(context) {
return {
Identifier: node => stack.push(node),
"Identifier:exit": node => stack.pop(node)
};
}
};
The stack variable is global to the rule rather than being scoped in the create function. If we linted multiple files at once, the stack would end up containing nodes from multiple different ASTs.
@not-an-aardvark That might be true (due to CommonJS caching), but we never promised that something like would work and supported by ESLint. So if somebody create a custom rule that uses this loophole, I think it's OK if we break them.
I'm not entirely convinced -- it seems like it would be reasonable for users to assume that a rule is only executed on a single AST at a time, and a user might reasonably do this if they didn't grasp the importance of the create function. I think it's rare enough that it wouldn't cause vast ecosystem breakage, but we should still only change it in a major version.
Only if the actual linting is interleaved through multiple file calls. I'll
note that JS is still single-threaded, and the only way you can have shared
memory that two threads with a process both manipulate is via shared array
buffers and web workers.
On Sun, Jan 29, 2017, 17:47 Teddy Katz notifications@github.com wrote:
In theory, it would be possible for a rule to be thread-unsafe. Rules like
this probably exist somewhere:'use strict'
const stack = [];
module.exports = {
create(context) {
return {
Identifier: node => stack.push(node),
"Identifier:exit": node => stack.pop(node)
};
}
};The stack variable is global to the rule rather than being scoped in the
create function. If we linted multiple files at once, the stack would end
up containing nodes from multiple different ASTs.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/eslint/eslint/issues/3565#issuecomment-275952778, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AERrBPjNF5npCJ5_KWOJmhTuwUbx3RLiks5rXRbrgaJpZM4F0RNW
.
@not-an-aardvark Fair enough. But as noted, if we will introduce parallel linting at some point, it will be behind CLI option and will not be default behavior. So I think we should be fine.
Related: Pinterest announced this today, which seems very useful. As far as I can tell, it generates a list of files and splits them among workers, then each worker runs ESLint independently on its files, which results in a significant performance boost.
One downside of this approach is that I don't think the config cache can be shared among workers. As a result, some config files will be read and resolved multiple times by different workers. A core implementation might be able to run even faster by sharing the cache.
I'd love to see this happen, but in the meantime this has helped speed things up for us jest-runner-eslint: https://github.com/rogeliog/jest-runner-eslint
If someone could up that bounty some, I'd be willing to tackle it.
parser is not a module reference). This would actually be the easy part, since I've done it before, but I'm not using that linked utility of mine because it's not even alpha yet (and I just haven't had the time nor money to work on it).
- The CLI is way too synchronous, even two years later, and I'd have to duplicate half of it.
I agree that the current state is not ideal. Ideally, I think it would be good to try to refactor it to use more pure functions, to avoid needing to duplicate half of it.
- I could probably make the linting a bit less memory-intensive by using async globbing, async printing, and interleaving I/O with linting. It'd make it easier to parallelize with the I/O this does entail (file reading, globbing, config reading, caching).
I'm not opposed to this, but I would note that linting is largely CPU-bound. I think the most effective first step would be to parallelize everything that happens after file I/O, since that's where most of the time is currently spent.
Interleaving I/O with linting sounds nice, but I'm unsure if it would work well in practice. Globbing/reading the cache file need to happen before linting any files, and printing output/writing the cache file need to happen after linting all files, so I don't think there would be much of a speedup from making those asynchronous. Config reading could happen at the same time as linting (by linting one file while reading the config for another), but I don't think this would be very effective in practice. (In most cases, the same config file is used for a large number of JavaScript files, and config files are cached. This means that any two arbitrary files are likely to share the same config anyway, so the config for the second file would already be available without doing any I/O.)
I agree that we could get a notable speedup by reading one file at the same time as we lint another.
- If breaking changes are allowed, I could make the above bullet the default for single-process linting, too.
Unfortunately I don't think we can do this. There are a lot of integrations using the current synchronous API, and it wouldn't be a good idea to break all of them. (If we could, I probably would have made pretty much all I/O synchronous by now 😄.)
- I'll be rolling my own process pool implementation, since I need to load scripts in the workers as well as the parent (I'll be offloading parsing as well as linting to workers if parser is not a module reference).
I'm not very familiar with the options for process pools, but I think worker-farm would be able to load scripts from the workers, and seems to be maintained. That said, I don't object to rolling our own, provided that it's not too complicated and there's an advantage to doing so over using a library.
I'm not sure what you mean by "if parser is not a module reference"; a parser needs to be loaded from node_modules in any case, regardless of whether the default parser is used. I would recommend running everything in Linter in parallel (this includes parsing, running rules, etc.) Linter does not perform any I/O, so I don't think it should be necessary to make any changes to Linter. (Each worker would probably have a separate Linter instance and would use it to perform CPU-bound tasks.)
@not-an-aardvark
I agree that the current state is not ideal. Ideally, I think it would be good to try to refactor it to use more pure functions, to avoid needing to duplicate half of it.
Definitely agreed.
Interleaving I/O with linting sounds nice, but I'm unsure if it would work well in practice. Globbing/reading the cache file need to happen before linting any files, and printing output/writing the cache file need to happen after linting all files, so I don't think there would be much of a speedup from making those asynchronous.
You could use a task loop, and also, you can lint directories as soon as you know what files you can ignore. The cache file could be largely kept in memory until you completely finish.
Config reading could happen at the same time as linting (by linting one file while reading the config for another), but I don't think this would be very effective in practice. (In most cases, the same config file is used for a large number of JavaScript files, and config files are cached. This means that any two arbitrary files are likely to share the same config anyway, so the config for the second file would already be available without doing any I/O.)
I'm aware, and I'm inclined to agree. I included it in the list of I/O-bound tasks, but I was debating leaving it out for that very reason.
Unfortunately I don't think we can do this. There are a lot of integrations using the current synchronous API, and it wouldn't be a good idea to break all of them. (If we could, I probably would have made pretty much all I/O synchronous by now 😄.)
Which is why I said if, with a semver-major implication. Pipe dreams are always nice to think about, though, and we could always create a separate async API that just happens to be a bit faster and more consistent. In fact, that was my initial plan.
I'm not very familiar with the options for process pools, but I think worker-farm would be able to load scripts from the workers, and seems to be maintained. That said, I don't object to rolling our own, provided that it's not too complicated and there's an advantage to doing so over using a library.
I didn't know about that library, but it seems interesting. There is one big drawback, though: it uses a single export, when we need multiple (5+) entry points in our case. There are a few other benefits to rolling our own:
Also, trust me in that it won't be too complicated, and most of the complicated logic would be isolated to a few modules.
I'm not sure what you mean by "if parser is not a module reference"; a parser needs to be loaded from node_modules in any case, regardless of whether the default parser is used.
I thought you could do parser: require("my-custom-parser"), but apparently, I misremembered. (I looked at the source to confirm this.)
I would recommend running everything in Linter in parallel (this includes parsing, running rules, etc.) Linter does not perform any I/O, so I don't think it should be necessary to make any changes to Linter. (Each worker would probably have a separate Linter instance and would use it to perform CPU-bound tasks.)
That was the plan. There's a few glitches in that which would necessitate dynamic module loading (as in, significant I/O) in the workers:
These are all primarily at startup, though.
It looks like there hasn't been much movement on this issue in a while. Given that esprint exists, I'm curious if we can close this and just refer people to that project instead?
Ideally ESLint would come performant out of the box without the need for 3rd party libraries. Thoughts on this?
That is a true statement, and ESLint is much more performant than it was
when this issue was first opened. Additionally, there are multiple ways to
make ESLint performant without implementing parallel processing in the core
(which this thread has shown, is quite difficult).
On Thu, Sep 20, 2018 at 9:57 AM Amila Welihinda notifications@github.com
wrote:
Ideally ESLint would come performant out of the box without the need for
3rd party libraries. Thoughts on this?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/eslint/eslint/issues/3565#issuecomment-423257081, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AACWki7tib1fQN0WARmwVPZjrHSDv_KDks5uc8j6gaJpZM4F0RNW
.
--
Nicholas C. Zakas
@slicknet
Author, Principles of Object-Oriented JavaScript http://amzn.to/29Pmfrm
Author, Understanding ECMAScript 6 http://amzn.to/29K1mIy
A couple team members had this on their wishlist for v6.0 (https://github.com/eslint/eslint/issues/10644). My impression is that this is still considered a relatively high-priority issue, but the team hasn't had time recently to do a lot of work on large issues like this. So my preference would be to keep this one open for now.
My goal is to work on this in the beginning of next year, if nobody gets to it before. I think this is one of the biggest impact changes that we can introduce, and there's been a lot of interest in this (however - it's not easy to implement).
Can either of you comment on the value of having this in the core instead
of referring people to esprint? I'm just trying to understand better why we
should take this on given that there's an existing solution.
On Thu, Sep 20, 2018 at 2:01 PM Ilya Volodin notifications@github.com
wrote:
My goal is to work on this in the beginning of next year, if nobody gets
to it before. I think this is one of the biggest impact changes that we can
introduce, and there's been a lot of interest in this (however - it's not
easy to implement).—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/eslint/eslint/issues/3565#issuecomment-423331787, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AACWkmRMiADZvZEqlN3yikhHkKaFoIZdks5udAIkgaJpZM4F0RNW
.
--
Nicholas C. Zakas
@slicknet
Author, Principles of Object-Oriented JavaScript http://amzn.to/29Pmfrm
Author, Understanding ECMAScript 6 http://amzn.to/29K1mIy
esprint requires an additional tool, config, and server to work. jest-eslint-runner requires using jest, and setting up a separate config. The jest runner (which I’ve helped maintain, since we use it at airbnb) is constrained by the limitations of jest itself, so for example, we can’t even use it for auto fixing at this time. In other words, our existing solution isn’t sufficient.
It’s great that these tools exist, but it’s a very large obstacle to have to use a different tool ran by a different team that has different maintenance behavior, may run on different node versions, has different documentation, and different priorities, and gains support for core eslint features on a different and delayed timeline.
Also, as mentioned before, external tools can't match the speed improvement that could be done in the core, due to the fact, that all of the external tools have to resolve all of the configs multiple times, where core can do it once and share it.
Fair enough. If there’s a plan to do it then I’m in.
On Fri, Sep 21, 2018 at 7:04 PM Ilya Volodin notifications@github.com
wrote:
Also, as mentioned before, external tools can't match the speed
improvement that could be done in the core, due to the fact, that all of
the external tools have to resolve all of the configs multiple times, where
core can do it once and share it.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/eslint/eslint/issues/3565#issuecomment-423709300, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AACWkkcEOiLENLvZL1_pmK8irTfGS51hks5udZqVgaJpZM4F0RNW
.
The ESLint team has just created an RFC process for complicated changes that require designs. This issue was marked as "needs design" and so falls into this new RFC process. You can read more about the RFC process here:
https://github.com/eslint/rfcs/
Thanks!
I'm totally an outsider but I found this comment strange https://github.com/eslint/eslint/issues/3565#issuecomment-423709300 I don't see any reason why reading and processing the config files only once should add mesurable speed improvement especially taking in count that the config files would be in cache and as such reading would be a noop from io point of view. Does someone verified this claim? I feel like something basically different than a script that 1- split the files to process in workload of same number of files 2- run one instance of eslint on each workload , would be a useless overengeneering the problem
@xcombelle Not on the ESLint team, but I feel qualified to explain despite that. (I've dealt with config loading in a few cases.)
That would mean you're loading all the configs and their plugins, once for every file. Instead of 1-2 syscalls per file, you're now making about a dozen, and that's after you resolve all of the internal require calls. So in reality, you'd easily have hundreds per file instead of a few. This might seem insignificant, but you're blocked by syscalls, hardware I/O, and initialization, all of which is far slower than the actual linting process (which is itself slow due to multiple traversals of a large tree). You might get small gains on a quad-core system, but nothing truly significant without a dedicated module that avoids much of the file system overhead and duplicates much of ESLint's built-in functionality. You also have to write a custom reporter to avoid output getting mangled and crossed up.
Also, I'm not even 100% sure if ESLint's CLI program even supports parallel execution in separate Node instances. It might conflict with other CLI instances when writing stuff back to the local .eslintcache.
Hey @nzakas, I'd like to pick up this one.
I made a quick and dirty PoC here https://github.com/eslint/eslint/pull/12191
The main idea is to split the files into X chunks that can be linted sperately on X workers, I used workerpool for the workers, but most likely it will change. The main issue I have now is passing the ConfigArray object as it breaks when serializing/deserializing. I think it'd make sense to pass it around as a plain array and construct the class right before linting a file, but it might take me some time to refactor that as I'm new to the project.
Please have a look at the PR and let me know what you think of the general idea
UPDATE: I found workaround by passing the ConfigArray instance, then rehydrating the non serializable members in the workers https://github.com/eslint/eslint/pull/12191/files#diff-660ea0590a55a93f96e9f6979144e554R445
@isiahmeadows, That would mean you're loading all the configs and their plugins, once for every file. Instead of 1-2 syscalls per file, you're now making about a dozen, and that's after you resolve all of the internal require calls. So in reality, you'd easily have hundreds per file instead of a few. This might seem insignificant, but you're blocked by syscalls, hardware I/O, and initialization, all of which is far slower than the actual linting process
Totally. It seems that the complexity is O(n). We are having huuuge config and the execution through the CLI is insanely slow, even for just single file. When just switch to eslint-config-airbnb (which is our base) everything is faster.
So, there are two problems - first, it happens even when using executeOnText, and second, heavy plugins/presets. But that probably proofs that there is also a problem in the linting.
I currently playing with both executeOnText and executeOnFiles called from a worker. Doesn't seems to be faster, the problem is just the ESLint core.
As a status update for anyone following this issue, ESLint v7 shipped with a new public Node.js API. The previous CLIEngine API was entirely synchronous, which prevented parallel linting. The ESLint public API class has asynchronous methods that will allow us to implement parallel linting once we've settled on the right design. If you're curious about the design of the new API, you can read more in the RFC.
This is the current RFC in development for this feature:
https://github.com/eslint/rfcs/pull/42
Assuming eslint's output doesn't change whether you feed it 1 or 100 files; could something like this help?
find . -regex ".*\.\(js\|jsx\|ts\|tsx\)" | xargs -P0 -n 50 npx eslint
@nzakas @hamidzr there is https://github.com/pinterest/esprint
@JustFly1984 https://github.com/eslint/eslint/issues/3565#issuecomment-423573943
Most helpful comment
As a status update for anyone following this issue, ESLint v7 shipped with a new public Node.js API. The previous
CLIEngineAPI was entirely synchronous, which prevented parallel linting. TheESLintpublic API class has asynchronous methods that will allow us to implement parallel linting once we've settled on the right design. If you're curious about the design of the new API, you can read more in the RFC.