Node: let v8 use libuv thread pool

Created on 15 Mar 2017  路  36Comments  路  Source: nodejs/node

Is it possible to let v8 use libuv thread pool ? this could reduce extra thread numbers.

V8 Engine libuv question

Most helpful comment

@jorangreef What you say is true but there is a caveat: system calls can consume significant kernel-mode CPU time. I've experimented with the approach you suggest but performance for some operation falls off a cliff when you have many more threads than cores.

Example: stat(2) on deep (or deeply symlinked) file paths. The inode metadata is probably in the dirent cache but traversing it can still be costly. If you have many threads hammering the cache, performance degrades in worse than linear fashion.

Most programs would not exhibit such pathological behavior but a specialized tool like a backup program might (and people do write such things in node.js.)

All 36 comments

If that were possible, doing so would further reduce the performance of fs, dns.lookup() (default method used internally by node), etc.

@mscdex We can increase libuv thread number.
Also, doing so we can make thread pool load more balance.

Sure, you can increase it, but I would guess most people use the default (currently 4).

Well, and we also can increase default libuv thread number...

I'm not 100% sure but I think 4 may have been chosen as that is/was the typical number of cores/cpus on most machines?

While increasing the default will hopefully be something we can do at some point, 4 is still the most common average if I'm remembering correctly.

In any case, having V8 use threads outside of the thread pool seems wrong. If 4 is the right default, you would in fact use more than 4 if V8 spins up its own threads. If you expect 4 + additional V8 threads to be a good number, then 4 is chosen too conservatively.

@hashseed I'm not sure what the performance difference is (waiting for a libuv thread vs. OS scheduler), but if V8 were to use the libuv thread pool, some node requests would/could get blocked (even more so than they may be currently), whereas they may not have before.

FWIW V8 hands over thread management to Chrome when embedded in Chrome.

@hashseed Could you please explain the reason what Chrome did ?

Node is quite a different thing than Chrome though ;-)

@jeisinger probably knows more.

Iiuc V8 simply prefers to let the embedder take control. In case of Chrome, page start-up can be very busy wrt threading, and Chrome's scheduler likely has a better grasp than the OS scheduler.

In the past, each isolate of v8 would create it's own worker pool. To fix that, I moved the worker pool to a single v8:: Platform instance.

Furthermore, chrome uses a central scheduler to eg throttle tasks in background tabs or give rendering related tasks a higher priority when the frame deadline approaches. To achieve this, it uses a custom v8:: Platform implementation.

FWIW, the reason V8 has its own threadpool is that it was the least amount of work. The plan was always to come up with something better.

What constitutes 'better' is something of an open question, though. Blindly shoveling it into the libuv threadpool will probably cause a performance hit.

I'm not 100% sure but I think 4 may have been chosen as that is/was the typical number of cores/cpus on most machines?

I think the conservative wisdom is to set threads = number of cores. This holds for when those threads are being used for CPU intensive tasks. The reason being that setting 4 threads for 4 cores reduces the cost of context switching, and then you would typically pin them to a core etc.

But the threads = number of cores idea only holds for when those threads are being used for CPU intensive tasks. In reality, as far as I understand, Node really uses the threadpool to simulate non-blocking disk and network operations where the underlying system does not provide a decent non-blocking api. So Node's threads are not "run hot".

Therefore, for Node, it would actually make much more sense to increase the default number of threads available for disk and network operations. This would reduce head-of-line blocking caused by slow DNS requests or degraded disks, by increasing concurrency. The time-scale of context switches in this async use-case is dwarfed by the time-scale of disk or network IO. Also, the memory overhead of libuv threads is low: 128 threads should cost 1 MB in total according to the libuv docs (http://docs.libuv.org/en/v1.x/threadpool.html)?

@jorangreef What you say is true but there is a caveat: system calls can consume significant kernel-mode CPU time. I've experimented with the approach you suggest but performance for some operation falls off a cliff when you have many more threads than cores.

Example: stat(2) on deep (or deeply symlinked) file paths. The inode metadata is probably in the dirent cache but traversing it can still be costly. If you have many threads hammering the cache, performance degrades in worse than linear fashion.

Most programs would not exhibit such pathological behavior but a specialized tool like a backup program might (and people do write such things in node.js.)

@bnoordhuis Would it not then make sense to allow writers of such problematic tools to reduce the thread count to equal the core count, rather than force most people, working on simpler things, to understand they can increase the core count to some unstudied number?

@bnoordhuis would it be possible to dynamically scale the thread count without incurring excessive overhead? I'm guessing you could be quite conservative and only change the number when you start to run into heavy performance problems.

I remember @sam-github saying that the node event loop time was a pretty good indicator of how your app was performing, what would the equivalent be for this?

I guess my question is, is it just a question of doing the work, or is it likely to result in a net performance decrease in most situations.

Libuv could use the work queue size or the average flight time of a work request as an input signal but that doesn't distinguish between threads waiting for I/O and threads doing computation.

For I/O-bound work loads num_threads > num_cpus is usually acceptable, for CPU-bound loads it's not. Computing bitcoin hashes in 1,000-fold parallel on a machine that only has 8 cores is not efficient.

Libuv knows what category its own work items belong to (fs, dns - all I/O) but it doesn't know that for external work items that come in through uv_queue_work(), which is what crypto.randomBytes() and the zlib functions use, for example.

What's more, libuv only knows about the current process but a common use case with node is to spawn multiple processes. You don't want to get into a perfect storm situation where multiple processes think "hey, throughput times are going up, let's create some more threads."

It's not intractable but fixing it in either node1 or libuv will involve a large amount of engineering and for an uncertain payoff.

Perhaps determining the ideal thread pool size is more of an ops things. We should expose hooks but leave tuning it to the programmer or the system administrator.

1 Node could side-step libuv, use its own thread pool and orchestrate with other node processes over IPC.

note that https://github.com/nodejs/node/pull/14001 switches v8 to libuv's threadpool

/cc @matthewloring

@bnoordhuis Is it currently possible to provide libuv with a hint to limit CPU-heavy threads, while allowing more I/O-heavy threads? V8 has a ExpectedRuntime enum we could maybe hook onto.

@TimothyGu Not at the moment.

That enum is not used by v8. We only added it because chrome used to use Windows worker pool reflecting WT_EXECUTELONGFUNCTION

The problem is that a single threadpool is used for IO and CPU threads.

If there were two threadpools, one could be used for IO (and have many threads) and the other could be used for CPU (and have threads === cores).

This would make tuning possible. Currently, there's no way to tune the threadpool for both use cases.

If libuv knows that all its threads are IO only, then the above can be rolled out as follows:

  1. All AsyncWorker, Node and other userland threads continue to use the current threadpool. This becomes known as the CPU threadpool and keeps the existing 4 thread default.

  2. Add a new "IO" threadpool with a saner, conservative yet tuneable default (perhaps 16).

  3. Libuv moves only its own known IO ops into this new threadpool and exposes a way for userland to do the same if it wants to.

@jorangreef We've been discussing such schemes in libuv since the thread pool was first added. You can find at least two attempts in my fork, maybe more.

Both attempts stranded on not performing significantly better most of the time and significantly worse some of the time. :-/

Thanks @bnoordhuis

Were you benchmarking latency or throughput? "Significantly worse" latency (say 1.1x) may not be that bad if it means significantly more throughput.

Was the workload mostly IO bound or mostly CPU bound?

The idea with two separate threadpools is to delegate these choices to the user. Everyone's benchmark requirements are different.

And if there is no advantage gained by separating CPU and IO intensive operations into their own theadpools, then the arguments against increasing the default 4 thread limit should no longer hold.

Are you 100% confident that the current hardcoded 4 thread default is the optimal solution?

Are you 100% confident that the current hardcoded 4 thread default is the optimal solution?

Hah, I don't think I ever claimed it was. Good enough most of the time, but optimal? No, sir!

Were you benchmarking latency or throughput? [...] Was the workload mostly IO bound or mostly CPU bound?

A bit of both. Node and libuv's benchmarks test both ends of the spectrum.

A bit of both. Node and libuv's benchmarks test both ends of the spectrum.

Just to double-check, when you benchmarked using two separate threadpools (one for IO tasks, one for CPU tasks), did you let both threadpools contend for the same set of cores or did you pin them to separate sets of cores?

In a sufficiently warmed-up webapp (assuming the most common node use case), do the v8 threads incur any significant CPU work? I thread-profiled a client server app with ~20% CPU consumption by the server process, and could not find v8 threads as contributing. Any suggested (v8) tunables to get more conclusive info?

@gireeshpunathil the v8 threads mostly handle i/o, I've only seen an uptick in thread cpu with heavy garbage collection activity. Nodejs is single-threaded, so the main thread will normally account for almost all the cpu usage. One can introduce some concurrency with the cluster module.

@andrasq - thanks. But sorry, your explanation seems orthogonal to my understanding:

v8 threads handle mostly i/o

which I/O? Are you talking about the primordial (main) thread? I doubt that is classified under v8 thread.

Nodejs is single-threaded, ...

in this discussion and in #14001 we are definitely talking about a number of background threads from v8 module, that is different from the priomordial thread and libuv worker threads. And this discussion focusses on the tradeoff of tenanting the v8 threads with libuv worker threads.

So, my question stands as:
(i) What are those v8 threads which will potentially contend for time slice with more critical work from libuv? Execution tracers? CPU profilers? GC helpers? JIT helpers?
(ii) How do we bring those threads to the forefront to be seen as eating up cycles? (any tunables in the commandline and tunables in the code). This will help us make observations in terms of time-slice distribution variations between the threads and make inferences on throughput, latency as a function of some of these tunables.
(iii) How significant these threads would be in terms of CPU consumption in a fully warmed up production server (no tracing and no new scripts)?

What are those v8 threads which will potentially contend for time slice with more critical work from libuv?

Compiler and GC threads.

How do we bring those threads to the forefront to be seen as eating up cycles?

perf(1)?

How significant these threads would be in terms of CPU consumption in a fully warmed up production server

Depends on the application.

thanks @bnoordhuis - that explains.
In terms of measurements, perf(1) did not help to get thread-wise split up (there is a -t flag, but behavior is weird) so I was using AIX tprofand the result, as I mentioned earlier, suggests that v8 threads contribute very less (not displayed in the top consumers)

I agree this depends on the workload characteristics of the application. Given:
(i) every new page brings in new script (chrome) vs. relatively boot-time-only script (node)
(ii) highly transactional based web workload means most objects collected in the scavenge phase

v8 threads may not consume much slice for either GC or JIT, and is in-line with my tprof observation.

I wish if I could modify the test to manifest the contrasting characteristics and was /am looking for pointers on that line - so that we know the extent to which these threads are insignificant, and what is the tipping point beyond which they show up.

This discussion seems to have run its course. I'm going to close, but if you think that's wrong and there's something concrete, feel free to re-open (if GitHub allows) or comment (requesting it be re-opened if you wish) or open another issue as appropriate.

FWIW the work for creating a Node.js-specific v8::Platform is being done at https://github.com/nodejs/node/pull/14001. In that PR, it was decided not to merge the libuv and V8 thread pools into one, but instead manage them separately for performance.

And FWIW In #22631 I am working on uniting the V8 and libuv threadpools in Node-land using my pluggable threadpool PR in libuv.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mcollina picture mcollina  路  3Comments

addaleax picture addaleax  路  3Comments

dfahlander picture dfahlander  路  3Comments

danialkhansari picture danialkhansari  路  3Comments

Icemic picture Icemic  路  3Comments