Node: [discuss] event loop idle metrics

Created on 23 Apr 2020  Â·  25Comments  Â·  Source: nodejs/node

@trevnorris is working on landing a change to libuv that will track and report the amount of time the event loop spends in idle time. It's an extremely useful metric that can provide for us a measurement of "event loop utilization". In a world of worker threads, monitoring CPU no longer becomes an effective way of monitoring performance and event loop delay is not enough on it's own, so having a built in mechanism for measuring event loop utilization would be fantastic. While there is some work still to be done to get the PR landed in libuv and get that new libuv version landed in core, I did want to briefly discuss how the new metric should be exposed in core.

In this comment @trevnorris suggests a simple performance.idleTime() that returns the direct value of this metric, which records the cumulative time spent in idle since the loop was configured to track. To calculate event loop utilization, however, we also need to know how long the event loop has been running (well, to be specific, how long it's been since the loop was configured to collect the data, which can be turned on but not turned off). Assuming we started the loop and started collecting the metric from the start, we do already record the start time of the event loop (using the performance milestones) so someone could calculate the utilization on their own by accessing those values. However, I think it might make more sense for us to just do the calculation for users and provide an API like performance.idleTime() that returns an object with two values { idle: n, utilization: y } where idle is the raw idle time and utilization is the calculated utilization value. The API should be very low cost to sample using AliasedArray or AliasedStruct as a backing.

/cc @nodejs/diagnostics @addaleax @mcollina

feature request perf_hooks performance

Most helpful comment

Now that the patch has landed in libuv v1.39.0 I'll create a PR with this feature ~once it has~ now since it has landed in node.

All 25 comments

I think its important that what the Node.js API returns is mappable to Prometheus in as high-value a way as possible, I'll take a look at this in the next day or so.

Is utilization here the wall clock time since measurement starts or the actual CPU time consumed by the thread executing the loop tasks?

@Flarna It's a bit in depth, but you should read my post on https://github.com/libuv/libuv/pull/2725#issuecomment-594752268 about the differences.

TL;DR "idle time" is measured using uv_hrtime(), which I believe defaults to monotonic, and represents the amount of time the event loop was idle, not the amount of time the event loop was spent in the event provider (e.g. poll). "event loop utilization" is "idle time" divided by the amount of time the event loop has been running (also measured using uv_hrtime()).

@trevnorris Good post!

I think a common use case would be to calculate utilization for regular intervals not always from loop startup. Therefore not sure if adding this overall utilization from loop start is that useful.
But as long as a call to performance.idleTime() isn't stateful this is possible for everyone to do.

Looks very useful to me.

Currently this mechanism uses uv_configure_loop() to enable the metrics tracking which enables configuration options but currently does not implement any semantics for disabling those options. Once enabled, the idle stats in libuv are calculated continuously over the lifetime of the event loop and are not reset when it is read. The idle time would accumulate indefinitely until the uint64_t overflows.

In theory, we could record the moment performance.idleTime() is called internally in Node.js and track that in the response and as part of the calculation of the utilization but that starts to make the api more complicated... for instance...

// Get the idle time since the start of the event loop
let idle = performance.idleTime()
// idle = { idle: n, mark: x, utilization: y }
// where mark == the uv_hrtime() when performance.idleTime() was processed.

// then later...
idle = performance.idleTime(idle)
// idle = { idle: n, mark: x, utilitization: y }
// where idle is now the difference between the current idle time and the provided idle.idle.
// mark is the new uv_hrtime()
// and utilization is the new n / the time since the provided idle.mark

Definitely possible but obviously starts to increase the complexity of the API so we need to be careful.

@Flarna Instead of reporting "utilization", if idleTime() returned { idle: n, active: m }, where active is the cumulative time spent in uv_run(), then it would be easy enough for users to implement their own.

A simple example:

function loopUtilization(last, current = performance.idleTime()) {
  return (current.idle - last.idle) / (current.active - last.active);
}

const now = performance.idleTime();
const t = Date.now();
while (Date.now() - t < 3000);
console.log(loopUtilization(now));

Though if active is included then maybe it shouldn't be called idleTime() anymore. :P

Yep, so the key question here is: do we provide the calculation of utilization for users or do we just provide the idle time and leave it to them to calculate.

If it's similar to my example above, to prevent yet another one-liner module we might just as well add it. So maybe it would operate something like so:

// This simply logs out the loop utilization for a three second duration
setTimeout((last) => {
  console.log(performance.loopUtilization(last));
}, 3000, performance.idleTime());

Internally we just store active_time = uv_hrtime(); just before calling uv_run(). So, some pseudo code to show the calls between JS and C++, the return value of idleTime() would be something like:

{
  idle: uv_metrics_idle_time(event_loop()),
  active: active_time,
}

I'm in favor of an easy, stateless API. If we add state we have to think about multi-user support.
As shown above utilization is easy calculated if idle time is available. I assume active above is actually the wall clock time in the same unit as idle time.

I assume active above is actually the wall clock time in the same unit as idle time.

Yes. It is the same unit as idle time.

Given the likely complexity of any other option, I'm +1 on the really simple Just Give Me The Idle Time API. We can calculate active from the existing perf milestones, we don't actually need the new API to give us that.

@jasnell Just as a note here, we should probably make idle time include Atomics.wait() once we implement an API for this in Node.js. That’s not hard to do, just something that seems like it’s easy to overlook :)

I'm not super familiar with how we'd be able to observe waits but +1 to including those. @trevnorris , have you looked into that at all?

@jasnell Luckily, I’ve done quite a bit of work around that in the past: https://github.com/v8/v8/blob/1d00b7856f9071c2c62ad66dbdbfdebff6dc370c/include/v8.h#L8867-L8944

Adding support for measuring that timing should be fairly straightforward, we basically register a callback that’s called once when Atomics.wait() starts and once when it ends.

(I’ve also been meaning to add a logging API for those events to Node.js, for better debugging Atomics.wait() deadlocks.)

To my understanding the time spent in Atomic.wait is no idle time in the sense that this thread can process something else. Therefore I don't think we should add the time here.
But monitoring the time spent in Atomic.wait is for sure a good enhancement.

@flarna Well … that’s a good point, but I don’t think it’s a bit more complex than that. Maybe we need to make this customizable.

I’m currently working on code that uses Atomics.wait() when nothing else is happening on the event loop in a Worker, as a way of reducing latency and overall overhead. That means that the Atomics.wait() call effectively fulfills the function of the event loop wait, and should count as idle time.

@addaleax

That means that the Atomics.wait() call effectively fulfills the function of the event loop wait, and should count as idle time.

The metric should only measure idle time in the event provider. Specifically the amount of time idle while it waits for an event to be received (reason for this is explained in https://github.com/libuv/libuv/pull/2725#issuecomment-594752268). Looking at AtomicsWaitEvent, it seems like Atomic.wait() could be considered a type of event provider. Though if it's working in conjunction with uv_run() it'd probably be better to not include it.

Well, I guess that actually depends on whether the Worker's event loop has any handles from which it expects to receive an event. If not then Atomic.wait() would be considered the event provider. And since signal handlers can't be added from a Worker thread, there wouldn't be another way to interrupt the event provider. And if you're running with UV_RUN_NOWAIT then there technically wouldn't ever be idle time in uv_run()'s event provider. So I can see what you're getting at.

That being said, measuring the amount of time spent in Atomic.wait() would be a great step towards calculating total CPU idle time per thread. Since there isn't a platform agnostic way to do getrusage(RUSAGE_THREAD, ..., being able to include all other types of idle time is immensely helpful for analysis.

So for usability maybe just provide both times as separate values? IMO that would be the most helpful.

TL;DR idleTime() should only return the amount of idle time within the _event provider_. Not just idle time in uv_run(). So if Atomic.wait() is the acting event provider, and uv_run() is running with UV_RUN_NOWAIT, then idleTime() should return the amount of time in wait().

In respect to the question about whether to just return the raw number since idle time metric was enabled or trying to return a calculated value over an interval, my preference is to leave it as the raw number.

Otherwise it's a slippery slope of people wanting to be able to specify the interval etc. so probably better to just let them calculate from the raw number in the first place and keep it as simple as possible.

Just had a great catch up with @trevnorris... it's been too long since we've chatted! On this specific API, I'm going to move forward with a draft PR that pulls in his libuv changes and adds a field to the existing Performance API nodeTiming object to get the current idle time. Using that, and the already existing loopStart timestamp, it ought to be trivial to calculate the utilization value.

nodeTiming.idleTime / (process.hrtime() - nodeTiming.loopStart)

We can provide the cumulative Atomics.wait time as a separate property off nodeTiming.

Now that the patch has landed in libuv v1.39.0 I'll create a PR with this feature ~once it has~ now since it has landed in node.

It has already landed in Node in https://github.com/nodejs/node/pull/34915.

EDIT: Ah, I see the original comment was already updated. Ignore this comment 😄

PR #34938 landed @trevnorris I assume this can be closed now?

Probably. :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

danialkhansari picture danialkhansari  Â·  3Comments

vsemozhetbyt picture vsemozhetbyt  Â·  3Comments

ksushilmaurya picture ksushilmaurya  Â·  3Comments

loretoparisi picture loretoparisi  Â·  3Comments

filipesilvaa picture filipesilvaa  Â·  3Comments