Presto: 5-10% idle CPU consumption

Created on 8 Jun 2020 · 13Comments · Source: prestosql/presto

I noticed that an idle (local) install of Presto (any recent version I tried) consumes 5-10% CPU.

That might seem like trifle, but on moderns machines it prevents the CPU and other subsystems from entering deeper sleep states and hence wasting power.

In the profiler I tracked it down to the AuthenticationFilter and there mostly to

io.prestosql.server.InternalAuthenticationManager.parseJwt (String)
and
io.prestosql.server.ServletSecurityUtils.withAuthenticatedIdentity (...)

The threads are all called http-worder-xxx

It's not immediately clear how to fix it, but I thought I'd at least record this here.
If anyone has some ideas I'm happy to work on it.

Source

lhofhansl

Most helpful comment

Actually, I tracked the wakeups to InternalResourceGroupManager

Specifically:

        if (started.compareAndSet(false, true)) {
            refreshExecutor.scheduleWithFixedDelay(this::refreshAndStartQueries, 1, 1, TimeUnit.MILLISECONDS);

If I change that to 100ms (or even just 10ms), CPU usage goes down to 2-5% and the package is allowed to enter deeper c-states (at least 40-50% of the time).

Since we're calculating CPU quotas in there I guess we want to run frequently, but every single ms?
(Especially since it looks like we're acting when cross second boundaries only anyway)

@dain @electrum

lhofhansl on 9 Jun 2020

👍3

All 13 comments

That is the authentication of internal communication between servers which uses JTW with SHA 256 (shared secret). When servers are "idle" there a few communications that happen. There is the discovery announcement which is a directory of which servers are running Presto so the coordinator can find them. Then there is the failure detector which pings these servers to determine if they are alive.

When you say 5-10% are you talking about global CPU or just one core? Also, are you talking about the coordinator, workers, or both?

dain on 8 Jun 2020

Thanks @dain

It's just 5-10% of one core. But it prevents an entire package (with multiple cores) from going into deeper power states.

This is one-node install (so that I can point a profiler/debugger at it easily), so it's hard to tell whether it's the worker or coordinator.

It's not a big deal. Mostly I was surprised. And it would be nice to safe power.
The JWT parsing looks to be somewhat expensive. There are easily 10ks of calls to com.fasterxml.jackson.databind.util.ClassUtil._addSuperTypes. com.fasterxml.jackson.databind.AnnotationIntrospector._findAnnotation, and various other jackson classes.

lhofhansl on 8 Jun 2020

The version of JJWT we are using creates ObjectMapper every time in DefaultJwtParser which could have some overhead. This has changed in newer versions of the library (the overall library design is a bit different). Would you like to send a PR to upgrade the version? It's unclear if this would make a material performance difference, but would be good to do anyway.

electrum on 8 Jun 2020

Regarding idle states for processors, what's the minimum level of usage that would allow this? If you are doing a very simple operation every 100ms, would that prevent sleeping?

electrum on 8 Jun 2020

I tried to upgrade to jjwt 0.10.8, but I see the same idle CPU usage.

Also checked the number of wakeups... The presto process wakes up about 600-700x / sec, which seems excessive. Edit: That includes all timers, though.

Just checked on my machine and the threshold latency for the lowest power state is 890us, and the threshold residency go to next higher state is 5000us. But that will vary from machine to machine.

So a simple op every 100ms should be OK, but waking up every 1 or 2ms, is "not OK"

lhofhansl on 9 Jun 2020

Actually, I tracked the wakeups to InternalResourceGroupManager

Specifically:

        if (started.compareAndSet(false, true)) {
            refreshExecutor.scheduleWithFixedDelay(this::refreshAndStartQueries, 1, 1, TimeUnit.MILLISECONDS);

If I change that to 100ms (or even just 10ms), CPU usage goes down to 2-5% and the package is allowed to enter deeper c-states (at least 40-50% of the time).

Since we're calculating CPU quotas in there I guess we want to run frequently, but every single ms?
(Especially since it looks like we're acting when cross second boundaries only anyway)

@dain @electrum

lhofhansl on 9 Jun 2020

👍3

@lhofhansl thanks for the thorough investigation.
I think the potential reason for running this as frequently is to be able to unblock a queued query as soon as eligible.
(if we run this every x ms, we add up to x ms delay)
However, i am really not convinced it really warrants running the check so often.

Note also that within https://github.com/prestosql/presto/pull/1128 the amount of work done per every run
of refreshAndStartQueries changed (significantly?).

cc @phd3

findepi on 9 Jun 2020

👍1

Interesting. Since I saw the process wake up about 600-700x/s it for sure takes less than 0.4-0.6ms for each run when idle (probably quite a bit less, since there's context switching, etc), also going by the low CPU usage of 5-10% the work done refreshAndStartQueries is probably very little (when idle), i.e. the scheduling overhead compared to the work done is high.

It seems unblocking a query after 10 or 20ms should be good enough (and larger quantas are probably better for throughput anyway), but I have nothing to base that specific number on. :)

I don't offhand see any place where we interrupt queries when CPU quantas expire. Is it only checked in the beginning of a query?

lhofhansl on 10 Jun 2020

One more data point:
With defaults I see this in the jstack:
"ResourceGroupManager" ... cpu=6236.68ms elapsed=304.04s

When I set the scheduling to 10ms I see:
"ResourceGroupManager" ... cpu=1556.73ms elapsed=302.88s ...

So about 4x the CPU usage.

For comparison the max between all the other threads in the ballpark of 100ms of CPU time.

lhofhansl on 10 Jun 2020

I personally have no objections for changing this number to 100ms.

findepi on 10 Jun 2020

I'll file a PR, and then we can continue there.

lhofhansl on 10 Jun 2020

@lhofhansl Thanks for the detailed investigation. Changing the frequency to 100ms sounds reasoanable.

@findepi #1128 adds some work in internalGenerateCpuQuota and updateGroupsAndProcessQueuedQueries. internalGenerateCpuQuota only executes with frequency of 1s. In the idle cluster, that extra work should not happen since it is only triggered if the CPU usage limit is crossed. The effective change in updateGroupsAndProcessQueuedQueries is that it updates both CPU and memory instead of just memory. So I wouldn't say that it should impact the cpu consumption significantly here. As @lhofhansl mentioned, I suspect it to be the scheduling overhead, rather than the amount of work done.