Distributed: excessive CPU time spent on gc (even after manually adjusting gc thresholds)

Created on 25 Jun 2019 · 13Comments · Source: dask/distributed

During the process I'm running I very quickly get this warning (for pretty much each worker)

distributed.utils_perf - WARNING - full garbage collections took 36% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 34% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 34% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)

However, the percent of memory used is by each worker is low -

I have plenty of memory to work with so my question is - is it possible dask is aggressively doing gc? If so is it possible to change the gc threshold/collect less aggressively?

I've tried to manually adjust the gc.threshold but this seems to have no effect

g0, g1, g2 = gc.get_threshold()
gc.set_threshold(g0*10, g1*10, g2*10)

(using distributed 1.28.1 and dask 1.2.2)

Source

kindjacket

Most helpful comment

Hi,
just to add, I'm getting the same errors while repartitioning a parquet file. I suspect that it has more to do with the computation being very quick and the garbage collection just being the more burdensome task. Maybe the solution is to add to only warn if GC takes a long fraction of a long task? It seems to be the idea behind the first comment in _gc_callback in distributed.utils_perf which emits this warning:

    def _gc_callback(self, phase, info):
        # Young generations are small and collected very often,
        # don't waste time measuring them
        if info["generation"] != 2:
            return

For reference, my code is:

from dask.distributed import Client
client = Client()
# outputs: <Client: 'tcp://127.0.0.1:45451' processes=7 threads=28, memory=48.32 GB>
import dask.dataframe as dataframe
df = dataframe.read_parquet('data_simulated_partitioned.parquet')
df.npartitions
# 3941
df = df.repartition(partition_size='100MB')
# This is were hundreds of warnings arise
df.npartitions
# 137
df = df.persist()
# Again, hundreds of warnings

benjaminvatterj on 12 Jan 2021

👍2 👀1

All 13 comments

I'm curious, what kind of workload are you running? That's a large amount of time spent in garbage collection.

If so is it possible to change the gc threshold/collect less aggressively?

I personally don't know much about how we handle GC, but you might want to look through the following file, which I think includes the code that's active here.

https://github.com/dask/distributed/blob/master/distributed/utils_perf.py

mrocklin on 25 Jun 2019

@mrocklin - sorry for the slow response. I'm running lots of parallel simulations. The task is looping through a large array of objects. Adjusting the objects and storing the simulation state. At the end of the simulation, an overall metric is stored, and the large dataset used is disregarded (probably garbage collected).

kindjacket on 15 Aug 2019

in distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%) is the threshold the warning threshold or the threshold for which GC occurs?

kindjacket on 15 Aug 2019

warning. Python's GC runs fairly frequently.

mrocklin on 15 Aug 2019

I am experiencing the same on my workstation:

distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP  local=tcp://127.0.0.1:58788 remote=tcp://127.0.0.1:33619>
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
 cdistributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)

It's been at least 20 min printing these warnings and no computation has been done as far as I can say.

Edit 1: After about 30 min it started doing stuff:

Screen Shot 2019-09-27 at 18 56 09

Edit 2: I forgot to mention that I am creating a list of delayed objects that are then called by dask.persist. See here.

muammar on 28 Sep 2019

You have lots of tasks relative to how many workers you have. You might want to look through https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs

mrocklin on 28 Sep 2019

It's also worth noting that this error message

distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)

is most often (but not exclusively) the fault of the code that you're running, and not anything to do with Dask. Dask is just in a nice position to let you know if things like that are going on.

mrocklin on 28 Sep 2019

👍2

Hi,

I am starting with dask on my laptop, going through the dask-tutorial which is on github, but I have similar issue with "distributed.utils_perf - WARNING - full garbage collections took xxx% CPU time recently (threshold: 10%)" always when I start local cluster with the distributed scheduler

It is occurring always when I start local cluster even when I submit simple calcualtions
different settings of parameters when starting the cluster, like n_workers, memory_limit, threads_per_worker, processes does not seem to have an effect
I use conda environment with python 3.7, on laptop with 4 cores, 8 logical processors and 16GB RAM
I tried using it on windows and an another machine with linux , no difference
the % cpu time in the warning is slowly climbing up and I got this warning almost every second
client.restart() does not have effect on this, these warnings show up again immediately
it is occurring only with the distributed scheduler
after these warnings start to show up and then after shutting down the cluster with client.shutdown() and then starting it again without restarting the ipython kernel, these warnings start to show up without submitting any calculation

Is there any progress on this pls,, or any way to solve this?

jklen on 28 Dec 2020

👍1

@jklen Which version of the dask are you using? Is it 2020.12.0? If yes, could you please downgrade to 2.30.0?

I was encountering this issue after upgrading to the latest version ("2020.12.0"), but after down grading these warning messages didn't appear. But, of course, I am not sure.

arnabbiswas1 on 11 Jan 2021

hi,
have similar problem - use cluster daskdev/dask:2.30.0 with 1 scheduler and 3 workers (deployed via helm chart). Cluster currently is in idle state - just some test tasks form time to time. In few days scheduler use a lot of resources (in idle cluster...) and start spamming warnings

distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
distributed.core - INFO - Event loop was unresponsive in Scheduler for 12.38s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
distributed.core - INFO - Event loop was unresponsive in Scheduler for 12.11s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)

Also i use web checks for dashboard - they start to fail in few days after scheduler start too.

scheduler:
  image:
    repository: "daskdev/dask"
    tag: 2.30.0
  resources:
    limits:
      cpu: 1.8
      memory: 6G
    requests:
      cpu: 1.8
      memory: 6G

Scheduler utilization in idle state

Scheduler cpu usage graph

Scheduler memory usage graph

andrii29 on 12 Jan 2021

    def _gc_callback(self, phase, info):
        # Young generations are small and collected very often,
        # don't waste time measuring them
        if info["generation"] != 2:
            return

For reference, my code is:

from dask.distributed import Client
client = Client()
# outputs: <Client: 'tcp://127.0.0.1:45451' processes=7 threads=28, memory=48.32 GB>
import dask.dataframe as dataframe
df = dataframe.read_parquet('data_simulated_partitioned.parquet')
df.npartitions
# 3941
df = df.repartition(partition_size='100MB')
# This is were hundreds of warnings arise
df.npartitions
# 137
df = df.persist()
# Again, hundreds of warnings

benjaminvatterj on 12 Jan 2021

👍2 👀1

I get the same messages, and I don't think it could be the fault of my code, as it happens when just using .unique() on a dask.dataframe.core.Series object. All of a sudden I'm continually bombarded with this message, and over time the CPU time percentage slowly trickles up.

my_df["my_column"].unique().compute()

gives this output:

full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)

The task doesn't seem difficult, as it completes in a matter of seconds.

mjspeck on 30 Apr 2021

I have get the same problem while processing large sets of data. For my case, the warning did not affect my output.

I somehow get away by allocating more workers to the job (used to be only one worker).

I am not sure about the implications and consequences (I also haven't test the speed before and after).

Hope that helps.

RayAtUofT on 20 May 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Visualize TaskGroups

mrocklin · 6Comments

Provide clearer message for multiprocess error

djhoese · 3Comments

`map_blocks` leads to downstream TypeError: can not serialize 'function' object

m-albert · 6Comments

Trigger callback on dask.distributed.Variable.set

sofroniewn · 5Comments

TypeError: catching classes that do not inherit from BaseException is not allowed

muammar · 6Comments