During the process I'm running I very quickly get this warning (for pretty much each worker)
distributed.utils_perf - WARNING - full garbage collections took 36% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 34% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 34% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
However, the percent of memory used is by each worker is low -

I have plenty of memory to work with so my question is - is it possible dask is aggressively doing gc? If so is it possible to change the gc threshold/collect less aggressively?
I've tried to manually adjust the gc.threshold but this seems to have no effect
g0, g1, g2 = gc.get_threshold()
gc.set_threshold(g0*10, g1*10, g2*10)
(using distributed 1.28.1 and dask 1.2.2)
I'm curious, what kind of workload are you running? That's a large amount of time spent in garbage collection.
If so is it possible to change the gc threshold/collect less aggressively?
I personally don't know much about how we handle GC, but you might want to look through the following file, which I think includes the code that's active here.
https://github.com/dask/distributed/blob/master/distributed/utils_perf.py
@mrocklin - sorry for the slow response. I'm running lots of parallel simulations. The task is looping through a large array of objects. Adjusting the objects and storing the simulation state. At the end of the simulation, an overall metric is stored, and the large dataset used is disregarded (probably garbage collected).
in distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%) is the threshold the warning threshold or the threshold for which GC occurs?
warning. Python's GC runs fairly frequently.
I am experiencing the same on my workstation:
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:58788 remote=tcp://127.0.0.1:33619>
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
cdistributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
It's been at least 20 min printing these warnings and no computation has been done as far as I can say.
Edit 1: After about 30 min it started doing stuff:

Edit 2: I forgot to mention that I am creating a list of delayed objects that are then called by dask.persist. See here.
You have lots of tasks relative to how many workers you have. You might want to look through https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs
It's also worth noting that this error message
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
is most often (but not exclusively) the fault of the code that you're running, and not anything to do with Dask. Dask is just in a nice position to let you know if things like that are going on.
Hi,
I am starting with dask on my laptop, going through the dask-tutorial which is on github, but I have similar issue with "distributed.utils_perf - WARNING - full garbage collections took xxx% CPU time recently (threshold: 10%)" always when I start local cluster with the distributed scheduler
Is there any progress on this pls,, or any way to solve this?
@jklen Which version of the dask are you using? Is it 2020.12.0? If yes, could you please downgrade to 2.30.0?
I was encountering this issue after upgrading to the latest version ("2020.12.0"), but after down grading these warning messages didn't appear. But, of course, I am not sure.
hi,
have similar problem - use cluster daskdev/dask:2.30.0 with 1 scheduler and 3 workers (deployed via helm chart). Cluster currently is in idle state - just some test tasks form time to time. In few days scheduler use a lot of resources (in idle cluster...) and start spamming warnings
distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
distributed.core - INFO - Event loop was unresponsive in Scheduler for 12.38s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
distributed.core - INFO - Event loop was unresponsive in Scheduler for 12.11s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
Also i use web checks for dashboard - they start to fail in few days after scheduler start too.
scheduler:
image:
repository: "daskdev/dask"
tag: 2.30.0
resources:
limits:
cpu: 1.8
memory: 6G
requests:
cpu: 1.8
memory: 6G
Scheduler utilization in idle state

Scheduler cpu usage graph

Scheduler memory usage graph

Hi,
just to add, I'm getting the same errors while repartitioning a parquet file. I suspect that it has more to do with the computation being very quick and the garbage collection just being the more burdensome task. Maybe the solution is to add to only warn if GC takes a long fraction of a long task? It seems to be the idea behind the first comment in _gc_callback in distributed.utils_perf which emits this warning:
def _gc_callback(self, phase, info):
# Young generations are small and collected very often,
# don't waste time measuring them
if info["generation"] != 2:
return
For reference, my code is:
from dask.distributed import Client
client = Client()
# outputs: <Client: 'tcp://127.0.0.1:45451' processes=7 threads=28, memory=48.32 GB>
import dask.dataframe as dataframe
df = dataframe.read_parquet('data_simulated_partitioned.parquet')
df.npartitions
# 3941
df = df.repartition(partition_size='100MB')
# This is were hundreds of warnings arise
df.npartitions
# 137
df = df.persist()
# Again, hundreds of warnings
I get the same messages, and I don't think it could be the fault of my code, as it happens when just using .unique() on a dask.dataframe.core.Series object. All of a sudden I'm continually bombarded with this message, and over time the CPU time percentage slowly trickles up.
my_df["my_column"].unique().compute()
gives this output:
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
The task doesn't seem difficult, as it completes in a matter of seconds.
I have get the same problem while processing large sets of data. For my case, the warning did not affect my output.
I somehow get away by allocating more workers to the job (used to be only one worker).
I am not sure about the implications and consequences (I also haven't test the speed before and after).
Hope that helps.
Most helpful comment
Hi,
just to add, I'm getting the same errors while repartitioning a parquet file. I suspect that it has more to do with the computation being very quick and the garbage collection just being the more burdensome task. Maybe the solution is to add to only warn if GC takes a long fraction of a long task? It seems to be the idea behind the first comment in _gc_callback in distributed.utils_perf which emits this warning:
For reference, my code is: