What happened:
Scheduler is run on a Linux server with 16 cores, the CPU usage is often reaching 100% on just one core, and the other 15 are idling.
What you expected to happen:
The scheduler to utilize all available cores to improve throughput
Anything else we need to know?:
I am running long simulation tasks with Dask. Each task is quite long (several seconds up to several minutes).
Unfortunately the worker nodes are spread out in different places, so the network between them is slow - maybe 10MBps between the worst endpoints.
The data needed for the processing is fairly large (many MB), so I start by scattering the data to make it available where needed.
I have installed the scheduler on its own 16 core server, to not make the scheduler a bottle neck. The scheduler runs Linux, since I had problems with too many open file descriptors in Windows. The worker nodes runs windows mainly due to libraries used are not available in Linux.
I have noticed that the CPU load on the scheduler is 100%, i.e. just using one CPU core, the other 15 are idling. I have the feeling that the scheduler is a bottleneck in the simulation.
Is the scheduler single threaded?
Is there anything I should consider in my setup to reduce the load on the scheduler?
Environment:
I ma newbie using dask, however if I understood your problem, the first approach is that you tell us what type of scheduler (https://docs.dask.org/en/latest/scheduling.html) you are using. Also ask the scheduler to tell you the number of cores it has set as default in order to undestand why tasks are not running in parallel.
The scheduler runs on Linux and is launched with the default command
dask-scheduler
Then I start worker nodes on other machines. I.e. the I'm using distributed scheduler.
The tasks submitted to the scheduler is run in parallel, so that is not the problem. The problem I see is that the scheduler process is after a while always using 100% of one single core. I get the feeling that this process is a bottleneck.
Are you launching any workers? The scheduler merely sends tasks to workers. It's the workers that parallelize the execution.
Would also recommend playing around with how many processes vs. threads are used on workers. There may be some optimum for your use case.
Instead of launching things from CLI (at least initially) would recommend doing something fairly high level like creating a client. This will spin up the scheduler, workers and connect them for you. Something like this should be fine:
from distributed import Client
client = Client()
That way when calling .compute() or .persist() on Dask objects it will schedule the work through the client for you behind the scenes.
Thanks @jakirkham for your comments, however I think you misunderstand our setup.
The DASK cluster is made up of four servers. One server is dedicated to be the scheduler, started from a bash console with the command dask-scheduler
The other three servers have 16 worker processes, with two threads each. This has been carefully tested to be a good fit for our application.
The worker nodes behaves as expected.
My question is about the scheduler process. The server where this is located has 16 cores, but the scheduler process (when inspected by e.g. "top" is only using 1 core, and is after a while constantly at 100% CPU usage. I get the feeling that this might limit the overall cluster performance.
My question was related to the scheduler itself, not about the workers.
Thanks for the background
Yeah I would not expect the Scheduler to make use of more than one core as it exists today
Thanks for the clarification!
I'll see if I can find out what is causing the load on the scheduler. I had some functions to check status for each client to report back to the user that we removed. We have also replaced windows nodes with Linux after porting our custom libraries.
This seems to have improved the situation. I'll close this issue.
We have also been doing work to improve the performance of the Scheduler, but we haven鈥檛 been looking at using multiple cores for the Scheduler at this time