Ray: work with cluster managed by Slurm

Created on 9 Aug 2017 · 18Comments · Source: ray-project/ray

I wish there is a user group/mailing list, so I don't have to abuse the issue tracker by asking user-related questions here.

I've read the documents about running jobs on small and large clusters, and have a few questions:

How large is a 'large' cluster? In my environment, we have a cluster of 504 nodes, with 16 cores on each. Is that a large cluster? It seems I can use 'ray start' even on this cluster, even it is large.
However, our cluster is managed by Slurm. I just started using it and found I may not have direct ssh to the nodes. The jobs and scheduling need to be done from commands of Slurm, such as 'srun'. How could I init redis and run jobs with Ray in such an environment?

I'm not sure these are valid questions, but thanks again for the help!

good first issue question

Source

weiliu620

👍1

All 18 comments

So, to answer your first question. The distinction that we were making in the documentation between a small cluster and a large cluster was that a small cluster is small enough that you can conveniently ssh to all of the machines and start/stop processes by hand. A large cluster is too large to do that conveniently (though it would of course still work if you did that).

As for your second question, I think you would need one Slurm job that runs ray start --head --redis-port=6379 to start the "head node" and then launch a bunch of Slurm jobs that run ray start --redis-address=... to start the non-head nodes. Though to do that you'll need the IP address of the head node. Does Slurm give that to you?

Then you could run a another job that actually does runs your application.

It should be possible to get it working (using something like the above). Let me know if you run into problems.

Also, posting github issues is the best way to post questions for now.

robertnishihara on 10 Aug 2017

Would it be possible to add a functionality which can be found in dask, in which the scheduler is placed in a file that can be read by the program to determine the IP?

So in dask, using slurm, you would do the following

DASKIP="/path/to/scheduler.json"

srun --export=ALL --ntasks=1 dask-scheduler --scheduler-file ${DASKIP} &
srun --export=ALL --ntasks=7 dask-worker --nthreads=1 --scheduler-file ${DASKIP}

Then when running the program, you would use:

client = Client(scheduler_file="/path/to/scheduler.json")

I think this functionality would be helpful for use on a scheduler like slurm.

mkovalski on 29 Aug 2017

I see, does this assume some underlying distributed file system?

The dask-scheduler command writes the file and the dask-worker commands read the file?

robertnishihara on 29 Aug 2017

Yes, using this methods assumes you are using a shared file system so the json file can be read by all works. And yes, the scheduler writes and the worker reads.

mkovalski on 30 Aug 2017

Something like that would be useful. If I understand correctly it wouldn't need to be part of the ray start command and it would be sufficient to write the following two scripts

start_head.sh - starts the Ray head node and writes the redis address to a file on a shared file system
start_worker.sh - loops until the file is present, reads the contents, and starts a worker node

Would you be interested in trying out something like that first and seeing if that works?

robertnishihara on 31 Aug 2017

@weiliu620 FYI we just made a mailing list which is a good place to ask questions https://groups.google.com/forum/#!forum/ray-dev.

robertnishihara on 8 Sep 2017

👍1

Just fyi, I think something like this https://github.com/dask/dask-drmaa for ray would be very useful for deploying it on e.g. a SLURM managed cluster. As far as I can tell, using DRMAA in an adaptive way could obviate the need for managing ray head/workers manually and enable auto-scaling on a DRMAA-compatible HPC cluster.

SebastianRiedel on 16 Jan 2018

👍1

Thanks for the tip, that looks useful!

Is this something you'd be interested in contributing? We could always use more help from people with some HPC expertise.

robertnishihara on 16 Jan 2018

We're just starting to set up a cluster infrastructure at our lab and I'm buried in other work too, so not in the near future unfortunately. But I wanted to list it here anyway (maybe someone picks it up).

SebastianRiedel on 16 Jan 2018

👍1

Hi, I'm interested in this implementation as well, I have an initial setup working, but need help / access to a cluster to test. The implementation is not slurm specific, so it can be used in any MPI-enabled scheduler.

I'm using the start/stop procedures used by ray start and ray stop. The rank 0 is used as the head, so it adapts the --head route, discovers all the rest of the reserved nodes/tasks and runs ray start --redis-address=IP:PORT on each node(once) using the head node's IP.

After the processes are started, ray init is called with the correct redis address. The user is responsible for submitting any tasks to ray after this, preferably from a single node.

When all is finished, the shutdown is performed, followed by a barrier, and when all nodes synced, the processes are stopped.

@robertnishihara some changes might be necessary with the stop script, since currently all processes running on the host is shutdown.If the host is shared, and same user submits multiple jobs which happen to get reserved on the same host, terminating one job will kill the other. Processes from other users should still be protected by the permissions enforced by the OS. Happy to discuss further.

nalyat on 4 Oct 2018

Robert,
I am willing to help you set up this as it will be tremendously useful for me. Let me know.

marianogabitto on 24 Nov 2018

So...

How to use Ray in a Slurm cluster?
For my case, it seems to be hard to access each individual node and install and start a redis server on it.

After using srun to put a python script start with ray.init(), the first error report shown is:

2019-04-08 21:05:15,898 INFO services.py:368 -- Failed to connect to the redis server, retrying.

It seems that the ray fails at the very beginning of initialization.

pengzhenghao on 8 Apr 2019

Can you copy the entire command that you are using? In addition, can you try to write a bash script and run it using sbatch ?

Thanks,

marianogabitto on 8 Apr 2019

I pasted my implementation below.

The "sleep infinity" and "ending sleep" lines are there because slurm would automatically kill any processes started by a script when that script completed. So these lines keep the "ray start" scripts from closing until trainer.py finishes.

This implementation seems to work. I can run trainer.py without error on more CPUs than any one node contains and the main script successfully stops all processes on completion.

Puzzlingly, trainer.py runs slower on 2 nodes than 1 and is roughly the same speed on 2-5, despite me increasing --num_cpus by 20 for each node. I don't know yet if this is because of my implementation here or some other issue.

script.sh

#!/bin/bash

#SBATCH --job-name=a3c_baseline
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=1GB
#SBATCH --nodes=5
#SBATCH --tasks-per-node 1

worker_num=4

module load Langs/Python/3.6.4 # Setting up my environment
cd ~/project/ssd/
source venv/bin/activate

nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )

node1=${nodes_array[0]}

ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname --ip-address) # making redis-address
suffix=':6379'
ip_head=$ip_prefix$suffix
echo "IP Head: $ip_head"

export ip_head

echo "STARTING HEAD at $node1"
srun --nodes=1 --ntasks=1 -w $node1 ~/scripts/start_head.sh &
sleep 5

for ((  i=1; i<=$worker_num; i++ ))
do
 node_i=${nodes_array[$i]}
 echo "STARTING WORKER $i at $node_i"
 srun --nodes=1 --ntasks=1 -w $node_i ~/scripts/start_worker.sh $ip_head $i &
 sleep 5
done

echo "RUNNING SCRIPT"
cd Github/ray_project/run_scripts/
python trainer.py --num_cpus 100

echo "ENDING SLEEP"
pkill -P $(<~/pid_storage/head.pid) sleep
for ((  i=1; i<=$worker_num; i++ ))
do
 pkill -P $(<~/pid_storage/worker${i}.pid) sleep
done

start_head.sh

#!/bin/bash

ray start --head --redis-port=6379

echo "Head PID: $$"
echo "$$" | tee ~/pid_storage/head.pid
sleep infinity

echo "Head stopped"

start_worker.sh

#!/bin/bash

ray start --redis-address=$1

echo "Worker ${2} PID: $$"
echo "$$" | tee ~/pid_storage/worker${2}.pid
sleep infinity

echo "Worker ${2} stopped"

gregSchwartz18 on 16 Aug 2019

@gregSchwartz18 this is really neat! I think we should move this into the documentation. Would you be interested in contributing this as a PR?

richardliaw on 16 Aug 2019

Thanks! I just made the PR

gregSchwartz18 on 17 Aug 2019

Addressed by #5467.

robertnishihara on 19 Aug 2019

Small addition - For anyone else getting the following error: slurmstepd-linux14: error: execve(): start_worker.sh: Permission denied, you need to run chmod +x start_worker.sh to make the script executable. (also for start_head.sh)

also in start_worker.sh, ray start --redis-address=$1 needs to be changed to ray start --address=$1 XD