I wish there is a user group/mailing list, so I don't have to abuse the issue tracker by asking user-related questions here.
I've read the documents about running jobs on small and large clusters, and have a few questions:
How large is a 'large' cluster? In my environment, we have a cluster of 504 nodes, with 16 cores on each. Is that a large cluster? It seems I can use 'ray start' even on this cluster, even it is large.
However, our cluster is managed by Slurm. I just started using it and found I may not have direct ssh to the nodes. The jobs and scheduling need to be done from commands of Slurm, such as 'srun'. How could I init redis and run jobs with Ray in such an environment?
I'm not sure these are valid questions, but thanks again for the help!
So, to answer your first question. The distinction that we were making in the documentation between a small cluster and a large cluster was that a small cluster is small enough that you can conveniently ssh to all of the machines and start/stop processes by hand. A large cluster is too large to do that conveniently (though it would of course still work if you did that).
As for your second question, I think you would need one Slurm job that runs ray start --head --redis-port=6379 to start the "head node" and then launch a bunch of Slurm jobs that run ray start --redis-address=... to start the non-head nodes. Though to do that you'll need the IP address of the head node. Does Slurm give that to you?
Then you could run a another job that actually does runs your application.
It should be possible to get it working (using something like the above). Let me know if you run into problems.
Also, posting github issues is the best way to post questions for now.
Would it be possible to add a functionality which can be found in dask, in which the scheduler is placed in a file that can be read by the program to determine the IP?
So in dask, using slurm, you would do the following
DASKIP="/path/to/scheduler.json"
srun --export=ALL --ntasks=1 dask-scheduler --scheduler-file ${DASKIP} &
srun --export=ALL --ntasks=7 dask-worker --nthreads=1 --scheduler-file ${DASKIP}
Then when running the program, you would use:
client = Client(scheduler_file="/path/to/scheduler.json")
I think this functionality would be helpful for use on a scheduler like slurm.
I see, does this assume some underlying distributed file system?
The dask-scheduler command writes the file and the dask-worker commands read the file?
Yes, using this methods assumes you are using a shared file system so the json file can be read by all works. And yes, the scheduler writes and the worker reads.
Something like that would be useful. If I understand correctly it wouldn't need to be part of the ray start command and it would be sufficient to write the following two scripts
start_head.sh - starts the Ray head node and writes the redis address to a file on a shared file systemstart_worker.sh - loops until the file is present, reads the contents, and starts a worker nodeWould you be interested in trying out something like that first and seeing if that works?
@weiliu620 FYI we just made a mailing list which is a good place to ask questions https://groups.google.com/forum/#!forum/ray-dev.
Just fyi, I think something like this https://github.com/dask/dask-drmaa for ray would be very useful for deploying it on e.g. a SLURM managed cluster. As far as I can tell, using DRMAA in an adaptive way could obviate the need for managing ray head/workers manually and enable auto-scaling on a DRMAA-compatible HPC cluster.
Thanks for the tip, that looks useful!
Is this something you'd be interested in contributing? We could always use more help from people with some HPC expertise.
We're just starting to set up a cluster infrastructure at our lab and I'm buried in other work too, so not in the near future unfortunately. But I wanted to list it here anyway (maybe someone picks it up).
Hi, I'm interested in this implementation as well, I have an initial setup working, but need help / access to a cluster to test. The implementation is not slurm specific, so it can be used in any MPI-enabled scheduler.
I'm using the start/stop procedures used by ray start and ray stop. The rank 0 is used as the head, so it adapts the --head route, discovers all the rest of the reserved nodes/tasks and runs ray start --redis-address=IP:PORT on each node(once) using the head node's IP.
After the processes are started, ray init is called with the correct redis address. The user is responsible for submitting any tasks to ray after this, preferably from a single node.
When all is finished, the shutdown is performed, followed by a barrier, and when all nodes synced, the processes are stopped.
@robertnishihara some changes might be necessary with the stop script, since currently all processes running on the host is shutdown.If the host is shared, and same user submits multiple jobs which happen to get reserved on the same host, terminating one job will kill the other. Processes from other users should still be protected by the permissions enforced by the OS. Happy to discuss further.
Robert,
I am willing to help you set up this as it will be tremendously useful for me. Let me know.
M
So...
How to use Ray in a Slurm cluster?
For my case, it seems to be hard to access each individual node and install and start a redis server on it.
After using srun to put a python script start with ray.init(), the first error report shown is:
2019-04-08 21:05:15,898 INFO services.py:368 -- Failed to connect to the redis server, retrying.
It seems that the ray fails at the very beginning of initialization.
Can you copy the entire command that you are using? In addition, can you try to write a bash script and run it using sbatch ?
Thanks,
M
I pasted my implementation below.
The "sleep infinity" and "ending sleep" lines are there because slurm would automatically kill any processes started by a script when that script completed. So these lines keep the "ray start" scripts from closing until trainer.py finishes.
This implementation seems to work. I can run trainer.py without error on more CPUs than any one node contains and the main script successfully stops all processes on completion.
Puzzlingly, trainer.py runs slower on 2 nodes than 1 and is roughly the same speed on 2-5, despite me increasing --num_cpus by 20 for each node. I don't know yet if this is because of my implementation here or some other issue.
script.sh
#!/bin/bash
#SBATCH --job-name=a3c_baseline
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=1GB
#SBATCH --nodes=5
#SBATCH --tasks-per-node 1
worker_num=4
module load Langs/Python/3.6.4 # Setting up my environment
cd ~/project/ssd/
source venv/bin/activate
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )
node1=${nodes_array[0]}
ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname --ip-address) # making redis-address
suffix=':6379'
ip_head=$ip_prefix$suffix
echo "IP Head: $ip_head"
export ip_head
echo "STARTING HEAD at $node1"
srun --nodes=1 --ntasks=1 -w $node1 ~/scripts/start_head.sh &
sleep 5
for (( i=1; i<=$worker_num; i++ ))
do
node_i=${nodes_array[$i]}
echo "STARTING WORKER $i at $node_i"
srun --nodes=1 --ntasks=1 -w $node_i ~/scripts/start_worker.sh $ip_head $i &
sleep 5
done
echo "RUNNING SCRIPT"
cd Github/ray_project/run_scripts/
python trainer.py --num_cpus 100
echo "ENDING SLEEP"
pkill -P $(<~/pid_storage/head.pid) sleep
for (( i=1; i<=$worker_num; i++ ))
do
pkill -P $(<~/pid_storage/worker${i}.pid) sleep
done
start_head.sh
#!/bin/bash
ray start --head --redis-port=6379
echo "Head PID: $$"
echo "$$" | tee ~/pid_storage/head.pid
sleep infinity
echo "Head stopped"
start_worker.sh
#!/bin/bash
ray start --redis-address=$1
echo "Worker ${2} PID: $$"
echo "$$" | tee ~/pid_storage/worker${2}.pid
sleep infinity
echo "Worker ${2} stopped"
@gregSchwartz18 this is really neat! I think we should move this into the documentation. Would you be interested in contributing this as a PR?
Thanks! I just made the PR
Addressed by #5467.
Small addition - For anyone else getting the following error: slurmstepd-linux14: error: execve(): start_worker.sh: Permission denied, you need to run chmod +x start_worker.sh to make the script executable. (also for start_head.sh)
also in start_worker.sh, ray start --redis-address=$1 needs to be changed to ray start --address=$1 XD