
We have 2 machine node1 (10.201.2.21 - left screen) and node2 (.22 - right screen).
/data/shared/ is shared storage between the two.
Training command can run multi gpu on 1 node with
mpirun -n 2 ./Train train --flagsfile tD.cfg --iter 1000000 --enable_distributed=true --logtostderr=1 --minloglevel=0 --rndv_filepath='' --rundir=''
, but not sure how to run on multi node.
cc: @lunixbochs @jacobkahn
For multi-node training you need to have for example SLURM, or any other systems which can schedule jobs in multi-node. For it to communicate between processes one can use https://github.com/facebookresearch/wav2letter/blob/v0.2/src/common/Defines.cpp#L329 and also world_size and world_rank https://github.com/facebookresearch/wav2letter/blob/v0.2/src/common/Defines.cpp#L318.
See also discussion here https://github.com/facebookresearch/wav2letter/issues/605
hostfile. The machine you're on needs to be able to SSH into each node to start the jobs. 8 here is the number of gpus per node. You'll also need to pass that to mpirun later as $mingpus10.0.10.61 slots=8 max_slots=8
10.0.10.54 slots=8 max_slots=8
Use the same disk layout on all of your nodes. I used a read-only /data/* mount for my training data, and copied model configs into /tmp/w2l/ on every node before starting a training run.
Run mpirun, something like this. $njobs is the number of total GPUs in play.
mpirun -d -n "$njobs" --hostfile "hostfile" \
--wdir /tmp/w2l \
--bind-to none \
-- /path/to/Train \
--flagsfile flagsfile \
--max_devices_per_node "$mingpus" \
--enable_distributed true \
--runname model \
--rundir /tmp/w2l \
--rndv_filepath=''
Ultimately distributed training wasn't fast enough for me to actually use. I ran into a bottleneck somewhere and never fully figured it out. I ended up training mostly on individual 8GPU nodes. Maybe you'll have more luck with two GPUs per node than I had with 8.
1. Build an mpirun hostfile like this, called e.g. `hostfile`. The machine you're on needs to be able to SSH into each node to start the jobs. 8 here is the number of gpus per node. You'll also need to pass that to mpirun later as `$mingpus`10.0.10.61 slots=8 max_slots=8 10.0.10.54 slots=8 max_slots=81. Use the same disk layout on all of your nodes. I used a read-only /data/* mount for my training data, and copied model configs into /tmp/w2l/ on every node before starting a training run. 2. Run mpirun, something like this. `$njobs` is the number of total GPUs in play.mpirun -d -n "$njobs" --hostfile "hostfile" \ --wdir /tmp/w2l \ --bind-to none \ -- /path/to/Train \ --flagsfile flagsfile \ --max_devices_per_node "$mingpus" \ --enable_distributed true \ --runname model \ --rundir /tmp/w2l \ --rndv_filepath=''Ultimately distributed training wasn't fast enough for me to actually use. I ran into a bottleneck somewhere and never fully figured it out. I ended up training mostly on individual 8GPU nodes. Maybe you'll have more luck with two GPUs per node than I had with 8.
When I followed the instruction, mpi caught error because it can't find proc session directory and job session directory. Do you have any suggestion how to fix this error @lunixbochs?
I found the same problem as @light42 . could you please enlighten us @lunixbochs
Sorry, you'll need to debug any MPI issues yourself.
Most helpful comment
hostfile. The machine you're on needs to be able to SSH into each node to start the jobs. 8 here is the number of gpus per node. You'll also need to pass that to mpirun later as$mingpusUse the same disk layout on all of your nodes. I used a read-only /data/* mount for my training data, and copied model configs into /tmp/w2l/ on every node before starting a training run.
Run mpirun, something like this.
$njobsis the number of total GPUs in play.Ultimately distributed training wasn't fast enough for me to actually use. I ran into a bottleneck somewhere and never fully figured it out. I ended up training mostly on individual 8GPU nodes. Maybe you'll have more luck with two GPUs per node than I had with 8.