Wav2letter: Train multinode multigpu

Created on 26 Aug 2020  路  5Comments  路  Source: flashlight/wav2letter

train.cfg.txt

Screen Shot 2020-08-26 at 13 36 25

We have 2 machine node1 (10.201.2.21 - left screen) and node2 (.22 - right screen).
/data/shared/ is shared storage between the two.
Training command can run multi gpu on 1 node with

mpirun -n 2 ./Train train   --flagsfile tD.cfg --iter 1000000 --enable_distributed=true --logtostderr=1 --minloglevel=0 --rndv_filepath='' --rundir='' 

, but not sure how to run on multi node.

  1. What is correct command on each node to use distributed multi node training ?
  2. What to see in the log that multi node training already running ?

293 #550 #555

cc: @lunixbochs @jacobkahn

question

Most helpful comment

  1. Build an mpirun hostfile like this, called e.g. hostfile. The machine you're on needs to be able to SSH into each node to start the jobs. 8 here is the number of gpus per node. You'll also need to pass that to mpirun later as $mingpus
10.0.10.61 slots=8 max_slots=8
10.0.10.54 slots=8 max_slots=8
  1. Use the same disk layout on all of your nodes. I used a read-only /data/* mount for my training data, and copied model configs into /tmp/w2l/ on every node before starting a training run.

  2. Run mpirun, something like this. $njobs is the number of total GPUs in play.

    mpirun -d -n "$njobs" --hostfile "hostfile" \
        --wdir /tmp/w2l \
    --bind-to none \
        -- /path/to/Train \
            --flagsfile flagsfile \
            --max_devices_per_node "$mingpus" \
            --enable_distributed true \
            --runname model \
            --rundir /tmp/w2l \
            --rndv_filepath=''

Ultimately distributed training wasn't fast enough for me to actually use. I ran into a bottleneck somewhere and never fully figured it out. I ended up training mostly on individual 8GPU nodes. Maybe you'll have more luck with two GPUs per node than I had with 8.

All 5 comments

For multi-node training you need to have for example SLURM, or any other systems which can schedule jobs in multi-node. For it to communicate between processes one can use https://github.com/facebookresearch/wav2letter/blob/v0.2/src/common/Defines.cpp#L329 and also world_size and world_rank https://github.com/facebookresearch/wav2letter/blob/v0.2/src/common/Defines.cpp#L318.

See also discussion here https://github.com/facebookresearch/wav2letter/issues/605

  1. Build an mpirun hostfile like this, called e.g. hostfile. The machine you're on needs to be able to SSH into each node to start the jobs. 8 here is the number of gpus per node. You'll also need to pass that to mpirun later as $mingpus
10.0.10.61 slots=8 max_slots=8
10.0.10.54 slots=8 max_slots=8
  1. Use the same disk layout on all of your nodes. I used a read-only /data/* mount for my training data, and copied model configs into /tmp/w2l/ on every node before starting a training run.

  2. Run mpirun, something like this. $njobs is the number of total GPUs in play.

    mpirun -d -n "$njobs" --hostfile "hostfile" \
        --wdir /tmp/w2l \
    --bind-to none \
        -- /path/to/Train \
            --flagsfile flagsfile \
            --max_devices_per_node "$mingpus" \
            --enable_distributed true \
            --runname model \
            --rundir /tmp/w2l \
            --rndv_filepath=''

Ultimately distributed training wasn't fast enough for me to actually use. I ran into a bottleneck somewhere and never fully figured it out. I ended up training mostly on individual 8GPU nodes. Maybe you'll have more luck with two GPUs per node than I had with 8.

1. Build an mpirun hostfile like this, called e.g. `hostfile`. The machine you're on needs to be able to SSH into each node to start the jobs. 8 here is the number of gpus per node. You'll also need to pass that to mpirun later as `$mingpus`
10.0.10.61 slots=8 max_slots=8
10.0.10.54 slots=8 max_slots=8
1. Use the same disk layout on all of your nodes. I used a read-only /data/* mount for my training data, and copied model configs into /tmp/w2l/ on every node before starting a training run.

2. Run mpirun, something like this. `$njobs` is the number of total GPUs in play.
    mpirun -d -n "$njobs" --hostfile "hostfile" \
        --wdir /tmp/w2l \
  --bind-to none \
        -- /path/to/Train \
            --flagsfile flagsfile \
            --max_devices_per_node "$mingpus" \
            --enable_distributed true \
            --runname model \
            --rundir /tmp/w2l \
            --rndv_filepath=''

Ultimately distributed training wasn't fast enough for me to actually use. I ran into a bottleneck somewhere and never fully figured it out. I ended up training mostly on individual 8GPU nodes. Maybe you'll have more luck with two GPUs per node than I had with 8.

When I followed the instruction, mpi caught error because it can't find proc session directory and job session directory. Do you have any suggestion how to fix this error @lunixbochs?

I found the same problem as @light42 . could you please enlighten us @lunixbochs

Sorry, you'll need to debug any MPI issues yourself.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hajix picture hajix  路  4Comments

megharangaswamy picture megharangaswamy  路  5Comments

gauenk picture gauenk  路  3Comments

AvielNiego picture AvielNiego  路  6Comments

mlexplore1122 picture mlexplore1122  路  3Comments