Horovod: How to specify working directories and anaconda environments for different workers?

Created on 23 Oct 2018  路  1Comment  路  Source: horovod/horovod

Thanks for your great work! However, I run into problems when the working directories and anaconda environments are different for different workers. When I run the following command on the host (worker0):

mpirun  -np 8 \
        -H worker0:4,worker1:4\
        -bind-to none -map-by slot \
        -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
        -mca pml ob1 -mca btl ^openib \
        python train.py 

it simply tells me that can not find train.py on worker1, besides, when I try the command which python, only the python root returned from worker0 is the one I expected, e.g., /home/tan/anaconda2/envs/pointnet2/bin/python, while the python root returned from worker1 is: /home/tan/anaconda2/bin/python, which is the default python environment and this is not what I want.

Everything works fine when tested on worker0 with the following command (of course the working directory and conda env have been switched to the correct ones with source activate):

mpirun  -np 4 \
        -H worker0:4\
        -bind-to none -map-by slot \
        -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
        -mca pml ob1 -mca btl ^openib \
        python train.py 

I know putting the file into the same directory and giving the conda env the same name for different workers then specify them via the command line may solve the problem, but is there any better way?Thank you very much!

Most helpful comment

UPDATE:

inspired by this, I modified the command into:

conda_env_worker0="/home/tan/anaconda2/envs/<ENV_NAME@worker0>/bin/python"
program_dir_worker0="<FILE_PATH@worker0>/train.py"
conda_env_worker1="/home/tan/anaconda2/envs/<ENV_NAME@worker1>/bin/python"
program_dir_worker1="<FILE_PATH@worker1>/train.py"

# NOTE: The anaconda home directory is the same on my worker0 and 1.

mpirun  -np 4 \
        -H worker0:4\
        -bind-to none -map-by slot \
        -x NCCL_DEBUG=INFO \
        -mca pml ob1 -mca btl ^openib \
        $conda_env_worker0 $program_dir_worker0 : \
        -H worker1:4\
        -bind-to none -map-by slot \
        -x NCCL_DEBUG=INFO \
        -mca pml ob1 -mca btl ^openib \
        $conda_env_worker1 $program_dir_worker1

And now it finally works.

>All comments

UPDATE:

inspired by this, I modified the command into:

conda_env_worker0="/home/tan/anaconda2/envs/<ENV_NAME@worker0>/bin/python"
program_dir_worker0="<FILE_PATH@worker0>/train.py"
conda_env_worker1="/home/tan/anaconda2/envs/<ENV_NAME@worker1>/bin/python"
program_dir_worker1="<FILE_PATH@worker1>/train.py"

# NOTE: The anaconda home directory is the same on my worker0 and 1.

mpirun  -np 4 \
        -H worker0:4\
        -bind-to none -map-by slot \
        -x NCCL_DEBUG=INFO \
        -mca pml ob1 -mca btl ^openib \
        $conda_env_worker0 $program_dir_worker0 : \
        -H worker1:4\
        -bind-to none -map-by slot \
        -x NCCL_DEBUG=INFO \
        -mca pml ob1 -mca btl ^openib \
        $conda_env_worker1 $program_dir_worker1

And now it finally works.

Was this page helpful?
0 / 5 - 0 ratings