Horovod: multi node train failed in docker

Created on 7 May 2019  路  1Comment  路  Source: horovod/horovod

Environment:

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet) tensorflow
  2. Framework version: 1.12.0
  3. Horovod version: 0.16.1
  4. MPI version: 4.0.0
  5. CUDA version: 9.0
  6. NCCL version: 2.3.7
  7. Python version: 3.6.7
  8. OS and version: ubuntu16.04

Checklist:

  1. Did you search issues to find if somebody asked this question before?
    yes
  2. If your question is about hang, did you read this doc?
    not hang
  3. If your question is about docker, did you read this doc?
    yes

Your question:
i want to train on 2 nodes,all operations in docker container, docker image is builded by https://github.com/horovod/horovod/blob/master/Dockerfile

i run this cmd on 192.168.0.135
mpirun --allow-run-as-root -np 4 -H 192.168.0.136:2,192.168.0.135:2 -mca plm_rsh_args "-p 12345 -v" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_DISABLE=1 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo python keras_mnist_advanced.py

i can successfully ssh passwordless on each node: ssh -p 12345 192.168.0.136, and on another node ssh -p 12345 192.168.0.136

this is output

OpenSSH_7.2p2 Ubuntu-4ubuntu2.8, OpenSSL 1.0.2g  1 Mar 2016
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: Connecting to 192.168.0.135 [192.168.0.135] port 12345.
debug1: Connection established.
debug1: permanently_set_uid: 0/0
debug1: identity file /root/.ssh/id_rsa type 1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.8
OpenSSH_7.2p2 Ubuntu-4ubuntu2.8, OpenSSL 1.0.2g  1 Mar 2016
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: Connecting to 192.168.0.136 [192.168.0.136] port 12345.
debug1: Connection established.
debug1: permanently_set_uid: 0/0
debug1: identity file /root/.ssh/id_rsa type 1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.8
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.2p2 Ubuntu-4ubuntu2.8
debug1: match: OpenSSH_7.2p2 Ubuntu-4ubuntu2.8 pat OpenSSH* compat 0x04000000
debug1: Authenticating to 192.168.0.135:12345 as 'root'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: [email protected]
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: [email protected] MAC: <implicit> compression: none
debug1: kex: client->server cipher: [email protected] MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.2p2 Ubuntu-4ubuntu2.8
debug1: match: OpenSSH_7.2p2 Ubuntu-4ubuntu2.8 pat OpenSSH* compat 0x04000000
debug1: Authenticating to 192.168.0.136:12345 as 'root'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: [email protected]
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: [email protected] MAC: <implicit> compression: none
debug1: kex: client->server cipher: [email protected] MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:THIjaDpeQu/5DZ/PUqdEGX6Eqm90fZv1B4Qps35XqLw
debug1: Host '[192.168.0.135]:12345' is known and matches the ECDSA host key.
debug1: Found key in /root/.ssh/known_hosts:5
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:THIjaDpeQu/5DZ/PUqdEGX6Eqm90fZv1B4Qps35XqLw
debug1: Host '[192.168.0.136]:12345' is known and matches the ECDSA host key.
debug1: Found key in /root/.ssh/known_hosts:3
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<rsa-sha2-256,rsa-sha2-512>
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<rsa-sha2-256,rsa-sha2-512>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Offering RSA public key: /root/.ssh/id_rsa
debug1: Server accepts key: pkalg rsa-sha2-512 blen 279
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentication succeeded (publickey).
Authenticated to 192.168.0.135 ([192.168.0.135]:12345).
debug1: channel 0: new [client-session]
debug1: Requesting [email protected]
debug1: Entering interactive session.
debug1: pledge: network
debug1: client_input_global_request: rtype [email protected] want_reply 0
debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Offering RSA public key: /root/.ssh/id_rsa
debug1: Server accepts key: pkalg rsa-sha2-512 blen 279
debug1: Authentication succeeded (publickey).
Authenticated to 192.168.0.136 ([192.168.0.136]:12345).
debug1: channel 0: new [client-session]
debug1: Requesting [email protected]
debug1: Entering interactive session.
debug1: pledge: network
debug1: client_input_global_request: rtype [email protected] want_reply 0
debug1: Sending environment.
debug1: Sending command:     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "276692992" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "cebc[2:76]c37d3f,[3:192].168.0.136,[3:192].168.0.135@0(3)" -mca orte_hnp_uri "276692992.0;tcp://172.17.0.2:43801" -mca pml "ob1" -mca btl "^openib" -mca btl_tcp_if_exclude "lo" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "276692992.0;tcp://172.17.0.2:43801" -mca plm_rsh_args "-p 12345 -v" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
debug1: Sending environment.
debug1: Sending command:     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "276692992" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "cebc[2:76]c37d3f,[3:192].168.0.136,[3:192].168.0.135@0(3)" -mca orte_hnp_uri "276692992.0;tcp://172.17.0.2:43801" -mca pml "ob1" -mca btl "^openib" -mca btl_tcp_if_exclude "lo" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "276692992.0;tcp://172.17.0.2:43801" -mca plm_rsh_args "-p 12345 -v" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    56bcfc3b79d6
  Remote host:   cebc76c37d3f
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: channel 0: free: client-session, nchannels 1
debug1: fd 0 clearing O_NONBLOCK
Transferred: sent 3408, received 3204 bytes, in 0.1 seconds
Bytes per second: sent 32870.7, received 30903.1
debug1: Exit status 1
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: channel 0: free: client-session, nchannels 1
debug1: fd 0 clearing O_NONBLOCK
Transferred: sent 3408, received 2784 bytes, in 0.1 seconds
Bytes per second: sent 29942.8, received 24460.3
debug1: Exit status 0

thank you for help!!!!!

question

Most helpful comment

solved, run docker with --network=host, i missed in doc. it's about net communication between docker containers.

>All comments

solved, run docker with --network=host, i missed in doc. it's about net communication between docker containers.

Was this page helpful?
0 / 5 - 0 ratings