Horovod: Unexpected segmentation fault with PyTorch + Horovod

Created on 8 Oct 2018 · 3Comments · Source: horovod/horovod

System configuration

Azure VM of size NC24s_v2 (four P100)
Ubuntu 16.04
Docker image from Horovod with Docker, skip Tensorflow/keras installation

Before beginning, I want to state that this problem is resolved for me. I am writing this issue for the record.

For learning rate decay, I define the optimizer every epoch

def train(epoch):
    net.train()
    optimizer = optim.SGD(net.parameters(), lr=cf.learning_rate(args.lr, epoch)*hvd.size(), momentum=0.9, weight_decay=5e-4)
    optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=net.named_parameters())
    ...

The mpirun command was simply

mpirun -np 4 -H localhost:4 python <PYTHON_CODE_PATH> --datadir <DATA_DIR>

At the very beginning of the 2nd epoch (during the first loss calculation and backprop), the code fails with the below stack trace

[fe535b5ccf9b:00019] *** Process received signal ***
[fe535b5ccf9b:00019] Signal: Segmentation fault (11)
[fe535b5ccf9b:00019] Signal code: Address not mapped (1)
[fe535b5ccf9b:00019] Failing at address: 0x28
[fe535b5ccf9b:00019] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fbf2779c390]
[fe535b5ccf9b:00019] [ 1] /usr/local/lib/python2.7/dist-packages/horovod/torch/mpi_lib_impl/_mpi_lib_impl.so(+0x58d71)[0x7fbeaabf4d71]
[fe535b5ccf9b:00019] [ 2] /usr/local/lib/python2.7/dist-packages/horovod/torch/mpi_lib_impl/_mpi_lib_impl.so(+0x5de24)[0x7fbeaabf9e24]
[fe535b5ccf9b:00019] [ 3] /usr/local/lib/python2.7/dist-packages/horovod/torch/mpi_lib_impl/_mpi_lib_impl.so(+0x692d8)[0x7fbeaac052d8]
[fe535b5ccf9b:00019] [ 4] /usr/local/lib/python2.7/dist-packages/horovod/torch/mpi_lib_impl/_mpi_lib_impl.so(+0x6a65b)[0x7fbeaac0665b]
[fe535b5ccf9b:00019] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7fbf1e97bc80]
[fe535b5ccf9b:00019] [ 6] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fbf277926ba]
[fe535b5ccf9b:00019] [ 7] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fbf274c841d]
[fe535b5ccf9b:00019] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
fe535b5ccf9b:18:122 [0] INFO comm 0x7fc7381be150 rank 0 nranks 4
fe535b5ccf9b:20:126 [2] INFO comm 0x7fbe601ba6e0 rank 2 nranks 4
--------------------------------------------------------------------------
mpirun.real noticed that process rank 1 with PID 0 on node fe535b5ccf9b exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Which means I am running this code in single node, 4 processes for 4 GPUs.
The question is - Is this an expected outcome?

It is redundant to define optimizer every epoch, but I just kept the original baseline code.
This issue has been resolved for me, I just wanted to record this case in case other people may face a similar problem. This is not a problem in a native PyTorch code, but it is a problem with Horovod.

bug

Source

Jongchan

All 3 comments

Thanks for reporting this issue. It's a bug we should fix. Specifically, if you instantiate additional hvd.DistributedOptimizer(), you have an additional set of hooks that will cause the same gradients to be all-reduced multiple times. It's still an error, but we should show a nice error message there instead of a segfault.

alsrgv on 8 Oct 2018

👍1

Hi, I have the same segmentation fault issue. I used mpirun to train Pytorch model with 4 GPUs in the same node with Horovod. But sometimes got the seg fault while most of the time is fine. The error massage is very brief and it's very hard for me to debug. Is this issue resolved on top of tree? Thanks!