System configuration
Before beginning, I want to state that this problem is resolved for me. I am writing this issue for the record.
For learning rate decay, I define the optimizer every epoch
def train(epoch):
net.train()
optimizer = optim.SGD(net.parameters(), lr=cf.learning_rate(args.lr, epoch)*hvd.size(), momentum=0.9, weight_decay=5e-4)
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=net.named_parameters())
...
The mpirun command was simply
mpirun -np 4 -H localhost:4 python <PYTHON_CODE_PATH> --datadir <DATA_DIR>
At the very beginning of the 2nd epoch (during the first loss calculation and backprop), the code fails with the below stack trace
[fe535b5ccf9b:00019] *** Process received signal ***
[fe535b5ccf9b:00019] Signal: Segmentation fault (11)
[fe535b5ccf9b:00019] Signal code: Address not mapped (1)
[fe535b5ccf9b:00019] Failing at address: 0x28
[fe535b5ccf9b:00019] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fbf2779c390]
[fe535b5ccf9b:00019] [ 1] /usr/local/lib/python2.7/dist-packages/horovod/torch/mpi_lib_impl/_mpi_lib_impl.so(+0x58d71)[0x7fbeaabf4d71]
[fe535b5ccf9b:00019] [ 2] /usr/local/lib/python2.7/dist-packages/horovod/torch/mpi_lib_impl/_mpi_lib_impl.so(+0x5de24)[0x7fbeaabf9e24]
[fe535b5ccf9b:00019] [ 3] /usr/local/lib/python2.7/dist-packages/horovod/torch/mpi_lib_impl/_mpi_lib_impl.so(+0x692d8)[0x7fbeaac052d8]
[fe535b5ccf9b:00019] [ 4] /usr/local/lib/python2.7/dist-packages/horovod/torch/mpi_lib_impl/_mpi_lib_impl.so(+0x6a65b)[0x7fbeaac0665b]
[fe535b5ccf9b:00019] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7fbf1e97bc80]
[fe535b5ccf9b:00019] [ 6] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fbf277926ba]
[fe535b5ccf9b:00019] [ 7] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fbf274c841d]
[fe535b5ccf9b:00019] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
fe535b5ccf9b:18:122 [0] INFO comm 0x7fc7381be150 rank 0 nranks 4
fe535b5ccf9b:20:126 [2] INFO comm 0x7fbe601ba6e0 rank 2 nranks 4
--------------------------------------------------------------------------
mpirun.real noticed that process rank 1 with PID 0 on node fe535b5ccf9b exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Which means I am running this code in single node, 4 processes for 4 GPUs.
The question is - Is this an expected outcome?
It is redundant to define optimizer every epoch, but I just kept the original baseline code.
This issue has been resolved for me, I just wanted to record this case in case other people may face a similar problem. This is not a problem in a native PyTorch code, but it is a problem with Horovod.
Thanks for reporting this issue. It's a bug we should fix. Specifically, if you instantiate additional hvd.DistributedOptimizer(), you have an additional set of hooks that will cause the same gradients to be all-reduced multiple times. It's still an error, but we should show a nice error message there instead of a segfault.
Hi, I have the same segmentation fault issue. I used mpirun to train Pytorch model with 4 GPUs in the same node with Horovod. But sometimes got the seg fault while most of the time is fine. The error massage is very brief and it's very hard for me to debug. Is this issue resolved on top of tree? Thanks!
@HaiguangWen, please open a new issue. Please collect a core dump and attach a backtrace, as well as describe versions of components.