Maskrcnn-benchmark: multi GPU training problem

Created on 24 Nov 2018 · 23Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

Hi,
I tried to run the benchmark on a 4 p100 GPU machine several times.
Most of the time (six out of eight tries) the GPUs were stuck - utility 100%, memory occupied, seems like working - but nothing is done (not a single iteration log was printed, no checkpoint was saved).
In two of my tries the network was trained (regular successful train - with log and checkpoints etc.)

I noticed that on the successful tries the "Start training" line was printed just before the training. example:

creating index...
index created!
Done (t=4.16s)
creating index...
index created!
Done (t=4.23s)
creating index...
index created!
2018-11-24 07:51:25,647 maskrcnn_benchmark.trainer INFO: Start training
2018-11-24 07:51:33,212 maskrcnn_benchmark.trainer INFO: eta: 1 day, 13:48:52 iter: 20 loss: 1.1634 (1.5222) loss_classifier: 0.5153 (0.8516) loss_box_reg: 0.0399 (0.0403) loss_objectness: 0.4983 (0.4964) loss_rpn_box_reg: 0.1370 (0.1338) time: 0.2600 (0.3782) data: 0.0060 (0.1302) lr: 0.001793 max mem: 1840

and on the unsuccessful runs the location of the start training line was different:

loading annotations into memory...
loading annotations into memory...
Done (t=3.61s)
creating index...
Done (t=3.59s)
creating index...
index created!
index created!
2018-11-24 07:42:22,458 maskrcnn_benchmark.trainer INFO: Start training
Done (t=4.03s)
creating index...
Done (t=4.06s)
creating index...
index created!
index created!
end of log

Is this problem known?
Can I bypass it somehow?

I'm working with:
Nvidia driver version: 396.26
CUDA used to build PyTorch: 9.0.176
OS: CentOS Linux 7 (Core)

Thanks

duplicate

Source

barakhi

Most helpful comment

Update -
I added: time.sleep(5) before the do_train function.
2 successful trains out of 2 tries.
Silly workaround but it may help others as well
(Can this issue as well as #58 be passed to the pytorch developers?)

barakhi on 25 Nov 2018

👍4 😄1

All 23 comments

hi @barakhi
the training procedure maybe hang, u can have a look at #58

zimenglan-sysu-512 on 25 Nov 2018

Hi, Thanks,
Although...
I read #58 before I asked my question - my nvidia driver is 396.26. I can upgrade/downgrade but I didn't find a clear answer to what versions... I think it's still an open question.
Plus - I'm describing a "random effect" and indicating that it might be related to sync between the workers / dataloader and the start of training message. If it just a driver issue - could it has this effect (sometimes work and sometimes not)?
I will try to upgrade my nccl version (it came up in #58 as a possible solution).
Any help will be wellcomed.

barakhi on 25 Nov 2018

👍4 😄1

@barakhi the issue is not related to PyTorch, but with bad CUDA interactions.

Can you update your CUDA to version 9.2?

Also, I'm going to be closing this issue as a duplicate of #58 , feel free to continue there.

fmassa on 27 Nov 2018

hi @barakhi @fmassa

when train, it's ok to use 1 or 2 or 3 gpus, but when use 4 gpus or 8 gpus, it's easy to hang. so i try to print something in do_train func, find that the procedure gets stuck in those lines. it looks like something wrong in dataloader.

i set _C.DATALOADER.NUM_WORKERS to be 0 (will slow down the training speed), it can run without any problem. (set to be 1 or 2, it's still possible to meet this problem, but sometimes can run successfully).

zimenglan-sysu-512 on 3 Dec 2018

@zimenglan-sysu-512 what's your system configuration?

Please copy and paste the output from the
environment collection script from PyTorch
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

fmassa on 3 Dec 2018

hi @fmassa
env info as below:

PyTorch version: 1.0.0.dev20181202
Is debug build: No
CUDA used to build PyTorch: 9.2.148

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce GTX TITAN X
GPU 1: GeForce GTX TITAN X
GPU 2: GeForce GTX TITAN X
GPU 3: GeForce GTX TITAN X

Nvidia driver version: 396.54
cuDNN version: Probably one of the following:
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudnn.so
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudnn.so.7
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudnn.so.7.4.1
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudnn_static.a
/usr/local/lib/libcudnn.so
/usr/local/lib/libcudnn.so.7
/usr/local/lib/libcudnn.so.7.4.1
/usr/local/lib/libcudnn_static.a

Pip version
pip 18.1 from /usr/local/lib/python3.6/dist-packages/pip (python 3.6)

CUDA Version 9.2.148

btw, i don't use conda.

zimenglan-sysu-512 on 3 Dec 2018

Can you launch your jobs with NCCL_DEBUG=1 python ...?
A few points:
1 - maybe set mp.set_start_method('spawn') in the beginning of the script?
2 - whenever your process hang, can you try printing the stack trace of it from gdb?
Do the following:

gdb attach <process_id>

> thread apply all bt

and paste the result here?

fmassa on 3 Dec 2018

output as below:

Attaching to process 15172
[New LWP 15360]
[New LWP 15466]
[New LWP 15471]
[New LWP 15629]
[New LWP 15630]
[New LWP 15631]
[New LWP 15632]
[New LWP 15678]
[New LWP 15679]
[New LWP 15680]
[New LWP 15681]
[New LWP 15682]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007ffd0264ab6d in clock_gettime ()
(gdb) thread apply all bt

Thread 13 (Thread 0x7fbb33fff700 (LWP 15682)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fbc9081c91c in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007fbc778123fb in torch::autograd::ReadyQueue::pop() ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#3  0x00007fbc77814f93 in torch::autograd::Engine::thread_main(torch::autograd::GraphTask*)
    () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#4  0x00007fbc77811917 in torch::autograd::Engine::thread_init(int) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#5  0x00007fbc88b4859a in torch::autograd::python::PythonEngine::thread_init(int) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so
#6  0x00007fbc90821c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fbc960316ba in start_thread (arg=0x7fbb33fff700) at pthread_create.c:333
#8  0x00007fbc9521441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 12 (Thread 0x7fbb3bfff700 (LWP 15681)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fbc9081c91c in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007fbc778123fb in torch::autograd::ReadyQueue::pop() ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#3  0x00007fbc77814f93 in torch::autograd::Engine::thread_main(torch::autograd::GraphTask*)
    () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#4  0x00007fbc77811917 in torch::autograd::Engine::thread_init(int) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#5  0x00007fbc88b4859a in torch::autograd::python::PythonEngine::thread_init(int) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so
#6  0x00007fbc90821c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fbc960316ba in start_thread (arg=0x7fbb3bfff700) at pthread_create.c:333
#8  0x00007fbc9521441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 11 (Thread 0x7fbb40825700 (LWP 15680)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fbc9081c91c in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007fbc778123fb in torch::autograd::ReadyQueue::pop() ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#3  0x00007fbc77814f93 in torch::autograd::Engine::thread_main(torch::autograd::GraphTask*)
    () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#4  0x00007fbc77811917 in torch::autograd::Engine::thread_init(int) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#5  0x00007fbc88b4859a in torch::autograd::python::PythonEngine::thread_init(int) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so
#6  0x00007fbc90821c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fbc960316ba in start_thread (arg=0x7fbb40825700) at pthread_create.c:333
#8  0x00007fbc9521441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
---Type <return> to continue, or q <return> to quit---

Thread 10 (Thread 0x7fbb41026700 (LWP 15679)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fbc9081c91c in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007fbc778123fb in torch::autograd::ReadyQueue::pop() ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#3  0x00007fbc77814f93 in torch::autograd::Engine::thread_main(torch::autograd::GraphTask*)
    () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#4  0x00007fbc77811917 in torch::autograd::Engine::thread_init(int) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#5  0x00007fbc88b4859a in torch::autograd::python::PythonEngine::thread_init(int) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so
#6  0x00007fbc90821c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fbc960316ba in start_thread (arg=0x7fbb41026700) at pthread_create.c:333
#8  0x00007fbc9521441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 9 (Thread 0x7fbb41827700 (LWP 15678)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fbc9081c91c in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007fbc778123fb in torch::autograd::ReadyQueue::pop() ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#3  0x00007fbc77814f93 in torch::autograd::Engine::thread_main(torch::autograd::GraphTask*)
    () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#4  0x00007fbc77811917 in torch::autograd::Engine::thread_init(int) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#5  0x00007fbc88b4859a in torch::autograd::python::PythonEngine::thread_init(int) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so
#6  0x00007fbc90821c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fbc960316ba in start_thread (arg=0x7fbb41827700) at pthread_create.c:333
#8  0x00007fbc9521441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 8 (Thread 0x7fbba4856700 (LWP 15632)):
#0  0x00007fbc96039827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, 
    expected=0, futex_word=0x7fbb3c0013e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7fbb3c0013e0, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007fbc960398d4 in __new_sem_wait_slow (sem=0x7fbb3c0013e0, abstime=0x0)
    at sem_waitcommon.c:181
#3  0x00007fbc9603997a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
#4  0x00000000004db764 in PyThread_acquire_lock_timed ()
#5  0x00000000005a50e3 in ?? ()
#6  0x000000000052032b in _PyCFunction_FastCallKeywords ()
#7  0x000000000057a6d9 in ?? ()
#8  0x00000000005730bf in _PyEval_EvalFrameDefault ()
#9  0x0000000000572147 in ?? ()
#10 0x000000000057b8bb in ?? ()
#11 0x000000000057a7bc in ?? ()
---Type <return> to continue, or q <return> to quit---
#12 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#13 0x000000000057ac35 in PyEval_EvalCodeEx ()
#14 0x00000000004f81f1 in ?? ()
#15 0x00000000004e45fa in PyObject_Call ()
#16 0x00000000005746f4 in _PyEval_EvalFrameDefault ()
#17 0x000000000057b7fd in ?? ()
#18 0x000000000057a7bc in ?? ()
#19 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#20 0x000000000057b7fd in ?? ()
#21 0x000000000057a7bc in ?? ()
#22 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#23 0x000000000057c1e3 in _PyFunction_FastCallDict ()
#24 0x00000000004e4f7c in _PyObject_Call_Prepend ()
#25 0x00000000004e45fa in PyObject_Call ()
#26 0x000000000061daf2 in ?? ()
#27 0x00007fbc960316ba in start_thread (arg=0x7fbba4856700) at pthread_create.c:333
#28 0x00007fbc9521441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 7 (Thread 0x7fbba5057700 (LWP 15631)):
#0  0x00007fbc96039827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, 
    expected=0, futex_word=0x7fbb440013e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7fbb440013e0, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007fbc960398d4 in __new_sem_wait_slow (sem=0x7fbb440013e0, abstime=0x0)
    at sem_waitcommon.c:181
#3  0x00007fbc9603997a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
#4  0x00000000004db764 in PyThread_acquire_lock_timed ()
#5  0x00000000005a50e3 in ?? ()
#6  0x000000000052032b in _PyCFunction_FastCallKeywords ()
#7  0x000000000057a6d9 in ?? ()
#8  0x00000000005730bf in _PyEval_EvalFrameDefault ()
#9  0x0000000000572147 in ?? ()
#10 0x000000000057b8bb in ?? ()
#11 0x000000000057a7bc in ?? ()
#12 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#13 0x000000000057ac35 in PyEval_EvalCodeEx ()
#14 0x00000000004f81f1 in ?? ()
#15 0x00000000004e45fa in PyObject_Call ()
#16 0x00000000005746f4 in _PyEval_EvalFrameDefault ()
#17 0x000000000057b7fd in ?? ()
#18 0x000000000057a7bc in ?? ()
#19 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#20 0x000000000057b7fd in ?? ()
#21 0x000000000057a7bc in ?? ()
#22 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#23 0x000000000057c1e3 in _PyFunction_FastCallDict ()
#24 0x00000000004e4f7c in _PyObject_Call_Prepend ()
#25 0x00000000004e45fa in PyObject_Call ()
#26 0x000000000061daf2 in ?? ()
#27 0x00007fbc960316ba in start_thread (arg=0x7fbba5057700) at pthread_create.c:333
#28 0x00007fbc9521441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

zimenglan-sysu-512 on 3 Dec 2018

Can you paste the rest of the stack trace?

fmassa on 3 Dec 2018

Thread 6 (Thread 0x7fbba5858700 (LWP 15630)):
#0  0x00007fbc96039827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, 
    expected=0, futex_word=0x7fbbd8001620)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7fbbd8001620, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007fbc960398d4 in __new_sem_wait_slow (sem=0x7fbbd8001620, abstime=0x0)
    at sem_waitcommon.c:181
#3  0x00007fbc9603997a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
#4  0x00000000004db764 in PyThread_acquire_lock_timed ()
#5  0x00000000005a50e3 in ?? ()
#6  0x000000000052032b in _PyCFunction_FastCallKeywords ()
#7  0x000000000057a6d9 in ?? ()
#8  0x00000000005730bf in _PyEval_EvalFrameDefault ()
#9  0x0000000000572147 in ?? ()
#10 0x000000000057b8bb in ?? ()
#11 0x000000000057a7bc in ?? ()
#12 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#13 0x000000000057ac35 in PyEval_EvalCodeEx ()
#14 0x00000000004f81f1 in ?? ()
#15 0x00000000004e45fa in PyObject_Call ()
#16 0x00000000005746f4 in _PyEval_EvalFrameDefault ()
#17 0x000000000057b7fd in ?? ()
#18 0x000000000057a7bc in ?? ()
#19 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#20 0x000000000057b7fd in ?? ()
#21 0x000000000057a7bc in ?? ()
#22 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#23 0x000000000057c1e3 in _PyFunction_FastCallDict ()
#24 0x00000000004e4f7c in _PyObject_Call_Prepend ()
#25 0x00000000004e45fa in PyObject_Call ()
#26 0x000000000061daf2 in ?? ()
#27 0x00007fbc960316ba in start_thread (arg=0x7fbba5858700) at pthread_create.c:333
#28 0x00007fbc9521441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 5 (Thread 0x7fbc63c55700 (LWP 15629)):
#0  0x00007fbc96039827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, 
    expected=0, futex_word=0x7fbbdc029770)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7fbbdc029770, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007fbc960398d4 in __new_sem_wait_slow (sem=0x7fbbdc029770, abstime=0x0)
    at sem_waitcommon.c:181
#3  0x00007fbc9603997a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
#4  0x00000000004db764 in PyThread_acquire_lock_timed ()
#5  0x00000000005a50e3 in ?? ()
#6  0x000000000052032b in _PyCFunction_FastCallKeywords ()
#7  0x000000000057a6d9 in ?? ()
#8  0x00000000005730bf in _PyEval_EvalFrameDefault ()
#9  0x0000000000572147 in ?? ()
#10 0x000000000057b8bb in ?? ()
#11 0x000000000057a7bc in ?? ()
---Type <return> to continue, or q <return> to quit---
#12 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#13 0x000000000057ac35 in PyEval_EvalCodeEx ()
#14 0x00000000004f81f1 in ?? ()
#15 0x00000000004e45fa in PyObject_Call ()
#16 0x00000000005746f4 in _PyEval_EvalFrameDefault ()
#17 0x000000000057b7fd in ?? ()
#18 0x000000000057a7bc in ?? ()
#19 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#20 0x000000000057b7fd in ?? ()
#21 0x000000000057a7bc in ?? ()
#22 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#23 0x000000000057c1e3 in _PyFunction_FastCallDict ()
#24 0x00000000004e4f7c in _PyObject_Call_Prepend ()
#25 0x00000000004e45fa in PyObject_Call ()
#26 0x000000000061daf2 in ?? ()
#27 0x00007fbc960316ba in start_thread (arg=0x7fbc63c55700) at pthread_create.c:333
#28 0x00007fbc9521441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 4 (Thread 0x7fbc51c45700 (LWP 15471)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x00007fbc90ced427 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fbc90c9cff7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fbc90cec6a8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fbc960316ba in start_thread (arg=0x7fbc51c45700) at pthread_create.c:333
#5  0x00007fbc9521441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 3 (Thread 0x7fbc51444700 (LWP 15466)):
#0  0x00007fbc9520874d in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fbc90cea633 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fbc90d5384d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fbc90cec6a8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fbc960316ba in start_thread (arg=0x7fbc51444700) at pthread_create.c:333
#5  0x00007fbc9521441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7fbc50c43700 (LWP 15360)):
#0  0x00007fbc952158c8 in accept4 (fd=17, addr=..., addr_len=0x7fbc50c42e58, flags=524288)
    at ../sysdeps/unix/sysv/linux/accept4.c:40
#1  0x00007fbc90ceb57a in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fbc90cddabd in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fbc90cec6a8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fbc960316ba in start_thread (arg=0x7fbc50c43700) at pthread_create.c:333
#5  0x00007fbc9521441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7fbc96445700 (LWP 15172)):
#0  0x00007ffd0264ab6d in clock_gettime ()
#1  0x00007fbc95222876 in __GI___clock_gettime (clock_id=4, tp=0x7ffd02630540)
    at ../sysdeps/unix/clock_gettime.c:115
#2  0x00007fbc90ce9b1e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fbc90d7eb93 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fbc90d9d22f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
---Type <return> to continue, or q <return> to quit---
#5  0x00007fbc90cc7380 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007fbc90be048e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007fbc90be2296 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00007fbc90d2fc22 in cuMemcpyHtoDAsync_v2 ()
   from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007fbbffc4854c in cudart::driverHelper::memcpyAsyncDispatch(void*, void const*, unsigned long, cudaMemcpyKind, CUstream_st*, bool) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2_gpu.so
#10 0x00007fbbffc1b473 in cudart::cudaApiMemcpyAsync(void*, void const*, unsigned long, cudaMemcpyKind, CUstream_st*) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2_gpu.so
#11 0x00007fbbffc5e1a8 in cudaMemcpyAsync ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2_gpu.so
#12 0x00007fbbff616b32 in (anonymous namespace)::copy_from_cpu(at::Tensor&, at::Tensor const&) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2_gpu.so
#13 0x00007fbbff69ba25 in void (anonymous namespace)::_copy__cuda<float>(at::Tensor&, at::Tensor const&, bool) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2_gpu.so
#14 0x00007fbbff6172a5 in at::native::_s_copy__cuda(at::Tensor&, at::Tensor const&, bool)
    () from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2_gpu.so
#15 0x00007fbbfe491f32 in at::CUDAFloatType::s_copy_(at::Tensor&, at::Tensor const&, bool) const () from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2_gpu.so
#16 0x00007fbc77a8d342 in torch::autograd::VariableType::s_copy_(at::Tensor&, at::Tensor const&, bool) const () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#17 0x00007fbbf46489a8 in at::TypeDefault::copy_(at::Tensor&, at::Tensor const&, bool) const () from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2.so
#18 0x00007fbbf4648432 in at::TypeDefault::copy(at::Tensor const&, bool, c10::optional<c10::Device>) const () from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2.so
#19 0x00007fbbf4475f1a in at::native::to_impl(at::Tensor const&, at::TensorOptions const&, bool) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2.so
#20 0x00007fbbf4476812 in at::native::to(at::Tensor const&, at::TensorOptions const&, bool, bool) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2.so
#21 0x00007fbbf460e1d7 in at::TypeDefault::to(at::Tensor const&, at::TensorOptions const&, bool, bool) const () from /usr/local/lib/python3.6/dist-packages/torch/lib/libcaffe2.so
#22 0x00007fbc779c4e0a in torch::autograd::VariableType::to(at::Tensor const&, at::TensorOptions const&, bool, bool) const ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so.1
#23 0x00007fbc88abb957 in torch::autograd::dispatch_to(at::Tensor const&, c10::Device, bool, bool) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so
#24 0x00007fbc88b2f50c in torch::autograd::THPVariable_to(_object*, _object*, _object*) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so
#25 0x000000000052087e in PyCFunction_Call ()
#26 0x0000000000578861 in _PyEval_EvalFrameDefault ()
#27 0x0000000000572147 in ?? ()
#28 0x000000000057b8bb in ?? ()
#29 0x000000000057a7bc in ?? ()
#30 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#31 0x0000000000572506 in ?? ()
#32 0x000000000057b8bb in ?? ()
#33 0x000000000057a7bc in ?? ()
#34 0x0000000000573e3d in _PyEval_EvalFrameDefault ()
---Type <return> to continue, or q <return> to quit---
#35 0x0000000000572147 in ?? ()
#36 0x000000000057b8bb in ?? ()
#37 0x000000000057a7bc in ?? ()
#38 0x0000000000573e3d in _PyEval_EvalFrameDefault ()
#39 0x000000000057b7fd in ?? ()
#40 0x000000000057a7bc in ?? ()
#41 0x00000000005730bf in _PyEval_EvalFrameDefault ()
#42 0x0000000000572147 in ?? ()
#43 0x0000000000571ec3 in PyEval_EvalCode ()
#44 0x00000000005e63a2 in ?? ()
#45 0x00000000005e680a in PyRun_FileExFlags ()
#46 0x00000000005e65c7 in PyRun_SimpleFileExFlags ()
#47 0x00000000005ebae3 in Py_Main ()
#48 0x00000000004d1ae9 in main ()
(gdb)

zimenglan-sysu-512 on 3 Dec 2018

I don't have an idea yet of what it could be.
And I assume that setting the multiprocessing method to spawn didn't change anything?

fmassa on 3 Dec 2018

I also meet the same problem with 4 X P40 and 4 X V100 servers. It seems related to data loading. On a server where the disk is slow, it is more likely to hang. But I am not totally sure whether it is due to data loading. I have tried sleep before main and use spawn. But neither solves the problem.

caiqi on 3 Dec 2018

This is very weird. I've not experienced deadlocks for a while.
Which version of PyTorch are you on @caiqi ?

fmassa on 3 Dec 2018

The Pytorch version is:
PyTorch version: 1.0.0.dev20181130

caiqi on 3 Dec 2018

Can you try changing the distributed backend to use the new c10d backend?
Basically replacing all instances of torch.distributed.deprecated with torch.distributed?

fmassa on 3 Dec 2018

hi @fmassa
when i try to set multiprocessing method to spawn, it will raise an error that multiprocessing method has been set.
i will try what u say replacing all instances of torch.distributed.deprecated with torch.distributed, and will report the results here later.

zimenglan-sysu-512 on 4 Dec 2018

👍1

@fmassa After replacing torch.distributed.deprecated with torch.distributed, I run the previous experiments 3 times and did not encounter the deadlocks. I will run more experiments to see whether c10d solves the problem.

caiqi on 4 Dec 2018

😄1

Awesome, thanks for the information! If it indeed solves all the problems for you, would you mind sending a PR replacing the deprecated backend with the new one?

fmassa on 4 Dec 2018

hi @fmassa
do i need to remove deprecated in this line?

zimenglan-sysu-512 on 5 Dec 2018

after replacing all instances of torch.distributed.deprecated with torch.distributed, it does not encounter the deadlocks.

zimenglan-sysu-512 on 5 Dec 2018

Great, I'll be merging the PR that replaces torch.distributed.deprecated with torch.distributed. Thanks for trying it out and sending the PR!

fmassa on 5 Dec 2018

i encountered the similar issue, but the torch.distributed.deprecated has already been replaced with torch.distributed. I thought it was pytorch issue and posted it here. Any more suggestion on the fix?

https://discuss.pytorch.org/t/distributed-training-hangs/46263