Hi, I am running on one 8-GPU machine in nvidia docker with pytorch 1.0, cuda 10.
I follow the script as here to run the program.
However, the distributed processes do not terminate after I Ctrl+C. Some of the process still running on background and killing does not terminate it either. Please help.
Hence, how to properly terminate a ongoing process with distributed running?
Thank you,
Did you use python -m torch.distributed.launch or python train.py? I've noticed the latter approach is a bit more reliable at killing the children processes with Ctrl+C, since it uses multiprocessing and will propagate the KeyboardInterrupt to the children.
But in general when I spot zombie processes, you can usually kill them with kill <process_id> or kill -9 <process_id>.
Thanks for the response.
When I try python train.py, the program using python -m torch.distributed.launch --nproc_per_node 2 train.py ${DATA_DIR} --ddp-backend=no_c10d will be faster.
So I try the this distributed.launch command and I cannot complete kill all child processes. It does seem to kill the master process but not the child. using kill -9 <id> does not work either.
Thanks
Btw, I think the note about torch.distributed.launch being faster is no longer true, I鈥檒l remove it. However, it remains true that in some cases --ddp-backend=no_c10d is faster (this is likely the case in your setting).
I use this script to kill zombie processes.
kill $(ps aux | grep "train.py" | grep -v grep | awk '{print $2}')
Did you use
python -m torch.distributed.launchorpython train.py? I've noticed the latter approach is a bit more reliable at killing the children processes with Ctrl+C, since it uses multiprocessing and will propagate the KeyboardInterrupt to the children.But in general when I spot zombie processes, you can usually kill them with
kill <process_id>orkill -9 <process_id>.
kill -9 does work for me!
Most helpful comment
I use this script to kill zombie processes.
kill $(ps aux | grep "train.py" | grep -v grep | awk '{print $2}')