Fairseq: How to kill distributed processes

Created on 1 Feb 2019 · 5Comments · Source: pytorch/fairseq

Hi, I am running on one 8-GPU machine in nvidia docker with pytorch 1.0, cuda 10.

I follow the script as here to run the program.

However, the distributed processes do not terminate after I Ctrl+C. Some of the process still running on background and killing does not terminate it either. Please help.

Hence, how to properly terminate a ongoing process with distributed running?

Thank you,

Source

nxphi47

Most helpful comment

I use this script to kill zombie processes.
kill $(ps aux | grep "train.py" | grep -v grep | awk '{print $2}')

frankang on 18 Feb 2019

👍18 ❤1

All 5 comments

Did you use python -m torch.distributed.launch or python train.py? I've noticed the latter approach is a bit more reliable at killing the children processes with Ctrl+C, since it uses multiprocessing and will propagate the KeyboardInterrupt to the children.

But in general when I spot zombie processes, you can usually kill them with kill <process_id> or kill -9 <process_id>.

myleott on 1 Feb 2019

👍1

Thanks for the response.
When I try python train.py, the program using python -m torch.distributed.launch --nproc_per_node 2 train.py ${DATA_DIR} --ddp-backend=no_c10d will be faster.

So I try the this distributed.launch command and I cannot complete kill all child processes. It does seem to kill the master process but not the child. using kill -9 <id> does not work either.

Thanks

nxphi47 on 2 Feb 2019

Btw, I think the note about torch.distributed.launch being faster is no longer true, I’ll remove it. However, it remains true that in some cases --ddp-backend=no_c10d is faster (this is likely the case in your setting).

myleott on 6 Feb 2019

I use this script to kill zombie processes.
kill $(ps aux | grep "train.py" | grep -v grep | awk '{print $2}')

frankang on 18 Feb 2019

👍18 ❤1

Did you use python -m torch.distributed.launch or python train.py? I've noticed the latter approach is a bit more reliable at killing the children processes with Ctrl+C, since it uses multiprocessing and will propagate the KeyboardInterrupt to the children.

But in general when I spot zombie processes, you can usually kill them with kill <process_id> or kill -9 <process_id>.

kill -9 does work for me!