Maskrcnn-benchmark: Program turns into zombie process when killed using `ctrl-c`

Created on 10 Apr 2019  路  12Comments  路  Source: facebookresearch/maskrcnn-benchmark

馃悰 Bug

0% utilization in second GPU in 2x GPUs training

Screenshot from 2019-04-10 17-25-07

Is the second GPUs only used to store tensors? Is the multi GPUs training in this codebase specially implemented, such that it is different from the multi GPUs training in PyTorch?

To Reproduce

Run training code with 2 GPUs

Expected behavior

Comparable utilization in 2 GPUs?

Environment

PyTorch version: 1.0.0.dev20190409
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
CMake version: version 3.5.1

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: TITAN X (Pascal)

Nvidia driver version: 418.39
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect
Pillow (6.0.0)

UPDATE: Note that this is actually a wrong description of the problem but is still kept here just to keep the flow. The correct description of the problem is in the post below.

Most helpful comment

I meet the same problem

All 12 comments

How did you launch your 2 GPU job? This behavior is not expected.

Also, I just noticed that you have two different GPUs. What might be happening is that the fastest GPU is waiting for the slowest GPU to finish its iteration.

It seems that 2080Ti do not have peer2peer enabled, which can make multi-GPU training much slower as memory transfer between GPUs should pass via the CPU

https://www.pugetsystems.com/labs/hpc/P2P-peer-to-peer-on-NVIDIA-RTX-2080Ti-vs-GTX-1080Ti-GPUs-1331/

I reinstalled NVIDIA driver and installed the latest pytorch-nightly and the problem disappears.

@fmassa My previous assessment of the problem was wrong. The actual problem is that the program turns into zombie process often when I ctrl-c to kill it, meaning it is no longer running but still hogging the memory and appears in top and nvidia-smi. The 100% utilization displayed in nvidia-smi is misleading because the program has already stopped. I always have to kill each started process manually by their PIDs using the kill command. Sometimes, even killing doesn't work. In those cases, I can only reboot my computer.

I run my GPU job using the command

NGPU=2
python -m torch.distributed.launch --nproc_per_node=$NGPU tools/train_net.py --config-file configs/<...>

One of the config files I used is as such

MODEL:
  META_ARCHITECTURE: "GeneralizedRCNN"
  WEIGHT: "catalog://ImageNetPretrained/FAIR/20171220/X-101-32x8d"
  BACKBONE:
    CONV_BODY: "R-101-FPN"
  RPN:
    USE_FPN: True
    ANCHOR_STRIDE: (4, 8, 16, 32, 64)
    PRE_NMS_TOP_N_TRAIN: 2000
    PRE_NMS_TOP_N_TEST: 1000
    POST_NMS_TOP_N_TEST: 1000
    FPN_POST_NMS_TOP_N_TEST: 1000
  ROI_HEADS:
    USE_FPN: True
  ROI_BOX_HEAD:
    POOLER_RESOLUTION: 7
    POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125)
    POOLER_SAMPLING_RATIO: 2
    FEATURE_EXTRACTOR: "FPN2MLPFeatureExtractor"
    PREDICTOR: "FPNPredictor"
  RESNETS:
    BACKBONE_OUT_CHANNELS: 256
    STRIDE_IN_1X1: False
    NUM_GROUPS: 32
    WIDTH_PER_GROUP: 8
DATASETS:
  TRAIN: ("crowdhuman_train", )
  TEST: ("crowdhuman_val",)
DATALOADER:
  SIZE_DIVISIBILITY: 32
SOLVER:
  BASE_LR: 0.02
  WEIGHT_DECAY: 0.0001
  STEPS: (60000, 80000)
  MAX_ITER: 90000
  IMS_PER_BATCH: 2
TEST:
  IMS_PER_BATCH: 2
INPUT:
  MIN_SIZE_TRAIN: (800,)
  MAX_SIZE_TRAIN: 1333
  MIN_SIZE_TEST: 800
  MAX_SIZE_TEST: 1333
OUTPUT_DIR: "results/exp2"

I have tried testing with other configs as well and the problem remains.

I am quite sure there is a bug in the code because this has happened in 2 different computers (I tried running it on AWS using 2x P100s as well).

Environment on AWS

PyTorch version: 1.1.0a0+be364ac
Is debug build: No
CUDA used to build PyTorch: 10.1.105

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.105
GPU models and configuration:
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB

Nvidia driver version: 410.104
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.0

Versions of relevant libraries:
[pip] msgpack-numpy==0.4.3.2
[pip] numpy==1.16.2
[pip] torch==1.1.0a0+be364ac
[pip] torchtext==0.4.0
[pip] torchvision==0.2.1
[conda] blas 1.0 mkl anaconda
[conda] magma-cuda100 2.1.0 5 local
[conda] mkl 2019.1 144
[conda] mkl-include 2019.1 144
[conda] mkl_fft 1.0.10 py36ha843d7b_0 anaconda
[conda] mkl_random 1.0.2 py36hd81dba3_0 anaconda
[conda] torch 1.1.0a0+be364ac pypi_0 pypi
[conda] torchtext 0.4.0 pypi_0 pypi
[conda] torchvision 0.2.1 pypi_0 pypi
Pillow (5.3.0.post0)

I thought I have solved it but apparently not.

This is a problem with the cleanup in PyTorch distributed launch utility, when one of the process dies the others might not be killed.

ccing @pietern to know if he has ideas on how to avoid this situation.

If you use ctrl-c to stop the program, be careful to kill every process carefully. In your case(2gpus), there are around 2 + 8(dataloading) processes. I usually will run ps aux | grep python to kill everything related to the training program.

@chengyangfu My expectation is that the ctrl-c signal should propagate to every process and I shouldn't have to manually kill all of them. A good implementation of multiprocessing code should not have such problem and so I would think that this is a bug. I have not have time to read through the code yet but is this library just using tools provided by PyTorch such that the problem lies in PyTorch? It is still strange though because I have been using PyTorch's DataParallel all the time in my other multi-GPUs training code and I have not met such problem.

I was browsing through the issues and seems like this issue https://github.com/facebookresearch/maskrcnn-benchmark/issues/58 is related to the problem discussed here. Their root problem is probably the same where the coordination and communication among the many launched processes are problematic.

I meet the same problem

Same here

I met a similar problem. I trained the model with 4 gpus. Training for thousand mini-batches, one process dead (I cannot get when and how it dead) and the utilization of the other three gpus are maintained at 100%, but the training has been stopped.

| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 20%   27C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 20%   32C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 20%   28C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 20%   30C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:85:00.0 Off |                  N/A |
| 24%   58C    P2    77W / 250W |   3764MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:86:00.0 Off |                  N/A |
| 20%   54C    P2    78W / 250W |   4110MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:89:00.0 Off |                  N/A |
| 20%   29C    P8    15W / 250W |     41MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:8A:00.0 Off |                  N/A |
| 24%   58C    P2    74W / 250W |   3906MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    4     65245      C   ...ongsq/environments/anaconda3/bin/python  3731MiB |
|    5     65246      C   ...ongsq/environments/anaconda3/bin/python  4077MiB |
|    7     65248      C   ...ongsq/environments/anaconda3/bin/python  3873MiB |
+-----------------------------------------------------------------------------+

As shown above, the process which pid should be 65247 has been killed for some reason, How should I fix this problem? I cannot reinstall nvidia driver because of no root right.

@Marcovaldong This is not related to the zombie process problem tracked in this issue.

What you're seeing is that a single process crashing causes the remaining processes to launch NCCL kernels that will never complete. This is a known problem with NCCL and has been addressed in the most recent minor release (2.4). There is work in progress to add the error detection to the NCCL bindings in PyTorch in pytorch/pytorch#22907. Once that is done and merged, the remaining processes will raise an error once one of its peers is no longer reachable or has crashed.

@pietern Thanks for your reply. I have fixed my problem. There is a dirty sample in my 700k train dataset, I have checked out it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

auroua picture auroua  路  3Comments

CF2220160244 picture CF2220160244  路  3Comments

botcs picture botcs  路  3Comments

kaaier picture kaaier  路  3Comments

zimenglan-sysu-512 picture zimenglan-sysu-512  路  3Comments