Detectron2: Connection closed by peer error after ~4000 iterations

Created on 6 Feb 2020  路  10Comments  路  Source: facebookresearch/detectron2

Instructions To Reproduce the Issue:

  1. what changes you made (git diff) or what code you wrote

In train_net.py I use register_coco_instances() to register my train and validation datasets.

  1. what exact command you run:
    I run
tools/train_net.py --num-gpus 4 --config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml DATASETS.TRAIN ('mydataset_train',) DATASETS.TEST ('mydataset_val',) SOLVER.IMS_PER_BATCH 8 SOLVER.BASE_LR 0.01 OUTPUT_DIR /home/$USER/shared

within a Docker container.

  1. what you observed (including the full logs):
    After ~3750 iterations, I keep getting the following error. If I use less than ~3750, everything finishes successfully.
[02/05 23:05:59 d2.utils.events]: eta: 9:07:04  iter: 3659  total_loss: 1.117  loss_cls: 0.345  loss_box_reg: 0.285  loss_mask: 0.343  loss_rpn_cls: 0.077  loss_rpn_loc: 0.105  time: 0.3771  data_time: 0.0084  lr: 0.010000  max_mem: 3402M
[02/05 23:06:07 d2.utils.events]: eta: 9:07:43  iter: 3679  total_loss: 1.123  loss_cls: 0.355  loss_box_reg: 0.285  loss_mask: 0.353  loss_rpn_cls: 0.065  loss_rpn_loc: 0.092  time: 0.3772  data_time: 0.0082  lr: 0.010000  max_mem: 3402M
[02/05 23:06:14 d2.utils.events]: eta: 9:08:00  iter: 3699  total_loss: 1.137  loss_cls: 0.329  loss_box_reg: 0.300  loss_mask: 0.348  loss_rpn_cls: 0.068  loss_rpn_loc: 0.081  time: 0.3773  data_time: 0.0083  lr: 0.010000  max_mem: 3402M
[02/05 23:06:22 d2.utils.events]: eta: 9:07:50  iter: 3719  total_loss: 1.175  loss_cls: 0.351  loss_box_reg: 0.297  loss_mask: 0.360  loss_rpn_cls: 0.066  loss_rpn_loc: 0.096  time: 0.3773  data_time: 0.0089  lr: 0.010000  max_mem: 3402M
[02/05 23:06:30 d2.utils.events]: eta: 9:07:50  iter: 3739  total_loss: 1.186  loss_cls: 0.334  loss_box_reg: 0.296  loss_mask: 0.353  loss_rpn_cls: 0.079  loss_rpn_loc: 0.111  time: 0.3774  data_time: 0.0079  lr: 0.010000  max_mem: 3402M
[02/05 23:06:38 d2.utils.events]: eta: 9:08:46  iter: 3759  total_loss: 1.160  loss_cls: 0.359  loss_box_reg: 0.283  loss_mask: 0.355  loss_rpn_cls: 0.070  loss_rpn_loc: 0.081  time: 0.3774  data_time: 0.0084  lr: 0.010000  max_mem: 3402M
[02/05 23:06:45 d2.utils.events]: eta: 9:08:46  iter: 3779  total_loss: 1.285  loss_cls: 0.364  loss_box_reg: 0.303  loss_mask: 0.376  loss_rpn_cls: 0.070  loss_rpn_loc: 0.092  time: 0.3775  data_time: 0.0083  lr: 0.010000  max_mem: 3402M
[02/05 23:06:53 d2.utils.events]: eta: 9:08:20  iter: 3799  total_loss: 1.227  loss_cls: 0.340  loss_box_reg: 0.300  loss_mask: 0.347  loss_rpn_cls: 0.084  loss_rpn_loc: 0.115  time: 0.3775  data_time: 0.0085  lr: 0.010000  max_mem: 3402M
ERROR [02/05 23:07:15 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 220, in run_step
    self._write_metrics(metrics_dict)
  File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 255, in _write_metrics
    all_metrics_dict = comm.gather(metrics_dict)
  File "/podc/src/detectron2/detectron2/utils/comm.py", line 200, in gather
    size_list, tensor = _pad_to_largest_tensor(tensor, group)
  File "/podc/src/detectron2/detectron2/utils/comm.py", line 126, in _pad_to_largest_tensor
    dist.all_gather(size_list, local_size, group=group)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1158, in all_gather
    work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:572] Connection closed by peer [172.19.0.2]:13729
[02/05 23:07:15 d2.engine.hooks]: Overall training speed: 3803 iterations in 0:24:16 (0.3829 s / it)
[02/05 23:07:15 d2.engine.hooks]: Total training time: 0:24:19 (0:00:02 on hooks)

Expected behavior

Environment:

------------------------  ---------------------------------------------------------
sys.platform              linux
Python                    3.6.9 (default, Nov  7 2019, 10:44:02) [GCC 8.3.0]
numpy                     1.18.1
detectron2                0.1 @/podc/src/detectron2/detectron2
detectron2 compiler       GCC 7.4
detectron2 CUDA compiler  10.1
detectron2 arch flags     sm_60
DETECTRON2_ENV_MODULE     <not set>
PyTorch                   1.4.0 @/usr/local/lib/python3.6/dist-packages/torch
PyTorch debug build       False
CUDA available            True
GPU 0,1,2,3               Tesla P100-SXM2-16GB
CUDA_HOME                 /usr/local/cuda
NVCC                      Cuda compilation tools, release 10.1, V10.1.243
Pillow                    6.2.2
torchvision               0.5.0 @/usr/local/lib/python3.6/dist-packages/torchvision
torchvision arch flags    sm_35, sm_50, sm_60, sm_70, sm_75
------------------------  ---------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
enhancement

Most helpful comment

@roger1993 have you met this kind of warnings?
[02/09 08:25:25 d2.utils.events]: eta: 17 days, 5:56:44 iter: 99 total_loss: 0.334 loss_cls: 0.024 loss_box_reg: 0.030 loss_mask: 0.109 loss_mask_point: 0.152 loss_rpn_cls: 0.002 loss_rpn_loc: 0.018 time: 5.8734 data_time: 3.7765 lr: 0.001998 max_mem: 11965M
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 44 bytes but only got 40. Skipping tag 37510
" Skipping tag %s" % (size, len(data), tag)
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 8 bytes but only got 0. Skipping tag 41730
" Skipping tag %s" % (size, len(data), tag)

Because my dataset is from industry and it is very dirty, use read_image in detectron2 would cause training fails, so I change the read_image function from pillow to opencv at https://github.com/facebookresearch/detectron2/blob/master/detectron2/data/detection_utils.py#L36, u can roughly refers to this issue https://github.com/facebookresearch/detectron2/issues/788, I don't have the same problem as urs

All 10 comments

How reproducible this is? Is it reproducible even using a stock example (i.e. coco dataset)?
Since we've never seen such error it is probably very specific to your dataset or certain environment. And if neither is accessible to us, it's unlikely we can provide reasonable help.

So, I've tested on three separate boxes, with 4 GPUs and 1 GPU. The issue doesn't appear when training on 1 GPU at least not yet, still training; however, I get the issue every time I train with more than 1 GPU, in this case I've only tried 4 GPUs. I'm testing with the stock coco dataset from 2014. However, I add it with a different name to ensure that I can use my own datasets. I found this issue, https://github.com/pytorch/pytorch/issues/30439, which may be related, but probably not.

@ppwwyyxx @CMobley7 I had met this problem too.
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 125654 bytes but got 14072. Skipping tag 37500
" Skipping tag %s" % (size, len(data), tag)
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:603: UserWarning: Metadata Warning, tag 282 had too many entries: 2, expected 1
% (tag, len(values))
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:603: UserWarning: Metadata Warning, tag 283 had too many entries: 2, expected 1
% (tag, len(values))
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:603: UserWarning: Metadata Warning, tag 34853 had too many entries: 9, expected 1
% (tag, len(values))
ERROR [02/07 10:46:26 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/engine/train_loop.py", line 220, in run_step
self._write_metrics(metrics_dict)
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/engine/train_loop.py", line 255, in _write_metrics
all_metrics_dict = comm.gather(metrics_dict)
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/utils/comm.py", line 200, in gather
size_list, tensor = _pad_to_largest_tensor(tensor, group)
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/utils/comm.py", line 126, in _pad_to_largest_tensor
dist.all_gather(size_list, local_size, group=group)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1160, in all_gather
work.wait()
RuntimeError: [/opt/conda/conda-bld/pytorch_1570910687230/work/third_party/gloo/gloo/transport/tcp/pair.cc:572] Connection closed by peer [172.1]:53475
the command I used is "python ./projects/PointRend/train_net.py --config-file ./prs/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco.yaml --num-gpus 4".
Have you sovled this problem? And How to fix it?
Thanks very much.

It is going to happen when my evaluation dataset is kind of large (around 17,000 images with 1300x800 resolution)

@roger1993 have you met this kind of warnings?
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 44 bytes but only got 40. Skipping tag 37510
" Skipping tag %s" % (size, len(data), tag)
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 8 bytes but only got 0. Skipping tag 41730
" Skipping tag %s" % (size, len(data), tag)

@roger1993 have you met this kind of warnings?
[02/09 08:25:25 d2.utils.events]: eta: 17 days, 5:56:44 iter: 99 total_loss: 0.334 loss_cls: 0.024 loss_box_reg: 0.030 loss_mask: 0.109 loss_mask_point: 0.152 loss_rpn_cls: 0.002 loss_rpn_loc: 0.018 time: 5.8734 data_time: 3.7765 lr: 0.001998 max_mem: 11965M
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 44 bytes but only got 40. Skipping tag 37510
" Skipping tag %s" % (size, len(data), tag)
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 8 bytes but only got 0. Skipping tag 41730
" Skipping tag %s" % (size, len(data), tag)

Because my dataset is from industry and it is very dirty, use read_image in detectron2 would cause training fails, so I change the read_image function from pillow to opencv at https://github.com/facebookresearch/detectron2/blob/master/detectron2/data/detection_utils.py#L36, u can roughly refers to this issue https://github.com/facebookresearch/detectron2/issues/788, I don't have the same problem as urs

@roger1993 Thanks. I changed the read_image function in detectron2 and the warnings disappeared. But the problem still exists with no other types of errors and warnnings.
error log:
ERROR [02/10 05:02:23 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/engine/train_loop.py", line 220, in run_step
self._write_metrics(metrics_dict)
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/engine/train_loop.py", line 255, in _write_metrics
all_metrics_dict = comm.gather(metrics_dict)
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/utils/comm.py", line 200, in gather
size_list, tensor = _pad_to_largest_tensor(tensor, group)
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/utils/comm.py", line 126, in _pad_to_largest_tensor
dist.all_gather(size_list, local_size, group=group)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1160, in all_gather
work.wait()
RuntimeError: [/opt/conda/conda-bld/pytorch_1570910687230/work/third_party/gloo/gloo/transport/tcp/pair.cc:572] Connection closed by peer [172.17.0.6]:17030
@CMobley7 @ppwwyyxx Do you guys have any idea about this? thanks~

So, running it with 1 GPU eventually failed because there were a few corrupt JPEGs in the dataset I was using. I re-downloaded the dataset and ran with both 1 GPU and 4 GPUs on 3 different boxes and didn't receive the error again. Also, note, that I rebuilt my docker images before running this second test. When installing torchvision 0.5.0, Pillow 7.0.0 instead of 6.2.2 was installed. I think the ultimate issue was issues with certain images in my dataset. So, as @roger1993, I'd suggest either cleaning your data or making further updates to the read_image function to ensure all images that cause an error are skipped. However, it would be nice to see the default read_image function updated; so that, problematic images are skipped instead of giving an arbitrary Connection closed by peer error. Should I close this issue since my issue is solved or leave it open; so that, potentially work on the read_image function can be tied to an issue.

Thanks, the error message is now more explicit

dirty data or unenough gpu resources may lead to the error above.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ChungNPH picture ChungNPH  路  3Comments

choasup picture choasup  路  3Comments

soumik12345 picture soumik12345  路  3Comments

RomRoc picture RomRoc  路  4Comments

jinfagang picture jinfagang  路  3Comments