git diff) or what code you wroteIn train_net.py I use register_coco_instances() to register my train and validation datasets.
tools/train_net.py --num-gpus 4 --config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml DATASETS.TRAIN ('mydataset_train',) DATASETS.TEST ('mydataset_val',) SOLVER.IMS_PER_BATCH 8 SOLVER.BASE_LR 0.01 OUTPUT_DIR /home/$USER/shared
within a Docker container.
[02/05 23:05:59 d2.utils.events]: eta: 9:07:04 iter: 3659 total_loss: 1.117 loss_cls: 0.345 loss_box_reg: 0.285 loss_mask: 0.343 loss_rpn_cls: 0.077 loss_rpn_loc: 0.105 time: 0.3771 data_time: 0.0084 lr: 0.010000 max_mem: 3402M
[02/05 23:06:07 d2.utils.events]: eta: 9:07:43 iter: 3679 total_loss: 1.123 loss_cls: 0.355 loss_box_reg: 0.285 loss_mask: 0.353 loss_rpn_cls: 0.065 loss_rpn_loc: 0.092 time: 0.3772 data_time: 0.0082 lr: 0.010000 max_mem: 3402M
[02/05 23:06:14 d2.utils.events]: eta: 9:08:00 iter: 3699 total_loss: 1.137 loss_cls: 0.329 loss_box_reg: 0.300 loss_mask: 0.348 loss_rpn_cls: 0.068 loss_rpn_loc: 0.081 time: 0.3773 data_time: 0.0083 lr: 0.010000 max_mem: 3402M
[02/05 23:06:22 d2.utils.events]: eta: 9:07:50 iter: 3719 total_loss: 1.175 loss_cls: 0.351 loss_box_reg: 0.297 loss_mask: 0.360 loss_rpn_cls: 0.066 loss_rpn_loc: 0.096 time: 0.3773 data_time: 0.0089 lr: 0.010000 max_mem: 3402M
[02/05 23:06:30 d2.utils.events]: eta: 9:07:50 iter: 3739 total_loss: 1.186 loss_cls: 0.334 loss_box_reg: 0.296 loss_mask: 0.353 loss_rpn_cls: 0.079 loss_rpn_loc: 0.111 time: 0.3774 data_time: 0.0079 lr: 0.010000 max_mem: 3402M
[02/05 23:06:38 d2.utils.events]: eta: 9:08:46 iter: 3759 total_loss: 1.160 loss_cls: 0.359 loss_box_reg: 0.283 loss_mask: 0.355 loss_rpn_cls: 0.070 loss_rpn_loc: 0.081 time: 0.3774 data_time: 0.0084 lr: 0.010000 max_mem: 3402M
[02/05 23:06:45 d2.utils.events]: eta: 9:08:46 iter: 3779 total_loss: 1.285 loss_cls: 0.364 loss_box_reg: 0.303 loss_mask: 0.376 loss_rpn_cls: 0.070 loss_rpn_loc: 0.092 time: 0.3775 data_time: 0.0083 lr: 0.010000 max_mem: 3402M
[02/05 23:06:53 d2.utils.events]: eta: 9:08:20 iter: 3799 total_loss: 1.227 loss_cls: 0.340 loss_box_reg: 0.300 loss_mask: 0.347 loss_rpn_cls: 0.084 loss_rpn_loc: 0.115 time: 0.3775 data_time: 0.0085 lr: 0.010000 max_mem: 3402M
ERROR [02/05 23:07:15 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 220, in run_step
self._write_metrics(metrics_dict)
File "/podc/src/detectron2/detectron2/engine/train_loop.py", line 255, in _write_metrics
all_metrics_dict = comm.gather(metrics_dict)
File "/podc/src/detectron2/detectron2/utils/comm.py", line 200, in gather
size_list, tensor = _pad_to_largest_tensor(tensor, group)
File "/podc/src/detectron2/detectron2/utils/comm.py", line 126, in _pad_to_largest_tensor
dist.all_gather(size_list, local_size, group=group)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1158, in all_gather
work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:572] Connection closed by peer [172.19.0.2]:13729
[02/05 23:07:15 d2.engine.hooks]: Overall training speed: 3803 iterations in 0:24:16 (0.3829 s / it)
[02/05 23:07:15 d2.engine.hooks]: Total training time: 0:24:19 (0:00:02 on hooks)
------------------------ ---------------------------------------------------------
sys.platform linux
Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0]
numpy 1.18.1
detectron2 0.1 @/podc/src/detectron2/detectron2
detectron2 compiler GCC 7.4
detectron2 CUDA compiler 10.1
detectron2 arch flags sm_60
DETECTRON2_ENV_MODULE <not set>
PyTorch 1.4.0 @/usr/local/lib/python3.6/dist-packages/torch
PyTorch debug build False
CUDA available True
GPU 0,1,2,3 Tesla P100-SXM2-16GB
CUDA_HOME /usr/local/cuda
NVCC Cuda compilation tools, release 10.1, V10.1.243
Pillow 6.2.2
torchvision 0.5.0 @/usr/local/lib/python3.6/dist-packages/torchvision
torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75
------------------------ ---------------------------------------------------------
PyTorch built with:
- GCC 7.3
- Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.3
- Magma 2.5.1
- Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
How reproducible this is? Is it reproducible even using a stock example (i.e. coco dataset)?
Since we've never seen such error it is probably very specific to your dataset or certain environment. And if neither is accessible to us, it's unlikely we can provide reasonable help.
So, I've tested on three separate boxes, with 4 GPUs and 1 GPU. The issue doesn't appear when training on 1 GPU at least not yet, still training; however, I get the issue every time I train with more than 1 GPU, in this case I've only tried 4 GPUs. I'm testing with the stock coco dataset from 2014. However, I add it with a different name to ensure that I can use my own datasets. I found this issue, https://github.com/pytorch/pytorch/issues/30439, which may be related, but probably not.
@ppwwyyxx @CMobley7 I had met this problem too.
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 125654 bytes but got 14072. Skipping tag 37500
" Skipping tag %s" % (size, len(data), tag)
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:603: UserWarning: Metadata Warning, tag 282 had too many entries: 2, expected 1
% (tag, len(values))
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:603: UserWarning: Metadata Warning, tag 283 had too many entries: 2, expected 1
% (tag, len(values))
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:603: UserWarning: Metadata Warning, tag 34853 had too many entries: 9, expected 1
% (tag, len(values))
ERROR [02/07 10:46:26 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/engine/train_loop.py", line 220, in run_step
self._write_metrics(metrics_dict)
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/engine/train_loop.py", line 255, in _write_metrics
all_metrics_dict = comm.gather(metrics_dict)
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/utils/comm.py", line 200, in gather
size_list, tensor = _pad_to_largest_tensor(tensor, group)
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/utils/comm.py", line 126, in _pad_to_largest_tensor
dist.all_gather(size_list, local_size, group=group)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1160, in all_gather
work.wait()
RuntimeError: [/opt/conda/conda-bld/pytorch_1570910687230/work/third_party/gloo/gloo/transport/tcp/pair.cc:572] Connection closed by peer [172.1]:53475
the command I used is "python ./projects/PointRend/train_net.py --config-file ./prs/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco.yaml --num-gpus 4".
Have you sovled this problem? And How to fix it?
Thanks very much.
It is going to happen when my evaluation dataset is kind of large (around 17,000 images with 1300x800 resolution)
@roger1993 have you met this kind of warnings?
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 44 bytes but only got 40. Skipping tag 37510
" Skipping tag %s" % (size, len(data), tag)
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 8 bytes but only got 0. Skipping tag 41730
" Skipping tag %s" % (size, len(data), tag)
@roger1993 have you met this kind of warnings?
[02/09 08:25:25 d2.utils.events]: eta: 17 days, 5:56:44 iter: 99 total_loss: 0.334 loss_cls: 0.024 loss_box_reg: 0.030 loss_mask: 0.109 loss_mask_point: 0.152 loss_rpn_cls: 0.002 loss_rpn_loc: 0.018 time: 5.8734 data_time: 3.7765 lr: 0.001998 max_mem: 11965M
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 44 bytes but only got 40. Skipping tag 37510
" Skipping tag %s" % (size, len(data), tag)
/opt/conda/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:784: UserWarning: Possibly corrupt EXIF data. Expecting to read 8 bytes but only got 0. Skipping tag 41730
" Skipping tag %s" % (size, len(data), tag)
Because my dataset is from industry and it is very dirty, use read_image in detectron2 would cause training fails, so I change the read_image function from pillow to opencv at https://github.com/facebookresearch/detectron2/blob/master/detectron2/data/detection_utils.py#L36, u can roughly refers to this issue https://github.com/facebookresearch/detectron2/issues/788, I don't have the same problem as urs
@roger1993 Thanks. I changed the read_image function in detectron2 and the warnings disappeared. But the problem still exists with no other types of errors and warnnings.
error log:
ERROR [02/10 05:02:23 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/engine/train_loop.py", line 220, in run_step
self._write_metrics(metrics_dict)
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/engine/train_loop.py", line 255, in _write_metrics
all_metrics_dict = comm.gather(metrics_dict)
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/utils/comm.py", line 200, in gather
size_list, tensor = _pad_to_largest_tensor(tensor, group)
File "/data/s00378650/InstanceSegmentation/detectron2/detectron2-master/detectron2/utils/comm.py", line 126, in _pad_to_largest_tensor
dist.all_gather(size_list, local_size, group=group)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1160, in all_gather
work.wait()
RuntimeError: [/opt/conda/conda-bld/pytorch_1570910687230/work/third_party/gloo/gloo/transport/tcp/pair.cc:572] Connection closed by peer [172.17.0.6]:17030
@CMobley7 @ppwwyyxx Do you guys have any idea about this? thanks~
So, running it with 1 GPU eventually failed because there were a few corrupt JPEGs in the dataset I was using. I re-downloaded the dataset and ran with both 1 GPU and 4 GPUs on 3 different boxes and didn't receive the error again. Also, note, that I rebuilt my docker images before running this second test. When installing torchvision 0.5.0, Pillow 7.0.0 instead of 6.2.2 was installed. I think the ultimate issue was issues with certain images in my dataset. So, as @roger1993, I'd suggest either cleaning your data or making further updates to the read_image function to ensure all images that cause an error are skipped. However, it would be nice to see the default read_image function updated; so that, problematic images are skipped instead of giving an arbitrary Connection closed by peer error. Should I close this issue since my issue is solved or leave it open; so that, potentially work on the read_image function can be tied to an issue.
Thanks, the error message is now more explicit
dirty data or unenough gpu resources may lead to the error above.
Most helpful comment
Because my dataset is from industry and it is very dirty, use read_image in detectron2 would cause training fails, so I change the read_image function from pillow to opencv at https://github.com/facebookresearch/detectron2/blob/master/detectron2/data/detection_utils.py#L36, u can roughly refers to this issue https://github.com/facebookresearch/detectron2/issues/788, I don't have the same problem as urs