Maskrcnn-benchmark: Error when trying to train: RuntimeError: cuda runtime error (59) : device-side assert triggered

Created on 3 Dec 2018 · 4Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

I'm trying to train a custom data set with train.py as seen in the tools folder. I followed the instructions to build a consistent dataloader. (cross checking with the coco dataloader it seems all the types/dim are in order)

Using the CUDA_LAUNCH_BLOCKING=1 before python3 train.py (to get more info in the output)
I get the following error:

2018-12-03 17:06:30,676 maskrcnn_benchmark.trainer INFO: Start training
/ibm/gpfs-homes/ial/.local/tmp_compilation/pytorch-master-at10.0/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/ibm/gpfs-homes/ial/.local/tmp_compilation/pytorch-master-at10.0/pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=111 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "relational_rxn_graphs/detector/train.py", line 228, in <module>
    main()
  File "relational_rxn_graphs/detector/train.py", line 221, in main
    model = train(cfg, data_cfg, args.local_rank, args.distributed)
  File "relational_rxn_graphs/detector/train.py", line 71, in train
    arguments,
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 66, in do_train
    loss_dict = model(images, targets)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 52, in forward
    x, result, detector_losses = self.roi_heads(features, proposals, targets)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/roi_heads.py", line 23, in forward
    x, detections, loss_box = self.box(features, proposals, targets)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/box_head.py", line 55, in forward
    [class_logits], [box_regression]
  File "/ibm/gpfs-homes/ial/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/loss.py", line 139, in __call__
    classification_loss = F.cross_entropy(class_logits, labels)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/functional.py", line 1928, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/nn/functional.py", line 1771, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /ibm/gpfs-homes/ial/.local/tmp_compilation/pytorch-master-at10.0/pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:111

I have the following configuration

PyTorch version: 1.0.0a0+5c89190
Is debug build: No
CUDA used to build PyTorch: 9.2.148

OS: Red Hat Enterprise Linux Server release 7.5 (Maipo)
GCC version: (GCC) 6.4.1 20170720 (Advance-Toolchain-at10.0) IBM AT 10 branch, based on subversion id 250395.
CMake version: version 2.8.12.2

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 9.2.148
GPU models and configuration:
GPU 0: Tesla P100-SXM2-16GB
GPU 1: Tesla P100-SXM2-16GB
GPU 2: Tesla P100-SXM2-16GB
GPU 3: Tesla P100-SXM2-16GB

Nvidia driver version: 396.37
cuDNN version: Probably one of the following:
/usr/local/cudnn-8.0-v5.1/lib64/libcudnn.so.5.1.10
/usr/local/cudnn-8.0-v5.1/lib64/libcudnn_static.a
/usr/local/cudnn-8.0-v6.0/lib64/libcudnn.so.6.0.20
/usr/local/cudnn-8.0-v6.0/lib64/libcudnn_static.a
/usr/local/cudnn-9.0-v7.0/lib64/libcudnn.so.7.0.5
/usr/local/cudnn-9.0-v7.0/lib64/libcudnn_static.a
/usr/local/cudnn-9.1-v7.1.2/lib64/libcudnn.so.7.1.2
/usr/local/cudnn-9.1-v7.1.2/lib64/libcudnn_static.a
/usr/local/cudnn-9.2-v7.1.3/lib64/libcudnn.so.7.1.3
/usr/local/cudnn-9.2-v7.1.3/lib64/libcudnn_static.a
/usr/local/cudnn-9.2-v7.2.1/lib64/libcudnn.so.7.2.1
/usr/local/cudnn-9.2-v7.2.1/lib64/libcudnn_static.a

This is a continuation of #230, it seems that the initial bug was fixed by reinstalling the library + torch.

Source

Nacho114

Most helpful comment

This probably means that your class labels are larger than the number of outputs from the model.
Could you check that?

fmassa on 3 Dec 2018

👍6

All 4 comments

This probably means that your class labels are larger than the number of outputs from the model.
Could you check that?

fmassa on 3 Dec 2018

👍6

It's working now.
Yeah that is what I thought, I was using a dataloader which had a function to return the nb_classes, I did not realize you had to add it to the config. I thought that this was the standard way to get the nb_classes... big mistake.
Thanks a lot!

Nacho114 on 4 Dec 2018

@Nacho114 @fmassa running into this error as well.. In my config, I have ROI_BOX_HEAD.NUM_CLASSES set to 3. I'm using the R-50.pkl pretrained weights, so I figured I do not need to follow the instructions in #15. What else is missing? Do you need to set the number of classes within the dataset class? Any help would be appreciated