Dali: NVJPEG error 5

Created on 7 Jun 2019 · 11Comments · Source: NVIDIA/DALI

I used DALI to process ImageNet data in my training script, but the program always broke down during the validation stage at the first epoch. The PyTorch version is 1.1.0 and I trained ResNet18 from torchvision on 8x Titan Xps using torch.nn.DataParallel. DALI usage is following https://github.com/NVIDIA/DALI/blob/master/docs/examples/pytorch/resnet50/main.py
The error message shows as follows:

=> creating model 'resnet18'
DALI "gpu" variant
read 1281167 files from 1000 directories
read 50000 files from 1000 directories

Epoch: [1 | 120]
Processing |################################| (5005/5004) Data: 0.007s | Batch: 0.164s | Total: 0:13:40 | ETA: 0:00:00 | Loss: 4.9892 | top1:  10.8257 | top5:  25.4221
Processing |################                | (100/195) Data: 0.001s | Batch: 0.114s | Total: 0:00:11 | ETA: 0:00:11 | Loss: 3.8560 | top1:  21.2617 | top5:  44.8984Traceback (most recent call last):
  File "imagenet.py", line 515, in <module>
    main()
  File "imagenet.py", line 332, in main
    val_loss, prec1 = validate(val_loader, model, criterion)
  File "imagenet.py", line 440, in validate
    for i, data in enumerate(val_loader):
  File "/home/rll/anaconda3/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 127, in __next__
    outputs.append(p._share_outputs())
  File "/home/rll/anaconda3/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 291, in _share_outputs
    return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline: Error in thread 14: [/opt/dali/dali/pipeline/operators/decoder/nvjpeg_decoder_decoupled_api.h:315] NVJPEG error "5"
Stacktrace (7 entries):
[frame 0]: /home/rll/anaconda3/lib/python3.6/site-packages/nvidia/dali/libdali.so(+0xa953e) [0x7fbd0a98a53e]
[frame 1]: /home/rll/anaconda3/lib/python3.6/site-packages/nvidia/dali/libdali.so(+0x28b517) [0x7fbd0ab6c517]
[frame 2]: /home/rll/anaconda3/lib/python3.6/site-packages/nvidia/dali/libdali.so(+0x28c0c5) [0x7fbd0ab6d0c5]
[frame 3]: /home/rll/anaconda3/lib/python3.6/site-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x183) [0x7fbd0aabccc3]
[frame 4]: /home/rll/anaconda3/lib/python3.6/site-packages/torch/lib/../../../../libstdc++.so.6(+0xb8678) [0x7fbd2b633678]
[frame 5]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fbd6b6116ba]
[frame 6]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fbd6b34741d]

Current pipeline object is no longer valid.

And the validation process can be completed if the previous training epoch is dropped

=> creating model 'resnet18'
DALI "gpu" variant
read 1281167 files from 1000 directories
read 50000 files from 1000 directories

Epoch: [1 | 120]
Processing |################################| (196/195) Data: 0.001s | Batch: 0.251s | Total: 0:00:49 | ETA: 0:00:00 | Loss: 7.4309 | top1:  0.0977 | top5:  0.4743

bug

Source

d-li14

Most helpful comment

It makes sense. As NVJPEG error "5" means NVJPEG_STATUS_ALLOCATOR_FAILURE and you are using 32 threads per DALI instance then there is 32 nvJPEG instances running per GPU, and double of that is you are running validation and training (what is the usual case).
Every nvJPEG instance (threads) keeps an internal buffer for the image decoding. Those buffers are as big as the biggest image the given nvjPEG instance has encountered. It is due to the fact that the memory is only enlarged, not shrunk to avoid expensive allocation operations. In the case of ImageNet, some images could require 200MB of device memory to decode, what in your case can lead to > 6GB of device memory (for the training itself). Usually, there is very little benefit to go above 4 threads per DALI when you are using full GPU pipeline.
I don't know if we can do much beyond reporting the error code returned by nvJPEG.

JanuszL on 17 Jun 2019

👍2

All 11 comments

Hi,
Thank you for the issue report. We just started optimizing nvJPEG operator and we have encountered some problem with nvJPEG streams. We hope to have that fixed in a few next days. We will let you know when new binary with the fix is available in the nightly/weekly release channel. Maybe this is what you have encountered. @mzient?

JanuszL on 7 Jun 2019

@JanuszL Thanks a lot for your contributions and timely reply! And I will keep tracking the relevant changes in nvJPEG operator. Look forward to your good news!

d-li14 on 7 Jun 2019

The probable fix for your problem was merged in https://github.com/NVIDIA/DALI/pull/962. Can you check the latest nightly build?

JanuszL on 14 Jun 2019

Thanks but I'm really sorry that my machines are busy running experiments temporarily. I will test the nightly version as soon as free GPUs are available.

d-li14 on 14 Jun 2019

I still saw this error with the nightly build. But I figured out that this phenomenon occurs when I set much larger workers in the arguments (e.g. 32), and the memory occupation on the GPU 0 is also very large accordingly. If the default value of 4 is used, the whole training and validation process will run smoothly without this error message.

d-li14 on 15 Jun 2019

👍2

JanuszL on 17 Jun 2019

👍2

OK, got it!

d-li14 on 17 Jun 2019

I added some more verbose errors in https://github.com/NVIDIA/DALI/pull/983 - you should get the enum name with the error so it should give you some clue what went wrong.

JanuszL on 17 Jun 2019

Sorry to reopen but related to this: what would be the issue when the DataLoader is run in --dali_cpu mode and this error occurs? Did the system run out of system memory (rather than GPU memory)?

Attila94 on 24 Mar 2020

@Attila94 NVJPEG error 5 is strictly related to ImageDecoder for mixed mode that relies on the nvJPEG in the underlying implementation. --dali_cpu is just a switch in the example, so as long as it doesn't use ImageDecoder for mixed mode it should not cause this error.

JanuszL on 24 Mar 2020

Thanks for the quick reply! I now realise that the example always runs the validation pipeline in mixed/gpu mode, regardless of the --dali_cpu argument.

Attila94 on 24 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings