I used DALI to process ImageNet data in my training script, but the program always broke down during the validation stage at the first epoch. The PyTorch version is 1.1.0 and I trained ResNet18 from torchvision on 8x Titan Xps using torch.nn.DataParallel. DALI usage is following https://github.com/NVIDIA/DALI/blob/master/docs/examples/pytorch/resnet50/main.py
The error message shows as follows:
=> creating model 'resnet18'
DALI "gpu" variant
read 1281167 files from 1000 directories
read 50000 files from 1000 directories
Epoch: [1 | 120]
Processing |################################| (5005/5004) Data: 0.007s | Batch: 0.164s | Total: 0:13:40 | ETA: 0:00:00 | Loss: 4.9892 | top1: 10.8257 | top5: 25.4221
Processing |################ | (100/195) Data: 0.001s | Batch: 0.114s | Total: 0:00:11 | ETA: 0:00:11 | Loss: 3.8560 | top1: 21.2617 | top5: 44.8984Traceback (most recent call last):
File "imagenet.py", line 515, in <module>
main()
File "imagenet.py", line 332, in main
val_loss, prec1 = validate(val_loader, model, criterion)
File "imagenet.py", line 440, in validate
for i, data in enumerate(val_loader):
File "/home/rll/anaconda3/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 127, in __next__
outputs.append(p._share_outputs())
File "/home/rll/anaconda3/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 291, in _share_outputs
return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline: Error in thread 14: [/opt/dali/dali/pipeline/operators/decoder/nvjpeg_decoder_decoupled_api.h:315] NVJPEG error "5"
Stacktrace (7 entries):
[frame 0]: /home/rll/anaconda3/lib/python3.6/site-packages/nvidia/dali/libdali.so(+0xa953e) [0x7fbd0a98a53e]
[frame 1]: /home/rll/anaconda3/lib/python3.6/site-packages/nvidia/dali/libdali.so(+0x28b517) [0x7fbd0ab6c517]
[frame 2]: /home/rll/anaconda3/lib/python3.6/site-packages/nvidia/dali/libdali.so(+0x28c0c5) [0x7fbd0ab6d0c5]
[frame 3]: /home/rll/anaconda3/lib/python3.6/site-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x183) [0x7fbd0aabccc3]
[frame 4]: /home/rll/anaconda3/lib/python3.6/site-packages/torch/lib/../../../../libstdc++.so.6(+0xb8678) [0x7fbd2b633678]
[frame 5]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fbd6b6116ba]
[frame 6]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fbd6b34741d]
Current pipeline object is no longer valid.
And the validation process can be completed if the previous training epoch is dropped
=> creating model 'resnet18'
DALI "gpu" variant
read 1281167 files from 1000 directories
read 50000 files from 1000 directories
Epoch: [1 | 120]
Processing |################################| (196/195) Data: 0.001s | Batch: 0.251s | Total: 0:00:49 | ETA: 0:00:00 | Loss: 7.4309 | top1: 0.0977 | top5: 0.4743
Hi,
Thank you for the issue report. We just started optimizing nvJPEG operator and we have encountered some problem with nvJPEG streams. We hope to have that fixed in a few next days. We will let you know when new binary with the fix is available in the nightly/weekly release channel. Maybe this is what you have encountered. @mzient?
@JanuszL Thanks a lot for your contributions and timely reply! And I will keep tracking the relevant changes in nvJPEG operator. Look forward to your good news!
The probable fix for your problem was merged in https://github.com/NVIDIA/DALI/pull/962. Can you check the latest nightly build?
Thanks but I'm really sorry that my machines are busy running experiments temporarily. I will test the nightly version as soon as free GPUs are available.
I still saw this error with the nightly build. But I figured out that this phenomenon occurs when I set much larger workers in the arguments (e.g. 32), and the memory occupation on the GPU 0 is also very large accordingly. If the default value of 4 is used, the whole training and validation process will run smoothly without this error message.
It makes sense. As NVJPEG error "5" means NVJPEG_STATUS_ALLOCATOR_FAILURE and you are using 32 threads per DALI instance then there is 32 nvJPEG instances running per GPU, and double of that is you are running validation and training (what is the usual case).
Every nvJPEG instance (threads) keeps an internal buffer for the image decoding. Those buffers are as big as the biggest image the given nvjPEG instance has encountered. It is due to the fact that the memory is only enlarged, not shrunk to avoid expensive allocation operations. In the case of ImageNet, some images could require 200MB of device memory to decode, what in your case can lead to > 6GB of device memory (for the training itself). Usually, there is very little benefit to go above 4 threads per DALI when you are using full GPU pipeline.
I don't know if we can do much beyond reporting the error code returned by nvJPEG.
OK, got it!
I added some more verbose errors in https://github.com/NVIDIA/DALI/pull/983 - you should get the enum name with the error so it should give you some clue what went wrong.
Sorry to reopen but related to this: what would be the issue when the DataLoader is run in --dali_cpu mode and this error occurs? Did the system run out of system memory (rather than GPU memory)?
@Attila94 NVJPEG error 5 is strictly related to ImageDecoder for mixed mode that relies on the nvJPEG in the underlying implementation. --dali_cpu is just a switch in the example, so as long as it doesn't use ImageDecoder for mixed mode it should not cause this error.
Thanks for the quick reply! I now realise that the example always runs the validation pipeline in mixed/gpu mode, regardless of the --dali_cpu argument.
Most helpful comment
It makes sense. As
NVJPEG error "5"meansNVJPEG_STATUS_ALLOCATOR_FAILUREand you are using 32 threads per DALI instance then there is 32 nvJPEG instances running per GPU, and double of that is you are running validation and training (what is the usual case).Every nvJPEG instance (threads) keeps an internal buffer for the image decoding. Those buffers are as big as the biggest image the given nvjPEG instance has encountered. It is due to the fact that the memory is only enlarged, not shrunk to avoid expensive allocation operations. In the case of ImageNet, some images could require 200MB of device memory to decode, what in your case can lead to > 6GB of device memory (for the training itself). Usually, there is very little benefit to go above 4 threads per DALI when you are using full GPU pipeline.
I don't know if we can do much beyond reporting the error code returned by nvJPEG.