Hello, I always met a crash of " NVJPEG_STATUS_ALLOCATOR_FAILURE" when using dali for cuda10.0 to read JPEG. I ran with 2*2080ti, and set the num_threads to 4 in the Pipeline of dali. The dataset is in a 7 TB, 7200 rpm hdd. I wonder if there are too many threads, but the hard disk reading are too slow, resulting in IO conflicts. I need some advice to prevent this crash. Thanks a lot.
Traceback (most recent call last):
File "train.py", line 320, in
main()
File "train.py", line 171, in main
prec1 = validate(val_loader, model, criterion, device)
File "train.py", line 236, in validate
for i, data in enumerate(val_loader):
File "/home/yang/anaconda3/envs/pytorch/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 158, in __next__
outputs.append(p.share_outputs())
File "/home/yang/anaconda3/envs/pytorch/lib/python3.7/site-packages/nvidia/dali/pipeline.py", line 410, in share_outputs
return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline: Error in thread 0: [/opt/dali/dali/pipeline/operators/decoder/nvjpeg/decoupled_api/nvjpeg_decoder_decoupled_api.h:355] NVJPEG error "5" : NVJPEG_STATUS_ALLOCATOR_FAILURE n02130308/ILSVRC2012_val_00033687.JPEG
Stacktrace (7 entries):
Current pipeline object is no longer valid.
An error occurred in nvJPEG worker thread:
Error in thread 0: [/opt/dali/dali/pipeline/operators/decoder/nvjpeg/decoupled_api/nvjpeg_decoder_decoupled_api.h:362] NVJPEG error "6" : NVJPEG_STATUS_EXECUTION_FAILED n02134084/ILSVRC2012_val_00021883.JPEG
Stacktrace (7 entries):
Hi,
In your case, you just run out of memory so nvJPEG cannot allocate any more. To prevent this, as you pointed out yourself, you could reduce the number of threads in DALI (usually 3-4 sounds reasonable) as well as reduce the batch size you use in your training.
We know how painful is the fact that DALI uses additional GPU memory but after all when you move your processing to the GPU we cannot avoid it.
Hi,
In your case, you just run out of memory so nvJPEG cannot allocate any more. To prevent this, as you pointed out yourself, you could reduce the number of threads in DALI (usually 3-4 sounds reasonable) as well as reduce the batch size you use in your training.
We know how painful is the fact that DALI uses additional GPU memory but after all when you move your processing to the GPU we cannot avoid it.
Thanks for the detailed reply. I output the usage of GPU memory regularly and discovered that it grew during the training. I wonder whether that is the memory leak. If so, could you advice me how to clean the cache or to avoid this? I didn't find any api about that from the documentation. Thanks a lot.
GPU usage would grow and should saturate in a couple of epochs (we reallocates memory to hold new combinations of images in the batch). So it is very likely that the combination of the biggest images would not happen in the first epoch.
There is nothing you can do with memory consumption. You can try to pass a hint to GPU image decoder about the required memory - https://github.com/NVIDIA/DALI/blob/master/docs/examples/pytorch/resnet50/main.py#L89 (we have checked that for ImageNet this is the maximum amount of memory needed for the intermediate buffers for this operator).
GPU usage would grow and should saturate in a couple of epochs (we reallocates memory to hold new combinations of images in the batch). So it is very likely that the combination of the biggest images would not happen in the first epoch.
There is nothing you can do with memory consumption. You can try to pass a hint to GPU image decoder about the required memory - https://github.com/NVIDIA/DALI/blob/master/docs/examples/pytorch/resnet50/main.py#L89 (we have checked that for ImageNet this is the maximum amount of memory needed for the intermediate buffers for this operator).
Thank you for your scrupulous and detailed response! With your help, I have figured out the reason of the memory rise. Thanks!
@PoonKinWang - so it was the ImageReader in your case?
I guess you mean if I use FileReader to read Image锛焂es I do. I also use it for ImageNet.
I meant ImageDecoder, sorry for the confusion?
I meant ImageDecoder, sorry for the confusion?
It doesn't matter. Yes, I do, and with your tip, I noticed the explain of its params device_memory_padding: a bigger image is encountered and internal buffer needs to be reallocated to decode it, just like you said.