I got an error similar with #1073 when reading a mxnet .rec file with DALI. I think the record file itself is okay because I can read it and train my network through the mxnet code.
However, when I use dali MXNetReader to read the data and train the model in PyTorch, I got below error message:
RuntimeError: Critical error in pipeline: [/opt/dali/dali/pipeline/operators/decoder/nvjpeg/decoupled_api/nvjpeg_decoder_decoupled_api.h:224] [/opt/dali/dali/image/jpeg.cc:132] Assert on "jpeg::GetImageInfo(encoded_buffer, length, &width, &height, &components) == true" failed
Stacktrace (12 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0xe53fe) [0x7fbde7b943fe]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0xebd0d) [0x7fbde7b9ad0d]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::Image::PeekShape() const+0x12) [0x7fbde7b97d72]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x18b057) [0x7fbde7c3a057]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x18e4fd) [0x7fbde7c3d4fd]
[frame 5]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x264e15) [0x7fbde7d13e15]
[frame 6]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x265739) [0x7fbde7d14739]
[frame 7]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x23f5d8) [0x7fbde7cee5d8]
[frame 8]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x2a48e6) [0x7fbde7d538e6]
[frame 9]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x107f310) [0x7fbde8b2e310]
[frame 10]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fbe4ba9b6db]
[frame 11]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fbe4bdd488f]
File: /media/ssd0/faces_emore/train.rec at index 16535993320
Here is the definition of my pipeline:
class HybridTrainPipe(Pipeline):
def __init__(self, batch_size, num_threads, device_id, data_dir, crop, dali_cpu=True):
super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed=12 + device_id)
# MXnet rec reader
self.input = ops.MXNetReader(path=join(data_dir, "train.rec"), index_path=join(data_dir, "train.idx"),
random_shuffle=True, shard_id=args.local_rank, num_shards=args.world_size)
#let user decide which pipeline works him bets for RN version he runs
dali_device = 'cpu' if dali_cpu else 'gpu'
decoder_device = 'cpu' if dali_cpu else 'mixed'
# This padding sets the size of the internal nvJPEG buffers to be able to handle all images from full-sized ImageNet
# without additional reallocations
device_memory_padding = 211025920 if decoder_device == 'mixed' else 0
host_memory_padding = 140544512 if decoder_device == 'mixed' else 0
self.decode = ops.ImageDecoder(device=decoder_device, output_type=types.RGB)
self.resize = ops.Resize(device=dali_device, resize_x=112, resize_y=112, interp_type=types.INTERP_TRIANGULAR)
self.cmnp = ops.CropMirrorNormalize(device=dali_device,
output_dtype=types.FLOAT,
output_layout=types.NCHW,
crop=(112, 112),
image_type=types.RGB,
mean=[0.485 * 255,0.456 * 255,0.406 * 255],
std=[0.229 * 255,0.224 * 255,0.225 * 255])
self.coin = ops.CoinFlip(probability=0.5)
print('DALI "{0}" variant'.format(dali_device))
def define_graph(self):
rng = self.coin()
self.jpegs, self.labels = self.input(name="Reader")
images = self.decode(self.jpegs)
images = self.resize(images)
output = self.cmnp(images.gpu(), mirror=rng)
return [output, self.labels]
DALI version: 0.15.0.
Hi,
For some reason 'GetImageInfo' fails. Most likely it is some error in the image itself. You can try to extract the image using MXNet API and see if it can be displayed by any viewer.
It can work in MXNet because it reads and decodes sample by sample and at the very end, it assembles the batch. So if any sample if bad it can be skipped and one more can be read in that place.
In the case of DALI, we work on the whole batch from the very beginning and we cannot reread one more sample when any if faulty (as decoding works on the batch that has been created already).
@JanuszL Thanks for your response. After carefully sample-wise checking, I found that there is a crupted sample in the record file with empty data field.
@JanuszL
您好,我想请问您,读取mxnet rec 数据 pytorch 训练,出现
Assert on "jpeg::GetImageInfo(encoded_buffer, length, &width, &height, &components) == true" failed
Current pipeline object is no longer valid
这应该是某个图片有错误, 但是由于rec 图片太大, 请问 dali 数据处理这边 可以直接有跳过异常操作吗
@sky186
Based on the source code I guess the image is empty. Can you try to extract it and see if that is true?
Most helpful comment
@JanuszL Thanks for your response. After carefully sample-wise checking, I found that there is a crupted sample in the record file with empty data field.