Dali: Error when using not resized RecordIO file.

Created on 19 Jul 2018 · 8Comments · Source: NVIDIA/DALI

I'm testing the mxnet-resnet50.ipynb example by converting it into a python script.

I use a record io file with no image resize, to preserve the image in original quality.

Then I got the following error:

$ python mxnet-resnet50.py
Training pipeline epoch size: 1281167
Validation pipeline epoch size: 50000
Traceback (most recent call last):
  File "mxnet-resnet50.py", line 127, in <module>
    dali_val_iter = DALIClassificationIterator(valpipes, valpipes[0].epoch_size("Reader"))
  File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/plugin/mxnet.py", line 151, in __init__
    data_layout)
  File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/plugin/mxnet.py", line 70, in __init__
    self._first_batch = self.next()
  File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/plugin/mxnet.py", line 92, in __next__
    outputs.append(p.run())
  File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/pipeline.py", line 164, in run
    return self.outputs()
  File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/pipeline.py", line 153, in outputs
    return self._pipe.Outputs()
RuntimeError: Critical error in pipeline: [/opt/dali/dali/pipeline/operators/fused/crop_mirror_normalize.cu:346] Assert on "H >= crop_h_" failed

My guess is that there are some images originally with one edge shorter than 224px, thus breaks the check. In MXNet we upscale the image before cropping if necessary: https://github.com/apache/incubator-mxnet/blob/master/src/io/image_aug_default.cc#L436

bug enhancement

Source

hetong007

All 8 comments

Hi,
We will look into this. One solution what comes to my mind is conditional resize operator which will upscale image only if it is smaller than requested size.
Internally tracked DALI-152.

JanuszL on 19 Jul 2018

It is basically an example issue if you use the original images - training pipeline has a RandomResizedCrop operation in it so it is immune to different sizes of inputs, but validation pipeline does not have anything there. We should add ops.Resize to validation pipeline there.

I will do it.

ptrendx on 19 Jul 2018

After adding ops.Resize, I have the validation in the following shape:

class HybridValPipe(Pipeline):
    def __init__(self, batch_size, num_threads, device_id, num_gpus):
        super(HybridValPipe, self).__init__(batch_size, num_threads, device_id, seed = 12 + device_id)
        self.input = ops.MXNetReader(path = [db_folder+"val.rec"], index_path=[db_folder+"val.idx"],
                                     random_shuffle = False, shard_id = device_id, num_shards = num_gpus)
        self.decode = ops.nvJPEGDecoder(device = "mixed", output_type = types.RGB)
        self.rs = ops.Resize(device = "gpu", resize_a = 256, resize_b = 256)
        self.cmnp = ops.CropMirrorNormalize(device = "gpu",
                                            output_dtype = types.FLOAT,
                                            output_layout = types.NCHW,
                                            crop = (224, 224),
                                            image_type = types.RGB,
                                            mean = [0.485 * 255,0.456 * 255,0.406 * 255],
                                            std = [0.229 * 255,0.224 * 255,0.225 * 255])

    def define_graph(self):
        self.jpegs, self.labels = self.input(name = "Reader")
        images = self.decode(self.jpegs)
        images = self.rs(images)
        output = self.cmnp(images)
        return [output, self.labels]

Then I met another error on nvJPEG:

$ python mxnet-resnet50.py
Training pipeline epoch size: 1281167
Validation pipeline epoch size: 50000
INFO:root:start with arguments Namespace(batch_size=1024, benchmark=0, data_nthreads=40, data_train='/data/imagenet/train-480-val-256-recordio/train.rec', data_train_idx='', data_val='/data/imagenet/train-480-val-256-recordio/val.rec', data_val_idx='', disp_batches=100, dtype='float16', gc_threshold=0.5, gc_type='none', gpus='0, 1, 2, 3, 4, 5, 6, 7', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1, max_random_shear_ratio=0.0, min_random_scale=0.533, model_prefix=None, mom=0.9, monitor=0, network='resnet-v1', num_classes=1000, num_epochs=1, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
[18:44:13] src/kvstore/././comm.h:690: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[18:44:13] src/kvstore/././comm.h:699: .vvvv...
[18:44:13] src/kvstore/././comm.h:699: v.vv.v..
[18:44:13] src/kvstore/././comm.h:699: vv.v..v.
[18:44:13] src/kvstore/././comm.h:699: vvv....v
[18:44:13] src/kvstore/././comm.h:699: v....vvv
[18:44:13] src/kvstore/././comm.h:699: .v..v.vv
[18:44:13] src/kvstore/././comm.h:699: ..v.vv.v
[18:44:13] src/kvstore/././comm.h:699: ...vvvv.
[18:44:13] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Error status: 9
Where: At /dvs/p4/build/sw/rel/gpgpu/toolkit/r9.0/nvJPEG/source/CodecJPEG.cpp:247
Message: NPP Runtime failure: '#-2'
What: Error in the NPP API call
Traceback (most recent call last):
  File "mxnet-resnet50.py", line 201, in <module>
    fit.fit(args, sym, get_dali_iter)
  File "/home/ubuntu/dali/docs/examples/mxnet/demo/common/fit.py", line 312, in fit
    monitor=monitor)
  File "/usr/local/lib/python2.7/dist-packages/mxnet/module/base_module.py", line 519, in fit
    next_data_batch = next(data_iter)
  File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/plugin/mxnet.py", line 92, in __next__
    outputs.append(p.run())
  File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/pipeline.py", line 164, in run
    return self.outputs()
  File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/pipeline.py", line 153, in outputs
    return self._pipe.Outputs()
RuntimeError: Critical error in pipeline: Error in thread 1: [/opt/dali/dali/pipeline/operators/decoder/nvjpeg_decoder.h:318] NVJPEG error "6"
Current pipeline object is no longer valid.

How should I configure nvJPEG? I've got them downloaded and tracked in $PATH.

hetong007 on 19 Jul 2018

👍1

It looks like problem of nvJpeg itself. We will look into this.

JanuszL on 19 Jul 2018

@ hetong007 - as I understand this crash is fully reproducible?
Do still use record io file with no image resize as I see in log "train-480-val-256-recordio"?
I cannot reproduce this on my side and I see you get your crash quite early.

JanuszL on 24 Jul 2018

@hetong007 - we are now able to reproduce the NPP "#-2" error on our end. Our NVJPEG and NPP teams are looking into it. Thanks for the report!

cliffwoolley on 25 Jul 2018

👍1

This should now be fixed in NVJPEG 0.1.3, which was just published today. https://developer.nvidia.com/nvjpeg . We will pick up NVJPEG 0.1.3 in later builds of DALI's pre-built binaries, but you can go ahead and switch to it now if you're building DALI from source by updating your copy of NVJPEG.

cliffwoolley on 7 Aug 2018

Thanks for the update! I'm not building dali from source. Will definitely try it out when a pre-built binary is available.

hetong007 on 7 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings