I'm testing the mxnet-resnet50.ipynb example by converting it into a python script.
I use a record io file with no image resize, to preserve the image in original quality.
Then I got the following error:
$ python mxnet-resnet50.py
Training pipeline epoch size: 1281167
Validation pipeline epoch size: 50000
Traceback (most recent call last):
File "mxnet-resnet50.py", line 127, in <module>
dali_val_iter = DALIClassificationIterator(valpipes, valpipes[0].epoch_size("Reader"))
File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/plugin/mxnet.py", line 151, in __init__
data_layout)
File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/plugin/mxnet.py", line 70, in __init__
self._first_batch = self.next()
File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/plugin/mxnet.py", line 92, in __next__
outputs.append(p.run())
File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/pipeline.py", line 164, in run
return self.outputs()
File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/pipeline.py", line 153, in outputs
return self._pipe.Outputs()
RuntimeError: Critical error in pipeline: [/opt/dali/dali/pipeline/operators/fused/crop_mirror_normalize.cu:346] Assert on "H >= crop_h_" failed
My guess is that there are some images originally with one edge shorter than 224px, thus breaks the check. In MXNet we upscale the image before cropping if necessary: https://github.com/apache/incubator-mxnet/blob/master/src/io/image_aug_default.cc#L436
Hi,
We will look into this. One solution what comes to my mind is conditional resize operator which will upscale image only if it is smaller than requested size.
Internally tracked DALI-152.
It is basically an example issue if you use the original images - training pipeline has a RandomResizedCrop operation in it so it is immune to different sizes of inputs, but validation pipeline does not have anything there. We should add ops.Resize to validation pipeline there.
I will do it.
After adding ops.Resize, I have the validation in the following shape:
class HybridValPipe(Pipeline):
def __init__(self, batch_size, num_threads, device_id, num_gpus):
super(HybridValPipe, self).__init__(batch_size, num_threads, device_id, seed = 12 + device_id)
self.input = ops.MXNetReader(path = [db_folder+"val.rec"], index_path=[db_folder+"val.idx"],
random_shuffle = False, shard_id = device_id, num_shards = num_gpus)
self.decode = ops.nvJPEGDecoder(device = "mixed", output_type = types.RGB)
self.rs = ops.Resize(device = "gpu", resize_a = 256, resize_b = 256)
self.cmnp = ops.CropMirrorNormalize(device = "gpu",
output_dtype = types.FLOAT,
output_layout = types.NCHW,
crop = (224, 224),
image_type = types.RGB,
mean = [0.485 * 255,0.456 * 255,0.406 * 255],
std = [0.229 * 255,0.224 * 255,0.225 * 255])
def define_graph(self):
self.jpegs, self.labels = self.input(name = "Reader")
images = self.decode(self.jpegs)
images = self.rs(images)
output = self.cmnp(images)
return [output, self.labels]
Then I met another error on nvJPEG:
$ python mxnet-resnet50.py
Training pipeline epoch size: 1281167
Validation pipeline epoch size: 50000
INFO:root:start with arguments Namespace(batch_size=1024, benchmark=0, data_nthreads=40, data_train='/data/imagenet/train-480-val-256-recordio/train.rec', data_train_idx='', data_val='/data/imagenet/train-480-val-256-recordio/val.rec', data_val_idx='', disp_batches=100, dtype='float16', gc_threshold=0.5, gc_type='none', gpus='0, 1, 2, 3, 4, 5, 6, 7', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1, max_random_shear_ratio=0.0, min_random_scale=0.533, model_prefix=None, mom=0.9, monitor=0, network='resnet-v1', num_classes=1000, num_epochs=1, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
[18:44:13] src/kvstore/././comm.h:690: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[18:44:13] src/kvstore/././comm.h:699: .vvvv...
[18:44:13] src/kvstore/././comm.h:699: v.vv.v..
[18:44:13] src/kvstore/././comm.h:699: vv.v..v.
[18:44:13] src/kvstore/././comm.h:699: vvv....v
[18:44:13] src/kvstore/././comm.h:699: v....vvv
[18:44:13] src/kvstore/././comm.h:699: .v..v.vv
[18:44:13] src/kvstore/././comm.h:699: ..v.vv.v
[18:44:13] src/kvstore/././comm.h:699: ...vvvv.
[18:44:13] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Error status: 9
Where: At /dvs/p4/build/sw/rel/gpgpu/toolkit/r9.0/nvJPEG/source/CodecJPEG.cpp:247
Message: NPP Runtime failure: '#-2'
What: Error in the NPP API call
Traceback (most recent call last):
File "mxnet-resnet50.py", line 201, in <module>
fit.fit(args, sym, get_dali_iter)
File "/home/ubuntu/dali/docs/examples/mxnet/demo/common/fit.py", line 312, in fit
monitor=monitor)
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/base_module.py", line 519, in fit
next_data_batch = next(data_iter)
File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/plugin/mxnet.py", line 92, in __next__
outputs.append(p.run())
File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/pipeline.py", line 164, in run
return self.outputs()
File "/usr/local/lib/python2.7/dist-packages/nvidia/dali/pipeline.py", line 153, in outputs
return self._pipe.Outputs()
RuntimeError: Critical error in pipeline: Error in thread 1: [/opt/dali/dali/pipeline/operators/decoder/nvjpeg_decoder.h:318] NVJPEG error "6"
Current pipeline object is no longer valid.
How should I configure nvJPEG? I've got them downloaded and tracked in $PATH.
It looks like problem of nvJpeg itself. We will look into this.
@ hetong007 - as I understand this crash is fully reproducible?
Do still use record io file with no image resize as I see in log "train-480-val-256-recordio"?
I cannot reproduce this on my side and I see you get your crash quite early.
@hetong007 - we are now able to reproduce the NPP "#-2" error on our end. Our NVJPEG and NPP teams are looking into it. Thanks for the report!
This should now be fixed in NVJPEG 0.1.3, which was just published today. https://developer.nvidia.com/nvjpeg . We will pick up NVJPEG 0.1.3 in later builds of DALI's pre-built binaries, but you can go ahead and switch to it now if you're building DALI from source by updating your copy of NVJPEG.
Thanks for the update! I'm not building dali from source. Will definitely try it out when a pre-built binary is available.