Dali: Illegal Memory access

Created on 12 Jun 2019 · 20Comments · Source: NVIDIA/DALI

When in my dali_tf.DALIIterator() I set device_id = d ( d is the device_id ) it throws:
[/opt/dali/dali/util/cuda_utils.h:69] CUDA runtime api error "an illegal memory access was encountered" . it only works if device_id is set to = 0 .
I have two gpus on my system and another system with 8 gpus same issue .

 for d in  range(DEVICES):
    with tf.device('/gpu:%i' % d):
        image = daliop(serialized_pipeline = serialized_pipes[d], 
            shapes = [(batch_size  ,sequence_length , 3 , 224, 224 )],
            dtypes = [tf.int32],   
            device_id=0 )#setting device_id to d throws illegal memory access. only 0 works.

My whole code :

class VideoReaderPipeline(Pipeline):
def __init__(self, batch_size, sequence_length, num_threads, device_id, files, crop_size,num_gpus):
    super(VideoReaderPipeline, self).__init__(batch_size, num_threads, device_id, seed=12)
    self.reader = ops.VideoReader(device="gpu", filenames=files, sequence_length=sequence_length, normalized=False,
                                 random_shuffle=False, image_type=types.RGB, dtype=types.UINT8, initial_fill=16, shard_id = device_id, num_shards = num_gpus)

    self.crop = ops.CropCastPermute(device="gpu", crop=crop_size, output_layout=types.NHWC, output_dtype=types.FLOAT)
    self.uniform = ops.Uniform(range=(0.0, 1.0))
    self.transpose = ops.Transpose(device="gpu", perm=[ 0 ,3 , 1, 2])

def define_graph(self):
    input = self.reader(name="Reader")
    cropped = self.crop(input, crop_pos_x=self.uniform(), crop_pos_y=self.uniform())
    output = self.transpose(cropped)
    return output

def get_batch_test_dali(args, ds_type):
batch_size = args.batchsize
file_root = '/home/dl/base-app/DALI_old/docs/examples/video/superres_pytorch/data_dir/720p/scenes/val'
sequence_length = 2#args.frames,
crop_size = args.crop_size
DEVICES= args.DEVICES


container_files = os.listdir(file_root)
container_files = [file_root + '/' + f for f in container_files]
pipelines = [VideoReaderPipeline(batch_size=batch_size,
                                    sequence_length=sequence_length,
                                    num_threads=2,
                                    device_id= device_id,
                                    files=container_files,
                                    crop_size=crop_size,
                                    num_gpus=DEVICES) for device_id in range(DEVICES)]
serialized_pipes = [pipe.serialize() for pipe in pipelines]
del pipelines
images = []
daliop = dali_tf.DALIIterator()
for d in range(DEVICES):
    with tf.device('/gpu:%i' % d):
        image = daliop(serialized_pipeline = serialized_pipes[d], 
            shapes = [(batch_size  ,sequence_length , 3 , 224, 224 )],
            dtypes = [tf.int32],   
            device_id = 0 )    ##  **<======Error Here ==== ****CAN'T PUT d instead of 0**

        images.append(image)
return images

bug

Source

qorbanpour

All 20 comments

Hi,
I cannot reproduce that with DALI 0.10. However not a long time ago we have fixed some problem with CUDA context handling in DALI. Could you retest this problem with the latest nightly build - https://github.com/NVIDIA/DALI#nightly-and-weekly-release-channels?

JanuszL on 13 Jun 2019

I retested that with latest nightly build and got the same error , I don't have any idea what is causing that ,I appreciate for any suggestion .

/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: TITAN V, pci bus id: 0000:65:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1

Dali: (Dali): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-13 11:58:42.965014: I tensorflow/core/common_runtime/placer.cc:1059] Dali: (Dali)/job:localhost/replica:0/task:0/device:GPU:0
Dali_1: (Dali): /job:localhost/replica:0/task:0/device:GPU:1
2019-06-13 11:58:42.965027: I tensorflow/core/common_runtime/placer.cc:1059] Dali_1: (Dali)/job:localhost/replica:0/task:0/device:GPU:1

terminate called after throwing an instance of 'dali::CUDAError'
terminate called recursively
what(): CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Aborted

qorbanpour on 13 Jun 2019

Could you provide full repro script with all argument you are using to run it?

JanuszL on 13 Jun 2019

script.zip

I have provided a simple stub for calling the iterator and the arguments there , the stub is taken from Dali Github Repo as well .

On the other machine with 8 GPUs the error was a little bit different , it was complaing about invalid resource handle

Thank you for your help .

qorbanpour on 13 Jun 2019

Hi,
Thanks. I managed to reproduce that problem. Let me look into it.

JanuszL on 14 Jun 2019

It should be fixed by https://github.com/NVIDIA/DALI/pull/978. Please check with the nightly build that follows the merge of that change if that works for you.

JanuszL on 14 Jun 2019

978 is not merged yet, I guess that error you see is the same, but it is just reported in the different place (of course if you still tests multi GPU scenario with the VideoReader).

JanuszL on 18 Jun 2019

If it still doesn't work with the most recent build, please reopen.

JanuszL on 5 Aug 2019

@JanuszL
I'm using release 0.12.0 and still sometimes get "an illegal memory access was encountered". It just randomly happens sometimes when I try to switch the folder from which to load data during training. When I disable Jitter ops this error never happens, so maybe it's the source of the error.
The full traceback:

Dataset changed.
Image size: 224
Batch size: 224
/home/zakirov/datasets/imagenet_2012/raw_data/292/train/
read 1281166 files from 1000 directories
/home/zakirov/datasets/imagenet_2012/raw_data/292/validation/
read 50000 files from 1000 directories
Traceback (most recent call last):
  File "train.py", line 512, in <module>
    main()
  File "train.py", line 191, in main
    dm.set_epoch(epoch)
  File "train.py", line 349, in set_epoch
    self._set_data(cur_phase)
  File "train.py", line 354, in _set_data
    self.trn_dl, self.val_dl = self._load_data(**phase)
  File "train.py", line 382, in _load_data
    device_id=args.gpu, train=False, **kwargs)
  File "/home/zakirov/repoz/imagenet18/modules/dali_dataloader.py", line 99, in get_loader
    pipe.build()
  File "/home/zakirov/.local/lib/python3.5/site-packages/nvidia/dali/pipeline.py", line 231, in build
    self._pipe.Build(self._names_and_devices)
RuntimeError: CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
terminate called after throwing an instance of 'dali::CUDAError'
  what():  CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Aborted (core dumped)

The code is available here

bonlime on 15 Aug 2019

I tried tracing this bug down using CUDA_LAUNCH_BLOCKING=1 but it never happens is this case.

bonlime on 15 Aug 2019

Sometimes the traceback is different:

THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=211 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
  File "train.py", line 512, in <module>
    main()
  File "train.py", line 193, in main
    train(dm.trn_dl, model, criterion, optimizer, scheduler, epoch)
  File "train.py", line 233, in train
    torch.cuda.synchronize()
  File "/home/zakirov/.local/lib/python3.5/site-packages/torch/cuda/__init__.py", line 365, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/torch/csrc/cuda/Module.cpp:211
terminate called after throwing an instance of 'dali::CUDAError'
  what():  CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Aborted (core dumped)

bonlime on 15 Aug 2019

The issue gets resolved if I explicitly delete old dataloaders and call torch.cuda.empty_cache() before creating new ones.

bonlime on 16 Aug 2019

Hi @bonlime,

Could you provide an example of a pipeline that crashes? As you are using Jitter op then I guess it is no longer a video. Different backtraces usually are the result of the fact that CUDA errors are caught during synchronization, and DALI has random operators so the error could sometime appear randomly as well.

JanuszL on 17 Aug 2019

Hi @JanuszL
I am also facing this issue mentioned by @bonlime above when using the Jitter operator.
Pytorch & DALI versions being used:

pytorch                   1.6.0.dev20200413       py3.7_cuda10.1.243_cudnn7.6.3_0  
nvidia-dali               0.20.0                            pypi_0

DALI pipeline
Simply removing the Jitter operator from this pipeline fixes the issue

class ExternalSourcePipeline(Pipeline):
    def __init__(self, batch_size, num_threads, device_id, external_data, device_type="gpu", training=True):
        super(ExternalSourcePipeline, self).__init__(batch_size, num_threads, device_id, seed=34,prefetch_queue_depth={ "cpu_size": 10, "gpu_size": 2})
        self.input = nvidia_ops.ExternalSource()
        self.input_label = nvidia_ops.ExternalSource()
        self.decode = nvidia_ops.ImageDecoder(device="mixed" if device_type=="gpu" else "cpu", output_type=nvidia_types.RGB)
        self.training = training
        self.crop_loc = nvidia_ops.Uniform(range=(0.,1.))
        self.coin = nvidia_ops.CoinFlip(probability=0.5)
        self.resize = nvidia_ops.Resize(device=device_type, resize_shorter=256)
        self.crop_mirror_normalize = nvidia_ops.CropMirrorNormalize(device=device_type, crop=(224,224),mean=128,std=128,output_layout='HWC')
        self.jitter = nvidia_ops.Jitter(device="gpu", nDegree=2)
        self.transpose = nvidia_ops.Transpose(device="gpu",perm=(2,0,1))
        self.cast = nvidia_ops.Cast(device="gpu", dtype=nvidia_types.FLOAT)
        self.external_data = external_data
        self.iterator = iter(self.external_data)
        self.device_type = device_type

    def training_data_augmentation(self, images):
        images = self.crop_mirror_normalize(images, crop_pos_x=self.crop_loc(), crop_pos_y=self.crop_loc(),
                                            mirror=self.coin())
        if self.device_type!="gpu":
            images = images.gpu()
        images = self.jitter(images)
        return images

    def validation_data_augmentation(self, images):
        images = self.crop_mirror_normalize(images)
        if self.device_type!="gpu":
            images = images.gpu()
        return images

    def define_graph(self):
        self.jpegs = self.input()
        self.labels = self.input_label()
        images = self.decode(self.jpegs)
        images = self.resize(images)
        if self.training:
            images = self.training_data_augmentation(images)
        else:
            images = self.validation_data_augmentation(images)
        images = self.transpose(images)
        output = self.cast(images)
        return (output, self.labels)

    def iter_setup(self):
        try:
            (images, labels) = self.iterator.next()
            self.feed_input(self.jpegs, images)
            self.feed_input(self.labels, labels)
        except StopIteration:
            self.iterator = iter(self.external_data)
            raise StopIteration

Error Message

139900145878784 Exception in thread: CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Traceback (most recent call last):
  File "training.py", line 189, in <module>
    main()
  File "training.py", line 185, in main
    trainer(0,args)
  File "training.py", line 53, in trainer
    train_loader, train_loader_len = create_data_loader(gpu, args, 'known_train_dataset.csv', batch_size=args.batch_size, shuffle_samples=True)
  File "training.py", line 40, in create_data_loader
    loader = PyTorchIterator(data_pipeline, size=external_iterator.size, last_batch_padded=True, fill_last_batch=False)
  File "/home/adhamija/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 360, in __init__
    last_batch_padded = last_batch_padded)
  File "/home/adhamija/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 162, in __init__
    self._first_batch = self.next()
  File "/home/adhamija/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 259, in next
    return self.__next__()
  File "/home/adhamija/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 212, in __next__
    device=category_device[category])
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'dali::CUDAError'
  what():  CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Aborted (core dumped)

akshay-raj-dhamija on 14 Apr 2020

HI,
I have run the following code with DALI build from master branch:

from nvidia.dali.pipeline import Pipeline
import nvidia.dali.ops as ops
import nvidia.dali.types as types

image_dir = "/data/imagenet/train-jpeg/"
batch_size = 8

class SimplePipeline(Pipeline):
    def __init__(self, batch_size, num_threads, device_id):
        super(SimplePipeline, self).__init__(batch_size, num_threads, device_id, seed = 12)
        self.input = ops.FileReader(file_root = image_dir)
        self.decode = ops.ImageDecoder(device = 'mixed', output_type = types.RGB)
        self.jitter = ops.Jitter(device="gpu", nDegree=2)
        self.crop_mirror_normalize = ops.CropMirrorNormalize(device="gpu", crop=(224,224),mean=128,std=128,output_layout='HWC')
        self.crop_loc = ops.Uniform(range=(0.,1.))
        self.coin = ops.CoinFlip(probability=0.5)
        self.resize = ops.Resize(
            device="gpu",
            resize_x=240,
            resize_y=240,
            min_filter=types.DALIInterpType.INTERP_TRIANGULAR)

    def define_graph(self):
        jpegs, labels = self.input()
        images = self.decode(jpegs)
        images = self.resize(images)
        images = self.crop_mirror_normalize(images, crop_pos_x=self.crop_loc(), crop_pos_y=self.crop_loc(),
                                            mirror=self.coin())
        img = self.jitter(images)
        return (images, img)
pipe = SimplePipeline(batch_size, 4, 0)
pipe.build()
i = 0
while 1:
    pipe.run()
    if i % 100:
        print(i)
    i += 1

on ImageNet and it works fine. I think I'm missing something from your setup.
Can you rework your code to something self contained that I can just run according to this guide. If you can share or point to any data that makes this problem reproducible it would make the debugging easier. Also please recheck on the latest DALI version.

JanuszL on 14 Apr 2020

Hi, I met the same error with the latest version 0.20.0 when the batch_size was 256, but I couldn't reproduce it when the batch_size was smaller, such as 128.

test_dali.zip

the script was attached, please replace the train_dir with standard ImageNet train directory.