When in my dali_tf.DALIIterator() I set device_id = d ( d is the device_id ) it throws:
[/opt/dali/dali/util/cuda_utils.h:69] CUDA runtime api error "an illegal memory access was encountered" . it only works if device_id is set to = 0 .
I have two gpus on my system and another system with 8 gpus same issue .
for d in range(DEVICES):
with tf.device('/gpu:%i' % d):
image = daliop(serialized_pipeline = serialized_pipes[d],
shapes = [(batch_size ,sequence_length , 3 , 224, 224 )],
dtypes = [tf.int32],
device_id=0 )#setting device_id to d throws illegal memory access. only 0 works.
My whole code :
class VideoReaderPipeline(Pipeline):
def __init__(self, batch_size, sequence_length, num_threads, device_id, files, crop_size,num_gpus):
super(VideoReaderPipeline, self).__init__(batch_size, num_threads, device_id, seed=12)
self.reader = ops.VideoReader(device="gpu", filenames=files, sequence_length=sequence_length, normalized=False,
random_shuffle=False, image_type=types.RGB, dtype=types.UINT8, initial_fill=16, shard_id = device_id, num_shards = num_gpus)
self.crop = ops.CropCastPermute(device="gpu", crop=crop_size, output_layout=types.NHWC, output_dtype=types.FLOAT)
self.uniform = ops.Uniform(range=(0.0, 1.0))
self.transpose = ops.Transpose(device="gpu", perm=[ 0 ,3 , 1, 2])
def define_graph(self):
input = self.reader(name="Reader")
cropped = self.crop(input, crop_pos_x=self.uniform(), crop_pos_y=self.uniform())
output = self.transpose(cropped)
return output
def get_batch_test_dali(args, ds_type):
batch_size = args.batchsize
file_root = '/home/dl/base-app/DALI_old/docs/examples/video/superres_pytorch/data_dir/720p/scenes/val'
sequence_length = 2#args.frames,
crop_size = args.crop_size
DEVICES= args.DEVICES
container_files = os.listdir(file_root)
container_files = [file_root + '/' + f for f in container_files]
pipelines = [VideoReaderPipeline(batch_size=batch_size,
sequence_length=sequence_length,
num_threads=2,
device_id= device_id,
files=container_files,
crop_size=crop_size,
num_gpus=DEVICES) for device_id in range(DEVICES)]
serialized_pipes = [pipe.serialize() for pipe in pipelines]
del pipelines
images = []
daliop = dali_tf.DALIIterator()
for d in range(DEVICES):
with tf.device('/gpu:%i' % d):
image = daliop(serialized_pipeline = serialized_pipes[d],
shapes = [(batch_size ,sequence_length , 3 , 224, 224 )],
dtypes = [tf.int32],
device_id = 0 ) ## **<======Error Here ==== ****CAN'T PUT d instead of 0**
images.append(image)
return images
Hi,
I cannot reproduce that with DALI 0.10. However not a long time ago we have fixed some problem with CUDA context handling in DALI. Could you retest this problem with the latest nightly build - https://github.com/NVIDIA/DALI#nightly-and-weekly-release-channels?
I retested that with latest nightly build and got the same error , I don't have any idea what is causing that ,I appreciate for any suggestion .
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: TITAN V, pci bus id: 0000:65:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1
Dali: (Dali): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-13 11:58:42.965014: I tensorflow/core/common_runtime/placer.cc:1059] Dali: (Dali)/job:localhost/replica:0/task:0/device:GPU:0
Dali_1: (Dali): /job:localhost/replica:0/task:0/device:GPU:1
2019-06-13 11:58:42.965027: I tensorflow/core/common_runtime/placer.cc:1059] Dali_1: (Dali)/job:localhost/replica:0/task:0/device:GPU:1
terminate called after throwing an instance of 'dali::CUDAError'
terminate called recursively
what(): CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Aborted
Could you provide full repro script with all argument you are using to run it?
I have provided a simple stub for calling the iterator and the arguments there , the stub is taken from Dali Github Repo as well .
On the other machine with 8 GPUs the error was a little bit different , it was complaing about invalid resource handle
Thank you for your help .
Hi,
Thanks. I managed to reproduce that problem. Let me look into it.
It should be fixed by https://github.com/NVIDIA/DALI/pull/978. Please check with the nightly build that follows the merge of that change if that works for you.
If it still doesn't work with the most recent build, please reopen.
@JanuszL
I'm using release 0.12.0 and still sometimes get "an illegal memory access was encountered". It just randomly happens sometimes when I try to switch the folder from which to load data during training. When I disable Jitter ops this error never happens, so maybe it's the source of the error.
The full traceback:
Dataset changed.
Image size: 224
Batch size: 224
/home/zakirov/datasets/imagenet_2012/raw_data/292/train/
read 1281166 files from 1000 directories
/home/zakirov/datasets/imagenet_2012/raw_data/292/validation/
read 50000 files from 1000 directories
Traceback (most recent call last):
File "train.py", line 512, in <module>
main()
File "train.py", line 191, in main
dm.set_epoch(epoch)
File "train.py", line 349, in set_epoch
self._set_data(cur_phase)
File "train.py", line 354, in _set_data
self.trn_dl, self.val_dl = self._load_data(**phase)
File "train.py", line 382, in _load_data
device_id=args.gpu, train=False, **kwargs)
File "/home/zakirov/repoz/imagenet18/modules/dali_dataloader.py", line 99, in get_loader
pipe.build()
File "/home/zakirov/.local/lib/python3.5/site-packages/nvidia/dali/pipeline.py", line 231, in build
self._pipe.Build(self._names_and_devices)
RuntimeError: CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
terminate called after throwing an instance of 'dali::CUDAError'
what(): CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Aborted (core dumped)
The code is available here
I tried tracing this bug down using CUDA_LAUNCH_BLOCKING=1 but it never happens is this case.
Sometimes the traceback is different:
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=211 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
File "train.py", line 512, in <module>
main()
File "train.py", line 193, in main
train(dm.trn_dl, model, criterion, optimizer, scheduler, epoch)
File "train.py", line 233, in train
torch.cuda.synchronize()
File "/home/zakirov/.local/lib/python3.5/site-packages/torch/cuda/__init__.py", line 365, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/torch/csrc/cuda/Module.cpp:211
terminate called after throwing an instance of 'dali::CUDAError'
what(): CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Aborted (core dumped)
The issue gets resolved if I explicitly delete old dataloaders and call torch.cuda.empty_cache() before creating new ones.
Hi @bonlime,
Could you provide an example of a pipeline that crashes? As you are using Jitter op then I guess it is no longer a video. Different backtraces usually are the result of the fact that CUDA errors are caught during synchronization, and DALI has random operators so the error could sometime appear randomly as well.
Hi @JanuszL
I am also facing this issue mentioned by @bonlime above when using the Jitter operator.
Pytorch & DALI versions being used:
pytorch 1.6.0.dev20200413 py3.7_cuda10.1.243_cudnn7.6.3_0
nvidia-dali 0.20.0 pypi_0
DALI pipeline
Simply removing the Jitter operator from this pipeline fixes the issue
class ExternalSourcePipeline(Pipeline):
def __init__(self, batch_size, num_threads, device_id, external_data, device_type="gpu", training=True):
super(ExternalSourcePipeline, self).__init__(batch_size, num_threads, device_id, seed=34,prefetch_queue_depth={ "cpu_size": 10, "gpu_size": 2})
self.input = nvidia_ops.ExternalSource()
self.input_label = nvidia_ops.ExternalSource()
self.decode = nvidia_ops.ImageDecoder(device="mixed" if device_type=="gpu" else "cpu", output_type=nvidia_types.RGB)
self.training = training
self.crop_loc = nvidia_ops.Uniform(range=(0.,1.))
self.coin = nvidia_ops.CoinFlip(probability=0.5)
self.resize = nvidia_ops.Resize(device=device_type, resize_shorter=256)
self.crop_mirror_normalize = nvidia_ops.CropMirrorNormalize(device=device_type, crop=(224,224),mean=128,std=128,output_layout='HWC')
self.jitter = nvidia_ops.Jitter(device="gpu", nDegree=2)
self.transpose = nvidia_ops.Transpose(device="gpu",perm=(2,0,1))
self.cast = nvidia_ops.Cast(device="gpu", dtype=nvidia_types.FLOAT)
self.external_data = external_data
self.iterator = iter(self.external_data)
self.device_type = device_type
def training_data_augmentation(self, images):
images = self.crop_mirror_normalize(images, crop_pos_x=self.crop_loc(), crop_pos_y=self.crop_loc(),
mirror=self.coin())
if self.device_type!="gpu":
images = images.gpu()
images = self.jitter(images)
return images
def validation_data_augmentation(self, images):
images = self.crop_mirror_normalize(images)
if self.device_type!="gpu":
images = images.gpu()
return images
def define_graph(self):
self.jpegs = self.input()
self.labels = self.input_label()
images = self.decode(self.jpegs)
images = self.resize(images)
if self.training:
images = self.training_data_augmentation(images)
else:
images = self.validation_data_augmentation(images)
images = self.transpose(images)
output = self.cast(images)
return (output, self.labels)
def iter_setup(self):
try:
(images, labels) = self.iterator.next()
self.feed_input(self.jpegs, images)
self.feed_input(self.labels, labels)
except StopIteration:
self.iterator = iter(self.external_data)
raise StopIteration
Error Message
139900145878784 Exception in thread: CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Traceback (most recent call last):
File "training.py", line 189, in <module>
main()
File "training.py", line 185, in main
trainer(0,args)
File "training.py", line 53, in trainer
train_loader, train_loader_len = create_data_loader(gpu, args, 'known_train_dataset.csv', batch_size=args.batch_size, shuffle_samples=True)
File "training.py", line 40, in create_data_loader
loader = PyTorchIterator(data_pipeline, size=external_iterator.size, last_batch_padded=True, fill_last_batch=False)
File "/home/adhamija/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 360, in __init__
last_batch_padded = last_batch_padded)
File "/home/adhamija/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 162, in __init__
self._first_batch = self.next()
File "/home/adhamija/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 259, in next
return self.__next__()
File "/home/adhamija/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 212, in __next__
device=category_device[category])
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'dali::CUDAError'
what(): CUDA runtime API error cudaErrorIllegalAddress (77):
an illegal memory access was encountered
Aborted (core dumped)
HI,
I have run the following code with DALI build from master branch:
from nvidia.dali.pipeline import Pipeline
import nvidia.dali.ops as ops
import nvidia.dali.types as types
image_dir = "/data/imagenet/train-jpeg/"
batch_size = 8
class SimplePipeline(Pipeline):
def __init__(self, batch_size, num_threads, device_id):
super(SimplePipeline, self).__init__(batch_size, num_threads, device_id, seed = 12)
self.input = ops.FileReader(file_root = image_dir)
self.decode = ops.ImageDecoder(device = 'mixed', output_type = types.RGB)
self.jitter = ops.Jitter(device="gpu", nDegree=2)
self.crop_mirror_normalize = ops.CropMirrorNormalize(device="gpu", crop=(224,224),mean=128,std=128,output_layout='HWC')
self.crop_loc = ops.Uniform(range=(0.,1.))
self.coin = ops.CoinFlip(probability=0.5)
self.resize = ops.Resize(
device="gpu",
resize_x=240,
resize_y=240,
min_filter=types.DALIInterpType.INTERP_TRIANGULAR)
def define_graph(self):
jpegs, labels = self.input()
images = self.decode(jpegs)
images = self.resize(images)
images = self.crop_mirror_normalize(images, crop_pos_x=self.crop_loc(), crop_pos_y=self.crop_loc(),
mirror=self.coin())
img = self.jitter(images)
return (images, img)
pipe = SimplePipeline(batch_size, 4, 0)
pipe.build()
i = 0
while 1:
pipe.run()
if i % 100:
print(i)
i += 1
on ImageNet and it works fine. I think I'm missing something from your setup.
Can you rework your code to something self contained that I can just run according to this guide. If you can share or point to any data that makes this problem reproducible it would make the debugging easier. Also please recheck on the latest DALI version.
Hi, I met the same error with the latest version 0.20.0 when the batch_size was 256, but I couldn't reproduce it when the batch_size was smaller, such as 128.
the script was attached, please replace the train_dir with standard ImageNet train directory.
Hi,
I have reproduced the problem. Will investigate it further.
https://github.com/NVIDIA/DALI/pull/1914 should fix the problem, batch size was the key to reproduce this.
@JanuszL Great! I'm very appreciate your quick fix.
0.22 is available with the fix.