Pytorch: RuntimeError: CUDA error: an illegal memory access was encountered

Created on 15 Jun 2019  Â·  103Comments  Â·  Source: pytorch/pytorch

Hi,everyone!
I met a strange illegal memory access error. It happens randomly without any regular pattern.
The code is really simple. It is PointNet for point cloud segmentation. I don't think there is anything wrong in the code.

import torch
import torch.nn as nn
import torch.nn.functional as F
import os
class InstanceSeg(nn.Module):
    def __init__(self, num_points=1024):
        super(InstanceSeg, self).__init__()

        self.num_points = num_points

        self.conv1 = nn.Conv1d(9, 64, 1)
        self.conv2 = nn.Conv1d(64, 64, 1)
        self.conv3 = nn.Conv1d(64, 64, 1)
        self.conv4 = nn.Conv1d(64, 128, 1)
        self.conv5 = nn.Conv1d(128, 1024, 1)
        self.conv6 = nn.Conv1d(1088, 512, 1)
        self.conv7 = nn.Conv1d(512, 256, 1)
        self.conv8 = nn.Conv1d(256, 128, 1)
        self.conv9 = nn.Conv1d(128, 128, 1)
        self.conv10 = nn.Conv1d(128, 2, 1)
        self.max_pool = nn.MaxPool1d(num_points)

    def forward(self, x):
        batch_size = x.size()[0] # (x has shape (batch_size, 9, num_points))

        out = F.relu(self.conv1(x)) # (shape: (batch_size, 64, num_points))
        out = F.relu(self.conv2(out)) # (shape: (batch_size, 64, num_points))
        point_features = out

        out = F.relu(self.conv3(out)) # (shape: (batch_size, 64, num_points))
        out = F.relu(self.conv4(out)) # (shape: (batch_size, 128, num_points))
        out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
        global_feature = self.max_pool(out) # (shape: (batch_size, 1024, 1))

        global_feature_repeated = global_feature.repeat(1, 1, self.num_points) # (shape: (batch_size, 1024, num_points))
        out = torch.cat([global_feature_repeated, point_features], 1) # (shape: (batch_size, 1024+64=1088, num_points))

        out = F.relu(self.conv6(out)) # (shape: (batch_size, 512, num_points))
        out = F.relu(self.conv7(out)) # (shape: (batch_size, 256, num_points))
        out = F.relu(self.conv8(out)) # (shape: (batch_size, 128, num_points))
        out = F.relu(self.conv9(out)) # (shape: (batch_size, 128, num_points))

        out = self.conv10(out) # (shape: (batch_size, 2, num_points))

        out = out.transpose(2,1).contiguous() # (shape: (batch_size, num_points, 2))
        out = F.log_softmax(out.view(-1, 2), dim=1) # (shape: (batch_size*num_points, 2))
        out = out.view(batch_size, self.num_points, 2) # (shape: (batch_size, num_points, 2))

        return out

Num = 0
network = InstanceSeg()
network.cuda()
while(1):

    input0 = torch.randn(32, 3, 1024).cuda()
    input1 = torch.randn(32, 3, 1024).cuda()
    input2 = torch.randn(32, 3, 1024).cuda()
    input = torch.cat((input0, input1, input2), 1)

    out = network(input)
    Num = Num+1
    print(Num)

After random number of steps, error raises. The error report is

Traceback (most recent call last):
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 58, in <module>
    input0 = torch.randn(32, 3, 1024).cuda()
RuntimeError: CUDA error: an illegal memory access was encountered

When I added "os.environ['CUDA_LAUNCH_BLOCKING'] = '1'" at the top of this script, the error report was changed to this

Traceback (most recent call last):
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 64, in <module>
    out = network(input)
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 35, in forward
    out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 187, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I know some wrong indexing operations and some wrong usage method of loss function may lead to illegal memory access error. But in this script, there is no such kind of operation.
I am quite sure this error is not because of out of memory since only about 2G GPU memory is used, and I have totally 12G GPU memory.

This is my environment information:

OS: Ubuntu 16.04 LTS 64-bit
Command: conda install pytorch torchvision cudatoolkit=9.0 -c pytorch
GPU: Titan XP
Driver Version: 410.93
Python Version: 3.6
cuda Version: cuda_9.0.176_384.81_linux
cudnn Version: cudnn-9.0-linux-x64-v7.4.2.24
pytorch Version: pytorch-1.0.1-py3.6_cuda9.0.176_cudnn7.4.2_2

I have been stuck here for long time.
In fact, not only this project faces this error, many other projects face similar error in my computer.
I don't think there is anything wrong in the code. It can run correctly for some steps. Maybe this error is because the environment. I am not sure.
Does anyone have any idea about this situation? If more detailed information is needed, please let me know.
Thanks for any suggestion.

cuda triaged

Most helpful comment

@jzazo
Hi, I had similar problem.
If I use device = torch.device("cuda:1"), I always got RuntimeError: CUDA error: an illegal memory access was encountered error.

But when I set a specific gpu by torch.cuda.set_device(1), everything is fine.

All 103 comments

Could be the same cudnn bug fixed in 7.6. See https://github.com/pytorch/pytorch/issues/16831. Could you try pytorch 1.1?

@SsnL Thanks for your reply. I will do more trials and post the results here. This is really a weird error and very hard to debug.

@SsnL I update the environment to pytorch 1.1, cuda 10.0, cudnn 7.6, but this error still happens.

Can't repro with pytorch 1.1/cuda10/cudnn7.6 after more than 5000 iterations (both V100 and P100, P100 should be similar to TitanXP).

Still having this problem

@zhixuanli are you seeing the same error using the latest PyTorch release (1.3.0)?
Could you post the setup you are using, so that we could try to reproduce this issue, since we weren't able to do so until now.

I met the same problem with 2080ti. Setting batch from 2 to 1 and reducing the gtBoxes of per image didn't work.
This is my environment information:

OS: Ubuntu 16.04 LTS 64-bit
Command: conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
GPU: 2080ti
Driver Version: 418.67
Python Version: 3.7
cuda Version: 10.1
cudnn Version: 7
pytorch Version: torch-1.1.0, torchvision-0.2.0

@ptrblck I tried the PyTorch(1.3.0), still having the same problem
Trian log:

out of memory
invalid argument
an illegal memory access was encountered
an illegal memory access was encountered
Traceback (most recent call last):
File "tools/train_net.py", line 174, in
main()
File "tools/train_net.py", line 167, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 73, in train
arguments,
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/engine/trainer.py", line 68, in do_train
loss_dict = model(images, targets)
File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
output = self.module(inputs[0], *kwargs[0])
File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
proposals, proposal_losses = self.rpn(images, features, targets)
File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 136, in forward
return self._forward_train(anchors, box_cls, box_regression, targets)
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 143, in _forward_train
anchors, box_cls, box_regression, targets
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/rpn/retinanet/loss.py", line 172, in __call__
match_quality_matrix = boxlist_iou(targets_, anchors_)
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/structures/rboxlist_ops.py", line 167, in boxlist_iou
overlaps_th = torch.tensor(overlaps).to(boxlist1.bbox.device) #[N, M]
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fb1e9515813 in /home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)

setting CUDA_LAUNCH_BLOCKING to 1 didn't work.

Is this problem related to this one?
I am on Ubuntu 18.04, and I have tried pytorch 1.1.0, 1.2.0, 1.3.0 and cuda's 9.2, 10.0, 10.1 with Python 3.7.4 within a conda installation. The nvidia-smi drivers I am currently using are 440.26, but I have tried a bunch as well, none working.

In my case, I get the RuntimeError: CUDA error: an illegal memory access was encountered message when I run my code on gpu 1, but it runs fine on gpu 0:

gpu=1
device = torch.device(f"cuda:{gpu}" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    torch.cuda.set_device(device)

Any ideas on how to try debug this?

@jzazo
Hi, I had similar problem.
If I use device = torch.device("cuda:1"), I always got RuntimeError: CUDA error: an illegal memory access was encountered error.

But when I set a specific gpu by torch.cuda.set_device(1), everything is fine.

I'm getting this error as well, but it seems to depend on my batch size. I don't encounter it on smaller batch sizes.
pytorch v 1.3.1 on a V100

@heiyuxiaokai
The first output points to an "out of memory" error.
Could you lower the batch size and rerun your code again?
Are you using the code snippet from the first post or another one?

@jzazo
The original script does not use apex, so this issue should be unrelated.

@kouohhashi @dan-nadler
Are you using the script from the first post or another one?

I still cannot reproduce the error for more than 20k iterations, so I would need (another) code snippet to reproduce this issue.

@ptrblck I am using a different script. Keeping the batch size down and moving the operations into functions seems to have solved it, though I'm staying around 80% GPU memory utilization. I had a handful of issues, though, so I'm not quite sure which change adressed which problem.

I tried this MNIST example.

I added the following lines at the beginning of script:

gpu = 1
device = torch.device(gpu if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    torch.cuda.set_device(gpu)

and device = torch.device(gpu if use_cuda else "cpu") in the main function.
I get the following error: RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1570710718161/work/aten/src/THC/THCGeneral.cpp:216

It's a different error than what I was getting in my own script, but still the simple example does not run on gpu=1, but it does work on gpu=0.

I just remembered I followed this guide to move Xorg from being loaded on discrete gpu, to be run on Intel's integrated chip. Could this change be responsible for this strange behavior?
I will undo the change and report back the outcome.

I did the rollback and it didn't fix the issue. I once more removed nvidia drivers, installed them and cuda again, and I still get the error. I don't know how to find the source of the problem.

@dan-nadler the peak memory usage might have caused the OOM issue.

@jzazo I cannot reproduce this issue by adding your provided code to the MNIST example on an 8 GPU system (rerunning with different GPU ids).

What GPU are you using as GPU1? If it's the intel integrated chip, this won't work.
You would need a GPU which can execute CUDA code.

I have the Intel's integrated card and 2x Ti 1080-GTX in Ubuntu 18.04 system. When I get some time I will try narrow down the problem. I don't have a clue of what's causing it.

Have you solved this problem?I met the same one recently. I can run the code correctly in a machine, but the bug arise in my own computer,even the two machine have same 2080Ti card with same diver and the same conda envirenment @xiaoxiangyeyuwangye

same problem. ubuntu 16.04, 2080 ti Driver Version: 440.33.01 CUDA Version: 10.2

I'm having a potentially related issue as well. On a machine with 8 RTX 2080 Ti GPUs, one specific GPU (4) gives the CUDA illegal memory access issue when trying to copy from the GPU to the CPU:

# predicted = pytorch tensor on GPU
predicted = predicted.view(-1).detach().cpu().numpy()
# RuntimeError: CUDA error: an illegal memory access was encountered

Identical code runs fine on the other 7 GPUs but gives an error on this particular GPU after a random number of iterations.

Driver: 430.50
Ubuntu 18.04.3 LTS
CUDA: 10.1.243
cuDNN: 7.5.1

conda install
python: 3.7.4
pytorch:  1.1.0 py3.7_cuda10.1.243_cudnn7.6.3_0
cudatoolkit: 10.1.243
torchvision:  0.4.2

I haven't done too much playing around, but this happens fairly repeatibly (usually within 20-30 minutes of running) only on this one particular GPU. Any developments about this issue before I start checking hardware?

@sicklife @bhaeffele Are you seeing this error using the code snippet from the first post on your setup?

Same problem here, happens when I try and call .to(device). CUDA 9.2, torch 0.4.0, torchvision 0.2.1.

I ran the code from the first post for 1e6 iterations without any errors on my "problematic" GPU. Still getting the error with my code on that GPU only.

@knagrecha
0.4.0 is quite old by now. Could you please update to the latest stable release (1.4.0) and retry your script? Feel free to create a new issue in case you see the same error with any to('cuda') call and ping me there or are you seeing this error with the first posted code snippet?

@bhaeffele Could you post a (minimal) executable code snippet to reproduce this error?

input0 = torch.randn(32, 3, 1024).cuda()

try this

input0 = Variable(torch.randn(32, 3, 1024).cuda())

and dont forget

from torch.autograd import Variable

@hadypranoto Variables were deprecated in 0.4.0, so this should not be necessary.
However, I would still recommend to update to the latest stable release and rerun the script.

Hi,everyone!
I met a strange illegal memory access error. It happens randomly without any regular pattern.
The code is really simple. It is PointNet for point cloud segmentation. I don't think there is anything wrong in the code.

import torch
import torch.nn as nn
import torch.nn.functional as F
import os
class InstanceSeg(nn.Module):
    def __init__(self, num_points=1024):
        super(InstanceSeg, self).__init__()

        self.num_points = num_points

        self.conv1 = nn.Conv1d(9, 64, 1)
        self.conv2 = nn.Conv1d(64, 64, 1)
        self.conv3 = nn.Conv1d(64, 64, 1)
        self.conv4 = nn.Conv1d(64, 128, 1)
        self.conv5 = nn.Conv1d(128, 1024, 1)
        self.conv6 = nn.Conv1d(1088, 512, 1)
        self.conv7 = nn.Conv1d(512, 256, 1)
        self.conv8 = nn.Conv1d(256, 128, 1)
        self.conv9 = nn.Conv1d(128, 128, 1)
        self.conv10 = nn.Conv1d(128, 2, 1)
        self.max_pool = nn.MaxPool1d(num_points)

    def forward(self, x):
        batch_size = x.size()[0] # (x has shape (batch_size, 9, num_points))

        out = F.relu(self.conv1(x)) # (shape: (batch_size, 64, num_points))
        out = F.relu(self.conv2(out)) # (shape: (batch_size, 64, num_points))
        point_features = out

        out = F.relu(self.conv3(out)) # (shape: (batch_size, 64, num_points))
        out = F.relu(self.conv4(out)) # (shape: (batch_size, 128, num_points))
        out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
        global_feature = self.max_pool(out) # (shape: (batch_size, 1024, 1))

        global_feature_repeated = global_feature.repeat(1, 1, self.num_points) # (shape: (batch_size, 1024, num_points))
        out = torch.cat([global_feature_repeated, point_features], 1) # (shape: (batch_size, 1024+64=1088, num_points))

        out = F.relu(self.conv6(out)) # (shape: (batch_size, 512, num_points))
        out = F.relu(self.conv7(out)) # (shape: (batch_size, 256, num_points))
        out = F.relu(self.conv8(out)) # (shape: (batch_size, 128, num_points))
        out = F.relu(self.conv9(out)) # (shape: (batch_size, 128, num_points))

        out = self.conv10(out) # (shape: (batch_size, 2, num_points))

        out = out.transpose(2,1).contiguous() # (shape: (batch_size, num_points, 2))
        out = F.log_softmax(out.view(-1, 2), dim=1) # (shape: (batch_size*num_points, 2))
        out = out.view(batch_size, self.num_points, 2) # (shape: (batch_size, num_points, 2))

        return out

Num = 0
network = InstanceSeg()
network.cuda()
while(1):

    input0 = torch.randn(32, 3, 1024).cuda()
    input1 = torch.randn(32, 3, 1024).cuda()
    input2 = torch.randn(32, 3, 1024).cuda()
    input = torch.cat((input0, input1, input2), 1)

    out = network(input)
    Num = Num+1
    print(Num)

After random number of steps, error raises. The error report is

Traceback (most recent call last):
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 58, in <module>
    input0 = torch.randn(32, 3, 1024).cuda()
RuntimeError: CUDA error: an illegal memory access was encountered

When I added "os.environ['CUDA_LAUNCH_BLOCKING'] = '1'" at the top of this script, the error report was changed to this

Traceback (most recent call last):
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 64, in <module>
    out = network(input)
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 35, in forward
    out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 187, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I know some wrong indexing operations and some wrong usage method of loss function may lead to illegal memory access error. But in this script, there is no such kind of operation.
I am quite sure this error is not because of out of memory since only about 2G GPU memory is used, and I have totally 12G GPU memory.

This is my environment information:

OS: Ubuntu 16.04 LTS 64-bit
Command: conda install pytorch torchvision cudatoolkit=9.0 -c pytorch
GPU: Titan XP
Driver Version: 410.93
Python Version: 3.6
cuda Version: cuda_9.0.176_384.81_linux
cudnn Version: cudnn-9.0-linux-x64-v7.4.2.24
pytorch Version: pytorch-1.0.1-py3.6_cuda9.0.176_cudnn7.4.2_2

I have been stuck here for long time.
In fact, not only this project faces this error, many other projects face similar error in my computer.
I don't think there is anything wrong in the code. It can run correctly for some steps. Maybe this error is because the environment. I am not sure.
Does anyone have any idea about this situation? If more detailed information is needed, please let me know.
Thanks for any suggestion.

Im facing this problem..
But my function is torch.zeros or torch.ones

input0 = torch.randn(32, 3, 1024).cuda()

try this

input0 = Variable(torch.randn(32, 3, 1024).cuda())

and dont forget

from torch.autograd import Variable

This way not help me to only delay the code error... j7

I met the same problem, It randomly happened when distributed training on 4 GPUs, and always on GPU 0. (Ubuntu 16.04, 1080ti x 3 and Titan Xp x 1, Driver Version: 430.50, CUDA Version: 10.1, pytorch: 1.4.0a0+7f73f1d

This also happens for me when I try to run PRNet.

RuntimeError: Caught RuntimeError in replica 0 on device 0.
...
...
H = torch.matmul(src_centered, src_corr_centered.transpose(2, 1).contiguous()).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered

Noteworthy is that device 0 is the same device that I use for my monitor. I wonder if other people also have problems with the GPU that is responsible for driving their monitor.

I'm currently running it with CUDA_VISIBLE_DEVICES=1 and will report back if I still get any problem.

Late edit: It didn't help. Trying on main GPU only.
Update again: Despite the error above, it runs fine on device 0 (2 epochs as of writing this). And as mentioned, device 0 is the same GPU that is used for my monitor.
Another update: After 7 epochs it crashed again with the same error as above.

Ubuntu 18.04
2 x GeForce GTX TITAN X
Driver Version: 440.64
Cuda version: 10.1
Torch version: 1.5.0 (nightly)

@jzazo
Hi, I had similar problem.
If I use device = torch.device("cuda:1"), I always got RuntimeError: CUDA error: an illegal memory access was encountered error.

But when I set a specific gpu by torch.cuda.set_device(1), everything is fine.

Have you solved your problem?I met the same problem, can you help me?

No, I haven't spent more time on this, and I couldn't fix it.

No, I haven't spent more time on this, and I couldn't fix it.

Have you ever tried to remove the GPU and reinstall it?

Never end .. problem...Sent from my Samsung Galaxy smartphone.
-------- Original message --------From: Sebastian Grans notifications@github.com Date: 3/28/20 1:30 AM (GMT+07:00) To: pytorch/pytorch pytorch@noreply.github.com Cc: hadypranoto hadypranoto@gmail.com, Mention mention@noreply.github.com Subject: Re: [pytorch/pytorch] RuntimeError: CUDA error: an illegal memory access was encountered (#21819)
This also happens for me when I try to run PRNet.
RuntimeError: Caught RuntimeError in replica 0 on device 0.
...
...
H = torch.matmul(src_centered, src_corr_centered.transpose(2, 1).contiguous()).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered

Noteworthy is that device 0 is the same device that I use for my monitor. I wonder if other people also have problems with the GPU that is responsible for driving their monitor.
I'm currently running it with CUDA_VISIBLE_DEVICES=1 and will report back if I still get any problem.
Ubuntu 18.04
2 x GeForce GTX TITAN X
Driver Version: 440.64
Cuda version: 10.1
Torch version: 1.5.0 (nightly)

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.
[
{
"@context": "http://schema.org",
"@type": "EmailMessage",
"potentialAction": {
"@type": "ViewAction",
"target": "https://github.com/pytorch/pytorch/issues/21819#issuecomment-605201414",
"url": "https://github.com/pytorch/pytorch/issues/21819#issuecomment-605201414",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
"@type": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]

Another update:
Running with CUDA_LAUNCH_BLOCKING=1 I was able to get some more information.

Exception has occurred: RuntimeError
cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorScatterGather.cu:72
  File "/home/grans/Documents/prnet2/model.py", line 535, in forward
    src_keypoints = torch.gather(src, dim=2, index=src_keypoints_idx)

[omitted]

I was running this in debug mode in VS Code so I could look into the stack trace and also interactively look at the variables, but nowhere in the stack trace was any cuda variable accessible. In fact, using cuda at all seems to give errors.

I tried creating a new tensor in the debug terminal, but this also resulted in illegal memory access error:

> mytensor = torch.Tensor([[1,2,3,4,5]])
> mytensor.numpy()
array([[1., 2., 3., 4., 5.]], dtype=float32)
> mytensor = torch.Tensor([[1,2,3,4,5]]).cuda()
RuntimeError: CUDA error: an illegal memory access was encountered

Note: This occurs regardless on which GPU i run it on.

Additional note:
The code seems to runs fine with torch.backends.cudnn.enabled = False and about twice as fast. 🤔

In a seemingly unrelated PR (https://github.com/pytorch/pytorch/pull/36668) I have encountered the same error in PyTorch CI

@SebastianGrans which command are you using to train the model?
All three training cmds yield:

ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 128])

Also, are you installing all dependencies from the environment.yml?
If so, could you comment out the torch and torchvision packages, as they are quite old by now?

It would be great if one of the users here can check out https://github.com/pytorch/pytorch/pull/36668, build it, run pytest -sv test/test_autograd.py -x, and confirm whether they experience the same error. If so, we will finally have a way to reliably reproduce this error in the CI.

@Baranowski Illegal memory accesses can be created in various ways. I doubt your PR is related to the original question, which uses a 1D CNN.

I see, thanks for clarifying.

@ptrblck: I might have had that issue as well. Here's a repo of my current "working" code that doesn't give that error: repo.

Just to make sure everything was still working, I did a fresh clone and setup the virtual environment with pip's default torch and torchvision. Surprisingly, this works when running on both GPUs. (At least until it crashed due to issues with the svd solver topk, an issue which has been discussed in the the original repos issue tracker: here)

I then installed torch==1.5.0, torchvision==0.6.0.dev20200327+cu101. With the following command I get the RuntimeError: CUDA error ... at iteration 372/820.

python3 main.py --exp_name "gittest" --svd_on_gpu --batch_size 12

To add to the pile, I also have this error after 28 epochs of training imagenet:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered
Traceback (most recent call last):
  File "train_imagenet.py", line 350, in <module>
    main()
  File "train_imagenet.py", line 79, in main
    main_worker(args.gpu, args.pretrained, args.arch, args.lr, args.momentum, args.weight_decay, args.resume, args.data, args.epochs, args.workers, args.batch_size, args.evaluate, args.print_freq)
  File "train_imagenet.py", line 170, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, gpu, print_freq)
  File "train_imagenet.py", line 217, in train
    losses.update(loss.detach().item(), images.size(0))
RuntimeError: CUDA error: an illegal memory access was encountered

Ubuntu 18.04
1 x GeForce RTX 2080 Water-cooled
Driver Version: 440.64
Cuda version: 10.2
Torch version: 1.4.0

@SebastianGrans
"Here's a repo of my current "working" code that doesn't give that error: repo."
Could you post a code snippet, which reproduces this error?

Your provided code runs without any errors using:

python3 main.py --exp_name "gittest" --svd_on_gpu --batch_size 12

@mgolub2
Which code are you running?

@ptrblck Interesting, because that crashes on my machine. I'll look into it...

@SebastianGrans I'll try to reproduce it on different systems and PyTorch versions.

I'm also experiencing a similar problem. It was fine before update to CUDA10.2 and PyTorch1.5

images = images.to(device, non_blocking=True)
labels = labels.to(device, non_blocking=True)
logits = self.model(images)
acc, _ = util.accuracy(logits, labels, topk=(1, 5))
acc_meters.update(acc.item(), labels.shape[0])
acc_meters.update(acc.item(), labels.shape[0])
RuntimeError: CUDA error: an illegal memory access was encountered

Ubuntu 19.10
1 x GeForce RTX 2080ti
Driver Version: 440.64
Cuda version: 10.2
Torch version: 1.5.0

I have the same problem at a random iteration with torch 1.5.0 and CUDA 10.1.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered
correct = pred.eq(masks).sum().item()
RuntimeError: CUDA error: an illegal memory access was encountered

I also meet this question, and I find when my dataset is small, this question wont appear. I guess this is a bug for pytorch

I would like to add that I'm encountering the same issue on PyTorch 1.3 with CUDA 10.0. I'm using 8 GPUs and the problem only occurs on GPU 4.

Same problem here, but when I change the batch size from 16 to 12 the problem disappeared,
pytorch 1.4, v100

@HanxunHuangLemonBear @yczhang1017 @XinMing0411 @greeneggsandyaml @curiosity2
If you are not running into the illegal memory access with the posted code from comment 1, please create a new issue with information about your script and setup (using the provided template).

Have you checked GPU memory usage?
In my case, illegal memory error meant out of GPU memory error.

I encounter the same problem when running a script for fine-tuning gpt2. It is probably caused by some bugs in CUDA or pytorch.

CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:764)

Upgrading torch to 1.5.0 and cuda to 10.2 solves my problem.

@ptrblck Sorry for the delayed response, I was the imagenet training code from here https://github.com/pytorch/examples/tree/master/imagenet.

I have not experienced the error since modifying the power limits on my GPU - So I'm no longer sure my crashes are related to the original issue here. I noticed that when running compute workloads, the RTX 2080 throttles heavily as soon as temp >50C and the power level is ~> 140W. By limiting the gpu to 130W, it'll actually stay at ~1500MHz instead of dropping to 800MHz or even 300MHz, even if the temp is >60C. I've completed two runs this way without issue.

To hazard a potential guess, it might be that Nvidia does something even with the voltage too when the GPU backs off, which might lead to some calculation going awry?

I am also on Torch 1.5.0 and CUDA 10.2 and I am still experiencing this problem. When I get above about 4 Gb of memory usage on the GPU is when it seems to happen. I'm using a GTX 1080 and running a 2d convolutional neural net if that matters.

CUDA error: an illegal memory access was encountered

I had the same problem and I found that I have defined a nn.layer in the inference file and didn't add it to the GPU.
nn.layer = nn.layer.to(args.device)
I don't know if that's the problem you have but I thought this might help.

I have the same problem on pytorch1.0 and cuda 9.2 when training my segmentation code. I find that setting torch.backends.cudnn.deterministic = True can solve the erro, but I don't know the mechanism behind this error.

@youth123 Any other setting about torch cuda except torch.backends.cudnn.deterministic, liking torch.backends.cudnn.enabled, os.environ["CUDA_VISIBLE_DEVICES"] or .to(config.devices[0])?
I have the same problem. It troubles me long time. Please help me. Thanks very much.

@tjusxh I tried other settings and all not working except torch.backends.cudnn.deterministic. May be updating the pytorch version is the quick solution.

Got this error today June 3rd, 2020 using pytorch 1.5. GPU is T4 on Debian. I had torch.backends.cudnn.deterministic = False

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    85/299     10.4G   0.05363   0.07044   0.03127    0.1553       779       640:  25%|████████████████                                               | 470/1849 [07:00<20:26,  1.12it/s]Traceback (most recent call last):
  File "train.py", line 395, in <module>
    train(hyp)
  File "train.py", line 262, in train
    scaled_loss.backward()
  File "/opt/conda/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/opt/conda/lib/python3.6/site-packages/apex/amp/handle.py", line 127, in scale_loss
    should_skip = False if delay_overflow_check else loss_scaler.update_scale()
  File "/opt/conda/lib/python3.6/site-packages/apex/amp/scaler.py", line 200, in update_scale
    self._has_overflow = self._overflow_buf.item()
RuntimeError: CUDA error: an illegal memory access was encountered

"illegal memory access was encountered" is a generic error that can be caused by many different reasons. Unless your report is accompanies by a minimum runnable and reproducible example triggering this error, it is not actionable.

@ngimel yes I understand. Unfortunately it's extremely rare, so I can't supply code to reproduce, as I can't even reproduce it myself, but thought I would add to the statistics here.

The only other detail I have is I was using Nvidia Apex for training.

@tjusxh I tried other settings and all not working except torch.backends.cudnn.deterministic. May be updating the pytorch version is the quick solution.

Thank you very much. I solve the problem by changing to cpu. My data is out of boundary.

@ngimel Do you have a suggestion for how we can achieve this? This bug seems very elusive.

Code that crashes on one GPU or computer, might run fine on another. Changing the batch size sometimes prevents a crash at a certain batch, only to appear later in the epoch. 😕

@tjusxh Just changing to cpu can solve the issue that the data is out of the boundary?

Traceback (most recent call last):
File "train.py", line 104, in
train(model, train_iter, optimizer, criterion)
File "train.py", line 28, in train
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered (embedding_dense_backward_cuda at /pytorch/aten/src/ATen/native/cuda/Embedding.cu:267)
I also came across this problem and hope to get help.

Maybe a clue: I have torch 1.5 fp32 code that raises this error for a large model and doesn't for a small model:

rp_bucket = rp_bucket.to(self.relative_attention_bias.weight.device)
RuntimeError: CUDA error: an illegal memory access was encountered

Even I'm getting this error. More details here

I'm having better success with pytorch 1.6 (nightly), I recommend trying that.

I'm having better success with pytorch 1.6 (nightly), I recommend trying that.

Where is the pytorch 1.6? I can only find the 1.5.1

To hopefully help others that ended up here, the solution suggested here worked for me:

@jzazo
Hi, I had similar problem.
If I use device = torch.device("cuda:1"), I always got RuntimeError: CUDA error: an illegal memory access was encountered error.

But when I set a specific gpu by torch.cuda.set_device(1), everything is fine.

I just added a torch.cuda.set_device(<device_num>) before the rest of my code, and it worked. The error for me was that I was loading a model from a checkpoint using pytorch lightning and something was getting put on the default gpu (which was gpu 0).

I'm having better success with pytorch 1.6 (nightly), I recommend trying that.

Where is the pytorch 1.6? I can only find the 1.5.1

@LetsGoFir should just be:

conda install pytorch torchvision cudatoolkit=10.2 -c pytorch-nightly

There seems to be multiple possible causes. Just for reference, I encountered the same problem but when I switched from one GPU server(cluster) to another, the error disappeared. So, the device may be one possible cause.

check torch.backends.cudnn.benchmark=Ture
use torch.backends.cudnn.benchmark=False or # torch.backends.cudnn.benchmark=True

@brucemuller This issue still can be found on Pytorch 1.6:

>>> import torch
>>> torch.__version__
'1.6.0'

And here is the coredump info:

Traceback (most recent call last):
  File "train.py", line 212, in <module>
    train(None)
  File "/gemfield/hostpv/gemfield/deepvac/lib/syszux_deepvac.py", line 335, in __call__
    self.process()
  File "train.py", line 163, in process
    self.processTrain()
  File "/gemfield/hostpv/gemfield/deepvac/lib/syszux_deepvac.py", line 294, in processTrain
    self.doBackward()
  File "train.py", line 139, in doBackward
    self.loss.backward()
  File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629403081/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fb3e291677d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7fb3e2b66d9d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fb3e2902b1d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x53f0ea (0x7fb41c1990ea in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #17: __libc_start_main + 0xe7 (0x7fb442bdfb97 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

got the same issue on pytorch 1.6 for RTX 2070 on Ubuntu 18.04. CUDA 10.2

Any solutions?
Please refer to my issue @ darknet: https://github.com/AlexeyAB/darknet/issues/6531

This https://github.com/PyTorchLightning/pytorch-lightning/issues/2085#issuecomment-678938629 may be useful for some cases.

The error happens to me at loss.backward() when I ran my model on non-zero GPUs with apex. There will be a data discrepancy caused by apex in this case. Setting CUDA_VISIBLE_DEVICES=2,3 and always starting from 0 in the code tricks apex to align its internal data and the model and avoids the problem for me.

In the python interpreter,I can reproduce one of the scenarios:

root@880e1530b95c:~/examples# python
Python 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> model = torch.nn.Linear(10, 10)
>>> x = torch.rand(15, 10)
>>> model(x)

now everything is ok.

>>> model.cuda()
Linear(in_features=10, out_features=10, bias=True)
>>> model(x.to(0))

now everything is ok.

>>> model(x)

because the x is on device:cpu, so it will report something error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 172, in __repr__
    return torch._tensor_str._str(self)
  File "/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py", line 372, in _str
    return _str_intern(self)
  File "/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py", line 352, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py", line 241, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py", line 89, in __init__
    nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: copy_if failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

But after this, if I want to convert tensor's or module's device to gpu/cpu, it will raise errors:

>>> x.to(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: an illegal memory access was encountered
>>> model.to('cpu')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 612, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 381, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 610, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: an illegal memory access was encountered

Everything will be ok if I restart the python interpreter.

Somehow I downgraded pytorch to 1.4 and the error is gone.

I also had the same problem. But if i run the whole code in one cell, then there's no error. The error only appears if I split the codes in different cells of jypyter notebook.

I also had this problem, I tested my code both on tensorflow and pytorch, all appeared the same error. It can run correctly for some steps, I donnot know how to solve this problem. Someone help me!

we've been testing this pretty rigorously on Lightning (master) and we don't seem to be having these issues... maybe try that?

https://pytorch-lightning.readthedocs.io/en/latest/new-project.html

The Issue exists in the latest release (PyTorch 1.6.0, CUDA 10.1), it works perfectly fine with PyTorch 1.4.0 CUDA 10.1
Anyone from Google Colab, use this

!pip install torch==1.4.0 torchvision==0.5.0

I have the same issue.
It arose after I added pytorch tensorboard to my pipeline.
The reason it seems to have happened is i forgot to push my data to the gpu before writing the graph with data to tensorboard.

Here is a small "similar" example. It actually errors for some reason which my training did not. But after running it the model.wieght is inaccessible.

import torch as torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter

model = nn.Linear(5,5)
data = torch.rand(20,5)  # .to('cuda')
model.to('cuda')
# model(data)

writer = SummaryWriter()

writer.add_graph(model, data)
writer.close()

now try to access model.weight the cuda memory is corrupted.

While in my training there was no error during runtime. Only when I afterwords tried to access the weight of the model. This code somewhat results in the same error.

@xparx please file a separate issue

I also got this error with Pytorch 1.6.0, Cuda 10.2 and Cudnn 7.6.5

Traceback (most recent call last):
  File "f:/pythonapps/alphazero_singleplayer/alphazero_pytorch.py", line 240, in <module>
    episode_returns, timepoints, a_best, seed_best, R_best = agent(game=args.game, n_ep=args.n_ep, n_mcts=args.n_mcts,
  File "f:/pythonapps/alphazero_singleplayer/alphazero_pytorch.py", line 214, in agent
    model.train(sb, Vb, pib)
  File "f:\pythonapps\alphazero_singleplayer\nn_model.py", line 49, in train
    vb, pib = vb.cuda(), pib.cuda()
RuntimeError: CUDA error: an illegal memory access was encountered

I also got this error with Pytorch 1.6.0, Cuda 10.2 and Cudnn 7.6.5

Traceback (most recent call last):
  File "f:/pythonapps/alphazero_singleplayer/alphazero_pytorch.py", line 240, in <module>
    episode_returns, timepoints, a_best, seed_best, R_best = agent(game=args.game, n_ep=args.n_ep, n_mcts=args.n_mcts,
  File "f:/pythonapps/alphazero_singleplayer/alphazero_pytorch.py", line 214, in agent
    model.train(sb, Vb, pib)
  File "f:\pythonapps\alphazero_singleplayer\nn_model.py", line 49, in train
    vb, pib = vb.cuda(), pib.cuda()
RuntimeError: CUDA error: an illegal memory access was encountered

https://github.com/pytorch/pytorch/issues/21819#issuecomment-698809496

https://github.com/pytorch/pytorch/issues/21819#issuecomment-638493770

"illegal memory access was encountered" is a generic error that can be caused by many different reasons. Unless your report is accompanies by a minimum runnable and reproducible example triggering this error, it is not actionable.

I also got this error with Pytorch 1.6.0, Cuda 10.2 and Cudnn 7.6.5

Traceback (most recent call last):
  File "f:/pythonapps/alphazero_singleplayer/alphazero_pytorch.py", line 240, in <module>
    episode_returns, timepoints, a_best, seed_best, R_best = agent(game=args.game, n_ep=args.n_ep, n_mcts=args.n_mcts,
  File "f:/pythonapps/alphazero_singleplayer/alphazero_pytorch.py", line 214, in agent
    model.train(sb, Vb, pib)
  File "f:\pythonapps\alphazero_singleplayer\nn_model.py", line 49, in train
    vb, pib = vb.cuda(), pib.cuda()
RuntimeError: CUDA error: an illegal memory access was encountered

@ngimel sorry this error was on me, I was able to fix it.

I'm closing this issue. All users who see "illegal memory access" please open a new issue with reproduction script. "illegal memory access was encountered" is a generic error that can be caused by many different reasons. Unless your report is accompanied by a minimum runnable and reproducible example triggering this error, it is not actionable.

@xiaofeng-fan 's way works for me.
I downgraded pytorch to 1.4 and the error is gone.

This issue still exists on me. Even I use pytorch 1.4 and CUDA 10.1.
I used 2080ti.

This issue still exists on me. Even I use pytorch 1.4 and CUDA 10.1.
I used 2080ti.

https://github.com/pytorch/pytorch/issues/21819#issuecomment-702877382

On Sat, Oct 3, 2020 at 7:15 PM Chung Rakjoon notifications@github.com
wrote:

This issue still exists on me. Even I use pytorch 1.4 and CUDA 10.1.
I used 2080ti.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pytorch/pytorch/issues/21819#issuecomment-703106176,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AJIJRG7SP3ZREYG5ANVK7XLSI4TIPANCNFSM4HYOP62Q
.

I also got this error with Pytorch 1.6.0, Cuda 10.2 and Cudnn 7.6.5

Traceback (most recent call last):
  File "f:/pythonapps/alphazero_singleplayer/alphazero_pytorch.py", line 240, in <module>
    episode_returns, timepoints, a_best, seed_best, R_best = agent(game=args.game, n_ep=args.n_ep, n_mcts=args.n_mcts,
  File "f:/pythonapps/alphazero_singleplayer/alphazero_pytorch.py", line 214, in agent
    model.train(sb, Vb, pib)
  File "f:\pythonapps\alphazero_singleplayer\nn_model.py", line 49, in train
    vb, pib = vb.cuda(), pib.cuda()
RuntimeError: CUDA error: an illegal memory access was encountered

@ngimel sorry this error was on me, I was able to fix it.

Can you explain what your fix was?

I can say that the fix on my end was that I had an nn.EmbeddingBag that was receiving an out of range input (too large, off by one).

Maybe helpful for someone. I had this problem and it was the result of multiplying a tensor on the gpu with a tensor on the cpu, and then trying to access the memory. Specifically if you create a new tensor in place (i.e. torch.eye(3), then make sure that tensor is on the gpu). Perhaps torch.matmul() should throw an error if the input tensors are not on the same device.

I had a similar experience as well, I had an nn.LayerNorm on the cpu receiving as input a tensor on the gpu, having both on the same device fixed the problem.

@totomobile43 Thank you so much, I was running into this illegal memory access error as well, and I think I had the same problem - multiplying tensors and then accessing the results later would throw the error.

However, when I explicitly moved both my tensors onto cpu before doing the matrix multiplication, the error went away.

The Issue exists in the latest release (PyTorch 1.6.0, CUDA 10.1), it works perfectly fine with PyTorch 1.4.0 CUDA 10.1
Anyone from Google Colab, use this

!pip install torch==1.4.0 torchvision==0.5.0

this fixed the issue for me @prameth thanks!

I upgraded pytorch to 1.7 and the error is gone.

Was this page helpful?
0 / 5 - 0 ratings