Apex: RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101)

Created on 20 May 2019 · 30Comments · Source: NVIDIA/apex

File "../ptx/fit_extension.py", line 386, in _train_epoch scaled_loss.backward() File "/home/suiguobin/anaconda3/lib/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "../../apex/apex/amp/handle.py", line 125, in scale_loss optimizer._post_amp_backward(loss_scaler) File "../../apex/apex/amp/_process_optimizer.py", line 123, in post_backward_with_master_weights models_are_masters=False) File "../../apex/apex/amp/scaler.py", line 113, in unscale 1./scale) File "../../apex/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in __call__ *args) RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f17e2ce2021 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f17e2ce18ea in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: void multi_tensor_apply<2, ScaleFunctor<c10::Half, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, ScaleFunctor<c10::Half, float>, float) + 0x1805 (0x7f17db4c3a75 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #3: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, float) + 0x15a8 (0x7f17db4b8748 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #4: <unknown function> + 0x1784f (0x7f17db4b684f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #5: <unknown function> + 0x14e4f (0x7f17db4b3e4f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) <omitting python frames> frame #54: __libc_start_main + 0xf5 (0x7f1824cc3b45 in /lib/x86_64-linux-gnu/libc.so.6)

I use single card to run the amp, it produced the above error.
However I use more than one cards to train, it doesn't produce ant error.

BERT

Source

nlp520

Most helpful comment

I also encountered a similar error. I specified the default GPU for each process with torch.cuda.set_device(), and I was able to avoid this error.

tatsuhiko-inoue on 29 May 2019

👍13

All 30 comments

Do you have a minimal code sample that reproduces the error? Also, what is your environment (which pytorch version, which cuda version)?

mcarilli on 21 May 2019

compile:
torch.__version__ = 1.1.0
Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
from /usr/local/cuda/bin

Pytorch binaries were compiled with Cuda 10.0.130

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

pytorch:1.1.0

nlp520 on 21 May 2019

I use the apex to train the bert and it produce error in
with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward()

nlp520 on 21 May 2019

What optimizer are you using? Also, how are you initializing Amp?

mcarilli on 21 May 2019

I use the BertAdam optimizer and initialize the amp
self.model, self.optimizer = amp.initialize(self.model, self.optimizer, opt_level=opt_level)

nlp520 on 24 May 2019

Are you using BertAdam from here? Also what value are you using for opt_level?

We've actually got some people right now working on optimizing BERT specifically. I'll let you know if we encounter anything similar.

mcarilli on 24 May 2019

I also encountered a similar error. I specified the default GPU for each process with torch.cuda.set_device(), and I was able to avoid this error.

tatsuhiko-inoue on 29 May 2019

👍13

I haven't used Apex/AMP before, so maybe there is some user error here. That said, I also seems to get an error when using a device other than the default device. The code at the end gives me:

RuntimeError: CUDA error: an illegal memory access was encountered

for opt_levels O1 and O2. In particular, I do not seem to get an error for opt_level O3.

Version information:

Apex commit: 8be5b6bedead620db636516d064db39f82052e01(latest commit when I installed it)
torch.version.git_version = '20607a99a31ec5405ca6aa92bc7e7bf768b7bc43' (just installed latest stable using official instructions this morning)
Nvidia driver: 430.14
Running this in docker container based on: nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 (e25e57dde9ade23a377536df339be4d8410a7a7bcddb1e96b0e2db63ac088ed4)

import torch
import torchvision

from apex import amp

device = "cuda:1"
wantIllegalAccessException = True

if __name__ == '__main__':
  if not wantIllegalAccessException:
    torch.cuda.set_device(device)

  model = torchvision.models.resnet34().to(device)
  optimizer = torch.optim.Adam(model.parameters(), 1e-3)
  criterion = torch.nn.CrossEntropyLoss().to(device)

  model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

  input = torch.randn(2, 3, 224, 224, device=device)
  target = torch.randint(0, 999, [input.shape[0]], device=device)

  output = model(input)
  loss = criterion(output, target)

  optimizer.zero_grad()
  with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
  optimizer.step()

svedi on 2 Jun 2019

👍1

At the scaler.py, there is one line code self._overflow_buf = torch.cuda.IntTensor([0]), which initialize the variable on the default cuda device, if the model is on another device, then we will encounter the error "CUDA error: an illegal memory access was encountered"

ReactiveCJ on 17 Jun 2019

👍3

@ReactiveCJ is probably right about the source of the error. However, in general, when using multiple GPUs or manually trying to use a GPU other than the default, it's definitely best practice to call torch.cuda.set_device before you construct your model or call amp.initialize. Calling .to manually on your model is error-prone and might not catch everything (even if you aren't using Amp).

mcarilli on 19 Jun 2019

❤1

I encountered this problem myself as well, where device = torch.device('cuda:0') works, but device = torch.device('cuda:1') does not.

jzazo on 9 Oct 2019

Error occuring randomly, not at epoch happen

THCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.7_104508condaconda-bld\pytorch_1572950778684\work\aten\srcTHC/generic/THCStorage.cpp line=39
error=700 : an illegal memory access was encountered
Traceback (most recent call last):
File "c:/Users/hadypranoto/Latihan/CycleGAN-master/CycleGAN_trainV2.py", line 244, in
D_A_loss.backward()
File "C:\Users\hadypranoto\Anaconda3envs\tensenv\lib\site-packages\torch\tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\hadypranoto\Anaconda3envs\tensenv\lib\site-packages\torch\autograd__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at C:\w\1\s\tmp_conda_3.7_104508condaconda-bld\pytorch_1572950778684\work\aten\srcTHC/generic/THCStorage.cpp:39

sometime i have error like this, occurring randomly

Traceback (most recent call last):
File "c:/Users/hadypranoto/Latihan/CycleGAN-master/CycleGAN_trainV2.py", line 225, in
G_B_loss = MSE_loss(D_A_fake_decision, Variable(torch.ones(D_A_fake_decision.size()).cuda(0)))
RuntimeError: CUDA error: an illegal memory access was encountered

i confusing this is pytorch bugs or my code having bugs..

hadypranoto on 8 Nov 2019

Yep, same problem.

device = torch.device('cuda:0') works OK

device = torch.device('cuda:1') fails when calling scaled_loss.backward()

Fixed by a call to torch.cuda.set_device(torch.device('cuda:1'))

I'm guessing somewhere in your code, there are 2 references being kept to different devices.

Can also be fixed by running opt-level O0, so I guess that means it's likely not my code.

DuaneNielsen on 23 Dec 2019

👍4

You might swap memory in the CPU or other gpus, reboot the cuda or computer, and you might be able to solve the problem

middle-chunjie on 28 Dec 2019

I also encoutered this error.
I think it may due to I used multiple GPU. One of a module of my model is placed on another GPU, and I transfer my data to other GPU manully by using code like p = p.to('cuda:1').
When I delete the amp code, the problem is fixed. Seems apex could not support such setting well.

Aria-K-Alethia on 5 Feb 2020

👍3

I also encoutered this error.
I think it may due to I used multiple GPU. One of a module of my model is placed on another GPU, and I transfer my data to other GPU manully by using code like p = p.to('cuda:1').
When I delete the amp code, the problem is fixed. Seems apex could not support such setting well.
@mcarilli @nlp520
i have the same problem,does the apex could not support pytorch's model parallel————
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
（not the DataParallel）

ZhangMingHui123 on 14 Feb 2020

Yep, same problem.

device = torch.device('cuda:0') works OK

device = torch.device('cuda:1') fails when calling scaled_loss.backward()

Fixed by a call to torch.cuda.set_device(torch.device('cuda:1'))

I'm guessing somewhere in your code, there are 2 references being kept to different devices.

Can also be fixed by running opt-level O0, so I guess that means it's likely not my code.

I was doing this using others code. The error always part when i create an local variable such

t = torch.zeros(sizeoftensor).cuda()

Its about insufficient memory? Because its happen after certain iteration. Not at the beggining.

hadypranoto on 23 Feb 2020

seeing this also while running pix2pixHD on two GPUs (with --fp16 argument).

tripzero on 12 Mar 2020

setting torch.backends.cudnn.benchmark = False resolves the error for me

MittalShruti on 20 Mar 2020

setting torch.backends.cudnn.benchmark = False resolves the error for me

Well, pix2pixHD doesn't crash anymore with this added... but it just locks up one of the GPUs at 100% doing something other than training.

tripzero on 21 Mar 2020

setting torch.backends.cudnn.benchmark = False resolves the error for me

Well, pix2pixHD doesn't crash anymore with this added... but it just locks up one of the GPUs at 100% doing something other than training.

@tripzero same problem, have you found any other solution? thanks~

dekura on 24 Mar 2020

@dekura no dice. Tried 1 GPU and 2 GPUs. Tried changing optimization level to O2. :(. I can't even reproduce the 100% GPU result I was seeing earlier. Just Illegal Memory Access errors.

tripzero on 24 Mar 2020

I encountered this issue myself. Did not see error on opt_level 'O0' but did see on opt_level 'O1'. Per the suggestion of @tatsuhiko-inoue, I can use O1 on GPU 1 with the following:
torch.cuda.set_device(1)
device = torch.device('cuda:1')
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
optLv = 'O1'
net.to(device)
net,optimizer = amp.initialize(net,optimizer,opt_level=optLv)

Then train as usual, replacing loss.backward with
with amp.scale_loss(loss,optimizer) as scaled_loss:
scaled_loss.backward()

matlabninja on 18 May 2020

@hadypranoto I encountered the same problem. Have you figured out why and how to solve it? Thanks!

JianYang93 on 4 Jun 2020

@JianYang93 @matlabninja @tripzero
Traceback (most recent call last):
File "train.py", line 104, in
train(model, train_iter, optimizer, criterion)
File "train.py", line 28, in train
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered (embedding_dense_backward_cuda at /pytorch/aten/src/ATen/native/cuda/Embedding.cu:267)
I also came across this problem and hope to get help.

ll0iecas on 8 Jun 2020

@ll0iecas Sorry I am in no way an expert on this and I encountered this error not in this particular package. FYI my problem was because of too large batch size.

JianYang93 on 8 Jun 2020

👍1

@ll0iecas Did you explicitly set your device?

torch.cuda.set_device(device)

BramVanroy on 8 Jun 2020

@ll0iecas Did you explicitly set your device?
torch.cuda.set_device(device)

I did, but nothing worked

ll0iecas on 10 Jun 2020

@ll0iecas Did you explicitly set your device?
torch.cuda.set_device(device)
I did, but nothing worked

Hello, I also got this error, and I have no idea to fix it. I explicitly set device but it does't work.

LeMei on 20 Aug 2020

I also encountered a similar error. I specified the default GPU for each process with torch.cuda.set_device(), and I was able to avoid this error.

What do you mean, how to specify GPU for each process? Do you write torch.cuda.set_device() after each new variable is created?

cherepas on 27 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

strange error when distributed training

LightToYang · 4Comments

relation between apex.parallel.DistributedDataParallel and torch.distributed

xmyqsh · 3Comments

Meet "fatal error: torch/extension.h: No such file or directory compilation terminated." when install with cuda_ext

dxxz · 3Comments

installation failed: Given no hashes to check 123 links for project 'pip': discarding no candidates

DeeDive · 4Comments

Learning Scheduler

TheRevanchist · 3Comments