File "../ptx/fit_extension.py", line 386, in _train_epoch
scaled_loss.backward()
File "/home/suiguobin/anaconda3/lib/python3.6/contextlib.py", line 88, in __exit__
next(self.gen)
File "../../apex/apex/amp/handle.py", line 125, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "../../apex/apex/amp/_process_optimizer.py", line 123, in post_backward_with_master_weights
models_are_masters=False)
File "../../apex/apex/amp/scaler.py", line 113, in unscale
1./scale)
File "../../apex/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in __call__
*args)
RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f17e2ce2021 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f17e2ce18ea in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: void multi_tensor_apply<2, ScaleFunctor<c10::Half, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, ScaleFunctor<c10::Half, float>, float) + 0x1805 (0x7f17db4c3a75 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, float) + 0x15a8 (0x7f17db4b8748 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x1784f (0x7f17db4b684f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x14e4f (0x7f17db4b3e4f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #54: __libc_start_main + 0xf5 (0x7f1824cc3b45 in /lib/x86_64-linux-gnu/libc.so.6)
I use single card to run the amp, it produced the above error.
However I use more than one cards to train, it doesn't produce ant error.
Do you have a minimal code sample that reproduces the error? Also, what is your environment (which pytorch version, which cuda version)?
compile:
torch.__version__ = 1.1.0
Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
from /usr/local/cuda/bin
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
I use the apex to train the bert and it produce error in
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
What optimizer are you using? Also, how are you initializing Amp?
I use the BertAdam optimizer and initialize the amp
self.model, self.optimizer = amp.initialize(self.model, self.optimizer, opt_level=opt_level)
Are you using BertAdam from here? Also what value are you using for opt_level?
We've actually got some people right now working on optimizing BERT specifically. I'll let you know if we encounter anything similar.
I also encountered a similar error. I specified the default GPU for each process with torch.cuda.set_device(), and I was able to avoid this error.
I haven't used Apex/AMP before, so maybe there is some user error here. That said, I also seems to get an error when using a device other than the default device. The code at the end gives me:
RuntimeError: CUDA error: an illegal memory access was encountered
for opt_levels O1 and O2. In particular, I do not seem to get an error for opt_level O3.
Version information:
8be5b6bedead620db636516d064db39f82052e01(latest commit when I installed it)torch.version.git_version = '20607a99a31ec5405ca6aa92bc7e7bf768b7bc43' (just installed latest stable using official instructions this morning)e25e57dde9ade23a377536df339be4d8410a7a7bcddb1e96b0e2db63ac088ed4)import torch
import torchvision
from apex import amp
device = "cuda:1"
wantIllegalAccessException = True
if __name__ == '__main__':
if not wantIllegalAccessException:
torch.cuda.set_device(device)
model = torchvision.models.resnet34().to(device)
optimizer = torch.optim.Adam(model.parameters(), 1e-3)
criterion = torch.nn.CrossEntropyLoss().to(device)
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
input = torch.randn(2, 3, 224, 224, device=device)
target = torch.randint(0, 999, [input.shape[0]], device=device)
output = model(input)
loss = criterion(output, target)
optimizer.zero_grad()
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
optimizer.step()
At the scaler.py, there is one line code self._overflow_buf = torch.cuda.IntTensor([0]), which initialize the variable on the default cuda device, if the model is on another device, then we will encounter the error "CUDA error: an illegal memory access was encountered"
@ReactiveCJ is probably right about the source of the error. However, in general, when using multiple GPUs or manually trying to use a GPU other than the default, it's definitely best practice to call torch.cuda.set_device before you construct your model or call amp.initialize. Calling .to manually on your model is error-prone and might not catch everything (even if you aren't using Amp).
I encountered this problem myself as well, where device = torch.device('cuda:0') works, but device = torch.device('cuda:1') does not.
Error occuring randomly, not at epoch happen
THCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.7_104508condaconda-bld\pytorch_1572950778684\work\aten\srcTHC/generic/THCStorage.cpp line=39
error=700 : an illegal memory access was encountered
Traceback (most recent call last):
File "c:/Users/hadypranoto/Latihan/CycleGAN-master/CycleGAN_trainV2.py", line 244, in
D_A_loss.backward()
File "C:\Users\hadypranoto\Anaconda3envs\tensenv\lib\site-packages\torch\tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\hadypranoto\Anaconda3envs\tensenv\lib\site-packages\torch\autograd__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at C:\w\1\s\tmp_conda_3.7_104508condaconda-bld\pytorch_1572950778684\work\aten\srcTHC/generic/THCStorage.cpp:39
sometime i have error like this, occurring randomly
Traceback (most recent call last):
File "c:/Users/hadypranoto/Latihan/CycleGAN-master/CycleGAN_trainV2.py", line 225, in
G_B_loss = MSE_loss(D_A_fake_decision, Variable(torch.ones(D_A_fake_decision.size()).cuda(0)))
RuntimeError: CUDA error: an illegal memory access was encountered
i confusing this is pytorch bugs or my code having bugs..
Yep, same problem.
device = torch.device('cuda:0') works OK
device = torch.device('cuda:1') fails when calling scaled_loss.backward()
Fixed by a call to torch.cuda.set_device(torch.device('cuda:1'))
I'm guessing somewhere in your code, there are 2 references being kept to different devices.
Can also be fixed by running opt-level O0, so I guess that means it's likely not my code.
You might swap memory in the CPU or other gpus, reboot the cuda or computer, and you might be able to solve the problem
I also encoutered this error.
I think it may due to I used multiple GPU. One of a module of my model is placed on another GPU, and I transfer my data to other GPU manully by using code like p = p.to('cuda:1').
When I delete the amp code, the problem is fixed. Seems apex could not support such setting well.
I also encoutered this error.
I think it may due to I used multiple GPU. One of a module of my model is placed on another GPU, and I transfer my data to other GPU manully by using code likep = p.to('cuda:1').
When I delete the amp code, the problem is fixed. Seems apex could not support such setting well.
@mcarilli @nlp520
i have the same problem,does the apex could not support pytorch's model parallel————
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
(not the DataParallel)
Yep, same problem.
device = torch.device('cuda:0') works OK
device = torch.device('cuda:1') fails when calling scaled_loss.backward()
Fixed by a call to torch.cuda.set_device(torch.device('cuda:1'))
I'm guessing somewhere in your code, there are 2 references being kept to different devices.
Can also be fixed by running opt-level O0, so I guess that means it's likely not my code.
I was doing this using others code. The error always part when i create an local variable such
t = torch.zeros(sizeoftensor).cuda()
Its about insufficient memory? Because its happen after certain iteration. Not at the beggining.
seeing this also while running pix2pixHD on two GPUs (with --fp16 argument).
setting torch.backends.cudnn.benchmark = False resolves the error for me
setting
torch.backends.cudnn.benchmark = Falseresolves the error for me
Well, pix2pixHD doesn't crash anymore with this added... but it just locks up one of the GPUs at 100% doing something other than training.
setting
torch.backends.cudnn.benchmark = Falseresolves the error for meWell, pix2pixHD doesn't crash anymore with this added... but it just locks up one of the GPUs at 100% doing something other than training.
@tripzero same problem, have you found any other solution? thanks~
@dekura no dice. Tried 1 GPU and 2 GPUs. Tried changing optimization level to O2. :(. I can't even reproduce the 100% GPU result I was seeing earlier. Just Illegal Memory Access errors.
I encountered this issue myself. Did not see error on opt_level 'O0' but did see on opt_level 'O1'. Per the suggestion of @tatsuhiko-inoue, I can use O1 on GPU 1 with the following:
torch.cuda.set_device(1)
device = torch.device('cuda:1')
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
optLv = 'O1'
net.to(device)
net,optimizer = amp.initialize(net,optimizer,opt_level=optLv)
Then train as usual, replacing loss.backward with
with amp.scale_loss(loss,optimizer) as scaled_loss:
scaled_loss.backward()
@hadypranoto I encountered the same problem. Have you figured out why and how to solve it? Thanks!
@JianYang93 @matlabninja @tripzero
Traceback (most recent call last):
File "train.py", line 104, in
train(model, train_iter, optimizer, criterion)
File "train.py", line 28, in train
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered (embedding_dense_backward_cuda at /pytorch/aten/src/ATen/native/cuda/Embedding.cu:267)
I also came across this problem and hope to get help.
@ll0iecas Sorry I am in no way an expert on this and I encountered this error not in this particular package. FYI my problem was because of too large batch size.
@ll0iecas Did you explicitly set your device?
torch.cuda.set_device(device)
@ll0iecas Did you explicitly set your device?
torch.cuda.set_device(device)
I did, but nothing worked
@ll0iecas Did you explicitly set your device?
torch.cuda.set_device(device)I did, but nothing worked
Hello, I also got this error, and I have no idea to fix it. I explicitly set device but it does't work.
I also encountered a similar error. I specified the default GPU for each process with torch.cuda.set_device(), and I was able to avoid this error.
What do you mean, how to specify GPU for each process? Do you write torch.cuda.set_device() after each new variable is created?
Most helpful comment
I also encountered a similar error. I specified the default GPU for each process with torch.cuda.set_device(), and I was able to avoid this error.