No problem in PyTorch 1.2. Archive with code and data: https://github.com/pytorch/pytorch/files/3723821/PyTorch.zip
Windows 10 (1903), Python 3.7.4, RTX 2060 (driver version 436.48)
RuntimeError Traceback (most recent call last)
<ipython-input-6-68308ed1e055> in <module>
35 cum_loss.append(loss.item())
36
---> 37 loss.backward()
38 optimizer.step()
39
C:\Anaconda3\envs\torch13\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph)
148 products. Defaults to ``False``.
149 """
--> 150 torch.autograd.backward(self, gradient, retain_graph, create_graph)
151
152 def register_hook(self, hook):
C:\Anaconda3\envs\torch13\lib\site-packages\torch\autograd\__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
97 Variable._execution_engine.run_backward(
98 tensors, grad_tensors, retain_graph, create_graph,
---> 99 allow_unreachable=True) # allow_unreachable flag
100
101
RuntimeError: CUDA error: unspecified launch failure
cc @ezyang @gchanan @zou3519 @SsnL @albanD @gqchen @ngimel @peterjc123
Can you provide a minimal code example to reproduce? Please also copy and paste the output from our environment collection script. You can get the script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
Hello @vincentqb,
Code example: https://github.com/pytorch/pytorch/files/3723821/PyTorch.zip
Output from the environment collection script:
Collecting environment information...
PyTorch version: 1.3.0
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Microsoft Windows 10 Enterprise
GCC version: Could not collect
CMake version: Could not collect
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin\cudnn64_7.dll
Versions of relevant libraries:
[pip] numpy==1.15.4
[pip] torch==1.3.0
[pip] torchvision==0.4.1
[conda] blas 1.0 mkl
[conda] libblas 3.8.0 13_mkl conda-forge
[conda] libcblas 3.8.0 13_mkl conda-forge
[conda] liblapack 3.8.0 13_mkl conda-forge
[conda] mkl 2019.4 245
[conda] mkl-service 2.3.0 py37hb782905_0
[conda] pytorch 1.3.0 py3.7_cuda101_cudnn7_0 pytorch
[conda] torchvision 0.4.1 py37_cu101 pytorch
nvidia-smi:
Mon Oct 14 21:05:01 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 436.48 Driver Version: 436.48 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 WDDM | 00000000:01:00.0 Off | N/A |
| N/A 63C P2 28W / N/A | 1103MiB / 6144MiB | 13% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6156 C C:\Anaconda3\envs\torch12\python.exe N/A |
+-----------------------------------------------------------------------------+
@alexeygolyshev is that a minimal example? Looks like there is a lot of code in there.
If you could reduce the size of the code, it would really help with finding what is the root cause, thanks !
Hello @albanD,
Yes, this is a minimal example. I don't think I can greatly reduce the code. I have already deleted the data preprocessing.
@albanD My inputs: [sentences, words, characters]. I have 2 varying dimensions: different number of words in a sentence and different number of characters in a word.
Unfortunately I don't have a setup with notebook available. Could you run your code with anomaly_mode enabled and post here the extended stack trace?
🐛 Bug
No problem in PyTorch 1.2. Archive with code and data: https://github.com/pytorch/pytorch/files/3723821/PyTorch.zip
Windows 10 (1903), Python 3.7.4, RTX 2060 (driver version 436.48)
RuntimeError Traceback (most recent call last) <ipython-input-6-68308ed1e055> in <module> 35 cum_loss.append(loss.item()) 36 ---> 37 loss.backward() 38 optimizer.step() 39 C:\Anaconda3\envs\torch13\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph) 148 products. Defaults to ``False``. 149 """ --> 150 torch.autograd.backward(self, gradient, retain_graph, create_graph) 151 152 def register_hook(self, hook): C:\Anaconda3\envs\torch13\lib\site-packages\torch\autograd\__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables) 97 Variable._execution_engine.run_backward( 98 tensors, grad_tensors, retain_graph, create_graph, ---> 99 allow_unreachable=True) # allow_unreachable flag 100 101 RuntimeError: CUDA error: unspecified launch failurecc @ezyang @SsnL @albanD @zou3519 @gqchen
Hello, my computer system is the same as yours,【Win10(1903),Python 3.7.4, RTX 2060 (driver version 441.20),torch.version==1.2.0,】
I encountered the same problem as you. Have you solved it now?
you say No problem in PyTorch 1.2. Can you tell me all the information in this version?
CUDA?CUDNN?Python?
Hello @JYH9351,
I am currently using PyTorch 1.3.0 in production. I don't know why, but this helps:
with t.autograd.set_detect_anomaly(False):
for epoch in range(epochs):
...
Crashes less frequently, not in the first 2 epochs.
Does switching off the TDR settings helps? https://zhuanlan.zhihu.com/p/38141415
No. TDR = 60. Run 2 times. Crashed in epochs 2 and 11. This error appears randomly.
with t.autograd.set_detect_anomaly(True) increases time per epoch in 5x. In October, I waited several hours, but there was no error. So there is no extended stack trace.
Sometimes with t.autograd.set_detect_anomaly(False) can increase time without errors. But I am not sure. In October, I trained several networks with a 2-day uptime. But in later experiments, it also crashed randomly.
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
# Idx W C C % % % % MHz MHz
0 28 62 - 12 0 0 0 6801 960
0 30 62 - 11 0 0 0 6801 1155
0 32 63 - 20 7 0 0 6801 1155
0 32 62 - 13 1 0 0 6801 960
0 28 62 - 15 1 0 0 6801 960
0 28 62 - 16 1 0 0 6801 960
0 28 62 - 15 1 0 0 6801 960
0 28 63 - 14 1 0 0 6801 960
0 27 63 - 13 0 0 0 6801 960
0 28 62 - 11 3 0 0 6801 960
0 28 62 - 0 0 0 0 6801 960
0 12 62 - 0 0 0 0 810 345
0 5 61 - 0 0 0 0 405 345
I have to say that it is difficult to say where the problem is without the stacktrace including the exact crash site. But we may get that with the help of a RelWithDebInfo build and the attachment of the VS debugger. I could build one for you if you have trouble in building the project.
It will be great if you can prepare the debug build. I don't have much experience.
Interesting.
I had this issue training a model from https://github.com/wgrathwohl/JEM with PyTorch 1.3
I used this command
python train_wrn_ebm.py --lr .0001 --dataset cifar10 --optimizer adam --p_x_weight 1.0 --p_y_given_x_weight 1.0 --p_x_y_weight 0.0 --sigma .03 --width 2 --depth 40 --save_dir ./experiments --plot_uncond --warmup_iters 1000
The error happened seemingly randomly in the middle of training. I am using linux mint, not Windows.
I will suggested that you try again with uninstalling GPU driver with DDU and installing the driver that comes with cuda toolkit.
Too many bugs with Nvidia GPU driver on win 10.
I have run into this same issue and tried the suggestion of @kice of installing the driver from the cuda toolkit with no luck.
I am running into similar issues on my windows machine, I have a simple pipeline for binary classification with an LSTM and it shuts down at epochs (seems to be random).
My issue is also with lstm. Interestingly when I add torch.autograd.set_detect_anomaly(True) to get a stack trace, it takes about 20% longer to train but didn't fail. I will run a few more times to see if that is consistently true.
Same problem with LSTM + binary classification + error in random epoch on windows 10 + Pytorch 1.4
File "C:/Users/User/GoogleDrive/mad2-recommend/gnn/train.py", line 108, in main
train_model(train_loader, predict_score_net, optimizer)
File "C:/Users/User/GoogleDrive/mad2-recommend/gnn/train.py", line 41, in train_model
loss.backward()
File "C:\Users\User\Anaconda3\lib\site-packages\torch\tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\User\Anaconda3\lib\site-packages\torch\autograd\__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: unspecified launch failure
update:
GRU has the same problem.
@shingyipcheung Are you able to replicate the error with torch.autograd.set_detect_anomaly(True) set in order to get a full stacktrace?
I'm having this issue as well! (EDIT: on latest 1.4). The network will train for awhile, then at some random point, the classifier will halt with this exception.
It is possible to reproduce by using FastAI AWD-LSTM transfer learning for text classification on a very large dataset: https://docs.fast.ai/text.html
After this happens, further CUDA operations result in the same error until the kernel is restarted.
I suspect a lot of this simply does not get tested on Windows. Professionally, I always use Linux for machine learning tasks. It just so happens that my only personal system with a GPU runs Windows and does not have space for a Linux install. Furthermore, "Ubuntu on Windows" does not support CUDA.
I have the same issue when I train with LSTM + classification, the error occurs in random epoch on windows 10 + Pytorch 1.4
waiting for a solution
Raising priority based on user activity
@peterjc123 Would you like to look at this issue again? Thanks!
I'm facing the same issue when running on Win10 + Pytorch 1.4 with LSTM + classification. My GPU could run smoothly with other models like CNN.
The same piece of code could be executed on server with Ubuntu system. So guess some issue with Win10 compatibility
Could you please test whether it is still a problem in the nightly package or the 1.5.0 package?
@peterjc123 the issue is resolved after I upgrade to nightly version. Thanks for the help :)
Same problem, LSTM training random crash.
Pytorch 1.5.0, 1660Ti, 2600X, windows10, 32G
I encountered same error with Pytorch 1.4.0, RTX2060, windows10, also training an LSTM.
Trying to upgrade to pytorch 1.5.0 now and will report again.
BTW I did not observed this error on my developing laptop pytorch1.4, GTX1050Ti (don know if I was lucky enough or it's specific for some GPU)
Update:
After upgrade to pytorch 1.5.0
The error still occurs randomly. (with RTX2060 on windows10)
Same issue here. On a Win10 machine with two 2080Ti cards. Kernal crashes after the first few batches of a GRU model.
Pytorch: 1.5.0
Well, instead of describing the fact that it occurs, it will be better to provide some code for us to reproduce it at our side.
Hi @peterjc123 , I'd like to but it just occurred in various positions of my code that I cannot specify a snippet to reproduce. later I will work on minimizing my project for your team to test.
If it helps I see the same thing with any windows nvidia driver above 431.68. Using 431.68 or below seems to be fine.
@roceh does running with torch.autograd.set_detect_anomaly(True) and a newer driver crash and produce a useful stack trace?
I have same problem running CenterNet object detection codes. with ubuntu18 + 2080super + CUDA10. It runs fine for few epochs and randomly crash at some iteration. Any solution...?
@TWDH can you get a stacktrace from a crash?
Hi @peterjc123 ,
I am a newbie of using pytorch & detectron2. I also encounter this problem last night. I use my own dataset (a small dataset, 180+ train images, 40+ val images), prepared in MS COCO format.
I am using Docker, and the container setup equals to the Docker file in detectron2 repo.
https://github.com/facebookresearch/detectron2/blob/master/docker/Dockerfile
And here is my PC config:
Attached is the train.py I use for train detectron2.
train_py.zip
Please help. Thanks.
Hi @cclo-astri. Thanks for reporting. Most of the other reports here have been for windows. Could you try to report the stacktrace from the failure by using torch.autograd.set_detect_anomaly(True) and rerunning the code? It would also help to know which version of PyTorch, Python, and Cuda you are using
Hi @mattip
After I change the batch_size and num_workers of dataloader to 1, the error seems gone (at least the training could run for nearly 5 hours, but the PC suddenly hangs and the training is incomplete). The ETA of training time also decreases, why ?
Before:
-- cfg.DATALOADER.NUM_WORKERS = 2
-- cfg.SOLVER.IMS_PER_BATCH = 2
-- ETA: 9.5 hours, max_mem: 5.8GBytes (nvidia-smi shows 6.8GBytes)
-- Failed after 1.5 hours
After:
-- cfg.DATALOADER.NUM_WORKERS = 1
-- cfg.SOLVER.IMS_PER_BATCH = 1
-- ETA: 5 hours, max_mem: 3.5GBytes (nvidia-smi shows 4.5GBytes)
And here is the detail information of my docker environment:
For further version information of installed packages, please refer to: https://hub.docker.com/layers/nvidia/cuda/10.1-cudnn7-devel/images/sha256-557de4ba2cb674029ffb602bed8f748d44d59bb7db9daa746ea72a102406d3ec?context=explore
Thanks.
I just got this error today after updating my NVIDIA drivers to 445.87 (I haven't updated them for a year at least). I'm using a GTX 1060 (6Gb), Cuda compilation tools, release 9.0, V9.0.176, pytorch 1.5.0+cu92, cudnn 7.3.0
My LSTM was training fine before. With a set seed, it always crashes at the same time (at the middle of epoch 52). I first get the error below
r_out, (h_out, c_out) = self.rnn(x)
File "C:\Users\willi\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__
result = self.forward(input, *kwargs)
File "C:\Users\willi\Anaconda3\lib\site-packages\torch\nn\modules\rnn.py", line 570, in forward
self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
But then when I try to train again without killing the kernel, I get :
File "C:\Users\willi\Anaconda3\lib\site-packages\torchtext\data\iterator.py", line 156, in __iter__
yield Batch(minibatch, self.dataset, self.device)
File "C:\Users\willi\Anaconda3\lib\site-packages\torchtext\databatch.py", line 34, in __init__
setattr(self, name, field.process(batch, device=device))
File "C:\Users\willi\Anaconda3\lib\site-packages\torchtext\data\field.py", line 237, in process
tensor = self.numericalize(padded, device=device)
File "C:\Users\willi\Anaconda3\lib\site-packages\torchtext\data\field.py", line 359, in numericalize
var = torch.tensor(arr, dtype=self.dtype, device=device)
RuntimeError: CUDA error: unspecified launch failure
Update : tried rolling back my NVIDIA drivers to 442.59 and the error still appears at epoch 60.
This related issue mentions a fix to our problem, which consistently works on my machine: https://github.com/pytorch/pytorch/issues/21819. Specifically this comment.
System:
I'm experiencing similar issues, however, detect anomaly is enabled in this case. The model is a simple two-layer (convolutional) revnet using less than a gigabyte of vram. The system itself uses between 300 and 800 MB.\
The errors pop up with both CUDA 10.2 and 11.0, giving the same tracebacks in both runs.
Warning: Error detected in CudnnConvolutionBackward. Traceback of forward call that caused the error:
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\autograd\function.py", line 77, in apply
return self._forward_cls.backward(self, *args)
File "C:\ProgramData\Anaconda3\lib\site-packages\memcnn-1.3.2-py3.7.egg\memcnn\models\revop.py", line 83, in backward
temp_output = ctx.fn(*detached_inputs)
File "C:\ProgramData\Anaconda3\lib\site-packages\memcnn-1.3.2-py3.7.egg\memcnn\models\additive.py", line 65, in forward
gmd = self.Gm.forward(y1)
File "C:\Users\UserName\Documents\Project\pytorch\model.py", line 121, in forward
sort = self.sort_conv(inp)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\conv.py", line 349, in forward
return self._conv_forward(input, self.weight)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\conv.py", line 346, in _conv_forward
self.padding, self.dilation, self.groups)
(print_stack at ..\torch\csrc\autograd\python_anomaly_mode.cpp:60)
Warning: Error detected in InvertibleCheckpointFunctionBackward. Traceback of forward call that caused the error:
File ".\main.py", line 14, in <module>
model.fit()
File "C:\Users\UserName\Documents\Project\pytorch\main.py", line 129, in fit
out = self.model(src)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\container.py", line 100, in forward
input = module(input)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\memcnn-1.3.2-py3.7.egg\memcnn\models\revop.py", line 183, in forward
*(xin + tuple([p for p in self._fn.parameters() if p.requires_grad])))
(print_stack at ..\torch\csrc\autograd\python_anomaly_mode.cpp:60)
Traceback (most recent call last):
File ".\main.py", line 14, in <module>
model.fit()
File "C:\Users\UserName\Documents\Project\pytorch\main.py", line 131, in fit
err.backward()
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\autograd\__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered (operator () at C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/CUDAScalar.cu:19)
(no backtrace available)
When using chromium (but not brave, chrome, firefox and even games on ultra settings) however, it won't ever run the first few batches and instead go straight to this:
Warning: Error detected in CudnnConvolutionBackward. Traceback of forward call that caused the error:
File ".\main.py", line 14, in <module>
model.fit()
File "C:\Users\UserName\Documents\Project\pytorch\main.py", line 129, in fit
out = self.model(src)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\container.py", line 100, in forward
input = module(input)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "C:\Users\UserName\Documents\Project\pytorch\model.py", line 121, in forward
sort = self.sort_conv(inp)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\conv.py", line 349, in forward
return self._conv_forward(input, self.weight)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\conv.py", line 346, in _conv_forward
self.padding, self.dilation, self.groups)
(print_stack at ..\torch\csrc\autograd\python_anomaly_mode.cpp:60)
Traceback (most recent call last):
File ".\main.py", line 14, in <module>
model.fit()
File "C:\Users\UserName\Documents\Project\pytorch\main.py", line 131, in fit
err.backward()
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\autograd\__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: unspecified launch failure (operator () at C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/CUDAScalar.cu:19)
(no backtrace available)
Curiously it then fixed itself after waiting for six hours, allowing it to process roughly 20,000 batches in one hour while crashing immediately after finishing these.
I've compiled the binaries with debug info.
https://5833189-65600975-gh.circle-artifacts.com/0/w/final_pkgs/torch-1.6.0.dev20200613-cp36-cp36m-win_amd64.whl
https://5833191-65600975-gh.circle-artifacts.com/0/w/final_pkgs/torch-1.6.0.dev20200613-cp37-cp37m-win_amd64.whl
https://5833196-65600975-gh.circle-artifacts.com/0/w/final_pkgs/torch-1.6.0.dev20200613-cp38-cp38-win_amd64.whl
You can install them and then get some more info using cuda-memcheck.
:: PythonRoot in the line below refers to the directory of your Python installation
:: e.g. C:\Python37
set _NT_ALT_SYMBOL_PATH=[PythonRoot]\Lib\site-packages\torch\lib
cuda-memcheck python your-script.py
With cuda-memcheck python bug.py, I get OOM. Epoch 0 never ends. But memory is growing.
So I run python bug.py. I see 2x speedup: 10 seconds per epoch vs 20 seconds in PyTorch 1.5. But:
epoch: 254
Traceback (most recent call last):
File "bug.py", line 124, in <module>
main()
File "bug.py", line 104, in main
loss.backward()
File "C:\Anaconda3\envs\torch16\lib\site-packages\torch\tensor.py", line 184, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Anaconda3\envs\torch16\lib\site-packages\torch\autograd\__init__.py", line 125, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
Exception raised from _cudnn_rnn_backward_input at ..\aten\src\ATen\native\cudnn\RNN.cpp:923 (most recent call first):
00007FFD84CE087200007FFD84C85FAB c10.dll!caffe2::TypeMeta::_typeMetaDataInstance<unsigned char> [<unknown file> @ <unknown line number>]
00007FFC91A959B600007FFC8C1EC160 torch_cuda.dll!THCudaShortTensor_set4d [<unknown file> @ <unknown line number>]
00007FFC91AC19F000007FFC8C1EC160 torch_cuda.dll!THCudaShortTensor_set4d [<unknown file> @ <unknown line number>]
00007FFC91ABFAAF00007FFC8C1EC160 torch_cuda.dll!THCudaShortTensor_set4d [<unknown file> @ <unknown line number>]
00007FFC91B7C43800007FFC8C1EC160 torch_cuda.dll!THCudaShortTensor_set4d [<unknown file> @ <unknown line number>]
00007FFC91B8BD3D00007FFC8C1EC160 torch_cuda.dll!THCudaShortTensor_set4d [<unknown file> @ <unknown line number>]
00007FFD3CAABF3A00007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD3CAE625E00007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD3CAABD2200007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD3CB347E500007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD3FAC814D00007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD3FAD573D00007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD3CAABF3A00007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD3CAE625E00007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD3CAABD2200007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD3CB347E500007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD3F8D35D100007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD3F89F55900007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD40055CA900007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD4005764A00007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD4005EE1900007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD4005E94200007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD60C21C3B00007FFD608D6622 torch_python.dll!THPVariable_Wrap [<unknown file> @ <unknown line number>]
00007FFD40042F3400007FFD345D6D2D torch_cpu.dll!caffe2::ExternalDataProto::MaybeArenaPtr [<unknown file> @ <unknown line number>]
00007FFD8E62D9F200007FFD8E62D980 ucrtbase.dll!o_strncat_s [<unknown file> @ <unknown line number>]
00007FFD8F817BD400007FFD8F817BC0 KERNEL32.DLL!BaseThreadInitThunk [<unknown file> @ <unknown line number>]
00007FFD916ECEE100007FFD916ECEC0 ntdll.dll!RtlUserThreadStart [<unknown file> @ <unknown line number>]
I get a lot of random errors, like this.
In my case, i fix this bugs by making 2 things
1 I change raiser from my card (1060 6gb connected to PCI-E 8x slot by raiser)
2 Change PCI-E slot to 16X.
All errors are gone. I think my problem was specific, and not fully connected to this topic, but i think this actions can help someone
When using LSTM and cuda, I saw the same error when I use my entire dataset.
But your assistance helped me, I used: torch.autograd.set_detect_anomaly(True) and I could use the full dataset when I set the above statement. thank you.
We are also training LSTMs and experiencing similar issues. We have tried various environment configurations:
(Note windows had the latest updates applied)
1) Windows 10, Nvidia driver CUDA 10.1, Pytorch 1.5 CUDA 10.1
2) Windows 10, Nvidia driver CUDA 10.2, Pytorch 1.5 CUDA 10.2
3) Windows 10, Nvidia driver CUDA 11, Pytorch 1.5 CUDA 10.2
and all of the above with Pytorch 1.5.1.
4) Ubuntu Linux 20.04, Nvidia driver CUDA 10.1, Pytorch 1.5 CUDA 10.1
5) Ubuntu Linux 20.04, Nvidia driver CUDA 10.2, Pytorch 1.5 CUDA 10.2
Training always fails at a random epoch with an unspecified launch failure, or unknown error.
Encountered the same problem (cashes at random epochs during training) yesterday with LSTM networks on a NVIDIA GTX 1070 and Windows 10.
Solved the problem by updating the drivers to 451.48. Unfortunately I don't know the drivers I had when I was getting the crashes.
Encountered the same problem (cashes at random epochs during training) yesterday with LSTM networks on a NVIDIA GTX 1070 and Windows 10.
Solved the problem by updating the drivers to 451.48. Unfortunately I don't know the drivers I had when I was getting the crashes.
I have a similar problem, but I think I have them since I updated to 451.48.
EDIT: I tried both the April (445.87) and May (446.14) drivers, but ended up with the same results. I don't think the drivers are the problem for me.
Not sure who to contact but im fairly confident i can recreate this issue.
Essentially if I choose the an output method (when token=1 or 2) it raises this error, however this does not occur with the concat method (when token=3)
Optimizer: AdamW
Criterion: CrossEntropyLoss()
Finally the error seems to be raised within my evaluate function at this line
total_acc += pred.eq(label.view_as(pred)).sum().item()
Please let me know if you need anything else.
RNN architecture Code:
`
class L_Rec_RNN(nn.Module):
def __init__(self, input_size, hidden_size, num_classes, hu, n_layers=1, token=3, use_cuda=False):
super(L_Rec_RNN, self).__init__()
self.name = "Language RNN"
self.hu_1 = hu
self.classes = num_classes
self.hidden_size = hidden_size
self.layers = n_layers
self.token = token
self.use_cuda = use_cuda
self.dr = 0.3
# condition to multiply the input to the first FC depending on the token method that is used to feed it
if token < 3:
mult = 1
else:
mult = 2
# [batch_size, 201, 552]
self.rnn = nn.GRU(input_size, hidden_size, self.layers, batch_first=True)
self.lin = nn.Sequential(
nn.Linear(mult * self.hidden_size, self.hu_1)
, nn.ReLU()
, nn.Dropout(self.dr)
, nn.Linear(self.hu_1, self.classes)
)
def forward(self, x):
# pre prep
x = x.squeeze(1) # for 1d convolutions
hs = torch.zeros(self.layers, x.size(0), self.hidden_size)
if self.use_cuda and torch.cuda.is_available():
hs = hs.cuda()
# calling the RNN
x, _ = self.rnn(x, hs)
# various methods for choosing the output hidden layer
# Method 1
if self.token == 1:
x = x[:, -1, :]
# Method 2
elif self.token == 2:
x = torch.max(x, dim=1)[0]
# Method 3
else:
x = torch.cat([torch.max(x, dim=1)[0], torch.mean(x, dim=1)], dim=1)
# calling the FC layers
x = self.lin(x)
return x`
Evaluate Function Code:
`
def evaluate(model, data, criterion=nn.CrossEntropyLoss(), get_loss=False, use_cuda=False):
total_loss = 0.0
total_acc = 0.0
total_epoch = 0
counter = 0
for spec, label in data:
if use_cuda and torch.cuda.is_available():
spec = spec.cuda()
label = label.cuda()
out = model(spec)
loss = criterion(out, label)
pred = out.max(1, keepdim=True)[1]
total_acc += pred.eq(label.view_as(pred)).sum().item()
total_loss += loss.item()
total_epoch += len(label)
counter += 1
acc = float(total_acc) / total_epoch
loss = float(total_loss) / counter
if get_loss:
return acc, loss
else:
return acc`
I don't know if this helps, but I'm getting that error when switching from samples that are 200^3 to samples that are 256^3. Although I have plenty of memory to store them. The program errors out when calculating one of my personal metrics.
I'm using pytorch 1.5 with CUDA 9.2
I have been struggling with this problem for quite some time now, but am unable to resolve it. However, there are some things I noticed about the problem.
1> It happens only when implemented on GPU. There is no problem with the same code when all the operations are do on CPU.
2> The problem is machine specific. It runs perfectly on some other machine (even on GPU) but crashes on mine. This shows that the code is fine, and that the problem lies with the machine/software/match-mismatch.
3> Most people seem to mention that the problem happens when doing a binary classification with an LSTM. This is my case also.
By the way, I reinstalled my Windows, Anaconda, Pytorch, GPU drivers, everything but the problem remains. However, the same code works perfectly on my collaborator's machine. I am using
Python 3.7
Conda 4.8.3
Pytorch 1.5.1
Torchvision 0.6.1
CUDA 10.2.89
NVIDIA 451.67
This is a serious problem, and I request more serious Pytorch researchers to look into it. I am attaching my code for others to take a look.
Same issue. Using Pytorch 1.5.1 and GTX 1050 ti
The same error in neural machine translation(NMT) with OpenNMT-py using pytorch 1.5.1 and cuda 10.2 .
I think this is an error with LSTM cells.
Now, I am working with transformers, and I will report the results.
Same issue. Using Pytorch 1.2.0 and GTX 2060s cuda 10.0
Same issue with pytorch 1.3.1 on Quadro RTX 8000, and similarly to others training a model with an LSTM layer. Also, trying with different seeds would sometimes crash with "cuDNN error: CUDNN_STATUS_INTERNAL_ERROR".
I tried to fix with set_detect_anomaly(True) and the process just got stuck on the same epoch where it previously crashed and seemed to be stuck in some backend loop - cuda showed 100% utilisation with occasional dips to ~55%.
In the end it seems like I finally managed to get around the issue completely by disabling cudnn with "torch.backends.cudnn.enabled = False", but I'm guessing this might lead to sub-optimal performance and potentially other issues?
@ngimel,
from JoshuaSv2 and hendrycks comments, it looks not a windows specific issue.
https://github.com/pytorch/pytorch/issues/39872 is a similar issue.
The same error in neural machine translation(NMT) with OpenNMT-py using pytorch 1.5.1 and cuda 10.2 .
I think this is an error with LSTM cells.
Now, I am working with transformers, and I will report the results.
Yes, this error does not occur with the transformer model. It's just an LSTM related error.
i have the same problem with pytorch 1.6.0
GTX 1060 on windows 10
Traceback (most recent call last):
File "train.py", line 126, in <module>
print(' Loss = %s' % epoch_train(
File "C:\Users\mehrd\Jupyter\SQLNet-master\sqlnet\utils.py", line 148, in epoch_train
score = model.forward(q_seq, col_seq, col_num, pred_entry,
File "C:\Users\mehrd\Jupyter\SQLNet-master\sqlnet\model\seq2sql.py", line 123, in forward
x_emb_var, x_len = self.embed_layer.gen_x_batch(q, col)
File "C:\Users\mehrd\Jupyter\SQLNet-master\sqlnet\model\modules\word_embedding.py", line 76, in gen_x_batch
val_inp = val_inp.cuda()
RuntimeError: CUDA error: unspecified launch failure
I faced the same problem with an LSTM model, but after setting the environment variable CUDA_LAUNCH_BLOCKING=1 before running the script I no longer get the error. This was suggested in this post for debugging purposes only, but like the answer there my code runs fine as well.
Any idea why this is the case?
Working with pytorch 1.5.0 and CUDA 10.1 on Windows 10.
Hello, I recently faced and solved this issue on my Windows machine.
In my case, this issue was invoked by Windows Timeout Detection and Recovery (TDR), which shuts down CUDA kernels that fail to respond in time.
The fix is as follow:
This should do the trick on Windows 10. Hope this helps.
PS. setting CUDA_LAUNCH_BLOCKING=1 also solves the issue but comes at a heavy performance penalty.
@YinPing-Cho For me, this made it better, but it still did not completely solve the issue. I eventually set it to 10000 seconds.
Just FYI, I was also keep failing with RuntimeError: CUDA error: unspecified launch failure when I was trained sound event detection model on P100 on google colab. For my case the solution was simply changing num_workers in dataloader from 2 to 0. I don't know how this error is related with num_workers and it was so hard to debug.
Just FYI, I was also keep failing with
RuntimeError: CUDA error: unspecified launch failurewhen I was trained sound event detection model on P100 on google colab. For my case the solution was simply changing num_workers in dataloader from 2 to 0. I don't know how this error is related with num_workers and it was so hard to debug.
This also helped me, but did not completely resolve the issue. Combined with @YinPing-Cho's fix, I was able to completely train an AWD-LSTM model on the second try awhile ago. In my experience though, these just made the issue more rare.
Just FYI, I was also keep failing with
RuntimeError: CUDA error: unspecified launch failurewhen I was trained sound event detection model on P100 on google colab. For my case the solution was simply changing num_workers in dataloader from 2 to 0. I don't know how this error is related with num_workers and it was so hard to debug.
This actually reduced the error in my environment, windows 10, CUDA 10.0, pytorch1.4. I don't know why but one of the reasons might be that making num_worker 0 initializes some internal setting every iteration. Some warnings (in my case, deprecated warning of nn.Softmax) show up every iteration when I make num_workers 0.
By the way, before this change, runtimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED in addition to RuntimeError: CUDA error: unspecified launch failure stopped my program randomly after some iterations.
After 30000 iterations, the error finally occurred...
This reduces the error but not the solution for me
Exception has occurred: RuntimeError
cuDNN error: CUDNN_STATUS_MAPPING_ERROR (getCudnnHandle at ..aten\src\ATen\cudnn\Handle.cpp:45)
(no backtrace available)
cudnn = 7.6.5
cudatoolkit = 10.2.89
pytorch = 1.5.1
Exception has occurred: RuntimeError
CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) (gemm at ..aten\src\ATen\cuda\CUDABlas.cpp:165)
(no backtrace available)
Same error on windows, training an LSTM on a GTX 2080 TI. Happens with both Pytorch 1.5 and 1.6.
Very annoying as it seems random, and the training is completely broken when it happens.
Switching from 1.6 to 1.5 and downgrading my Nvidia driver to 431.86 fixed the error for me.
Same error while training an LSTM with a big batch size on windows, I was getting random crashes after 1 to 20 epochs. Setting torch.backends.cudnn.enabled = False fixed the issue.
Pytorch 1.5.1
Cuda 10.2.89
CuDNN 7.6.5
GTX 1070 - MSI Gaming X - Driver 445.75
Windows 10 Pro 1909 build 18363.1016
@lucas-emery, did you try extend the TDR display or disable TDR as https://developer.download.nvidia.com/NsightVisualStudio/2.2/Documentation/UserGuide/HTML/Content/Timeout_Detection_Recovery.htm
@mszhanyi i did try extending the TDR to 60 seconds. I was able to run a 13 hour training session after setting the TDR and restarting my pc, but the backprop time was also faster (down from 1 min to 10/15 secs), I guess it was just a coincidence and cudnn chose a different algorithm.
After that I stopped the training to update a function and when I tried to resume I couldn't get past 20 epochs without a crash, sometimes "illegal memory" and sometimes "launch failure" the backprop time went back up to 1 minute. I reverted my changes and tried to train a new model from scratch but it crashed between 1 and 20 epochs with the same errors. After setting torch.backends.cudnn.enabled = False with no code changes and no reboot it stopped crashing and backprop time went down to 20 secs. That training session lasted 12 hours with no errors.
I did two more 4 hour sessions without problems today.
@lucas-emery , could you provide a simplified script that I could reproduce it?
@mszhanyi I'm afraid it won't be possible, it's a very complex model on a reinforcement learning task. I'll let you know if I find anything else. I'll try to get something reproducible after I finish.
The error started appearing after I incremented my batch size to 1k with an unroll length of 32.
I'm getting this issue on my RTX 3080, and I can't even downgrade PyTorch because older versions don't support RTX 3000.
These two fixes worked for me, but both have a performance penalty:
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'torch.backends.cudnn.enabled = FalseSame issue on 3090+Windows10+CUDA11+PyTorch(Stable&Nightly)
These fixes worked for me,too.
$env:CUDA_LAUNCH_BLOCKING=1 increases the training time by 500%.torch.backends.cudnn.enabled = False increases the training time by 20%.We are facing the same issue. Tried on Ubuntu 18.04, Nvidia K80, M60, V100, all with the same pytorch version 1.6.0, cuda 11.
Applying the below fix doesn't help as well.... :(
```
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
torch.backends.cudnn.enabled = False
````
Most helpful comment
I am running into similar issues on my windows machine, I have a simple pipeline for binary classification with an LSTM and it shuts down at epochs (seems to be random).