I am training a version of unet with joint classification and semantic segmentation using O1 level. The training crashes after I explicitly cast box_coord_tensor in roi_pool function.
rois = roi_pool(
input=classification_feature_map_tensor, # FLOAT16
boxes=box_coord_tensor.half(), # FLOAT32 IF NOT CASTED EXPLICITLY
output_size=roi_size,
spatial_scale=1,
)
Thing is, classification_feature_map_tensor comes as float16 since it is handled by amp while box_coord_tensor comes from input batch which is float32. However, roi_pool requires tensors to have equal precision and throws
RuntimeError: Expected tensor for argument #1 'input' to have the same type as tensor for argument #2 'rois'; but type Variable[CUDAHalfType] does not equal Variable[CUDAFloatType] (while checking arguments for ROIPool_forward_cuda) (checkSameType at /pytorch/aten/src/ATen/TensorUtils.cpp:140)
But if I cast box_coord_tensor to float16, CUDA throws memory access error below.
File "/usr/lib/python3.7/contextlib.py", line 119, in __exit__
next(self.gen)
File "/usr/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 119, in unscale
self.unscale_python(model_grads, master_grads, scale)
File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 89, in unscale_python
self.dynamic)
File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 9, in scale_check_overflow_python
cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered
Is there anything I could try to do because so far any attempts result in the error above.
When in doubt, always prefer casting to FP32. In this case (I think) you're calling into a custom torchvision op that may not have an FP16 implementation. Cast both inputs to FP32 instead of FP16 and it should work.
I casted everything to float32
rois = roi_pool(
input=classification_feature_map_tensor.float(),
boxes=box_coord_tensor.float(),
output_size=self.roi_size,
spatial_scale=1,
)
The roi_pool works and passes but the exception is thrown in apex here
with amp.scale_loss(loss, self.optimizer) as scaled_loss:
scaled_loss.backward() # exception is thrown
inside the training loop below
for epoch in range(1, self.num_epochs + 1):
logger.info(f"running epoch {epoch}")
avg_train_loss = 0
self.model.train()
for step, sample_batch in enumerate(self.train_data, start=1):
sample_batch = self._sample_to_device(sample_batch)
self.optimizer.zero_grad()
doc_id_batch = sample_batch[DOC_ID]
logits_dict = self.model(sample_batch)
loss = self.criterion(logits_dict, sample_batch)
with amp.scale_loss(loss, self.optimizer) as scaled_loss:
scaled_loss.backward() # exception is thrown
self.optimizer.step()
avg_train_loss += loss.item()
epoch_end_time = timeit.default_timer()
epoch_time = epoch_end_time - epoch_start_time
Below are some training logs with O2 just before the crash. You can even see that epoch 1 completed with nan loss though.
2019-11-04 10:35:43,186 - INFO - __main__ - starting training
2019-11-04 10:35:43,186 - INFO - net.train.trainer - starting training of model, going to train 100 epochs
2019-11-04 10:35:43,190 - INFO - net.train.trainer - running epoch 1
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
2019-11-04 10:35:53,378 - INFO - net.train.trainer - epoch 1; average train loss nan; processed 10 batches in 10.19 seconds, 1.02 sec per batch on average
2019-11-04 10:35:53,379 - INFO - net.train.trainer - epoch 1; starting validation
2019-11-04 10:35:56,085 - INFO - net.train.trainer - epoch 1: validation loss nan
2019-11-04 10:35:56,085 - INFO - net.train.trainer - epoch 1: validation loss did not decrease, patience left 9
2019-11-04 10:35:56,085 - INFO - net.train.trainer - running epoch 2
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16.0
(...)
File "/usr/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 182, in post_backward_with_master_weights
models_are_masters=False)
File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 119, in unscale
self.unscale_python(model_grads, master_grads, scale)
File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 89, in unscale_python
self.dynamic)
File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 9, in scale_check_overflow_python
cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered
Now with O3 we get a little bit further and with a crash duing summing the validation loss.
Selected optimization level O3: Pure FP16 training.
Defaults for this optimization level are:
enabled : True
opt_level : O3
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : False
master_weights : False
loss_scale : 1.0
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O3
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : False
master_weights : False
loss_scale : 1.0
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ImportError('/usr/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE')
2019-11-04 13:19:25,347 - INFO - __main__ - starting training
2019-11-04 13:19:25,347 - INFO - net.train.trainer - starting training of model, going to train 100 epochs
2019-11-04 13:19:25,351 - INFO - net.train.trainer - running epoch 1
2019-11-04 13:19:35,604 - INFO - net.train.trainer - epoch 1; average train loss 3.7108697175979612; processed 10 batches in 10.25 seconds, 1.03 sec per batch on average
2019-11-04 13:19:35,605 - INFO - net.train.trainer - epoch 1; starting validation
2019-11-04 13:19:38,362 - INFO - net.train.trainer - epoch 1: validation loss 3.0665794213612876
2019-11-04 13:19:38,362 - INFO - net.train.trainer - epoch 1: better model found, new best validation loss: 3.0665794213612876
2019-11-04 13:19:38,367 - INFO - net.train.trainer - running epoch 2
2019-11-04 13:19:48,451 - INFO - net.train.trainer - epoch 2; average train loss 2.4132291316986083; processed 10 batches in 10.08 seconds, 1.01 sec per batch on average
2019-11-04 13:19:48,451 - INFO - net.train.trainer - epoch 2; starting validation
2019-11-04 13:19:51,411 - INFO - net.train.trainer - epoch 2: validation loss 2.798730452855428
2019-11-04 13:19:51,411 - INFO - net.train.trainer - epoch 2: better model found, new best validation loss: 2.798730452855428
2019-11-04 13:19:51,416 - INFO - net.train.trainer - running epoch 3
...
File "/home/user/net/train/trainer.py", line 138, in train
avg_train_loss += loss.item()
RuntimeError: CUDA error: an illegal memory access was encountered
Running the training with CUDA_LAUNCH_BLOCKING=1 gives us:
trained_model_state, optimizer_state, metrics = trainer.train()
File "/home/user/net/train/trainer.py", line 131, in train
scaled_loss.backward()
File "/home/user/.local/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/user/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
Could it be related to this? So does it mean that we are running out of memory? But nvidia-smi tells that we use only 50% GPU.
2.1.10. GEMM Algorithms Numerical Behavior
Some GEMM algorithms split the computation along the dimension K to increase the GPU occupancy, especially when the dimension K is large compared to dimensions M and N. When this type of algorithm is chosen by the cuBLAS heuristics or explicitly by the user, the results of each split is summed deterministically into the resulting matrix to get the final result.
For the routines cublasgemmEx and cublasGemmEx, when the compute type is greater than the output type, the sum of the split chunks can potentially lead to some intermediate overflows thus producing a final resulting matrix with some overflows. Those overflows might not have occured if all the dot products had been accumulated in the compute type before being converted at the end in the output type.
This computation side-effect can be easily exposed when the computeType is CUDA_R_32F and Atype, Btype and Ctype are in CUDA_R_16F.
I don't think it's running out of memory. With O1, for the backward pass (https://github.com/NVIDIA/apex/issues/580#issuecomment-549171867) does it error on the very first backward pass? And what is the exception trace that is thrown?
Correct, with O1 it fails on the first backward pass. With O2 it finishes two epochs and with O3 finishes three epochs. With O0 it does not crash.
Below is the run with O1 opt-level.
CUDA_LAUNCH_BLOCKING=1 python train.py --config-file config/config.gin --log-level INFO
2019-11-04 19:29:08,258 - INFO - __main__ - setting random seed to 42
2019-11-04 19:29:08,258 - INFO - __main__ - setting up train data
2019-11-04 19:29:08,264 - INFO - __main__ - split data with valid fraction 0.2 --> # train data: 40, # valid data: 10
2019-11-04 19:29:08,268 - INFO - net.utils.class_weights - calculating class weights with c=1.04 for box weights and c=1.04 for segmentation weights
2019-11-04 19:29:16,816 - INFO - net.utils.class_weights - calculated box class weights: tensor([ 1.5608, 21.2831, 22.9914, 16.3494, 23.2191, 21.6754, 25.2760, 25.3858,
23.1732, 25.0054, 19.9499, 10.7810, 19.6184, 20.9051])
2019-11-04 19:29:16,817 - INFO - net.utils.class_weights - calculated segmentation class weights: tensor([0.0821, 0.1714, 0.1662, 0.1396, 0.1677, 0.1864, 0.1912, 0.2489, 0.1080])
2019-11-04 19:29:16,832 - INFO - __main__ - setting up loss function
2019-11-04 19:29:16,832 - INFO - __main__ - combining loss by sum with box loss weight 1.0 and segmentation loss weight 1.0
2019-11-04 19:29:16,832 - INFO - __main__ - setting up model
2019-11-04 19:29:16,891 - INFO - __main__ - setting up trainer instance
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ImportError('/usr/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE')
2019-11-04 19:29:22,263 - INFO - __main__ - starting training
2019-11-04 19:29:22,263 - INFO - net.train.trainer - starting training of model, going to train 100 epochs
2019-11-04 19:29:22,263 - INFO - net.train.trainer - running epoch 1
...
File "train.py", line 267, in train
trained_model_state, optimizer_state, metrics = trainer.train()
File "/home/user/net/train/trainer.py", line 132, in train
scaled_loss.backward()
File "/home/user/.local/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/user/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
According to these docs CUBLAS_STATUS_EXECUTION_FAILED means "the function failed to launch on the GPU". I wonder what could be the possible reasons for that since it is launched on GPU several times before it crashes.
Batch size does not change the behavior. I also tried running with nightly pytorch builds, same results. Tried running on different machines GTX1070 and GTX1080Ti, same error. The apex example imagenet network runs without errors though so it is something with our model.
@tastyminerals Are you using variable input sizes, i.e. are some inputs larger than others?
If so, could it be related to this issue?
If you are using CUDA10.0, could you update to 10.1, please, and check, if it's working?
I get a similar error with the forward pass. After some batches, it gives the following error(s).
Sometimes it is error 1 and sometimes it is error 2 or error 3.
Sometimes the error is thrown after processing 1st batch and sometimes at 2nd,9th or 13th, 17th, 21st batch.
Error 1
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when callingcublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)``
Error 2
RuntimeError: CUDA error: device-side assert triggered
Error 3
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCBlas.cu:258
Maybe this issue discussion can bring more perspective to it.
I managed to train the model without crashing (at least reach 10th epoch) with batch_size=1 and O2 opt-level. Anything else leads to an exception.
batch_size=1, opt-level=O1 --> crashes after couple of epochs
batch_size=1, opt-level=O2 --> works fine
batch_size=1, opt-level=O3 --> crashes after couple of epochs
batch_size=2, opt-level=O1 --> crashes after couple of epochs
batch_size=2, opt-level=O2 --> crashes after couple of epochs
batch_size=2, opt-level=O3 --> crashes after couple of epochs
Unfortunately, even though with O2 I am able to train the loss is still nan right after the first epoch :(
Defaults for this optimization level are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ImportError('/usr/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE')
2019-11-09 19:31:13,427 - INFO - __main__ - starting training
2019-11-09 19:31:13,427 - INFO - unet.train.trainer - starting training of model, going to train 100 epochs
2019-11-09 19:31:13,429 - INFO - unet.train.trainer - running epoch 1
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00048828125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
2019-11-09 19:31:23,699 - INFO - unet.train.trainer - epoch 1; average train loss nan; processed 40 batches in 10.27 seconds, 0.26 sec per batch on average
2019-11-09 19:31:23,699 - INFO - unet.train.trainer - epoch 1; starting validation
2019-11-09 19:31:26,067 - INFO - unet.train.trainer - epoch 1: validation loss nan
2019-11-09 19:31:26,068 - INFO - unet.train.trainer - epoch 1: validation loss did not decrease, patience left 9
2019-11-09 19:31:26,068 - INFO - unet.train.trainer - running epoch 2
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.725290298461914e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.313225746154785e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9103830456733704e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8189894035458565e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7763568394002505e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.440892098500626e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1102230246251565e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3877787807814457e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.469446951953614e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.673617379884035e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19
2019-11-09 19:31:36,441 - INFO - unet.train.trainer - epoch 2; average train loss nan; processed 40 batches in 10.37 seconds, 0.26 sec per batch on average
2019-11-09 19:31:36,442 - INFO - unet.train.trainer - epoch 2; starting validation
2019-11-09 19:31:38,790 - INFO - unet.train.trainer - epoch 2: validation loss nan
2019-11-09 19:31:38,791 - INFO - unet.train.trainer - epoch 2: validation loss did not decrease, patience left 8
2019-11-09 19:31:38,791 - INFO - unet.train.trainer - running epoch 3
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.710505431213761e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6940658945086007e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.470329472543003e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.235164736271502e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.117582368135751e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0587911840678754e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.293955920339377e-23
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6469779601696886e-23
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3234889800848443e-23
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.617444900424222e-24
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.308722450212111e-24
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6543612251060553e-24
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.271806125530277e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.1359030627651384e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0679515313825692e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0339757656912846e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.169878828456423e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5849394142282115e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2924697071141057e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.462348535570529e-27
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2311742677852644e-27
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6155871338926322e-27
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.077935669463161e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0389678347315804e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0194839173657902e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0097419586828951e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.048709793414476e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.524354896707238e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.262177448353619e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.310887241768095e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1554436208840472e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5777218104420236e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.888609052210118e-31
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.944304526105059e-31
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9721522630525295e-31
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.860761315262648e-32
2019-11-09 19:31:49,216 - INFO - unet.train.trainer - epoch 3; average train loss nan; processed 40 batches in 10.43 seconds, 0.26 sec per batch on average
2019-11-09 19:31:49,217 - INFO - unet.train.trainer - epoch 3; starting validation
2019-11-09 19:31:51,595 - INFO - unet.train.trainer - epoch 3: validation loss nan
2019-11-09 19:31:51,595 - INFO - unet.train.trainer - epoch 3: validation loss did not decrease, patience left 7
2019-11-09 19:31:51,595 - INFO - unet.train.trainer - running epoch 4
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.930380657631324e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.465190328815662e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.232595164407831e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.162975822039155e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0814879110195774e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5407439555097887e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.703719777548943e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.851859888774472e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.925929944387236e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.62964972193618e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.81482486096809e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.407412430484045e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2037062152420224e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.018531076210112e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.009265538105056e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.504632769052528e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.52316384526264e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.76158192263132e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.88079096131566e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.4039548065783e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.70197740328915e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.350988701644575e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1754943508222875e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.877471754111438e-39
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.938735877055719e-39
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4693679385278594e-39
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.346839692639297e-40
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6734198463196485e-40
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8367099231598242e-40
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.183549615799121e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.591774807899561e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2958874039497803e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1479437019748901e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.739718509874451e-42
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8698592549372254e-42
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4349296274686127e-42
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.174648137343064e-43
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.587324068671532e-43
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.793662034335766e-43
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.96831017167883e-44
2019-11-09 19:32:02,018 - INFO - unet.train.trainer - epoch 4; average train loss nan; processed 40 batches in 10.42 seconds, 0.26 sec per batch on average
2019-11-09 19:32:02,018 - INFO - unet.train.trainer - epoch 4; starting validation
2019-11-09 19:32:04,435 - INFO - unet.train.trainer - epoch 4: validation loss nan
2019-11-09 19:32:04,436 - INFO - unet.train.trainer - epoch 4: validation loss did not decrease, patience left 6
I have cuda 10.1.243-2, torchvision 0.4.2-3 and pytorch 1.3.0 installed.
@tastyminerals Are you using variable input sizes, i.e. are some inputs larger than others?
If so, could it be related to this issue?
If you are using CUDA10.0, could you update to 10.1, please, and check, if it's working?
I cannot reproduce the bug, the code below works fine on my machine.
torch.zeros((16*2**20 - 512)//2 + 1, 1, dtype=torch.float16, device='cuda:0') @ torch.zeros(1, 2, dtype=torch.float16, device='cuda:0')
@tastyminerals @someAdjectiveNoun
Could you try to post a (small) code snippet to reproduce this issue?
@tastyminerals @someAdjectiveNoun
Could you try to post a (small) code snippet to reproduce this issue?
The problem is solved now. How? The problem was actually caused by using BioBERT model that I used. Using the BERT in Pytorch works smoothly. The problem seems to be coming from BioBERT.
@tastyminerals @someAdjectiveNoun
Could you try to post a (small) code snippet to reproduce this issue?
Unfortunately, you'd require custom dataset which we cannot share. We are using unet model.
I pulled recent apex master and reran the experiments. Now, previously working batch_size=1, opt-level=O2 stopped working and crashes right after the first epoch.
However, now there are some useful debug messages.
With O1:
Traceback (most recent call last):
File "train.py", line 339, in main
train()
File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1073, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/home/pavel/.local/lib/python3.7/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
six.raise_from(proxy.with_traceback(exception.__traceback__), None)
File "<string>", line 3, in raise_from
File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1050, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "train.py", line 269, in train
trained_model_state, optimizer_state, metrics = trainer.train()
File "/home/pavel/dev/gini/multi-modal-ner/mumo/train/trainer.py", line 127, in train
scaled_loss.backward()
File "/home/pavel/miniconda3/envs/gini_torch/lib/python3.7/contextlib.py", line 119, in __exit__
next(self.gen)
File "/home/pavel/.local/lib/python3.7/site-packages/apex/amp/handle.py", line 127, in scale_loss
should_skip = False if delay_overflow_check else loss_scaler.update_scale()
File "/home/pavel/.local/lib/python3.7/site-packages/apex/amp/scaler.py", line 200, in update_scale
self._has_overflow = self._overflow_buf.item()
RuntimeError: CUDA error: an illegal memory access was encountered
With O3:
File "train.py", line 343, in <module>
main()
File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "train.py", line 339, in main
train()
File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1073, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/home/pavel/.local/lib/python3.7/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
six.raise_from(proxy.with_traceback(exception.__traceback__), None)
File "<string>", line 3, in raise_from
File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1050, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "train.py", line 269, in train
trained_model_state, optimizer_state, metrics = trainer.train()
File "/home/pavel/dev/gini/multi-modal-ner/mumo/train/trainer.py", line 131, in train
avg_train_loss += loss.item()
RuntimeError: CUDA error: an illegal memory access was encountered
In call to configurable 'train' (<function train at 0x7f0a8829b840>)
Prepending CUDA_LAUNCH_BLOCKING=1
File "train.py", line 339, in main
train()
File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1073, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/home/pavel/.local/lib/python3.7/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
six.raise_from(proxy.with_traceback(exception.__traceback__), None)
File "<string>", line 3, in raise_from
File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1050, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "train.py", line 269, in train
trained_model_state, optimizer_state, metrics = trainer.train()
File "/home/pavel/dev/gini/multi-modal-ner/mumo/train/trainer.py", line 127, in train
scaled_loss.backward()
File "/home/pavel/.local/lib/python3.7/site-packages/torch/tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/pavel/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
In call to configurable 'train' (<function train at 0x7fd08eb2e840>)
Here is the chunk of training code.
for epoch in range(1, self.num_epochs + 1):
logger.info(f"running epoch {epoch}")
avg_train_loss = 0
epoch_start_time = timeit.default_timer()
# set model to training mode, validation switches some things like dropout off
self.model.train()
for step, sample_batch in enumerate(self.train_data, start=1):
sample_batch = self._sample_to_device(sample_batch)
self.optimizer.zero_grad()
doc_id_batch = sample_batch[DOC_ID]
logits_dict = self.model(sample_batch) # unet with 1 encoder and 1 decoder
loss = self.criterion(logits_dict, sample_batch) # SGD + momentum
logger.debug(
f"epoch {epoch}: step {step}; loss {loss.item()}; doc ids {doc_id_batch}"
)
with amp.scale_loss(loss, self.optimizer) as scaled_loss:
scaled_loss.backward()
self.optimizer.step()
avg_train_loss += loss.item()
epoch_end_time = timeit.default_timer()
epoch_time = epoch_end_time - epoch_start_time
avg_train_loss /= len(self.train_data)
I pulled recent apex master and reran the experiments. Now, previously working
batch_size=1, opt-level=O2stopped working and crashes right after the first epoch.
However, now there are some useful debug messages.With
O1:Traceback (most recent call last): File "train.py", line 339, in main train() File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1073, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/home/pavel/.local/lib/python3.7/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise six.raise_from(proxy.with_traceback(exception.__traceback__), None) File "<string>", line 3, in raise_from File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1050, in gin_wrapper return fn(*new_args, **new_kwargs) File "train.py", line 269, in train trained_model_state, optimizer_state, metrics = trainer.train() File "/home/pavel/dev/gini/multi-modal-ner/mumo/train/trainer.py", line 127, in train scaled_loss.backward() File "/home/pavel/miniconda3/envs/gini_torch/lib/python3.7/contextlib.py", line 119, in __exit__ next(self.gen) File "/home/pavel/.local/lib/python3.7/site-packages/apex/amp/handle.py", line 127, in scale_loss should_skip = False if delay_overflow_check else loss_scaler.update_scale() File "/home/pavel/.local/lib/python3.7/site-packages/apex/amp/scaler.py", line 200, in update_scale self._has_overflow = self._overflow_buf.item() RuntimeError: CUDA error: an illegal memory access was encounteredWith
O3:File "train.py", line 343, in <module> main() File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "train.py", line 339, in main train() File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1073, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/home/pavel/.local/lib/python3.7/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise six.raise_from(proxy.with_traceback(exception.__traceback__), None) File "<string>", line 3, in raise_from File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1050, in gin_wrapper return fn(*new_args, **new_kwargs) File "train.py", line 269, in train trained_model_state, optimizer_state, metrics = trainer.train() File "/home/pavel/dev/gini/multi-modal-ner/mumo/train/trainer.py", line 131, in train avg_train_loss += loss.item() RuntimeError: CUDA error: an illegal memory access was encountered In call to configurable 'train' (<function train at 0x7f0a8829b840>)Prepending
CUDA_LAUNCH_BLOCKING=1File "train.py", line 339, in main train() File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1073, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/home/pavel/.local/lib/python3.7/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise six.raise_from(proxy.with_traceback(exception.__traceback__), None) File "<string>", line 3, in raise_from File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1050, in gin_wrapper return fn(*new_args, **new_kwargs) File "train.py", line 269, in train trained_model_state, optimizer_state, metrics = trainer.train() File "/home/pavel/dev/gini/multi-modal-ner/mumo/train/trainer.py", line 127, in train scaled_loss.backward() File "/home/pavel/.local/lib/python3.7/site-packages/torch/tensor.py", line 166, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/pavel/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)` In call to configurable 'train' (<function train at 0x7fd08eb2e840>)
I'm getting the same results trying to run pix2pixHD training on Quadro RTX 6000
@jbartolozzi Quadro RTX 6000 has like 24GB of GPU memory? ... good lord. Did you try to use different batch sizes? Does it crash with batch_size = 1? Does it crash if you reduce the input image resolution?
With opt-level=00 there's no crashing.
These results are with a batch size of 1.
Yeah, the opt-level=O0 doesn't crash because it does not modify the model in any way. It's is a dry run. But looks like this ticket won't be solved in near future.
as with @jbartolozzi I have tried pix2pixHD with cuda 10.2 and getting the same results.
https://github.com/NVIDIA/apex/issues/580#issuecomment-549299087 may be fixed by https://github.com/pytorch/pytorch/pull/37569. The fix has been in master for a while, but did not make 1.5.1.
I still recommend moving to torch.cuda.amp. However, if the above PR is the right diagnosis, the problem is not in apex, but in Pytorch's FP16 gemv implementation, so you'll have to update Pytorch whether you choose apex or torch.cuda.amp.
#580 (comment) may be fixed by pytorch/pytorch#37569. The fix has been in master for a while, but did not make 1.5.1.
I still recommend moving to torch.cuda.amp. However, if the above PR is the right diagnosis, the problem is not in apex, but in Pytorch's FP16 gemv implementation, so you'll have to update Pytorch whether you choose apex or
torch.cuda.amp.
May try pytorch nightly builds as it look like 1.6 is just around the corner...
Just tried pytorch 1.7.0 nightly. While I didn't get a CUBLAS_STATUS_EXECUTION_FAILED error, I did get the "Gradient overflow" and my GAN started producing black images :|.
Make sure you're following the guidance for multiple models/losses/optimizers. (retain_graph in that snippet is present because the two backward passes share some graph sections, it has nothing to do with amp. You may not need retain_graph for your own multi-model network.)
An example GAN training-loop step with proper torch.cuda.amp control flow can be found here, courtesy of @vfdev-5 (https://twitter.com/pytorch_ignite/status/1262721636844920832).
If that doesn't work, file an issue with a minimal repro on Pytorch github and tag me.
Got the same tracktrace as above (RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)) on pytorch 1.4.x, 1.5.x
With the last nightly build (1.7.0.dev20200709), Cuda V10.1.243, apex master (https://github.com/NVIDIA/apex/commit/1ff54b8fed441c39dac181091b44fecdca31a403) and cudnn 7.6.3_0 seems to be working fine (no overflows or segfaults) (using apex API) on cycleGAN (https://github.com/seovchinnikov/pytorch-CycleGAN-and-pix2pix)
@mcarilli converted my training loop to use torch.cuda.amp instead of apex. It runs... but it doesn't seem like there's any indication that it's actually using 16 bit floats. Memory usage is identical as non-fp16 as is the speed. Do you know if there a way to verify amp is working with fp16 correctly?
Here's my modified code from pix2pixHD:
amp_scaler = GradScaler(enabled=opt.fp16)
with autocast(enabled=opt.fp16):
############## Forward Pass ######################
losses, generated = model(Variable(data['label']), inst_map,
Variable(data['image']), Variable(data['feat']), infer=save_fake)
# sum per device losses
losses = [torch.mean(x) if not isinstance(x, int)
else x for x in losses]
loss_dict = dict(zip(model.module.loss_names, losses))
# calculate final loss scalar
loss_D = (loss_dict['D_fake'] + loss_dict['D_real']) * 0.5
loss_G = loss_dict['G_GAN'] + \
loss_dict.get('G_GAN_Feat', 0) + loss_dict.get('G_VGG', 0)
############### Backward Pass ####################
# update generator weights
optimizer_G.zero_grad()
amp_scaler.scale(loss_G).backward()
amp_scaler.step(optimizer_G)
# if opt.fp16:
# with amp.scale_loss(loss_G, optimizer_G) as scaled_loss:
# scaled_loss.backward()
# else:
# loss_G.backward()
# optimizer_G.step()
# update discriminator weights
optimizer_D.zero_grad()
amp_scaler.scale(loss_D).backward()
amp_scaler.step(optimizer_D)
amp_scaler.update()
update: Using DataParallel, I need to wrap forward of my module in @autocast. Works now... for a while and then I start getting nan losses :(.
Most helpful comment
I get a similar error with the forward pass. After some batches, it gives the following error(s).
Sometimes it is error 1 and sometimes it is error 2 or error 3.
Sometimes the error is thrown after processing 1st batch and sometimes at 2nd,9th or 13th, 17th, 21st batch.
Error 1
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when callingcublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)``Error 2
RuntimeError: CUDA error: device-side assert triggeredError 3
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCBlas.cu:258Maybe this issue discussion can bring more perspective to it.