Apex: RuntimeError: cuda runtime error (74) : misaligned address at /pytorch/aten/src/THC/THCTensorCopy.cu:84

Created on 13 Jan 2019 · 14Comments · Source: NVIDIA/apex

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCTensorCopy.cu line=84 error=74 : misaligned address
Traceback (most recent call last):
File "full_main.py", line 394, in
full_main()
File "full_main.py", line 184, in full_main
loss = train(train_loader, model, criterion, optimizer, epoch, log_training,args.fp16)
File "full_main.py", line 232, in train
output = model(input_var)
File "/home/work/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(input, *kwargs)
File "/home/work/anaconda3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/work/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(input, *kwargs)
File "/home/work/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in forward
return self.gather(outputs, self.output_device)
File "/home/work/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 136, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/work/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/home/work/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, outputs)
File "/home/work/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 65, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/work/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 160, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: cuda runtime error (74) : misaligned address at /pytorch/aten/src/THC/THCTensorCopy.cu:84
terminate called after throwing an instance of 'at::Error'
what(): CUDA error: invalid device pointer (CudaCachingDeleter at /pytorch/aten/src/THC/THCCachingAllocator.cpp:498)
frame #0: THStorage_free + 0x44 (0x7ff6e3f710d4 in /home/work/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #1: THTensor_free + 0x2f (0x7ff6e40107df in /home/work/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #2: at::CUDAHalfTensor::~CUDAHalfTensor() + 0x9 (0x7ff64c972ab9 in /home/work/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: torch::autograd::generated::CudnnConvolutionBackward::~CudnnConvolutionBackward() + 0x5d (0x7ff6ebc06afd in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #5: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #6: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: torch::autograd::generated::CudnnBatchNormBackward::~CudnnBatchNormBackward() + 0x74 (0x7ff6ebc06474 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #8: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #9: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #10: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #11: torch::autograd::generated::ThresholdBackward1::~ThresholdBackward1() + 0x66 (0x7ff6ebc05cb6 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #12: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #13: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #14: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #15: torch::autograd::generated::MaxPool2DWithIndicesBackward::~MaxPool2DWithIndicesBackward() + 0x88 (0x7ff6ebc06fa8 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #16: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #17: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #18: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #19: torch::autograd::generated::CudnnConvolutionBackward::~CudnnConvolutionBackward() + 0x73 (0x7ff6ebc06b13 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #20: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #21: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #22: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #23: torch::autograd::generated::CudnnBatchNormBackward::~CudnnBatchNormBackward() + 0x74 (0x7ff6ebc06474 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #24: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #25: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #26: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #27: torch::autograd::generated::ThresholdBackward1::~ThresholdBackward1() + 0x66 (0x7ff6ebc05cb6 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #28: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #29: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #30: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #31: torch::autograd::generated::CudnnConvolutionBackward::~CudnnConvolutionBackward() + 0x73 (0x7ff6ebc06b13 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #32: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #33: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #34: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #35: torch::autograd::generated::CudnnBatchNormBackward::~CudnnBatchNormBackward() + 0x74 (0x7ff6ebc06474 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #36: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #37: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #38: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #39: torch::autograd::generated::ThresholdBackward1::~ThresholdBackward1() + 0x66 (0x7ff6ebc05cb6 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #40: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #41: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #42: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #43: torch::autograd::generated::MaxPool2DWithIndicesBackward::~MaxPool2DWithIndicesBackward() + 0x88 (0x7ff6ebc06fa8 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #44: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #45: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #46: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #47: torch::autograd::generated::AvgPool2DBackward::~AvgPool2DBackward() + 0x67 (0x7ff6ebc065d7 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #48: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #49: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #50: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #51: torch::autograd::generated::CudnnConvolutionBackward::~CudnnConvolutionBackward() + 0x73 (0x7ff6ebc06b13 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #52: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #53: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #54: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #55: torch::autograd::generated::CudnnBatchNormBackward::~CudnnBatchNormBackward() + 0x74 (0x7ff6ebc06474 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #56: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #57: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #58: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #59: torch::autograd::generated::ThresholdBackward1::~ThresholdBackward1() + 0x66 (0x7ff6ebc05cb6 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #60: torch::autograd::deleteFunction(torch::autograd::Function) + 0x47 (0x7ff6eb9f8257 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #61: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7ff6eb624ea5 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #62: torch::autograd::Function::~Function() + 0xfe (0x7ff6eb6f2f4e in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #63: torch::autograd::generated::CatBackward::~CatBackward() + 0x72 (0x7ff6ebc02132 in /home/work/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)

Aborted

Source

Solacex

Most helpful comment

it support multi-gpu training.
when i run code with 4 pascal gpu. throw same error as you
when i run code with 1 pascal gpu, throw no error
when i run code with 1 or 4 turing gpu, throw no error
when i run code with dgx-1(8 x volta gpu), throw no error
i think there exist some error with multi-pascal gpu

seongwook-ham on 14 Jan 2019

👍2

All 14 comments

i also experience same issue in cuda 9 pascal gpu card(1 x titan x pascal and 3x1080ti).
but when i test same code in turing(titan rtx and 3 x 2080ti setting) or volta(dgx-1 8x tesla v100) with cuda 9 it does not throw error.
i think it do not fully support pascal architecture

seongwook-ham on 13 Jan 2019

i also experience same issue in cuda 9 pascal gpu card(1 x titan x pascal and 3x1080ti).
but when i test same code in turing(titan rtx and 3 x 2080ti setting) or volta(dgx-1 8x tesla v100) with cuda 9 it does not throw error.
i think it do not fully support pascal architecture

But my environment is Tesla P100 with CUDA 9.0, which is supported by this library in the document.

Solacex on 13 Jan 2019

Can you please run with CUDA_LAUNCH_BLOCKING=1 environment variable, to see exactly where the error is coming from?

ngimel on 13 Jan 2019

Can you please run with CUDA_LAUNCH_BLOCKING=1 environment variable, to see exactly where the error is coming from?

Thank you ! ~ I will run as you said and return the result.

Solacex on 13 Jan 2019

Can you please run with CUDA_LAUNCH_BLOCKING=1 environment variable, to see exactly where the error is coming from?

Do this library support multi-GPU training?Could this be the reason of crash?

Solacex on 13 Jan 2019

seongwook-ham on 14 Jan 2019

👍2

@Solacex @Coderx7 I noticed you're using DataParallel rather than DistributedDataParallel. Also, the error is coming from a Python operation that's called by DataParallel.

We typically recommend using DistributedDataParallel with one gpu device per process. Can you give this a try?

You have two options:
https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel
Simple example with Pytorch's built-in DDP
https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel
Simple example with Apex's DDP
Ignore the use of FP16_Optimizer, that's unrelated.
Also, note the use of torch.distributed.launch to spawn the processes in run.sh. The use of torch.distributed.launch requires calling init_process_group with init_method='env://' (see here). This use of torch.distributed.launch is identical for both Apex DDP and Pytorch DDP.

Both work well for most use cases and the API for most use cases is similar. The most important difference is

torch.nn.parallel.DistributedDataParallel requires you to specify device_ids and output_device in order to use one process per device (see here)
apex.parallel.DistributedDataParallel assumes that you are only using one process per device (see here)

mcarilli on 15 Jan 2019

👍1

@Solacex I noticed you're using DataParallel rather than DistributedDataParallel. Also, the error is coming from a Python operation that's called by DataParallel.

We typically recommend using DistributedDataParallel with one gpu device per process. Can you give this a try?

You have two options:
https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel
Simple example with Pytorch's built-in DDP
https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel
Simple example with Apex's DDP
Ignore the use of FP16_Optimizer, that's unrelated.
Also, note the use of torch.distributed.launch to spawn the processes in run.sh. The use of torch.distributed.launch requires calling init_process_group with init_method='env://' (see here). This use of torch.distributed.launch is identical for both Apex DDP and Pytorch DDP.

Both work well for most use cases and the API for most use cases is similar. The most important difference is

torch.nn.parallel.DistributedDataParallel requires you to specify device_ids and output_device in order to use one process per device (see here)

apex.parallel.DistributedDataParallel assumes that you are only using one process per device (see here)

Well, you mean it could not support multi-gpu computing on one machine, right ?

Solacex on 17 Jan 2019

@Solacex
i've tested just dataparallel with turing(custom machine) or volta(dgx-1) multi gpu. it worked well. but in pascal multi gpu setting it did not work
when i tested @mcarilli 's solution(distributed training) with multi pascal gpu. it works well.
this solution do not mean you have to use multi machine. you could use this solution by using bash script to launch multiprocess in one machine

#!/bin/bash
python -m torch.distributed.launch --nproc_per_node=4(specify your # of gpu) run_pretrain_dist.py(your training python script for distributed training)

each process use 1gpu. this script launch 4 process so use 4gpu in one machine.
also your script should be modified to use distributed training by including local_rank and distributed data loader
see what mcarilli says

seongwook-ham on 17 Jan 2019

👍1

@Solacex DistributedDataParallel supports multi-gpu computing on one machine, or on many machines. The recommended practice is to launch one process per device, as in @seongwook-ham 's answer and run.sh from my example (https://github.com/NVIDIA/apex/tree/master/examples/FP16_Optimizer_simple/distributed_apex).

Also, note that to use DistributedDataParallel, each process needs to parse the arguments to receive its local rank, and set the device based on its local rank.

mcarilli on 17 Jan 2019

I see similar problem with same error message. What resolved this issue for me was to use 1 GPU instead of multiple GPUs. (no data - parallel). Now I can't have power of mutliple GPUs, but I can get to result. I am also using PyTorch.