I'm getting a segmentation fault when trying to train a model with amp.
torch version '1.0.1.post2'
cudnn version 7.4.2
@mcarilli What could cause those issues?
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Segmentation fault (core dumped)
:[System Logs]:
:Mar 29 12:23:22 server kernel: python3.6[95957]: segfault at b ip 00007f6e08bc664c sp 00007ffe63a06200 error 6 in amp_C.cpython-36m-x86_64-linux-gnu.so[7f6e08bb3000+64000]
:Mar 29 12:23:22 server abrt-hook-ccpp[96042]: Process 95957 (python3.6) of user 992434 killed by SIGSEGV - dumping core
:Mar 29 12:26:09 server kernel: python3.6[99444]: segfault at b ip 00007f502c4b764c sp 00007ffd04c510a0 error 6 in amp_C.cpython-36m-x86_64-linux-gnu.so[7f502c4a4000+64000]
:Mar 29 12:26:09 server abrt-hook-ccpp[99522]: Process 99444 (python3.6) of user 992434 killed by SIGSEGV - dumping core
:Mar 29 12:44:58 server kernel: python3.6[106167]: segfault at b ip 00007fa96640c64c sp 00007ffce94225f0 error 6 in amp_C.cpython-36m-x86_64-linux-gnu.so[7fa9663f9000+64000]
:Mar 29 12:44:58 server abrt-hook-ccpp[106244]: Process 106167 (python3.6) of user 992434 killed by SIGSEGV - dumping core
I assume that cudnn has to be enabled, right? At least that's what the assertion here is saying
https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py#L126
Yes, cudnn should be enabled...Also, I got an email saying that you had segfaults coming from Eigen in a Tensorflow site-package so I have no idea what to make of that.
@mcarilli cudnn is enabled. Regarding the email, I accidentally posted some "old" log messages. That's how it ended up there.
What can I do regarding the issue? Is there a way to get some verbose output from apex?
Is it segfaulting within amp.initialize? Also, is this a single-process run or a multiprocess run?
It happens after initialization during training while calling optimizer.step()
I am using 2GPUs
To sandbox out potential distributed issues: does the problem reproduce in a single-process run?
Do you have an example jupyter notebook or script that I could test quick and easy?
I also tried with 1 GPU. The same thing is happening.
I've got the imagenet example and simple distributed example, currently working on a GAN example as well...
Do you get a python backtrace, or just a segfault? Also what optimizer are you using?
I was playing around with the FP16_Optimizer and that's when it crashes in optimizer.step().
Actually, the segmentation fault is happening when running scaled_loss.backward(). I am using pytorch implementation of Adam
I am only getting a segfault.
Oh, don't use FP16_Optimizer directly, that's deprecated and basically dead code at this point. Use nothing but the new API:
https://nvidia.github.io/apex/amp.html
and pass the Pytorch Adam optimizer that you construct directly to amp.initialize.
For a straight-line model this is all you have to do:
```
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.Adam(model.parameters(), ...)
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
...
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
...
That's the way I did it. and it's crashing inside of scaled_loss.backward()
It's crashing somewhere inside of
Variable._execution_engine.run_backward(
tensors, grad_tensors, retain_graph, create_graph,
allow_unreachable=True)
Can you get backtrace from your coredump?
@mcarilli, @ngimel here the coredump backtrace: https://gist.github.com/che85/5b7989ad11f3d7aacbb85c2174f100e5
Thanks for your help.
Here also the installation output:
sudo python36 setup.py install --cuda_ext --cpp_ext
torch.__version__ = 1.0.1.post2
Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
from /usr/local/cuda/bin
Pytorch binaries were compiled with Cuda 10.0.130
running install
running bdist_egg
running egg_info
writing apex.egg-info/PKG-INFO
writing dependency_links to apex.egg-info/dependency_links.txt
writing top-level names to apex.egg-info/top_level.txt
reading manifest file 'apex.egg-info/SOURCES.txt'
writing manifest file 'apex.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/apex
copying build/lib.linux-x86_64-3.6/apex/__init__.py -> build/bdist.linux-x86_64/egg/apex
creating build/bdist.linux-x86_64/egg/apex/RNN
copying build/lib.linux-x86_64-3.6/apex/RNN/RNNBackend.py -> build/bdist.linux-x86_64/egg/apex/RNN
copying build/lib.linux-x86_64-3.6/apex/RNN/__init__.py -> build/bdist.linux-x86_64/egg/apex/RNN
copying build/lib.linux-x86_64-3.6/apex/RNN/cells.py -> build/bdist.linux-x86_64/egg/apex/RNN
copying build/lib.linux-x86_64-3.6/apex/RNN/models.py -> build/bdist.linux-x86_64/egg/apex/RNN
creating build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.6/apex/amp/__init__.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.6/apex/amp/__version__.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.6/apex/amp/_amp_state.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.6/apex/amp/_initialize.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.6/apex/amp/amp.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.6/apex/amp/compat.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.6/apex/amp/frontend.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.6/apex/amp/handle.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.6/apex/amp/opt.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.6/apex/amp/rnn_compat.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.6/apex/amp/scaler.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.6/apex/amp/utils.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.6/apex/amp/wrap.py -> build/bdist.linux-x86_64/egg/apex/amp
creating build/bdist.linux-x86_64/egg/apex/amp/lists
copying build/lib.linux-x86_64-3.6/apex/amp/lists/__init__.py -> build/bdist.linux-x86_64/egg/apex/amp/lists
copying build/lib.linux-x86_64-3.6/apex/amp/lists/functional_overrides.py -> build/bdist.linux-x86_64/egg/apex/amp/lists
copying build/lib.linux-x86_64-3.6/apex/amp/lists/tensor_overrides.py -> build/bdist.linux-x86_64/egg/apex/amp/lists
copying build/lib.linux-x86_64-3.6/apex/amp/lists/torch_overrides.py -> build/bdist.linux-x86_64/egg/apex/amp/lists
creating build/bdist.linux-x86_64/egg/apex/fp16_utils
copying build/lib.linux-x86_64-3.6/apex/fp16_utils/__init__.py -> build/bdist.linux-x86_64/egg/apex/fp16_utils
copying build/lib.linux-x86_64-3.6/apex/fp16_utils/fp16_optimizer.py -> build/bdist.linux-x86_64/egg/apex/fp16_utils
copying build/lib.linux-x86_64-3.6/apex/fp16_utils/fp16util.py -> build/bdist.linux-x86_64/egg/apex/fp16_utils
copying build/lib.linux-x86_64-3.6/apex/fp16_utils/loss_scaler.py -> build/bdist.linux-x86_64/egg/apex/fp16_utils
creating build/bdist.linux-x86_64/egg/apex/multi_tensor_apply
copying build/lib.linux-x86_64-3.6/apex/multi_tensor_apply/__init__.py -> build/bdist.linux-x86_64/egg/apex/multi_tensor_apply
copying build/lib.linux-x86_64-3.6/apex/multi_tensor_apply/multi_tensor_apply.py -> build/bdist.linux-x86_64/egg/apex/multi_tensor_apply
creating build/bdist.linux-x86_64/egg/apex/normalization
copying build/lib.linux-x86_64-3.6/apex/normalization/__init__.py -> build/bdist.linux-x86_64/egg/apex/normalization
copying build/lib.linux-x86_64-3.6/apex/normalization/fused_layer_norm.py -> build/bdist.linux-x86_64/egg/apex/normalization
creating build/bdist.linux-x86_64/egg/apex/optimizers
copying build/lib.linux-x86_64-3.6/apex/optimizers/__init__.py -> build/bdist.linux-x86_64/egg/apex/optimizers
copying build/lib.linux-x86_64-3.6/apex/optimizers/fp16_optimizer.py -> build/bdist.linux-x86_64/egg/apex/optimizers
copying build/lib.linux-x86_64-3.6/apex/optimizers/fused_adam.py -> build/bdist.linux-x86_64/egg/apex/optimizers
creating build/bdist.linux-x86_64/egg/apex/parallel
copying build/lib.linux-x86_64-3.6/apex/parallel/LARC.py -> build/bdist.linux-x86_64/egg/apex/parallel
copying build/lib.linux-x86_64-3.6/apex/parallel/__init__.py -> build/bdist.linux-x86_64/egg/apex/parallel
copying build/lib.linux-x86_64-3.6/apex/parallel/distributed.py -> build/bdist.linux-x86_64/egg/apex/parallel
copying build/lib.linux-x86_64-3.6/apex/parallel/multiproc.py -> build/bdist.linux-x86_64/egg/apex/parallel
copying build/lib.linux-x86_64-3.6/apex/parallel/optimized_sync_batchnorm.py -> build/bdist.linux-x86_64/egg/apex/parallel
copying build/lib.linux-x86_64-3.6/apex/parallel/optimized_sync_batchnorm_kernel.py -> build/bdist.linux-x86_64/egg/apex/parallel
copying build/lib.linux-x86_64-3.6/apex/parallel/sync_batchnorm.py -> build/bdist.linux-x86_64/egg/apex/parallel
copying build/lib.linux-x86_64-3.6/apex/parallel/sync_batchnorm_kernel.py -> build/bdist.linux-x86_64/egg/apex/parallel
creating build/bdist.linux-x86_64/egg/apex/reparameterization
copying build/lib.linux-x86_64-3.6/apex/reparameterization/__init__.py -> build/bdist.linux-x86_64/egg/apex/reparameterization
copying build/lib.linux-x86_64-3.6/apex/reparameterization/reparameterization.py -> build/bdist.linux-x86_64/egg/apex/reparameterization
copying build/lib.linux-x86_64-3.6/apex/reparameterization/weight_norm.py -> build/bdist.linux-x86_64/egg/apex/reparameterization
copying build/lib.linux-x86_64-3.6/apex_C.cpython-36m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.6/amp_C.cpython-36m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.6/fused_adam_cuda.cpython-36m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.6/syncbn.cpython-36m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.6/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
byte-compiling build/bdist.linux-x86_64/egg/apex/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/RNN/RNNBackend.py to RNNBackend.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/RNN/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/RNN/cells.py to cells.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/RNN/models.py to models.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/__version__.py to __version__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/_amp_state.py to _amp_state.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/_initialize.py to _initialize.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/amp.py to amp.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/compat.py to compat.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/frontend.py to frontend.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/handle.py to handle.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/opt.py to opt.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/rnn_compat.py to rnn_compat.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/scaler.py to scaler.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/utils.py to utils.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/wrap.py to wrap.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/lists/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/lists/functional_overrides.py to functional_overrides.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/lists/tensor_overrides.py to tensor_overrides.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/lists/torch_overrides.py to torch_overrides.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/fp16_utils/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/fp16_utils/fp16_optimizer.py to fp16_optimizer.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/fp16_utils/fp16util.py to fp16util.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/fp16_utils/loss_scaler.py to loss_scaler.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/multi_tensor_apply/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/multi_tensor_apply/multi_tensor_apply.py to multi_tensor_apply.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/normalization/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/normalization/fused_layer_norm.py to fused_layer_norm.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/optimizers/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/optimizers/fp16_optimizer.py to fp16_optimizer.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/optimizers/fused_adam.py to fused_adam.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/LARC.py to LARC.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/distributed.py to distributed.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/multiproc.py to multiproc.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/optimized_sync_batchnorm.py to optimized_sync_batchnorm.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/optimized_sync_batchnorm_kernel.py to optimized_sync_batchnorm_kernel.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/sync_batchnorm.py to sync_batchnorm.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/sync_batchnorm_kernel.py to sync_batchnorm_kernel.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/reparameterization/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/reparameterization/reparameterization.py to reparameterization.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/reparameterization/weight_norm.py to weight_norm.cpython-36.pyc
creating stub loader for apex_C.cpython-36m-x86_64-linux-gnu.so
creating stub loader for amp_C.cpython-36m-x86_64-linux-gnu.so
creating stub loader for fused_adam_cuda.cpython-36m-x86_64-linux-gnu.so
creating stub loader for syncbn.cpython-36m-x86_64-linux-gnu.so
creating stub loader for fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/apex_C.py to apex_C.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/amp_C.py to amp_C.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/fused_adam_cuda.py to fused_adam_cuda.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/syncbn.py to syncbn.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/fused_layer_norm_cuda.py to fused_layer_norm_cuda.cpython-36.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying apex.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying apex.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying apex.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying apex.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.amp_C.cpython-36: module references __file__
__pycache__.apex_C.cpython-36: module references __file__
__pycache__.fused_adam_cuda.cpython-36: module references __file__
__pycache__.fused_layer_norm_cuda.cpython-36: module references __file__
__pycache__.syncbn.cpython-36: module references __file__
creating 'dist/apex-0.1-py3.6-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing apex-0.1-py3.6-linux-x86_64.egg
removing '/usr/local/lib64/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg' (and everything under it)
creating /usr/local/lib64/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg
Extracting apex-0.1-py3.6-linux-x86_64.egg to /usr/local/lib64/python3.6/site-packages
apex 0.1 is already the active version in easy-install.pth
Installed /usr/local/lib64/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg
Processing dependencies for apex==0.1
Finished processing dependencies for apex==0.1
@che85 I'm currently in the process of refactoring the treatment of optimizers under the hood, which will also affect the operation of the backward context manager. I plan to have this merged/publicly usable by Monday. Can we revisit at that time?
Awesome! Sounds like a plan. Maybe you can just get back to me by the time once done? Thanks a lot
You know what? I think, I just figured it out. I just recompiled it with gcc 5.3.1 (before 4.8.5) and now it's training! If everything works fine, I will just close this issue.
Great!
@che85 You saved my day and my job!! I changed the gcc from 4.8.5 to 5.4.0 and magically, all things worked. Thank you
Most helpful comment
@che85 You saved my day and my job!! I changed the gcc from 4.8.5 to 5.4.0 and magically, all things worked. Thank you