I am running into an exception when loading a model on CPU in one of the example scripts. I suppose this is related to loading the FusedLayerNorm from apex, even when --no_cuda has been set.
https://github.com/huggingface/pytorch-pretrained-BERT/blob/8da280ebbeca5ebd7561fd05af78c65df9161f92/pytorch_pretrained_bert/modeling.py#L154
Or is this working for anybody else?
Example:
run_classifier.py --data_dir glue/CoLA --task_name CoLA --do_train --do_eval --bert_model bert-base-cased --max_seq_length 32 --train_batch_size 12 --learning_rate 2e-5 --num_train_epochs 2.0 --output_dir /tmp/mrpc_output/ --no_cuda
Exception:
[...]
File "/home/mp/miniconda3/envs/bert/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/normalization/fused_layer_norm.py", line 19, in forward
input_, self.normalized_shape, weight_, bias_, self.eps)
RuntimeError: input must be a CUDA tensor (layer_norm_affine at apex/normalization/csrc/layer_norm_cuda.cpp:120)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fe35f6e4cc5 in /home/mp/miniconda3/envs/bert/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: layer_norm_affine(at::Tensor, c10::ArrayRef<long>, at::Tensor, at::Tensor, double) + 0x4bc (0x7fe3591456ac in /home/mp/miniconda3/envs/bert/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #2: <unknown function> + 0x18db4 (0x7fe359152db4 in /home/mp/miniconda3/envs/bert/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x16505 (0x7fe359150505 in /home/mp/miniconda3/envs/bert/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #12: THPFunction_do_forward(THPFunction*, _object*) + 0x15c (0x7fe38fb7db7c in /home/mp/miniconda3/envs/bert/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Hi @tholor, apex is a GPU specific extension.
What kind of use-case do you have in which you have apex installed but no GPU (also fp16 doesn't work on CPU, it's not supported on PyTorch currently)?
The two cases I came across this:
1) testing if some code works for both GPU and CPU (on a GPU machine with apex installed)
2) training/debugging small sample models on my laptop. It has a small "toy GPU" with only 2 GB RAM and therefore I am usually using the CPUs here.
I agree that these are edge cases, but I thought the flag --no_cuda is intended for exactly such cases?
I see. It's a bit tricky because apex is loaded by default when it can be found and this loading is deep inside the library it-self, not the examples (here). I don't think it's worth it to add specific logic inside the loading of the library to handle such a case.
I guess the easiest solution in your case is to have two python environments (with conda or virtualenv) and switch to the one in which apex is not installed when don't want to use GPU.
Feel free to re-open the issue if this doesn't solve your problem.
Sure, then it's not worth the effort.
@thomwolf a solution would be to check torch.cuda.is_available() and then we can disable apex by using CUDA_VISIBLE_DEVICES=-1
Is this also related to the fact then the tests fail when apex is installed?
def forward(self, input, weight, bias):
input_ = input.contiguous()
weight_ = weight.contiguous()
bias_ = bias.contiguous()
output, mean, invvar = fused_layer_norm_cuda.forward_affine(
> input_, self.normalized_shape, weight_, bias_, self.eps)
E RuntimeError: input must be a CUDA tensor (layer_norm_affine at apex/normalization/csrc/layer_norm_cuda.cpp:120)
E frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f754d802021 in /lium/buster1/caglayan/anaconda/envs/bert/lib/python3.6/site-packages/torch/lib/libc10.so)
E frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f754d8018ea in /lium/buster1/caglayan/anaconda/envs/bert/lib/python3.6/site-packages/torch/lib/libc10.so)
E frame #2: layer_norm_affine(at::Tensor, c10::ArrayRef<long>, at::Tensor, at::Tensor, double) + 0x6b9 (0x7f754a8aafe9 in /lium/buster1/caglayan/anaconda/envs/bert/lib/python3.6/site-pack
ages/apex-0.1-py3.6-linux-x86_64.egg/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
E frame #3: <unknown function> + 0x19b9d (0x7f754a8b8b9d in /lium/buster1/caglayan/anaconda/envs/bert/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/fused_layer_norm_cuda.cpy
thon-36m-x86_64-linux-gnu.so)
E frame #4: <unknown function> + 0x19d1e (0x7f754a8b8d1e in /lium/buster1/caglayan/anaconda/envs/bert/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/fused_layer_norm_cuda.cpy
thon-36m-x86_64-linux-gnu.so)
E frame #5: <unknown function> + 0x16971 (0x7f754a8b5971 in /lium/buster1/caglayan/anaconda/envs/bert/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/fused_layer_norm_cuda.cpy
thon-36m-x86_64-linux-gnu.so)
E <omitting python frames>
E frame #13: THPFunction_do_forward(THPFunction*, _object*) + 0x15c (0x7f7587d411ec in /lium/buster1/caglayan/anaconda/envs/bert/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
../../lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/normalization/fused_layer_norm.py:21: RuntimeError
_______________________________________________________________________________ OpenAIGPTModelTest.test_default
Hello @artemisart,
What do you mean by "disable apex by CUDA_VISIBLE_DEVICES=-1" ? I tried to do that but the import still work at this line
@LamDang You can set the env CUDA_VISIBLE_DEVICES=-1 to disable cuda in pytorch (ex when you launch your script in bash CUDA_VISIBLE_DEVICES=-1 python script.py), and then wrap the import apex with a if torch.cuda.is_available() in the script.
Hi all, I came across this issue when my GPU memory was fully loaded and had to make some inference at the same time. For this kind of temporary need, the simplest solution for me is just to touch apex.py before the run and remove it afterwards.
Re-opening this to remember to wrap the apex import with a if torch.cuda.is_available() in the next release as advocated by @artemisart
Hello, I pushed a pull request here to solve this issue upstream https://github.com/NVIDIA/apex/pull/256
Update: it is merged into apex
Re-opening this to remember to wrap the apex import with a if
torch.cuda.is_available()in the next release as advocated by @artemisart
Yes please, I also struggle with Apex in CPU mode, i have wrapped Bertmode in my object and when I tried to load the pretrained GPU model with torch.load(model, map_location='cpu') , it shows 'no module named apex' but if I install apex, I get no cuda error(I'm on a CPU machine in inference phase )
Well it should be solved in apex now. What is the exact error message you have ?
By the way, not using apex is also fine, don't worry about it if you don't need t.
I got
model = torch.load(model_file, map_location='cpu')
result = unpickler.load()
ModuleNotFoundError: No module named 'apex'
model_file is a pretrained object with GPU with a bertmodel field , but I want to unpickle it in CPU mode
Try to use pytorch recommended serialization practice (saving/loading the state dict):
https://pytorch.org/docs/stable/notes/serialization.html
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
I see. It's a bit tricky because apex is loaded by default when it can be found and this loading is deep inside the library it-self, not the examples (here). I don't think it's worth it to add specific logic inside the loading of the library to handle such a case.
I guess the easiest solution in your case is to have two python environments (with conda or virtualenv) and switch to the one in which apex is not installed when don't want to use GPU.
Feel free to re-open the issue if this doesn't solve your problem.