Maskrcnn-benchmark: build error:apex

Created on 23 Apr 2019 · 31Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

when i install the apex using the command "python setup.py install --cuda_ext --cpp_ext"
I get the error :

torch.__version__  =  1.1.0.dev20190422
Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
from /usr/local/cuda/bin

Pytorch binaries were compiled with Cuda 9.0.176

running install
running bdist_egg
running egg_info
writing apex.egg-info/PKG-INFO
writing dependency_links to apex.egg-info/dependency_links.txt
writing top-level names to apex.egg-info/top_level.txt
reading manifest file 'apex.egg-info/SOURCES.txt'
writing manifest file 'apex.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'amp_C' extension
gcc -pthread -B /home/dy113/anaconda3/envs/maskrcnn_benchmark/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/dy113/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/include -I/home/dy113/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/dy113/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/include/TH -I/home/dy113/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/dy113/anaconda3/envs/maskrcnn_benchmark/include/python3.7m -c csrc/amp_C_frontend.cpp -o build/temp.linux-x86_64-3.7/csrc/amp_C_frontend.o -O3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/usr/local/cuda/bin/nvcc -I/home/dy113/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/include -I/home/dy113/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/dy113/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/include/TH -I/home/dy113/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/dy113/anaconda3/envs/maskrcnn_benchmark/include/python3.7m -c csrc/multi_tensor_scale_kernel.cu -o build/temp.linux-x86_64-3.7/csrc/multi_tensor_scale_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
csrc/type_shim.h(13): error: class "at::Type" has no member "scalarType"

1 error detected in the compilation of "/tmp/tmpxft_0000270c_00000000-6_multi_tensor_scale_kernel.cpp1.ii".
error: command '/usr/local/cuda/bin/nvcc' failed with exit status 1

Can anyone help me? Thank you very much!

dependency bug

Source

wxd1995

Most helpful comment

workaround is to downgrade to pytorch nightly from a few days ago:
conda install pytorch-nightly=1.0.0.dev20190404 cudatoolkit=10.0 -c pytorch

mel-2445 on 25 Apr 2019

👍9

All 31 comments

i got the same trouble, too......

Tegala on 23 Apr 2019

what is your gcc version?

Shao-kun-Zhang on 23 Apr 2019

what is your gcc version?

mine is gcc-5.2

Tegala on 23 Apr 2019

According to this issue, seems that apex can only be installed with CUDA10. My gcc version is 7.3, python is 3.6, and my pytorch version is 1.0.0. It works.

Yuliang-Zou on 24 Apr 2019

same error with cuda 10 on ubuntu 16
tried with gcc 5 and gcc 7, python 2.7 and python 3.6

mel-2445 on 24 Apr 2019

workaround is to downgrade to pytorch nightly from a few days ago:
conda install pytorch-nightly=1.0.0.dev20190404 cudatoolkit=10.0 -c pytorch

mel-2445 on 25 Apr 2019

👍9

cc @mcarilli are you aware of any recent breakages of apex with latest PyTorch nightly?

fmassa on 25 Apr 2019

There's already an issue at Apex (#267) with a PR (#272) that fixes it that apparently will be merged soon.

So if you need an immediate fix use the scalar_type branch of ptrblk's fork.

mdsmith-cim on 25 Apr 2019

Thanks @mdsmith-cim for the concise, correct summary. Our fix will be merged tomorrow at the latest (I have some other commitments so I may not have time to review it in detail today).

mcarilli on 25 Apr 2019

@mdsmith-cim hi, sorry to bother you. I still have the same error after git your apex, Why is this, I look forward to your reply.

Shao-kun-Zhang on 26 Apr 2019

@zskadazhang Did you checkout the scalar_type branch?

MC-devel-staudt on 26 Apr 2019

@DavidSPumpkins Oh! It works,Thank you very much!

Shao-kun-Zhang on 26 Apr 2019

@zskadazhang Good to hear it's working!
Please tag me in case you are running into issues related to this branch.

However, we should merge it to apex/master today so you can pull from the master branch again.

ptrblck on 26 Apr 2019

The PR was merged so the build should work again using apex master. :)

ptrblck on 26 Apr 2019

👍3

With torch 1.1.0.dev20190425, and the latest apex fix, I still get an error when I try to compile with python setup.py install --cuda_ext --cpp_ext. I'm using gcc 5.5. Can anyone please help? Much appreciated!
```/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11080): error: argument of type "void *" is incompatible with parameter of type "float *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11089): error: argument of type "void *" is incompatible with parameter of type "float *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11100): error: argument of type "void *" is incompatible with parameter of type "float *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11109): error: argument of type "void *" is incompatible with parameter of type "double *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11120): error: argument of type "void *" is incompatible with parameter of type "double *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11129): error: argument of type "void *" is incompatible with parameter of type "double *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11140): error: argument of type "void *" is incompatible with parameter of type "double *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11149): error: argument of type "void *" is incompatible with parameter of type "int *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11160): error: argument of type "void *" is incompatible with parameter of type "int *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11169): error: argument of type "void *" is incompatible with parameter of type "int *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11180): error: argument of type "void *" is incompatible with parameter of type "int *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11189): error: argument of type "void *" is incompatible with parameter of type "long long *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11200): error: argument of type "void *" is incompatible with parameter of type "long long *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11209): error: argument of type "void *" is incompatible with parameter of type "long long *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11220): error: argument of type "void *" is incompatible with parameter of type "long long *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11229): error: argument of type "void *" is incompatible with parameter of type "int *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11240): error: argument of type "void *" is incompatible with parameter of type "int *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11249): error: argument of type "void *" is incompatible with parameter of type "int *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11260): error: argument of type "void *" is incompatible with parameter of type "int *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11269): error: argument of type "void *" is incompatible with parameter of type "long long *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11280): error: argument of type "void *" is incompatible with parameter of type "long long *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11289): error: argument of type "void *" is incompatible with parameter of type "long long *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(11300): error: argument of type "void *" is incompatible with parameter of type "long long *"

92 errors detected in the compilation of "/tmp/tmpxft_00004034_00000000-6_multi_tensor_scale_kernel.cpp1.ii".
error: command '/usr/local/cuda-9.0/bin/nvcc' failed with exit status 1```

chengruizhe on 27 Apr 2019

👍4

Sorry, I don not know why。My enviroment is CUDA10.1,GCC7.3。And I get APEX from @mdsmith-cim 。Maybe You can ask him。

Shao-kun-Zhang on 27 Apr 2019

@chengruizhe I've never seen this error before. Does it point to a particular line in the file?

mcarilli on 28 Apr 2019

@chengruizhe @mcarilli
Could it be related to the gcc version?
Based on this information e.g. Ubuntu 16.04 should use GCC 5.3.1 for CUDA9.0.

ptrblck on 28 Apr 2019

AttributeError: 'AmpState' object has no attribute 'opt_properties'

is there anyone got this problem? i build apex and maskrcnn-benchmark successful without any error.
my version informations are
cuda9.0, gcc 5.2, pytorch-nightly1.1 (Centos)
(i can run it successful under UbuntuOS with cuda9.0, gcc5.2, pytorch-nightly1.0.0...... )

Tegala on 29 Apr 2019

@Tegala Are you building apex from source or are you using an older version of apex?

ptrblck on 29 Apr 2019

@Tegala Are you building apex from source or are you using an older version of apex?
Thanks for your reply!
I use the commad to huild apex:

git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install --cuda_ext --cpp_ext

And I am sure I using the latest version of apex. this is strange...

Tegala on 29 Apr 2019

@Tegala Are you building apex from source or are you using an older version of apex?
Thanks for your reply!
I use the commad to huild apex:
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install --cuda_ext --cpp_ext
And I am sure I using the latest version of apex. this is strange...

The error info outputs:

2019-04-30 06:11:31,877 maskrcnn_benchmark.trainer INFO: Start training
Traceback (most recent call last):
  File "tools/train_net.py", line 177, in <module>
    main()
  File "tools/train_net.py", line 170, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 76, in train
    arguments,
  File "/home/hjz/projects/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 79, in do_train
    with amp.scale_loss(losses, optimizer) as scaled_losses:
  File "/home/hjz/perl5/anaconda3/envs/deephaj-env/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/home/hjz/perl5/anaconda3/envs/deephaj-env/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/handle.py", line 78, in scale_loss
    if not _amp_state.opt_properties.enabled:
AttributeError: 'AmpState' object has no attribute 'opt_properties'

Tegala on 29 Apr 2019

Thanks for the information!
I'm trying to reproduce this issue. CC @mcarilli

ptrblck on 29 Apr 2019

@Tegala Amp requires that

model, optimizer = amp.initialize(model, optimizer, opt_level=...)

be called before any invocation of

with amp.scale_loss(losses, optimizer) as scaled_loss:

If your code is somehow invoking with amp.scale_loss without ever invoking amp.initialize, the above error will result.

mcarilli on 29 Apr 2019

😄1

@Tegala Amp要求
model, optimizer = amp.initialize(model, optimizer, opt_level=...)
在任何调用之前调用
with amp.scale_loss(losses, optimizer
。

如果您的代码以某种方式调用with amp.scale_loss而没有调用amp.initialize，将导致上述错误。

Thanks so much!
I check it again and find that It is just like what you said, now it works!

Tegala on 30 Apr 2019

👍2

https://github.com/SeanNaren/deepspeech.pytorch/issues/376

git clone --recursive https://github.com/NVIDIA/apex.git
cd apex && pip install .

This worked for me.

aashokvardhan on 10 May 2019

👍1

@aashokvardhan pip install . will perform a Python-only install, which is not ideal for performance. You should install with cuda and c++ extensions via

pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

and only fall back to pip install . if the extension build doesn't work.

mcarilli on 10 May 2019

👍3

Encountered this when using Ubuntu 18.04 | CUDA 9.0 and the default GCC/G++ in Ubuntu, version 7. The CUDA compiler is incompatible with GCC >= 6.4.

Solved it by installing GCC-5 and G++-5 ( sudo apt install gcc-5 g++-5 ), and setting them as higher priority using update alternatives:

sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 10
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 10

After this, Apex installed fine using the default instructions.

joskaaaa on 14 Jun 2019

My solution is:
sudo ln -sf /usr/bin/gcc-5 /usr/local/cuda-9.0/bin/gcc
sudo ln -sf /usr/bin/g++-5 /usr/local/cuda-9.0/bin/g++

on my Ubuntu 18.04, cuda 9.0, pytorch 1.1.0, python 3.6.

ying-tiger-cai on 12 Dec 2019

pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

I am trying to install apex on windows 10. I clone the apex from its repo and when I run the above command, I get this error: ERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found. Do you have any idea how to resolve the issue?
python 3.6
gcc 5.3.0
torch 1.0.1

Mahhos on 29 Jan 2020

@Mahhos Could you check your current working directory for the setup.py file?

ptrblck on 5 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Unable to reproduce the results of baseline on conv5 in FPN paper on CityScapes

krumo · 3Comments

Raise ValueError: Type mismatch (<type 'str'> vs. <type 'tuple'>) with values (coco_2017_train vs. ('coco_2017_train',)) for config key: DATASETS.TRAIN

SkeletonOne · 3Comments

how to solve this bug?

qijiezhao · 3Comments

Support for Fast RCNN

adityaarun1 · 3Comments

Why the large batchsize cause training slow?

auroua · 3Comments