I'm compiling jaxlib with CUDA 10.0 on Ubuntu 18.04. The build fails with the following error:
$ python3 build/build.py --enable_cuda --cuda_path /usr/local/cuda-10.0/ --cudnn_path /usr/local/cuda-10.0/ --enable_march_native
[...]
ERROR: /home/clem/.cache/bazel/_bazel_clem/ffaac3f7c6ad1cb26f04f1933452eef6/external/nccl_archive/BUILD.bazel:53:1: error while parsing .d file: /h
ome/clem/.cache/bazel/_bazel_clem/ffaac3f7c6ad1cb26f04f1933452eef6/execroot/__main__/bazel-out/k8-opt/bin/external/nccl_archive/_objs/device_lib/pr
od_i32_reduce_scatter.cu.d (No such file or directory)
nvcc fatal : Could not open input file /tmp/tmpxft_00000004_00000000-6_prod_i32_reduce_scatter.cu.compute_35.cpp1.ii
Target //build:install_xla_in_source_tree failed to build
INFO: Elapsed time: 278.116s, Critical Path: 69.60s
INFO: 1281 processes: 1281 linux-sandbox.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
Traceback (most recent call last):
File "build/build.py", line 331, in <module>
main()
File "build/build.py", line 326, in main
[":install_xla_in_source_tree", os.getcwd()])
File "build/build.py", line 50, in shell
output = subprocess.check_output(cmd)
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['./bazel-0.24.1-linux-x86_64', 'run', '--verbose_failures=true', '--config=opt', '--config=mkl_open_source
_only', '--config=cuda', ':install_xla_in_source_tree', '/home/clem/git/jax/build']' returned non-zero exit status 1.
Above this error message are only compiler warnings but no errors which could lead to some file not being created. Am I missing something? Or might there be a file name bug? Thanks a lot for your help!
I'm on a fresh Ubuntu 18.04.2 install with CUDA 10.0, cudnn and driver version 410.48.
Full log
I saw this too. It seems to be nondeterministic and related to nvcc, but I didn't have time to track down the problem. Try running the build again, and it should make more progress.
Thanks for the advice. I had to restart the compilation ~10 times and finally it finished.
However, after installing jaxlib and jax the xla backend does not find my GPU and falls back to CPU. Could this be related?
No, I think the two are unrelated. One is a build problem, the other is a run time problem.
Are you sure it's using the right jaxlib (i.e., the one you just built?) You can install it locally with pip install -e jax/build.
(You might also try a prebuilt jaxlib wheel; there are links to CUDA 10 wheels on the JAX github README.md).
Thank you again, I'll look into it.
I have tried the pre-build jaxlib wheels without success鹿. The readme states the GPU support of those is experimental, thats why I tried building myself.
鹿 _without my GPU being detected that is_
I've had this bug for a few months too, but never got around to reporting it. If I resume compilation, it usually progresses until another failure on a file in that directory. (But as reported above, it does eventually work.) My guess is that it's a race condition on a lot of requirements in nccl. My build machine has 20 threads (bazel uses all of them), and it happens most of the time.
I don't have any issues detecting my GPU with CUDA 10.0 and cuDNN 7.5 on Ubuntu 18.04.
Thank you @kroq-gar78 for the information. I posted a separate issue on that as it apparently does not correlate to the compilation error.
Edit: It indeed does not, see #993.
PR #1096 seems to have fixed this for me. Hope that helps!