Hi,
I have been having trouble getting JAX to work on CentOS 7 and was wondering if I could have some help. If I manage to get it working I'll document how I did it here.
First, some build system stats:
Currently Loaded Modulefiles:
1) cuda/10.1.105_418.39 3) gcc/8.3.0 5) slurm/18.08.8
2) cudnn/v7.6.2-cuda-10.1 4) lib/openblas/0.2.19-haswell 6) openmpi/1.10.7-hfi
conda: 4.8.1
python: 3.7.6
Following @shoyer's advice in #1948, I switched to a self-built bazel. This eliminated some issues but now I have others.
The advice in #1659 of setting the correct cuda path was not applicable to me.
For the record I have TensorFlow working with GPUs from a wheel. I also have JAX working for CPU-only from the conda-forge version.
Here is my current command and problem:
Command:
python build/build.py --enable_march_native --enable_cuda --cuda_path /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39 --cudnn_path /cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1 --bazel_path /mnt/home/mcranmer/Downloads/bazel/output/bazel 2>&1 > bazel_build_log.txt
Output:
WARNING: Output base '/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd' is on NFS. This may lead to surprising failures and undetermined behavior.
Starting local Bazel server and connecting to it...
WARNING: Output base '/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd' is on NFS. This may lead to surprising failures and undetermined behavior.
INFO: Options provided by the client:
Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'run' from /mnt/home/mcranmer/Downloads/jax/.bazelrc:
Inherited 'build' options: --repo_env PYTHON_BIN_PATH=/mnt/home/mcranmer/miniconda3/envs/main2/bin/python --python_path=/mnt/home/mcranmer/miniconda3/envs/main2/bin/python --repo_env TF_NEED_CUDA=1 --distinct_host_configuration=false --copt=-Wno-sign-compare -c opt --apple_platform_type=macos --macos_minimum_os=10.9 --announce_rc --define=no_aws_support=true --define=no_gcp_support=true --define=no_hdfs_support=true --define=no_kafka_support=true --define=no_ignite_support=true --define=grpc_no_ares=true --spawn_strategy=standalone --strategy=Genrule=standalone --cxxopt=-std=c++14 --host_cxxopt=-std=c++14 --action_env CUDA_TOOLKIT_PATH=/cm/shared/sw/pkg/devel/cuda/10.1.105_418.39 --action_env CUDNN_INSTALL_PATH=/cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1
INFO: Found applicable config definition build:opt in file /mnt/home/mcranmer/Downloads/jax/.bazelrc: --copt=-march=native --host_copt=-march=native
INFO: Found applicable config definition build:mkl_open_source_only in file /mnt/home/mcranmer/Downloads/jax/.bazelrc: --define=tensorflow_mkldnn_contraction_kernel=1
INFO: Found applicable config definition build:cuda in file /mnt/home/mcranmer/Downloads/jax/.bazelrc: --crosstool_top=@local_config_cuda//crosstool:toolchain --define=using_cuda=true --define=using_cuda_nvcc=true
Loading:
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
currently loading: build
INFO: Call stack for the definition of repository 'local_config_cuda' which is a cuda_configure (rule definition at /mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl:1306:18):
- /mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/tensorflow/workspace.bzl:87:5
- /mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/tensorflow/workspace.bzl:77:5
- /mnt/home/mcranmer/Downloads/jax/WORKSPACE:46:1
ERROR: An error occurred during the fetch of repository 'local_config_cuda':
Traceback (most recent call last):
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 1304
_create_local_cuda_repository(<1 more arguments>)
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 1006, in _create_local_cuda_repository
_get_cuda_config(repository_ctx)
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 729, in _get_cuda_config
find_cuda_config(repository_ctx, <1 more arguments>)
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 709, in find_cuda_config
auto_configure_fail(<1 more arguments>)
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 340, in auto_configure_fail
fail(<1 more arguments>)
Cuda Configuration Error: Failed to run find_cuda_config.py: Could not find any cublas_api.h matching version '' in any subdirectory:
''
'include'
'include/cuda'
'include/*-linux-gnu'
'extras/CUPTI/include'
'include/cuda/CUPTI'
of:
'/cm/local/apps/cmd/lib'
'/cm/local/apps/mysql++/current/lib'
'/lib'
'/lib64'
'/opt/dell/srvadmin/lib64'
'/opt/dell/srvadmin/lib64/openmanage'
'/opt/dell/srvadmin/lib64/openmanage/smpop'
'/opt/dell/toolkit/bin'
'/usr'
'/usr/lib64//bind9-export'
'/usr/lib64/R/lib'
'/usr/lib64/atlas'
'/usr/lib64/dyninst'
'/usr/lib64/mysql'
'/usr/lib64/octave/3.8.2'
'/usr/lib64/tcl8.5'
'/usr/lib64/vtk'
ERROR: Skipping ':install_xla_in_source_tree': no such package '@local_config_cuda//cuda': Traceback (most recent call last):
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 1304
_create_local_cuda_repository(<1 more arguments>)
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 1006, in _create_local_cuda_repository
_get_cuda_config(repository_ctx)
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 729, in _get_cuda_config
find_cuda_config(repository_ctx, <1 more arguments>)
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 709, in find_cuda_config
auto_configure_fail(<1 more arguments>)
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 340, in auto_configure_fail
fail(<1 more arguments>)
Cuda Configuration Error: Failed to run find_cuda_config.py: Could not find any cublas_api.h matching version '' in any subdirectory:
''
'include'
'include/cuda'
'include/*-linux-gnu'
'extras/CUPTI/include'
'include/cuda/CUPTI'
of:
'/cm/local/apps/cmd/lib'
'/cm/local/apps/mysql++/current/lib'
'/lib'
'/lib64'
'/opt/dell/srvadmin/lib64'
'/opt/dell/srvadmin/lib64/openmanage'
'/opt/dell/srvadmin/lib64/openmanage/smpop'
'/opt/dell/toolkit/bin'
'/usr'
'/usr/lib64//bind9-export'
'/usr/lib64/R/lib'
'/usr/lib64/atlas'
'/usr/lib64/dyninst'
'/usr/lib64/mysql'
'/usr/lib64/octave/3.8.2'
'/usr/lib64/tcl8.5'
'/usr/lib64/vtk'
WARNING: Target pattern parsing failed.
ERROR: no such package '@local_config_cuda//cuda': Traceback (most recent call last):
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 1304
_create_local_cuda_repository(<1 more arguments>)
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 1006, in _create_local_cuda_repository
_get_cuda_config(repository_ctx)
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 729, in _get_cuda_config
find_cuda_config(repository_ctx, <1 more arguments>)
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 709, in find_cuda_config
auto_configure_fail(<1 more arguments>)
File "/mnt/home/mcranmer/.cache/bazel/_bazel_mcranmer/aa4ae3007acde055b5e4c61fdbbf8dbd/external/org_tensorflow/third_party/gpus/cuda_configure.bzl", line 340, in auto_configure_fail
fail(<1 more arguments>)
Cuda Configuration Error: Failed to run find_cuda_config.py: Could not find any cublas_api.h matching version '' in any subdirectory:
''
'include'
'include/cuda'
'include/*-linux-gnu'
'extras/CUPTI/include'
'include/cuda/CUPTI'
of:
'/cm/local/apps/cmd/lib'
'/cm/local/apps/mysql++/current/lib'
'/lib'
'/lib64'
'/opt/dell/srvadmin/lib64'
'/opt/dell/srvadmin/lib64/openmanage'
'/opt/dell/srvadmin/lib64/openmanage/smpop'
'/opt/dell/toolkit/bin'
'/usr'
'/usr/lib64//bind9-export'
'/usr/lib64/R/lib'
'/usr/lib64/atlas'
'/usr/lib64/dyninst'
'/usr/lib64/mysql'
'/usr/lib64/octave/3.8.2'
'/usr/lib64/tcl8.5'
'/usr/lib64/vtk'
INFO: Elapsed time: 618.998s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
ERROR: Build failed. Not running target
FAILED: Build did NOT complete successfully (0 packages loaded)
_ _ __ __
| | / \ \ \/ /
_ | |/ _ \ \ /
| |_| / ___ \/ \
\___/_/ \/_/\_\
Bazel binary path: /mnt/home/mcranmer/Downloads/bazel/output/bazel
Python binary path: /mnt/home/mcranmer/miniconda3/envs/main2/bin/python
MKL-DNN enabled: yes
-march=native: yes
CUDA enabled: yes
CUDA toolkit path: /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39
CUDNN library path: /cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1
Building XLA and installing it in the jaxlib source tree...
/mnt/home/mcranmer/Downloads/bazel/output/bazel run --verbose_failures=true --config=opt --config=mkl_open_source_only --config=cuda --define=xla_python_enable_gpu=true :install_xla_in_source_tree /mnt/home/mcranmer/Downloads/jax/build
Traceback (most recent call last):
File "build/build.py", line 351, in <module>
main()
File "build/build.py", line 346, in main
shell(command)
File "build/build.py", line 50, in shell
output = subprocess.check_output(cmd)
File "/mnt/home/mcranmer/miniconda3/envs/main2/lib/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/mnt/home/mcranmer/miniconda3/envs/main2/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/mnt/home/mcranmer/Downloads/bazel/output/bazel', 'run', '--verbose_failures=true', '--config=opt', '--config=mkl_open_source_only', '--config=cuda', '--define=xla_python_enable_gpu=true', ':install_xla_in_source_tree', '/mnt/home/mcranmer/Downloads/jax/build']' returned non-zero exit status 1.
I've also seen some other build issues - they seem to come and go. I'll document them if I save the log next time.
Please let me know if you have any tips.
Thanks!
Miles
Do you know if you have cublas installed on your system? That's what your error message is about.
It should be all there:
(main2) โ ls | grep cublas
cublas_api.h
cublas.h
cublasLt.h
cublas_v2.h
cublasXt.h
(main2) โ pwd
/cm/shared/sw/pkg/devel/cuda/10.1.105_418.39/targets/x86_64-linux/include
Looking through https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/cuda_configure.bzl, I started to wonder if some of these environment variables would be needed. I set this one:
export TF_CUDA_PATHS=/cm/shared/sw/pkg/devel/cuda/10.1.105_418.39
and now I can get a bit further in the build process. Will update this post when it's finished (hopefully successfully).
Okay, I finally built JAX on CentOS 7 and confirmed GPU support!! ๐๐๐๐
(I use the Bazel built by the JAX installer, rather than using one I built from source as I previously tried)
Here's my solution.
echo $LD_LIBRARY_PATH, it should be the folder before lib64 for the cuda files. Mine is here: /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39/lib64.export TF_CUDA_PATHS=/cm/shared/sw/pkg/devel/cuda/10.1.105_418.39
This is the folder with the following (yours might be slightly different, but should have include and lib64):
(main2) โ ~ ls /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39
bin extras lib64 NsightCompute-2019.1 nvml share tools
doc include libnsight nsightee_plugins nvvm src version.txt
EULA.txt jre libnvvp NsightSystems-2018.3 samples targets
/cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1. This folder contains: include and lib64.git clone --depth=1 https://github.com/google/jaxpython build/build.py --enable_march_native --enable_cuda --cuda_path /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39 --cudnn_path /cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1
For your system, you need to change /cm/shared/sw/pkg/devel/cuda/10.1.105_418.39 to the folder you passed TF_CUDA_PATHS and /cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1 to the folder you found for cuDNN.
pip install -e build and then pip install -e .. Thank you for sharing these great instructions!
This didn't work for me on Centos 7, looks like a Bazel issue. Did you build Bazel from source or just using yum? And which version of Bazel was used?
I think the JAX installer will build its own Bazel. I ended up just using that one.
Ah I see, thanks for clarifying - got confused by your comment in the original post that you tried an externally built one.
Oops, sorry, let me make it clear in the instructions for future users!
Hi,
Thank you for sharing.
I couldn't get it work.
Could you explain what you mean by "Make sure the relevant folders are on your LIBRARY_PATH/LD_LIBRARY_PATH/CPATH/PATH environment variables."?
Also I don't have a cudnn directory. The cudnn.h file is in usr/include and I'm using that.
"Make sure the relevant folders are on your LIBRARY_PATH/LD_LIBRARY_PATH/CPATH/PATH environment variables."?
He means that the folders for all the requirements from the step above ("CUDA, cuDNN, gcc, openMPI installed.") should be on those environment variables. For your case if you do e.g. echo $LD_LIBRARY_PATH, then you should get all the folders [which contain those prerequisites] on that path and for instance cuDNN should be there. E.g. on my server:
Currently Loaded Modules:
1) CUDA/10.0.130 3) GCCcore/8.3.0 5) binutils/2.32-GCCcore-8.3.0 7) numactl/2.0.12-GCCcore-8.3.0 9) libxml2/2.9.9-GCCcore-8.3.0 11) hwloc/2.0.3-GCCcore-8.3.0
2) cuDNN/7.6.4.38-CUDA-10.0.130 4) zlib/1.2.11-GCCcore-8.3.0 6) GCC/8.3.0-2.32 8) XZ/5.2.4-GCCcore-8.3.0 10) libpciaccess/0.14-GCCcore-8.3.0 12) OpenMPI/4.0.1-GCC-8.3.0-2.32
And:
echo $LD_LIBRARY_PATH
/blablab/el7/OpenMPI/4.0.1-GCC-8.3.0-2.32/lib:/blbalba/el7/hwloc/2.0.3-GCCcore-8.3.0/lib:/appl/opt/libpciaccess/0.14-GCCcore-8.3.0/lib:/blbalba/libxml2/2.9.9-GCCcore-8.3.0/lib:/blbalba/XZ/5.2.4-GCCcore-8.3.0/lib:/blbalba/numactl/2.0.12-GCCcore-8.3.0/lib:/blbalba/binutils/2.32-GCCcore-8.3.0/lib:/blbalba/zlib/1.2.11-GCCcore-8.3.0/lib:/blbalba/GCCcore/8.3.0/lib64:/blbalba/GCCcore/8.3.0/lib:/blbalba/cuDNN/7.6.4.38-CUDA-10.0.130/lib64:/blbalba/CUDA/10.0.130/nvvm/lib64:/blbalba/CUDA/10.0.130/extras/CUPTI/lib64:/blbalba/CUDA/10.0.130/lib64:/blbalba/centos/usr/lib/jvm/jre-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/lib:/blbalba/centos/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/lib:/blbalba/centos/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib:
So you can see the cuDNN path here is: /blbalba/cuDNN/7.6.4.38-CUDA-10.0.130
Most helpful comment
Okay, I finally built JAX on CentOS 7 and confirmed GPU support!! ๐๐๐๐
(I use the Bazel built by the JAX installer, rather than using one I built from source as I previously tried)
Here's my solution.
echo $LD_LIBRARY_PATH, it should be the folder before lib64 for the cuda files. Mine is here:/cm/shared/sw/pkg/devel/cuda/10.1.105_418.39/lib64.This is the folder with the following (yours might be slightly different, but should have include and lib64):
/cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1. This folder contains:includeandlib64.git clone --depth=1 https://github.com/google/jaxFor your system, you need to change
/cm/shared/sw/pkg/devel/cuda/10.1.105_418.39to the folder you passedTF_CUDA_PATHSand/cm/shared/sw/pkg/devel/cudnn/v7.6.2-cuda-10.1to the folder you found for cuDNN.pip install -e buildand thenpip install -e ..