Host - CentOS7:
$ uname -a
Linux XXX 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ rpm -qa | grep singularity
singularity-2.6.0-1.1.el7.x86_64
singularity-runtime-2.6.0-1.1.el7.x86_64
on prior kernels (well, i only have tested on 3.10.0-693.21.1.el7.x86_64...) i can run a singularity container with tensorflow
$ singularity exec --nv <image> python
Python 3.6.3 (default, Mar 20 2018, 13:50:41)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf; print(tf.__version__)
1.11.0
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
and i would see all my gpus.
the container has SCL python 3.6.3 and a pip install of tensorflow based from the docker image nvidia/cuda:9.0-cudnn7-runtime-centos7
however under kernel 3.10.0-862.14.4.el7.x86_64 the same image and commands result in
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2018-10-06 21:37:45.602641: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-10-06 21:37:45.619262: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2018-10-06 21:37:45.619351: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: ocio-gpu50.slac.stanford.edu
2018-10-06 21:37:45.619400: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: YYYY
2018-10-06 21:37:45.619477: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: Invalid argument: expected %d.%d, %d.%d.%d, or %d.%d.%d.%d form for driver version; got "1"
2018-10-06 21:37:45.619561: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 390.42.0
Device mapping: no known devices.
2018-10-06 21:37:45.625107: I tensorflow/core/common_runtime/direct_session.cc:291] Device mapping:
i also tried upgrading the nvidia drivers to the latest 390.87 but the same error persists.
as a result i cannot use tensorflow-gpu under the new kernel.
@yee379 - i also ran into the cuInit: CUDA_ERROR_UNKNOWN, but it was not due to our cluster's kernel version. frankly, we do not know the cause of the error, but we did find a solution: run a simple cuda-enabled pytorch code (outside of a container) before running tensorflow.
import torch
print(torch.rand(2,3).cuda())
would you mind trying this potential solution?
Hey @kaczmarj yep - confirmed. although in my case i run
python -c 'import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))'
ie: upon a fresh reboot, if i run a singularity container with exact same code it errors out. however, if i then run the same command on the native host (which succeeds) and then run the singularity container again with the above code... it works!
i was wondering if some microcode got pushed when i upgrade my kernel from
3.10.0-693.21.1.el7.x86_64
to
3.10.0-862.14.4.el7.x86_64
Is this from the /dev/nvidia-uvm device that doesn't get automatically created?
See: https://github.com/sylabs/singularity/issues/1441#issuecomment-379029288
This looks to be a similar issue...
I confirm that running python -c 'import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))' before running the actual singularity container handled the cuInit: CUDA_ERROR_UNKNOWN error, thank you @kaczmarj and @yee379 .
Also, just doing nvidia-modprobe -u -c=0 as suggested on one forum also does the trick and does not require tensorflow installed system-wide on the host!
Closing this as a dupe of #1441
Most helpful comment
Hey @kaczmarj yep - confirmed. although in my case i run
ie: upon a fresh reboot, if i run a singularity container with exact same code it errors out. however, if i then run the same command on the native host (which succeeds) and then run the singularity container again with the above code... it works!
i was wondering if some microcode got pushed when i upgrade my kernel from
to