Singularity: tensorflow singularity container with gpu fails with new kernel on centos 7

Created on 7 Oct 2018 · 5Comments · Source: hpcng/singularity

Version of Singularity:

Host - CentOS7:

$ uname -a
Linux XXX 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ rpm -qa | grep singularity
singularity-2.6.0-1.1.el7.x86_64
singularity-runtime-2.6.0-1.1.el7.x86_64

Expected behavior

on prior kernels (well, i only have tested on 3.10.0-693.21.1.el7.x86_64...) i can run a singularity container with tensorflow

$ singularity exec --nv <image> python
Python 3.6.3 (default, Mar 20 2018, 13:50:41)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf; print(tf.__version__)
1.11.0
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

and i would see all my gpus.

the container has SCL python 3.6.3 and a pip install of tensorflow based from the docker image nvidia/cuda:9.0-cudnn7-runtime-centos7

Actual behavior

however under kernel 3.10.0-862.14.4.el7.x86_64 the same image and commands result in

>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2018-10-06 21:37:45.602641: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-10-06 21:37:45.619262: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2018-10-06 21:37:45.619351: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: ocio-gpu50.slac.stanford.edu
2018-10-06 21:37:45.619400: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: YYYY
2018-10-06 21:37:45.619477: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: Invalid argument: expected %d.%d, %d.%d.%d, or %d.%d.%d.%d form for driver version; got "1"
2018-10-06 21:37:45.619561: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 390.42.0
Device mapping: no known devices.
2018-10-06 21:37:45.625107: I tensorflow/core/common_runtime/direct_session.cc:291] Device mapping:

i also tried upgrading the nvidia drivers to the latest 390.87 but the same error persists.

as a result i cannot use tensorflow-gpu under the new kernel.

Hacktoberfest Question help wanted

Source

yee379

Most helpful comment

Hey @kaczmarj yep - confirmed. although in my case i run

python -c 'import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))'

ie: upon a fresh reboot, if i run a singularity container with exact same code it errors out. however, if i then run the same command on the native host (which succeeds) and then run the singularity container again with the above code... it works!

i was wondering if some microcode got pushed when i upgrade my kernel from

3.10.0-693.21.1.el7.x86_64

3.10.0-862.14.4.el7.x86_64

yee379 on 15 Oct 2018

👍3 🎉2

All 5 comments

@yee379 - i also ran into the cuInit: CUDA_ERROR_UNKNOWN, but it was not due to our cluster's kernel version. frankly, we do not know the cause of the error, but we did find a solution: run a simple cuda-enabled pytorch code (outside of a container) before running tensorflow.

import torch
print(torch.rand(2,3).cuda())

would you mind trying this potential solution?

kaczmarj on 13 Oct 2018

Hey @kaczmarj yep - confirmed. although in my case i run

python -c 'import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))'

i was wondering if some microcode got pushed when i upgrade my kernel from

3.10.0-693.21.1.el7.x86_64

3.10.0-862.14.4.el7.x86_64

yee379 on 15 Oct 2018

👍3 🎉2

Is this from the /dev/nvidia-uvm device that doesn't get automatically created?

See: https://github.com/sylabs/singularity/issues/1441#issuecomment-379029288

This looks to be a similar issue...

jmstover on 15 Oct 2018

I confirm that running python -c 'import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))' before running the actual singularity container handled the cuInit: CUDA_ERROR_UNKNOWN error, thank you @kaczmarj and @yee379 .

Also, just doing nvidia-modprobe -u -c=0 as suggested on one forum also does the trick and does not require tensorflow installed system-wide on the host!