singularity 🚀 - Nvidia GPU support with different driver versions?

Hi @thiell,

Singularity can already support GPUs natively, but it takes doing a few custom configurations on bind paths and possibly including a custom LD_LIBRARY_PATH. It isn't as transparent as doing it automatically (which we also hope to do soon), but it does work very well (so I'm told, as I haven't done it myself).

Thanks!

gmkurtzer on 17 Apr 2017

Thanks @gmkurtzer!

thiell on 17 Apr 2017

My pleasure. Let me know how it goes for ya!

gmkurtzer on 17 Apr 2017

Hiya @thiell. Until this is supported by Singularity, you might try using this script:

https://github.com/NIH-HPC/gpu4singularity

It should update the drivers in an existing container or install drivers if none exist.

GodloveD on 18 Apr 2017

Heya guys... I wrote a quick "hack" that I am experimenting on what is necessary for proper GPU support inside of containers. It currently exists in the development branch, and the code is in /etc/singularity/init and triggered by a command line switch --nv.

Would you mind testing that and telling me how it works for you?

Thank you!

gmkurtzer on 19 Apr 2017

Hi @gmkurtzer!

What's the status of the NVIDIA contribution for this? I heard they were trying to get something in sync with the 2.3 release.

kcgthb on 19 Apr 2017

This is a great idea and it should work. But there are a few kinks to work out. See my comment on https://github.com/singularityware/singularity/commit/364895f664ef288307c068416eeb0467bf67be7d.

GodloveD on 19 Apr 2017

@kcgthb: Yes, Nvidia has been working on a generic container library and their idea is to have Singularity link to that, but because Singularity by default (and to get access to all critical features) we run as SUID, which technically would open up a security vector through any external libraries we link to. Instead I am hoping to (with their help) implement the necessary features directly within Singularity proper.

@GodloveD: Thanks! I will start working on moving it to C where I should be able to address your second point shortly.

gmkurtzer on 19 Apr 2017

@gmkurtzer Awesome. Yes I think this idea should work. I tried a similar thing many months ago but I bind-mounted entire directories and not individual files. This gave me limited success, (nvidia-smi worked) but, as you can probably guess, CUDA failed because it saw the wrong libraries on the path (I think). Anyhow I'm pretty confident that this will work if you circumvent the overlay fs gotcha.

GodloveD on 19 Apr 2017

@gmkurtzer: thanks for the details, very interesting, I didn't realize the security implications. Wasn't there a plan to move away from SUID and use a narrower set of privileges (with CAPABILITIES or something)?

I'm wondering about the workaround you're proposing, though: doesn't this defeats the very purpose of containerization? I mean, if we expose and use host-level libs within the container, it will potentially be a source of conflicts and mismatches, don't you think? To me, the main goal of running an application in a container is to be able to abstract from the host's software and be self-sufficient in terms of libraries. Having a mix of host-level and container-level system libraries seems orthogonal to that.

Plus, it will still require that the apps and libs within the container are compatible with the host libs. For GPU applications, it means that they need to be linked against a CUDA/NVML/driver version that precisely matches what's installed on the host. Which limits the portability of such containers.

In the end, the workaround may add some convenience, but I don't think it really solves the problem of portability. Instead of having to install in the container the precise CUDA/NVML versions that match the host's, users will be able to just use what's installed on the host, but their application will still have to match those libs.

I believe that NVIDIA's solution were to introduce some level of library abstraction that is supposed to relax those lib version constraints, and allow containers and hosts to have somewhat different versions installed, so containers could be portable across a wider range of host settings.

So, I'm not saying the workaround is not interesting nor useful, but I hope it won't make the need for a more generic solution (one that could relax container/host versioning constraints) less important. :)

kcgthb on 19 Apr 2017

@kcgthb I think this is going to give you much more flexibility than you think (provided it works). I apologize if I am misunderstanding, but I think you might be conflating a few different ideas here.

If you dynamically bind the NVIDIA libs into the container the way that @gmkurtzer is suggesting your are guaranteed to have an exact match between the binaries and libraries within the container and those on the host system. The way I implemented it in my gpu4singularity script you have to be finicky and match things precisely, but this will do it for you. You will still have to install CUDA, cuDNN and anything else you need to run on top of the driver inside the container. But that is always going to remain the case. And CUDA is flexible. You don't have to make sure your CUDA version matches the driver version precisely (there is no such thing). You just have to make sure your driver is recent enough to support the CUDA you are trying to run. I actually think this "workaround" might be a fully-fledged elegant solution once it's all coded up. :smile_cat:

GodloveD on 19 Apr 2017

Hi @kcgthb,

Your point about library compatibility and portability between distros is a good one, but the Nvidia library does not and can not fix that. The libraries that the build (as I understand) are built against reasonably old and thus compatible versions of glibc, so we get general compatibility across most (if not all) current distributions of Linux.

@GodloveD Can you test the latest development branch. I moved things to the C backend, and I've updated the frontend considerably, so now there is a dedicated action_argparser.sh which does the library discovery. Eventually, we can move that to C as well, but for the time being, this works fine and is easily compatible (as long as the libraries are found via ld's standard paths).

Thanks!

gmkurtzer on 20 Apr 2017

OK, so after installing https://github.com/singularityware/singularity/commit/cb5d6c5ccb46dc386f9401c24dabf86f39d84698, I tried this:

singularity exec -n docker://tensorflow/tensorflow:latest-gpu python -m tensorflow.models.image.mnist.convolutional

and I got this (after all the fancy explosions):

ERROR  : Could not write to /tmp/.singularity-runtime.ka4aN7De/libs/libnvidia-tls.so.367.48: Permission denied
ERROR  : Failed creating file at /tmp/.singularity-runtime.ka4aN7De/libs/libnvidia-tls.so.367.48: Permission denied
ABORT  : Retval = 255

GodloveD on 20 Apr 2017

That is a very weird place for it to error out. I can't seem to replicate it, can you send me the full debug output as well as the permissions for that directory(/tmp/.singularity-runtime.ka4aN7De/libs)? Thanks!

gmkurtzer on 20 Apr 2017

Oh, I forgot to mention, to see the leftover sessiondir, you must invoke Singularity like this:

SINGULARITY_NOSESSIONCLEANUP=1 singularity -d exec -n docker://ubuntu >/tmp/debug 2>&1

Then in the debug output file, you will see where the sessiondir is, can you show me the permissions of both the sessiondir itself as well as the libs/ directory within it?

Thanks!

gmkurtzer on 20 Apr 2017

Prob should have thought to give you that info in the first place. :stuck_out_tongue_winking_eye:

I pm-ed you the info on slack.

GodloveD on 20 Apr 2017

I must be doing something very stupid. Every time I try to run something like

/path/to/singularity shell -nv docker://tensorflow/tensorflow:latest-gpu

I get

ERROR: Unknown option: -nv

And yes, I installed the dev branch, and even explicitly tried the 1a64d3b. I also set the SINGULARITY_INCLUDEGPU env variable to 1 at configure/compile time and runtime.

Since the exec command mentioned -n, I tried that too (and double dashes --nv etc).

Clues?

davidedelvento on 5 May 2017

hey @davidedelvento could you double check the version?

singularity --version

I just tested with a (slightly dated) and then updated development branch, and I couldn't produce this error:

https://asciinema.org/a/2n8lxkukfa0z2o6pcxp1yki9g?speed=3

vsoch on 6 May 2017

Use either --nv (2 dashes) or -n. ☺

On May 5, 2017 7:57 PM, "Vanessa Sochat" notifications@github.com wrote:

hey @davidedelvento https://github.com/davidedelvento could you double
check the version?

singularity --version

I just tested with a (slightly dated) and then updated development branch,
and I couldn't produce this error:

https://asciinema.org/a/2n8lxkukfa0z2o6pcxp1yki9g?speed=3

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/singularityware/singularity/issues/611#issuecomment-299600521,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHUUXC0NHeg06lkcXk1yENBaNxiP6fR6ks5r27dMgaJpZM4M-zk7
.

GodloveD on 6 May 2017

good catch @GodloveD !

vsoch on 6 May 2017

Thanks @vsoch for the message. I should have mentioned that on my own...
Thanks also @GodloveD for that, however these were some of the options I did try and did not work (I retried them right now, just in case).

I installed many versions, but I deleted a few of them. For the ones that I did not delete, the output of --version is one of 2.2.1, 2.2.99-HEAD.g1a64d3b or 2.2.99. Is any of these supposed to work?

davidedelvento on 6 May 2017

I think your install might be off.. can you try removing everything from the old version completely, there is actually a lot hiding:

$ sudo rm -rf /usr/local/libexec/singularity
$ sudo rm -rf /usr/local/etc/singularity
$ sudo rm -rf /usr/local/include/singularity
$ sudo rm -rf /usr/local/lib/singularity
$ sudo rm -rf /usr/local/var/lib/singularity/
$ sudo rm /usr/local/bin/singularity
$ sudo rm /usr/local/bin/run-singularity
$ sudo rm /usr/local/etc/bash_completion.d/singularity 
$ sudo rm /usr/local/man/man1/singularity.1

Then, when you download and install, just do this in a fresh place:

$ git clone -b development https://github.com/singularityware/singularity.git
$ cd singularity
$ ./autogen.sh
$ ./configure --prefix=/usr/local --sysconfdir=/etc
$ make
$ sudo make install

then do singularity --version. You should see something that looks like 2.2.99-development.g945c6ee - the ones above don't seem quite right to me.

vsoch on 6 May 2017

@vsoch Thanks for the suggestion, I will try that and let you know. Just one detail which might be important: my prefix is not nor can be /usr/local because I'm not root on the machine where I'm installing. Therefore I can delete everything in my prefix in one shot. I'm mentioning just in case this --nv hack makes implicit assumptions about the prefix.

davidedelvento on 6 May 2017

When you install Singularity as root it creates an SUID binary that's used
to escalate privileges for some operations. I believe that the --nv option
may require privilege escalation because all of the libraries must be bind
mounted into the container regardless of whether or not the appropriate
places to mount them already exist. This might be different if you have
access to an overlay file system(?) but what I'm trying to say is that if
you are installing Singularity on a machine where you don't have root
access --nv may not work.

On Sat, May 6, 2017 at 8:34 AM, Davide notifications@github.com wrote:

@vsoch https://github.com/vsoch Thanks for the suggestion, I will try
that and let you know. Just one detail which might be important: my prefix
is not nor can be /usr/local because I'm not root on the machine where
I'm installing. Therefore I can delete everything in my prefix in one shot.
I'm mentioning just in case this --nv hack makes implicit assumptions
about the prefix.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/singularityware/singularity/issues/611#issuecomment-299636926,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHUUXH5Tq2BNbE4IErhupPqYU9CdwEmiks5r3GjQgaJpZM4M-zk7
.

GodloveD on 6 May 2017

I think @GodloveD might have hit the nail in the head. To test, could you try running with --debug, and then without those additional args?

vsoch on 6 May 2017

And do you have a workstation where you can be root to create and bootstrap images?

vsoch on 6 May 2017

@vsoch Yes, I have a workstation where I can be root, however that one does not have GPUs. Does it matter?

davidedelvento on 7 May 2017

Well, it won't replicate the error entirely, so we can't really test it. What we can do is figure out if it has to do with your installation, specifically the fact that you don't have sudo (and doesn't have anything to do with gpu/nvidia at all). If you can try the same command on your (non sudo install) location with the non-gpu image:

/path/to/singularity shell docker://tensorflow/tensorflow:latest

if that doesn't work, we know for sure it's the install., and perhaps you can ask a system admin to install for you?

And if you were to try the above on your (sudo) workstation, it would work too :)

Probably either way, your best bet is going to be to wait for a version (2.3) that your admin is comfortable installing, and have him/her install it. Probability says it's a him, lol.

vsoch on 7 May 2017

@vsoch I tried to run the non-gpu version. It downloaded a lot of stuff and it took quite a few minutes. After that, it failed with

ERROR  : User namespace not supported, and program not running privileged.
ABORT  : Retval = 255

My comments on this:

The check should be done before the download
One of the design goals of singularity is not requiring root like docker does, and now you're telling me well, we do too? I understand this is less than docker requires, but my sysadmins (yep, an all-male group) will not install it.

Thanks a lot for your help sorting this out, really appreciated.

davidedelvento on 8 May 2017

I apologize for your misunderstanding - most of Singularity to run on a shared resource would require an administeator to install the base software, then the user is empowered to run and use images without root. The idea is that the software that would normally require root to install can live and run in the image that has been created by the user off of the cluster. The exception is for an image creation then import from docker, which would not require root. Singularity does not let the user escalate to root as Docker does. That is the primary difference.

I hope that your group gives it a second go at some point! Feel free to close the issue if it's resolved for now, or wait for feedback from others.

vsoch on 8 May 2017

Hi @gmkurtzer and sorry for the delay!!

Thanks for adding --nv! I tried it (using 2.2.99-development.g945c6ee) and played a bit with docker://tensorflow/tensorflow:latest-gpu...

The good news is that it seems to work well in jobs that allocate a single GPU (in slurm, so a single GPU device is visible - note: we're using device cgroup).

Example (single GPU allocation):

Singularity tensorflow:latest-gpu:~> python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> with tf.device('/gpu:0'):
...     a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
...     b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
...     c = tf.matmul(a, b)
... 
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2017-05-10 02:23:25.640377: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-10 02:23:25.640407: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-10 02:23:25.640416: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-05-10 02:23:26.163631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:85:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-05-10 02:23:26.163686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 
2017-05-10 02:23:26.163696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y 
2017-05-10 02:23:26.163712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:85:00.0)
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K80, pci bus id: 0000:85:00.0
2017-05-10 02:23:26.227264: I tensorflow/core/common_runtime/direct_session.cc:257] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K80, pci bus id: 0000:85:00.0

>>> print(sess.run(c))
MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0
2017-05-10 02:23:47.507606: I tensorflow/core/common_runtime/simple_placer.cc:841] MatMul: (MatMul)/job:localhost/replica:0/task:0/gpu:0
b: (Const): /job:localhost/replica:0/task:0/gpu:0
2017-05-10 02:23:47.507659: I tensorflow/core/common_runtime/simple_placer.cc:841] b: (Const)/job:localhost/replica:0/task:0/gpu:0
a: (Const): /job:localhost/replica:0/task:0/gpu:0
2017-05-10 02:23:47.507679: I tensorflow/core/common_runtime/simple_placer.cc:841] a: (Const)/job:localhost/replica:0/task:0/gpu:0
[[ 22.  28.]
 [ 49.  64.]]

However, for jobs using multiple GPUs, even when effectively using only one GPU in TensorFlow, I get the following error messages:

Singularity tensorflow:latest-gpu:~> python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> with tf.device('/gpu:0'):
...     a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
...     b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
...     c = tf.matmul(a, b)
... 
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2017-05-10 02:27:16.126277: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-10 02:27:16.126330: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-10 02:27:16.126353: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-05-10 02:27:16.139414: E tensorflow/stream_executor/cuda/cuda_driver.cc:405] failed call to cuInit: CUDA_ERROR_UNKNOWN
2017-05-10 02:27:16.139496: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: xs-0007
2017-05-10 02:27:16.139523: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: xs-0007
2017-05-10 02:27:16.139594: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Invalid argument: expected %d.%d or %d.%d.%d form for driver version; got "1"
2017-05-10 02:27:16.139657: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  361.93.03  Tue Sep 27 22:40:25 PDT 2016
GCC version:  gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC) 
"""
2017-05-10 02:27:16.139718: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 361.93.3
Device mapping: no known devices.
2017-05-10 02:27:16.143122: I tensorflow/core/common_runtime/direct_session.cc:257] Device mapping:

I tried with several different GPU indexes, same issue.

thiell on 10 May 2017

thanks for testing this @thiell !! :dango:

vsoch on 10 May 2017

Hey, so I have done a few more tests and the above error does appear very often when allocating a lot of GPUs (>= 12), but sometimes it does actually work... grr, I don't like that. But I think we have to be careful as it could be a tensorflow issue. Indeed the error looks very similar to https://github.com/tensorflow/tensorflow/issues/2239 which, in the case of docker, was apparently fixed by using nvidia-docker instead. However, please note that we do have a 1.0 of tensorflow compiled and installed on this cluster and it works well even with up to 16 GPUs, but it is not the latest version.

I also quickly tested docker://romanbilyi/torch-gpu using singularity --nv with multiple GPUs and didn't get any error when loading cutorch and doing very simple stuffs.

thiell on 10 May 2017

Just to restart the discussion about strict library match requirements:

Your point about library compatibility and portability between distros is a good one, but the Nvidia library does not and can not fix that. The libraries that the build (as I understand) are built against reasonably old and thus compatible versions of glibc, so we get general compatibility across most (if not all) current distributions of Linux.

I was not referring to glibc or Linux distribution compatibility, but about the possibility to have less strict matching requirement between the different components of the NVIDIA stack (driver, libraries, CUDA) between host and containers. The nvidia-docker devs lay it out pretty well here and explain it much better than I could.

kcgthb on 11 May 2017

That is very very close to how it is getting done in Singularity today with the "experimental" GPU support. We are also getting the GPU libraries from the ld.so.cache as they are so things work very similar, and those libraries will indeed be matched to the host kernel (assuming the host's GPU support is working).

gmkurtzer on 12 May 2017

That seems close enough indeed, but I think there may additional bits missing. For instance, the NVML-related tools, such as nvidia-smi, which heavily depend on the driver, are not available within the container.
nvidia-docker mounts them at runtime, as mentionned in https://github.com/NVIDIA/nvidia-docker/issues/274

kcgthb on 2 Jun 2017

Hi @kcgthb,

Is nvidia-smi a statically compiled binary? If not, can you run the following commands for me:

$ ldd `which nvidia-smi`
$ objdump -T `which nvidia-smi` | grep GLIBC_ | sed -e 's/^.*\(GLIBC_[^ ]*\).*$/\1/' | sort | uniq

My concern is that mounting a binary as such to the host system breaks portability... But on the other hand, we are doing it with libraries already! If you did a bind of nvidia-smi at Singularity runtime does that work? e.g.:

$ singularity shell -B /usr/bin/nvidia-smi --nv ......

gmkurtzer on 3 Jun 2017

Hi Greg,

Nope, it's dynamically linked:

$ ldd `which nvidia-smi`
        linux-vdso.so.1 =>  (0x00007ffd32fae000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f84af507000)
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f84af302000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007f84aef41000)
        librt.so.1 => /usr/lib64/librt.so.1 (0x00007f84aed39000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f84af730000)
$ objdump -T `which nvidia-smi` | grep GLIBC_ | sed -e 's/^.*\(GLIBC_[^ ]*\).*$/\1/' | sort | uniq
GLIBC_2.2.5

So yes, portability may not be great, but as you said, it's already the case for libraries.

And binding the binary with -B does work:

$ singularity exec -B /usr/bin/nvidia-smi --nv cuda.img nvidia-smi -L
GPU 0: Tesla P40 (UUID: GPU-0e07a660-2201-322e-1456-18a213fc2983)
GPU 1: Tesla P40 (UUID: GPU-42b925fd-0dee-ec3c-627e-3024d9376024)

kcgthb on 5 Jun 2017

I have an aversion to making the nvidia-smi program bound into the container by default, but I will if it is deemed to be completely necessary (as opposed to having nvidia-smi already within the container with the Nvidia linked applications).

gmkurtzer on 8 Jun 2017

I get the feeling, but:

that's how NVIDIA does this in nvidia-docker
nvidia-smi is a very common tool that most users would expect to find in their containers
it's shipped as part of the NVIDIA X11 driver, not CUDA, so it's tightly coupled to the NVML and the kernel module, which should be part of the host OS stack, not the container's.

So, although I understand that it may feel weird, I think that would be necessary to avoid having to install matching NVML/drivers in the containers.

kcgthb on 8 Jun 2017

What Kilian said :-)

On Thu, Jun 8, 2017 at 2:12 PM, Kilian Cavalotti notifications@github.com
wrote:

I get the feeling, but:

that's how NVIDIA does this in nvidia-docker

nvidia-smi is a very common tool that most users would expect to
find in their containers

it's shipped as part of the NVIDIA X11 driver, not CUDA, so it's
tightly coupled to the NVML and the kernel module, which should be part of
the host OS stack, not the container's.

So, although I understand that it may feel weird, I think that would be
necessary to avoid having to install matching NVML/drivers in the
containers.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/singularityware/singularity/issues/611#issuecomment-307184102,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHUUXC24ZlxfNPOMMYs5FM1sl20Pq_IDks5sCDmfgaJpZM4M-zk7
.

GodloveD on 8 Jun 2017

Can you test [development 25cd7a1] and let me know if that works for ya when only using the --nv option?

Thanks!

gmkurtzer on 9 Jun 2017

Yup, tested, works.
Thanks!

kcgthb on 9 Jun 2017

Excellent, thank you for testing. I'm gonna close this issue. YAY!

gmkurtzer on 9 Jun 2017

👍1

Could we get this last patch included in a stable 2.3.x version? thx!

thiell on 5 Jul 2017

@gmkurtzer is this included in 2.3.1?

vsoch on 5 Jul 2017

Nope, at the moment it is only in development branch. I can include it into master which will ensure it gets into the next release.

gmkurtzer on 6 Jul 2017

👍1

I just tried --nv option, and found it does not work if nvidia-smi is not installed in the default paths. For instance, the GPU driver in our cluster is in /cm/local/apps/cuda/libs/current/, then it has the following errors if "--nv" option is used:
which: no nvidia-smi in (/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin)
WARNING: Could not find the Nvidia SMI binary to bind into container

renganxu on 30 Aug 2017

I suspect that Singularity is simply searching your path but because your path is sanitized it is only looking in the usual locations. I don't know how we could cope with this because the path is sanitized for security reasons. Perhaps we could provide an env var that would allow you to specify to search in a non-standard location?

I the development branch there is a config file that allows you to specify what things to search for. Maybe that is the answer. I'll play with it and see what I can come up with.

GodloveD on 31 Aug 2017

Hi, I have the same problem with drivers being installed /cm/local/apps/cuda-driver/libs/current/bin/
My containers were working previously with the --nv option.
Has there been any developments with this issue?

jcbowden on 20 Nov 2017

Hi @jcbowden. There is an open PR that addresses this and a few other issues with the --nv option. https://github.com/singularityware/singularity/pull/1082 . It is not good to go yet because the implementation is not quite right. Basically, I am grepping the config file in sh. I really need to write a small program in c that will return values from the config file instead.

GodloveD on 20 Nov 2017

😕1

Ok, great to know the work is in progress. I can match the driver versions when I build the container for now and that gets stuff working. It will be good to not have to rebuild at every driver upgrade though so I'm looking forward to your efforts. Our cluster uses Bright Cluster Management software that tends to put things in unusual places so we have to work around that.

jcbowden on 21 Nov 2017

Just an observation that I still get this:
which: no nvidia-smi in (/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin)
WARNING: Could not find the Nvidia SMI binary to bind into container

when I use --nv. My nvidia-smi is also in a Bright CM path that starts /cm/shared
If there is a workaround I could put in my recipe that would be fine. I tried adding a bind in the %post section to no avail.

chrisreidy on 22 Feb 2018

Hi @chrisreidy - thanks for your note. I've also seen this issue on a Bright CM install, and it is addressed by the PR mentioned above that has now been merged into development. Coming to a new release of Singularity soon.

dtrudg on 22 Feb 2018

@dctrud Thanks David. I can live with it until then

chrisreidy on 22 Feb 2018

Singularity: Nvidia GPU support with different driver versions?

All 55 comments

Related issues