Nixpkgs: Cannot build pytorch

Created on 4 Sep 2018  路  30Comments  路  Source: NixOS/nixpkgs

Issue description

Upon building pytorch, I am getting the following error:

RPATH of binary /nix/store/pg7wfc8gw05w48fi1gw7n7njw6b0crad-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so contains a forbidden reference to /build

Steps to reproduce

run nix-shell on the following input:

{ bootstrap ? import <nixpkgs> {} }:

let pkgs_source = fetchTarball "https://github.com/NixOS/nixpkgs/archive/0e7ba35ddc51ee4d40a66efb45e991f9ce2dcab3.tar.gz";
    overlays = [(self: super: {
      haskellPackages = super.haskellPackages.extend (selfHS: superHS: {
      });
    })];
    config = {
      allowUnfree = true;
      cudaSupport = true;
    };
    pkgs = import pkgs_source {inherit overlays; inherit config;};
    py = pkgs.python3.buildEnv.override {
      extraLibs =  with pkgs.python3Packages;
        [
         pytorch
        ];
      ignoreCollisions = true;};
in
  pkgs.stdenv.mkDerivation {
    name = "sh-env";
    buildInputs = [py];
    shellHook = ''
      export LANG=en_US.UTF-8
      export PYTHONIOENCODING=UTF-8
    '';
  }

Technical details

 - system: `"x86_64-linux"`
 - host os: `Linux 4.9.5-200.fc25.x86_64, Fedora, 25 (Server Edition)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.0.4`
 - channels(sharid): `"nixpkgs-18.09pre149867.4826f9b828a"`
 - nixpkgs: `/home/xbejea/.nix-defexpr/channels/nixpkgs`
python

Most helpful comment

Note that as of CUDA 10 the situation has changed somewhat: https://docs.nvidia.com/deploy/cuda-compatibility/index.html

Previously, the host nvidia.ko driver and the coupled libcuda.so.1 driver had to be equivalent -- they were tightly woven and must come from the same driver package, as there were no compatibility guarantees. But with CUDA 10, there is a measure of compatibility between newer libcuda.so.1 versions -- in other words, you can deploy a new libcuda.so.1 with an older CUDA 10 nvidia.ko

The push for this is likely part of Nvidia's GPU containerization push -- people want to use arbitrary containers with arbitrary CUDA userspace libraries, but there can only be a single host kernel offering GPU resources (with a particular nvidia.ko), which the container cannot control.

I believe this should ease the requirements for Nix style builds, because now we can simply pick a random linuxPackages.nvidia_x11 package and use the accompanying libcuda.so.1 in order to provide a runtime driver. In fact with CUDA 10, we could in theory remove libcuda.so from linuxPackages itself since the underlying driver package is less important. (We could offer a separate expression for the userspace CUDA driver component, for instance.)

I currently have something like this working: a full PyTorch 1.0 training application that can run inside of a container using the coupled libcuda.so.1 driver from linuxPackages.nvidia_x11 -- In fact I am running this on machines with different kernels entirely from the nvidia_x11 package; my host machine is linuxPackages_4_19, but it does have nvidia.ko installed of course, and the container hardcodes LD_PRELOAD to lib/libcuda.so.1 in the linuxPackages.nvidia_x11 package.

This works fine, although I haven't quite gotten GPU acceleration working inside a container yet, but the libraries all work fine -- I think it's pretty close, though. (I need to set up a machine with the Nvidia Container Runtime so I can actually debug it, but I'm not enthused about supporting that in NixOS...)

In a nutshell: I think this means it may be possible to avoid the usage of /run/opengl-driver/ in the rpath, etc, at least for limited-compatibility cases(or at least reliably work around it, as I have), in the long run.

All 30 comments

@teh I see that hydra can build the package, I am wondering if my configuration is incorrect of if I am doing something wrong somehow.

@jyp
I was unaware of #45773 and independently tried bumping package version and stumbled upon RPATH issue too. I manually checked the binary and found there is a entry referring to build directory. I guess authors tried to reference library in the same dir and in addition to $ORIGIN added explicit absolute directory.
In my case I ended with this hack:

preFixup = ''
    function join_by { local IFS="$1"; shift; echo "$*"; }
    function strip2 {
      echo "FILE" $1
      echo OLD_RPATH
      patchelf --print-rpath $1
      IFS=':'
      read -ra RP <<< `patchelf --print-rpath $1`
      RP_NEW=`join_by : ''${RP[@]:2}`
      patchelf --set-rpath \$ORIGIN:''${RP_NEW} $1
      echo NEW_RPATH
      echo $RP_NEW
      echo ACTUAL_RP
      patchelf --print-rpath $1
    }

    for f in `find ''${out} -name 'libcaffe2*.so'`
    do
      strip2 $f
    done
  '';

Also during local build I see test script failure during downloading some testing tensors. Is it some kind of nix build isolation mechanism in action?

@akamaus Did you submit a PR with this fix?
@teh What would you be your opinion of said fix?

@akamaus if you submit PR please run through shellcheck first (shellcheck -s bash ..).

@jyp OK with the @akamaus fix if it unblocks both of you. I'd need more time to understand how this broke in the first place and would probably arrive with a similar solution.

I'm puzzled. Binary pytorch package on nixos-unstable somehow doesnt have incorrect RPATH entry (Tested at ca2ba44cab4). Moreother, quite long number of self tests were disabled, I'm not sure they all were failing because of conflicts with build system. For example, during my experiments I stumbled upon this https://github.com/pytorch/pytorch/issues/11133

@akamaus I tried your patch (over current master branch), and got:

checking for references to /build in /nix/store/wma1asa1jycc7qiwkqslnnjv31x572wj-python3.6-pytorch-0.4.1...
running install tests
Traceback (most recent call last):
  File "test/run_test.py", line 14, in <module>
    import torch
  File "/nix/store/wma1asa1jycc7qiwkqslnnjv31x572wj-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/__init__.py", line 80, in <module>
    from torch._C import *
ImportError: libcusparse.so.9.1: cannot open shared object file: No such file or directory
builder for '/nix/store/vi58yjc0ckzlcnqg49n1l88igp7alrfs-python3.6-pytorch-0.4.1.drv' failed with exit code 1
cannot build derivation '/nix/store/v7fdmkdkc8grvrxz42c1csk0kasy8vnl-python3-3.6.6-env.drv': 1 dependencies couldn't be built
error: build of '/nix/store/v7fdmkdkc8grvrxz42c1csk0kasy8vnl-python3-3.6.6-env.drv' failed

Hello @jyp
Unfortunately, there was a small error related to IFS special variable in the script which totally garbled the result. I fixed it and opened pull-request, see https://github.com/NixOS/nixpkgs/pull/46562

@akamaus
In my build of your PR, many tests fail. The first issue seems to be:

warning: no library file corresponding to '/nix/store/ghn6k0ccfiiqbzchf22yzybry0d29p4x-cudatoolkit-9.1-cudnn-7.0.5/lib/libcudnn.so.7' found (skipping)

Then eventually:

Ran 41 tests in 480.707s

FAILED (failures=32, skipped=9)
Traceback (most recent call last):
  File "test/run_test.py", line 345, in <module>
    main()
  File "test/run_test.py", line 337, in main
    raise RuntimeError(message)
RuntimeError: test_distributed failed!
builder for '/nix/store/3ixx8mmr19waff4vdjg1rjpgq9gh3bg2-python3.6-pytorch-0.4.1.drv' failed with exit code 1
cannot build derivation '/nix/store/p8b2ml0qscdlyzaxgkz5ni6vf16g1wag-python3-3.6.6-env.drv': 1 dependencies couldn't be built

@jyp
That's weird. Could you please paste the full log after build is done and tests started? Maybe there is some dynamic logic inside. I had to disable some tests because of some strange SSL-related errors but test_distributed worked for me. I've triggered rebuild to be absolutely sure.

My logs follow. Something seems to be wrong with cuda, some tests say it's unavailable but it works if I try manually.

running install tests
No CUDA runtime is found, using CUDA_HOME='/nix/store/1636k54rcgcqc77z0gww0d8xg0dlrr2h-cudatoolkit-9.1.85.1-unsplit'
Running test_autograd ...
............................................................................................s.........s............................................................................................................................s..................................................s........................................................s................................................................................................................................................................ss..................................................s..........................................................................................s...........................s......................................................................................................................................s......................s.....
----------------------------------------------------------------------
Ran 831 tests in 120.034s

OK (skipped=12)
Ninja is not available. Skipping C++ extensions test. Install ninja with 'pip install ninja' or 'conda install ninja'.
Running test_cpp_extensions ...
Running test_c10d ...
.THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
.THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
.ss......
----------------------------------------------------------------------
Ran 11 tests in 40.138s

OK (skipped=2)
Running test_cuda ...

----------------------------------------------------------------------
Ran 0 tests in 0.000s

OK
CUDA not available, skipping tests
Running test_distributed ...
s..s.s......s..s...s.s...........s..s....
----------------------------------------------------------------------
Ran 41 tests in 4.259s

OK (skipped=9)
s..s.s......s..s...s.s...........s..s....
----------------------------------------------------------------------
Ran 41 tests in 10.985s

OK (skipped=9)
s..s.s......s..s...s.sss.ssssssssssssssss
----------------------------------------------------------------------
Ran 41 tests in 4.687s

OK (skipped=25)
s..s.s......s..s...s.sss.ssssssssssssssss
----------------------------------------------------------------------
Ran 41 tests in 7.568s

OK (skipped=25)
sssssssssssssssssssssssssssssssssssssssss
----------------------------------------------------------------------
Ran 41 tests in 0.036s

OK (skipped=41)
sssssssssssssssssssssssssssssssssssssssss
----------------------------------------------------------------------
Ran 41 tests in 0.035s

OK (skipped=41)
Running test_distributions ...
ssss.s.s...................................s...ss....ss.......ssss...sssssss.ssss..ss....s..ss.ss.s.s.s..s...sssss....s..sss....................sss.s.s...........
----------------------------------------------------------------------
Ran 162 tests in 8.497s

OK (skipped=51)
Running test_indexing ...
.........................sss..................
----------------------------------------------------------------------
Ran 46 tests in 0.048s

OK (skipped=3)
Running test_jit ...
.........s.....ss.sss..s.s..ssss.s......x..x.sss.x.s...s.........s...ss................................................................................................................................................................................................................................................................................................................................................................................................................................................ss..........................................................................................................................................................................................................................................................................................................................s.................................x........................................................x...........................s..........................
----------------------------------------------------------------------
Ran 965 tests in 30.019s

OK (skipped=25, expected failures=5)
Running test_legacy_nn ...
..ss.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s......s.s.ss.s.s..s.s..s.s.s..s.s.s.s.s..s.s.s.s.s.s..s.s.s.s..s.s.ss.s.s.s.s.s.s.s..s.s.s.s.s.s..ss...ss.s..ss.s.s.s..s.s..ss.s.s.s.s.s.s.s.s.s.s.s.s.s.s....s.s.s.s.s.s..s.s..s.s.s..ss.s.s.s.s.s.s..s.ss.s.s.s.s.s.s.s.s.s.s.s.s..s.s.s.s.s.ss.s.s.s.s.s.s..s.s..s.s.ss.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s..ss.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s..s.s.s.s.s.s.s...s.s.....
----------------------------------------------------------------------
Ran 424 tests in 82.879s

OK (skipped=199)
Running test_multiprocessing ...
ssss..ss..........s..
----------------------------------------------------------------------
Ran 21 tests in 8.968s

OK (skipped=7)
Running test_nccl ...

----------------------------------------------------------------------
Ran 0 tests in 0.000s

OK
CUDA not available, skipping tests
Running test_nn ...
s....s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s..s.s.s.s.s.s.ss..ss.s.s.s.s.sss.s.s.sss.sss.s.s.sss.s.s.sss.sss.s.s.s.s.s.s.s.s.s.s.s.s..s.ss.s.s.s.s.s.s.s..s.ss.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s..s.s.ss.s.s.ss.s.s..s.sss..s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s..s.s.sss.sss.sss.sss.sss.sss.s....s.s..s.s.s.ss.s..s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.sss.sss.s.sss.s.sss.sss.sss.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.sss.s.s.sss.sss.sss.s.sss.s.s.sss.s.s.s.s..s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.sss.s.s.sss.sss.sss.sss.sss.sss.sss.sss.s.s.s.s.s.s.s.s.s.s.s..s....sss.s.ssssss.s.s.sss.sss.s.sss.s.sss..sss.s.ssssss.sss.s.sss.s.sss.s.sss.sss.sss.s.sss..s.s.s.s.s.s.sss.sss.sss.ssssss.sss.sss.sss.sss.sss..s.s.s.s.s.sss.sss.sss.sss.s.s.s.s.s.s.s...s.s.sss.sss..sssss.s.s.s..s.ss.s.s.s.s.s.s.s.s.....s.s.sss.s.s.sss.sss.sss.sss.s.sss..ss.s.s.s.s..s.s.ss.s.s.s.s.s.s.s.s.s.s.s.s.s.s/nix/store/5bx3bif1v3g0ryzjvbc3lmk9jpz3vfs1-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/nn/modules/upsampling.py:122: UserWarning: nn.Upsampling is deprecated. Use nn.functional.interpolate instead.
  warnings.warn("nn.Upsampling is deprecated. Use nn.functional.interpolate instead.")
.s.s.s..ss..ss..ss..ss.s.s.s.s.s.s.s.s.s.s.s..ss..ss.s.s.......s.s.....s.s............sss.....s.s.....s.....ssssssssssssssssss.....s...s....sss..........s............s.s.s.s.s.s....s/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/test/common.py:547: RuntimeWarning: could not download test file 'https://download.pytorch.org/test_data/linear.pt'
  warnings.warn(msg, RuntimeWarning)
s.......s/nix/store/5bx3bif1v3g0ryzjvbc3lmk9jpz3vfs1-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/nn/functional.py:995: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
  warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
/nix/store/5bx3bif1v3g0ryzjvbc3lmk9jpz3vfs1-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/nn/functional.py:1006: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
  warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
.s.......s....s........ss....sss.s.s.s.s.s.s............/nix/store/5bx3bif1v3g0ryzjvbc3lmk9jpz3vfs1-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/nn/functional.py:1890: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
  warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
..........s................s.s..s..s.ss.s.
----------------------------------------------------------------------
Ran 1173 tests in 265.555s

OK (skipped=600)
Running test_optim ...
................................
----------------------------------------------------------------------
Ran 32 tests in 57.472s

OK

[nix-shell:~/nixpkgs-staging]$ python
Python 3.6.6 (default, Jun 27 2018, 05:47:41) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.tensor([1,2,3]).to('cuda')
tensor([1, 2, 3], device='cuda:0')

@akamaus
Here is the complete log:
pytorch.log

And the shell.nix file:

{ bootstrap ? import <nixpkgs> {} }:
let 
    pkgs_source = bootstrap.fetchFromGitHub 
        owner = "akamaus";
        repo = "nixpkgs";
        rev = "master-pytorch-fix";
        sha256 = "1a12d4lzbs4f7rphc1zkjsmz4iv4ldazrf5jhmjby92pn05mqs79";
      };
    overlays = [];
    config = {
      allowUnfree = true;
      cudaSupport = true;
    };
    pkgs = import pkgs_source {inherit overlays; inherit config;};
    py = pkgs.python3.buildEnv.override {
      extraLibs =  with pkgs.python3Packages;
        [
         pytorch
        ];
      ignoreCollisions = true;};
in
  pkgs.stdenv.mkDerivation {
    name = "sh-env";
    buildInputs = [py];
    shellHook = ''
      export LANG=en_US.UTF-8
      export PYTHONIOENCODING=UTF-8
    '';
  }

I suppose that I am doing something wrong, but I can't figure out what.

@jyp you still testing this on Fedora? In your comment I think it should be:

-    pkgs_source = bootstrap.fetchFromGitHub 
+    pkgs_source = bootstrap.fetchFromGitHub {

@teh Fedora it is, indeed.

Sorry about the brace. I removed some inane comment that what on this line and the brace came off with it.

@jyp,
I skimmed through your logs, the suspicious line I saw is
Error: failed to parse the list of possible procesors in /sys/devices/system/cpu/possible

In my case that file contains this:

$ cat /sys/devices/system/cpu/possible
0-11

Are you using single-processor system or some kind of virtualization?

@akamaus Thanks to your analysis I could find the problem. I was building in a sandbox, which probably was hiding /sys/ from the build system. So my build goes through in non-sandboxed mode. However pytorch still does not load:

[xbejea@lark tmp]$ nix-shell --pure pytorch.nix
bash: warning: setlocale: LC_TIME: cannot change locale (en_US.UTF-8)

[nix-shell:~/tmp]$ python
Python 3.6.6 (default, Jun 27 2018, 05:47:41) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nix/store/djiwnlid23xwc5kig4z04f8w5zxjfzs4-python3-3.6.6-env/lib/python3.6/site-packages/torch/__init__.py", line 80, in <module>
    from torch._C import *
ImportError: /nix/store/djiwnlid23xwc5kig4z04f8w5zxjfzs4-python3-3.6.6-env/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so: undefined symbol: nvrtcGetProgramLogSize

Could you load pytorch?
nvrtcGetProgramLogSize appears to come from CUDA, perhaps somehow CUDA becomes unavailable in the runtime environment (even though it was in the build)?

@jyp
I'm able to reproduce your problem. Previously I tested without --pure flag and it was a mistake. I'm trying to understand what's going on. ldd shows identical dependencies for _C.cpython-36m-x86_64-linux-gnu.so library in both environments.
strace -e file python -c 'import torch' 2>&1 |grep -i cuda shows identical accesses to library files too.
I stumbled upon a somewhat similar issue: https://github.com/pytorch/pytorch/pull/3455. But its marked as fixed almost a year ago.

@jyp
Looks like I more or less understood what's going on. Torch's C wrapper _C.cpython-36m-x86_64-linux-gnu.so contains some symbols which are not resolved by library's direct dependencies:

0000000000000000         *UND*  0000000000000000              nvrtcGetProgramLogSize
0000000000000000         *UND*  0000000000000000              nvrtcCompileProgram
0000000000000000         *UND*  0000000000000000              nvrtcCreateProgram
0000000000000000         *UND*  0000000000000000              nvrtcGetErrorString
0000000000000000         *UND*  0000000000000000              nvrtcDestroyProgram
0000000000000000         *UND*  0000000000000000              nvrtcGetProgramLog
0000000000000000         *UND*  0000000000000000              nvrtcGetPTXSize
0000000000000000         *UND*  0000000000000000              nvrtcGetPTX

They come from libnvrtc.so.9.1 which is a dependency of _nvrtc.cpython-36m-x86_64-linux-gnu.so. And the latter library is cowardly loaded in torch.__init__:

try:
    import torch._nvrtc
except ImportError:
    pass

And besides other things it's depends on cuda library coming from currently loaded driver:

ldd /nix/store/6grsp98w0xbzf9ajizv5n1s82z115n7f-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/_nvrtc.cpython-36m-x86_64-linux-gnu.so |grep cuda  
        libcuda.so.1 => /run/opengl-driver/lib/libcuda.so.1 (0x00007f2a03173000)
        libnvrtc.so.9.1 => /nix/store/1636k54rcgcqc77z0gww0d8xg0dlrr2h-cudatoolkit-9.1.85.1-unsplit/lib64/libnvrtc.so.9.1 (0x00007f2a01a45000)

In user's environment there is LD_LIBRARY_PATH pointing to driver's directory:

echo $LD_LIBRARY_PATH                                                                                                                                                                     
/run/opengl-driver/lib:/run/opengl-driver-32/lib

In pure shell LD_LIBRARY_PATH is empty.
So basically we need to setup a proper variable value.

@akamaus I'm impressed :+1: by your debugging. Also I don't really get why things are done like this in torch.

@jyp
One question is how we're supposed to fix it? Should anything be able to work in pure environment? Is it a proper way to just fix that variable? @xeji ?

Sorry, I'm not familiar with the details of opengl-related things, so I'm not sure about the best way to fix this. What you might try is use a program wrapper that appends /run/opengl-driver/lib to LD_LIBRARY_PATH. Just search for uses of wrapProgram in nixpkgs.

@xeji Thanks for this.

However, @akamaus, shouldn't it be that the rpath of _nvrtc.cpython-36m-x86_64-linux-gnu.so should be changed instead of changing LD_LIBRARY_PATH using a wrapper? (I am really not an expert in this, but as far as I understand this is what I have been doing so far in the tensorflow package).

Ok, I see now that I was in error. libcuda.so.1 is supposed to come from the system-wide installed driver, not from the nix store.

Going back to my problem. I am trying to run pytorch on a fedora machine, without the --pure option. Here is what ldd says outside the nix-shell:

ldd /nix/store/djiwnlid23xwc5kig4z04f8w5zxjfzs4-python3-3.6.6-env/lib/python3.6/site-packages/torch/_nvrtc.cpython-36m-x86_64-linux-gnu.so  | grep cuda
    libcuda.so.1 => /lib64/libcuda.so.1 (0x00007f0aee316000)
    libnvrtc.so.9.1 => /nix/store/1636k54rcgcqc77z0gww0d8xg0dlrr2h-cudatoolkit-9.1.85.1-unsplit/lib64/libnvrtc.so.9.1 (0x00007f0aecbe8000)

So, libcuda.so.1 is found, presumably at the correct location (system-wide installed driver). Indeed:

ldconfig -p | grep cuda
    libicudata.so.57 (libc6,x86-64) => /lib64/libicudata.so.57
    libcuda.so.1 (libc6,x86-64) => /lib64/libcuda.so.1
    libcuda.so.1 (libc6) => /lib/libcuda.so.1
    libcuda.so (libc6,x86-64) => /lib64/libcuda.so
    libcuda.so (libc6) => /lib/libcuda.so

Let's now repeat that from within the shell:

ldd /nix/store/djiwnlid23xwc5kig4z04f8w5zxjfzs4-python3-3.6.6-env/lib/python3.6/site-packages/torch/_nvrtc.cpython-36m-x86_64-linux-gnu.so  | grep cuda
    libcuda.so.1 => not found
    libnvrtc.so.9.1 => /nix/store/1636k54rcgcqc77z0gww0d8xg0dlrr2h-cudatoolkit-9.1.85.1-unsplit/lib64/libnvrtc.so.9.1 (0x00007fdd92433000)

libcuda.so.1 is not found. ldconfig is now less than helpful:

ldconfig -p
ldconfig: Can't open cache file /nix/store/2qgjpsn1zkf0clvrrjympwf6ar2dx83r-glibc-2.27/etc/ld.so.cache
: No such file or directory

And python does not like to be fed an LD_LIBRARY_PATH:

LD_LIBRARY_PATH=/lib64 python
/nix/store/56nrxy58wbhvs2sy3rir1jqa68p0kkm5-bash-4.4-p23/bin/bash: relocation error: /lib64/libc.so.6: symbol _dl_starting_up version GLIBC_PRIVATE not defined in file ld-linux-x86-64.so.2 with link time reference

So, how is one supposed to provide this system library (libcuda.so.1) within a nix-shell?

LD_LIBRARY_PATH=/lib64 overwrites the path with only /lib64, appending /lib64 to the existing LD_LIBRARY_PATH might be better.

@jyp
I guess part of your hack with LD_LIBRARY_PATH=/lib64 might be massive overriding libraries used. Try running ldd python to see which libraries are actually loaded. Maybe that explains /lib64/libc.so.6 complains. Maybe LD_PRELOAD would be a better fit :open_mouth:

Also, latter on the way you may encounter nvidia kernel module and userspace (cuda or nvidia, can't remember exactly) library version mismatch, be prepared. I had to make sure nvidia version in nixpkgs is perfectly the same as version installed in your fedora. I saw it with TF and ubuntu.

@akamaus It works!

LD_PRELOAD=/lib64/libcuda.so.1:/lib64/libnvidia-fatbinaryloader.so.390.67 python
Python 3.6.6 (default, Jun 27 2018, 05:47:41) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> 

I am quite aware of the version mismatch problem --- I've been maintainer of the TF package for many months. I don't think that there is any way around it though, since the driver is loaded in the kernel. (But perhaps you'll surprise me again...)

To wrap up: I am ready to close this issue. It would be nice to document all this somewhere in nixpkgs manual though.

@jyp
from one side I have a feeling of incompleteness. From another side looks like it's a result of ideological conflict between user-space isolation provided by Nix and singleton nature of a kernel. So proper resolution is probably out of the scope of any particular cuda-using package. Seems to me libcuda is similar to glibc in it's role and should be tackled in a similar way. So, let's close this.

Note that as of CUDA 10 the situation has changed somewhat: https://docs.nvidia.com/deploy/cuda-compatibility/index.html

Previously, the host nvidia.ko driver and the coupled libcuda.so.1 driver had to be equivalent -- they were tightly woven and must come from the same driver package, as there were no compatibility guarantees. But with CUDA 10, there is a measure of compatibility between newer libcuda.so.1 versions -- in other words, you can deploy a new libcuda.so.1 with an older CUDA 10 nvidia.ko

The push for this is likely part of Nvidia's GPU containerization push -- people want to use arbitrary containers with arbitrary CUDA userspace libraries, but there can only be a single host kernel offering GPU resources (with a particular nvidia.ko), which the container cannot control.

I believe this should ease the requirements for Nix style builds, because now we can simply pick a random linuxPackages.nvidia_x11 package and use the accompanying libcuda.so.1 in order to provide a runtime driver. In fact with CUDA 10, we could in theory remove libcuda.so from linuxPackages itself since the underlying driver package is less important. (We could offer a separate expression for the userspace CUDA driver component, for instance.)

I currently have something like this working: a full PyTorch 1.0 training application that can run inside of a container using the coupled libcuda.so.1 driver from linuxPackages.nvidia_x11 -- In fact I am running this on machines with different kernels entirely from the nvidia_x11 package; my host machine is linuxPackages_4_19, but it does have nvidia.ko installed of course, and the container hardcodes LD_PRELOAD to lib/libcuda.so.1 in the linuxPackages.nvidia_x11 package.

This works fine, although I haven't quite gotten GPU acceleration working inside a container yet, but the libraries all work fine -- I think it's pretty close, though. (I need to set up a machine with the Nvidia Container Runtime so I can actually debug it, but I'm not enthused about supporting that in NixOS...)

In a nutshell: I think this means it may be possible to avoid the usage of /run/opengl-driver/ in the rpath, etc, at least for limited-compatibility cases(or at least reliably work around it, as I have), in the long run.

(triage) So there have been quite a few comments. My glance-over reading of it is that basically there is no longer an issue, and that there might have been a need for documentation but with cuda 10 there won't be this need any longer. Am I聽understanding correctly? If so, I think this could be closed.

@Ekleog, pytorch builds fine for me, both with and without cuda. So I think, this issue should be closed.

@akamaus Thanks!

Let's close then, feel free to reopen and/or ask for a reopening :)

Was this page helpful?
0 / 5 - 0 ratings