Upon building pytorch, I am getting the following error:
RPATH of binary /nix/store/pg7wfc8gw05w48fi1gw7n7njw6b0crad-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so contains a forbidden reference to /build
run nix-shell on the following input:
{ bootstrap ? import <nixpkgs> {} }:
let pkgs_source = fetchTarball "https://github.com/NixOS/nixpkgs/archive/0e7ba35ddc51ee4d40a66efb45e991f9ce2dcab3.tar.gz";
overlays = [(self: super: {
haskellPackages = super.haskellPackages.extend (selfHS: superHS: {
});
})];
config = {
allowUnfree = true;
cudaSupport = true;
};
pkgs = import pkgs_source {inherit overlays; inherit config;};
py = pkgs.python3.buildEnv.override {
extraLibs = with pkgs.python3Packages;
[
pytorch
];
ignoreCollisions = true;};
in
pkgs.stdenv.mkDerivation {
name = "sh-env";
buildInputs = [py];
shellHook = ''
export LANG=en_US.UTF-8
export PYTHONIOENCODING=UTF-8
'';
}
- system: `"x86_64-linux"`
- host os: `Linux 4.9.5-200.fc25.x86_64, Fedora, 25 (Server Edition)`
- multi-user?: `yes`
- sandbox: `yes`
- version: `nix-env (Nix) 2.0.4`
- channels(sharid): `"nixpkgs-18.09pre149867.4826f9b828a"`
- nixpkgs: `/home/xbejea/.nix-defexpr/channels/nixpkgs`
@teh I see that hydra can build the package, I am wondering if my configuration is incorrect of if I am doing something wrong somehow.
@jyp
I was unaware of #45773 and independently tried bumping package version and stumbled upon RPATH issue too. I manually checked the binary and found there is a entry referring to build directory. I guess authors tried to reference library in the same dir and in addition to $ORIGIN added explicit absolute directory.
In my case I ended with this hack:
preFixup = ''
function join_by { local IFS="$1"; shift; echo "$*"; }
function strip2 {
echo "FILE" $1
echo OLD_RPATH
patchelf --print-rpath $1
IFS=':'
read -ra RP <<< `patchelf --print-rpath $1`
RP_NEW=`join_by : ''${RP[@]:2}`
patchelf --set-rpath \$ORIGIN:''${RP_NEW} $1
echo NEW_RPATH
echo $RP_NEW
echo ACTUAL_RP
patchelf --print-rpath $1
}
for f in `find ''${out} -name 'libcaffe2*.so'`
do
strip2 $f
done
'';
Also during local build I see test script failure during downloading some testing tensors. Is it some kind of nix build isolation mechanism in action?
@akamaus Did you submit a PR with this fix?
@teh What would you be your opinion of said fix?
@akamaus if you submit PR please run through shellcheck first (shellcheck -s bash ..
).
@jyp OK with the @akamaus fix if it unblocks both of you. I'd need more time to understand how this broke in the first place and would probably arrive with a similar solution.
I'm puzzled. Binary pytorch package on nixos-unstable somehow doesnt have incorrect RPATH entry (Tested at ca2ba44cab4). Moreother, quite long number of self tests were disabled, I'm not sure they all were failing because of conflicts with build system. For example, during my experiments I stumbled upon this https://github.com/pytorch/pytorch/issues/11133
@akamaus I tried your patch (over current master branch), and got:
checking for references to /build in /nix/store/wma1asa1jycc7qiwkqslnnjv31x572wj-python3.6-pytorch-0.4.1...
running install tests
Traceback (most recent call last):
File "test/run_test.py", line 14, in <module>
import torch
File "/nix/store/wma1asa1jycc7qiwkqslnnjv31x572wj-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/__init__.py", line 80, in <module>
from torch._C import *
ImportError: libcusparse.so.9.1: cannot open shared object file: No such file or directory
builder for '/nix/store/vi58yjc0ckzlcnqg49n1l88igp7alrfs-python3.6-pytorch-0.4.1.drv' failed with exit code 1
cannot build derivation '/nix/store/v7fdmkdkc8grvrxz42c1csk0kasy8vnl-python3-3.6.6-env.drv': 1 dependencies couldn't be built
error: build of '/nix/store/v7fdmkdkc8grvrxz42c1csk0kasy8vnl-python3-3.6.6-env.drv' failed
Hello @jyp
Unfortunately, there was a small error related to IFS special variable in the script which totally garbled the result. I fixed it and opened pull-request, see https://github.com/NixOS/nixpkgs/pull/46562
@akamaus
In my build of your PR, many tests fail. The first issue seems to be:
warning: no library file corresponding to '/nix/store/ghn6k0ccfiiqbzchf22yzybry0d29p4x-cudatoolkit-9.1-cudnn-7.0.5/lib/libcudnn.so.7' found (skipping)
Then eventually:
Ran 41 tests in 480.707s
FAILED (failures=32, skipped=9)
Traceback (most recent call last):
File "test/run_test.py", line 345, in <module>
main()
File "test/run_test.py", line 337, in main
raise RuntimeError(message)
RuntimeError: test_distributed failed!
builder for '/nix/store/3ixx8mmr19waff4vdjg1rjpgq9gh3bg2-python3.6-pytorch-0.4.1.drv' failed with exit code 1
cannot build derivation '/nix/store/p8b2ml0qscdlyzaxgkz5ni6vf16g1wag-python3-3.6.6-env.drv': 1 dependencies couldn't be built
@jyp
That's weird. Could you please paste the full log after build is done and tests started? Maybe there is some dynamic logic inside. I had to disable some tests because of some strange SSL-related errors but test_distributed worked for me. I've triggered rebuild to be absolutely sure.
My logs follow. Something seems to be wrong with cuda, some tests say it's unavailable but it works if I try manually.
running install tests
No CUDA runtime is found, using CUDA_HOME='/nix/store/1636k54rcgcqc77z0gww0d8xg0dlrr2h-cudatoolkit-9.1.85.1-unsplit'
Running test_autograd ...
............................................................................................s.........s............................................................................................................................s..................................................s........................................................s................................................................................................................................................................ss..................................................s..........................................................................................s...........................s......................................................................................................................................s......................s.....
----------------------------------------------------------------------
Ran 831 tests in 120.034s
OK (skipped=12)
Ninja is not available. Skipping C++ extensions test. Install ninja with 'pip install ninja' or 'conda install ninja'.
Running test_cpp_extensions ...
Running test_c10d ...
.THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
.THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
THCudaCheck FAIL file=/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/aten/src/THC/THCGeneral.cpp line=74 error=35 : CUDA driver version is insufficient for CUDA runtime version
.ss......
----------------------------------------------------------------------
Ran 11 tests in 40.138s
OK (skipped=2)
Running test_cuda ...
----------------------------------------------------------------------
Ran 0 tests in 0.000s
OK
CUDA not available, skipping tests
Running test_distributed ...
s..s.s......s..s...s.s...........s..s....
----------------------------------------------------------------------
Ran 41 tests in 4.259s
OK (skipped=9)
s..s.s......s..s...s.s...........s..s....
----------------------------------------------------------------------
Ran 41 tests in 10.985s
OK (skipped=9)
s..s.s......s..s...s.sss.ssssssssssssssss
----------------------------------------------------------------------
Ran 41 tests in 4.687s
OK (skipped=25)
s..s.s......s..s...s.sss.ssssssssssssssss
----------------------------------------------------------------------
Ran 41 tests in 7.568s
OK (skipped=25)
sssssssssssssssssssssssssssssssssssssssss
----------------------------------------------------------------------
Ran 41 tests in 0.036s
OK (skipped=41)
sssssssssssssssssssssssssssssssssssssssss
----------------------------------------------------------------------
Ran 41 tests in 0.035s
OK (skipped=41)
Running test_distributions ...
ssss.s.s...................................s...ss....ss.......ssss...sssssss.ssss..ss....s..ss.ss.s.s.s..s...sssss....s..sss....................sss.s.s...........
----------------------------------------------------------------------
Ran 162 tests in 8.497s
OK (skipped=51)
Running test_indexing ...
.........................sss..................
----------------------------------------------------------------------
Ran 46 tests in 0.048s
OK (skipped=3)
Running test_jit ...
.........s.....ss.sss..s.s..ssss.s......x..x.sss.x.s...s.........s...ss................................................................................................................................................................................................................................................................................................................................................................................................................................................ss..........................................................................................................................................................................................................................................................................................................................s.................................x........................................................x...........................s..........................
----------------------------------------------------------------------
Ran 965 tests in 30.019s
OK (skipped=25, expected failures=5)
Running test_legacy_nn ...
..ss.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s......s.s.ss.s.s..s.s..s.s.s..s.s.s.s.s..s.s.s.s.s.s..s.s.s.s..s.s.ss.s.s.s.s.s.s.s..s.s.s.s.s.s..ss...ss.s..ss.s.s.s..s.s..ss.s.s.s.s.s.s.s.s.s.s.s.s.s.s....s.s.s.s.s.s..s.s..s.s.s..ss.s.s.s.s.s.s..s.ss.s.s.s.s.s.s.s.s.s.s.s.s..s.s.s.s.s.ss.s.s.s.s.s.s..s.s..s.s.ss.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s..ss.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s..s.s.s.s.s.s.s...s.s.....
----------------------------------------------------------------------
Ran 424 tests in 82.879s
OK (skipped=199)
Running test_multiprocessing ...
ssss..ss..........s..
----------------------------------------------------------------------
Ran 21 tests in 8.968s
OK (skipped=7)
Running test_nccl ...
----------------------------------------------------------------------
Ran 0 tests in 0.000s
OK
CUDA not available, skipping tests
Running test_nn ...
s....s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s..s.s.s.s.s.s.ss..ss.s.s.s.s.sss.s.s.sss.sss.s.s.sss.s.s.sss.sss.s.s.s.s.s.s.s.s.s.s.s.s..s.ss.s.s.s.s.s.s.s..s.ss.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s..s.s.ss.s.s.ss.s.s..s.sss..s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s..s.s.sss.sss.sss.sss.sss.sss.s....s.s..s.s.s.ss.s..s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.sss.sss.s.sss.s.sss.sss.sss.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.sss.s.s.sss.sss.sss.s.sss.s.s.sss.s.s.s.s..s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.s.sss.s.s.sss.sss.sss.sss.sss.sss.sss.sss.s.s.s.s.s.s.s.s.s.s.s..s....sss.s.ssssss.s.s.sss.sss.s.sss.s.sss..sss.s.ssssss.sss.s.sss.s.sss.s.sss.sss.sss.s.sss..s.s.s.s.s.s.sss.sss.sss.ssssss.sss.sss.sss.sss.sss..s.s.s.s.s.sss.sss.sss.sss.s.s.s.s.s.s.s...s.s.sss.sss..sssss.s.s.s..s.ss.s.s.s.s.s.s.s.s.....s.s.sss.s.s.sss.sss.sss.sss.s.sss..ss.s.s.s.s..s.s.ss.s.s.s.s.s.s.s.s.s.s.s.s.s.s/nix/store/5bx3bif1v3g0ryzjvbc3lmk9jpz3vfs1-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/nn/modules/upsampling.py:122: UserWarning: nn.Upsampling is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.Upsampling is deprecated. Use nn.functional.interpolate instead.")
.s.s.s..ss..ss..ss..ss.s.s.s.s.s.s.s.s.s.s.s..ss..ss.s.s.......s.s.....s.s............sss.....s.s.....s.....ssssssssssssssssss.....s...s....sss..........s............s.s.s.s.s.s....s/tmp/nix-build-python3.6-pytorch-0.4.1.drv-0/source/test/common.py:547: RuntimeWarning: could not download test file 'https://download.pytorch.org/test_data/linear.pt'
warnings.warn(msg, RuntimeWarning)
s.......s/nix/store/5bx3bif1v3g0ryzjvbc3lmk9jpz3vfs1-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/nn/functional.py:995: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
/nix/store/5bx3bif1v3g0ryzjvbc3lmk9jpz3vfs1-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/nn/functional.py:1006: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
.s.......s....s........ss....sss.s.s.s.s.s.s............/nix/store/5bx3bif1v3g0ryzjvbc3lmk9jpz3vfs1-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/nn/functional.py:1890: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
..........s................s.s..s..s.ss.s.
----------------------------------------------------------------------
Ran 1173 tests in 265.555s
OK (skipped=600)
Running test_optim ...
................................
----------------------------------------------------------------------
Ran 32 tests in 57.472s
OK
[nix-shell:~/nixpkgs-staging]$ python
Python 3.6.6 (default, Jun 27 2018, 05:47:41)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.tensor([1,2,3]).to('cuda')
tensor([1, 2, 3], device='cuda:0')
@akamaus
Here is the complete log:
pytorch.log
And the shell.nix file:
{ bootstrap ? import <nixpkgs> {} }:
let
pkgs_source = bootstrap.fetchFromGitHub
owner = "akamaus";
repo = "nixpkgs";
rev = "master-pytorch-fix";
sha256 = "1a12d4lzbs4f7rphc1zkjsmz4iv4ldazrf5jhmjby92pn05mqs79";
};
overlays = [];
config = {
allowUnfree = true;
cudaSupport = true;
};
pkgs = import pkgs_source {inherit overlays; inherit config;};
py = pkgs.python3.buildEnv.override {
extraLibs = with pkgs.python3Packages;
[
pytorch
];
ignoreCollisions = true;};
in
pkgs.stdenv.mkDerivation {
name = "sh-env";
buildInputs = [py];
shellHook = ''
export LANG=en_US.UTF-8
export PYTHONIOENCODING=UTF-8
'';
}
I suppose that I am doing something wrong, but I can't figure out what.
@jyp you still testing this on Fedora? In your comment I think it should be:
- pkgs_source = bootstrap.fetchFromGitHub
+ pkgs_source = bootstrap.fetchFromGitHub {
@teh Fedora it is, indeed.
Sorry about the brace. I removed some inane comment that what on this line and the brace came off with it.
@jyp,
I skimmed through your logs, the suspicious line I saw is
Error: failed to parse the list of possible procesors in /sys/devices/system/cpu/possible
In my case that file contains this:
$ cat /sys/devices/system/cpu/possible
0-11
Are you using single-processor system or some kind of virtualization?
@akamaus Thanks to your analysis I could find the problem. I was building in a sandbox, which probably was hiding /sys/ from the build system. So my build goes through in non-sandboxed mode. However pytorch still does not load:
[xbejea@lark tmp]$ nix-shell --pure pytorch.nix
bash: warning: setlocale: LC_TIME: cannot change locale (en_US.UTF-8)
[nix-shell:~/tmp]$ python
Python 3.6.6 (default, Jun 27 2018, 05:47:41)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/nix/store/djiwnlid23xwc5kig4z04f8w5zxjfzs4-python3-3.6.6-env/lib/python3.6/site-packages/torch/__init__.py", line 80, in <module>
from torch._C import *
ImportError: /nix/store/djiwnlid23xwc5kig4z04f8w5zxjfzs4-python3-3.6.6-env/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so: undefined symbol: nvrtcGetProgramLogSize
Could you load pytorch?
nvrtcGetProgramLogSize
appears to come from CUDA, perhaps somehow CUDA becomes unavailable in the runtime environment (even though it was in the build)?
@jyp
I'm able to reproduce your problem. Previously I tested without --pure flag and it was a mistake. I'm trying to understand what's going on. ldd
shows identical dependencies for _C.cpython-36m-x86_64-linux-gnu.so
library in both environments.
strace -e file python -c 'import torch' 2>&1 |grep -i cuda
shows identical accesses to library files too.
I stumbled upon a somewhat similar issue: https://github.com/pytorch/pytorch/pull/3455. But its marked as fixed almost a year ago.
@jyp
Looks like I more or less understood what's going on. Torch's C wrapper _C.cpython-36m-x86_64-linux-gnu.so contains some symbols which are not resolved by library's direct dependencies:
0000000000000000 *UND* 0000000000000000 nvrtcGetProgramLogSize
0000000000000000 *UND* 0000000000000000 nvrtcCompileProgram
0000000000000000 *UND* 0000000000000000 nvrtcCreateProgram
0000000000000000 *UND* 0000000000000000 nvrtcGetErrorString
0000000000000000 *UND* 0000000000000000 nvrtcDestroyProgram
0000000000000000 *UND* 0000000000000000 nvrtcGetProgramLog
0000000000000000 *UND* 0000000000000000 nvrtcGetPTXSize
0000000000000000 *UND* 0000000000000000 nvrtcGetPTX
They come from libnvrtc.so.9.1 which is a dependency of _nvrtc.cpython-36m-x86_64-linux-gnu.so. And the latter library is cowardly loaded in torch.__init__
:
try:
import torch._nvrtc
except ImportError:
pass
And besides other things it's depends on cuda library coming from currently loaded driver:
ldd /nix/store/6grsp98w0xbzf9ajizv5n1s82z115n7f-python3.6-pytorch-0.4.1/lib/python3.6/site-packages/torch/_nvrtc.cpython-36m-x86_64-linux-gnu.so |grep cuda
libcuda.so.1 => /run/opengl-driver/lib/libcuda.so.1 (0x00007f2a03173000)
libnvrtc.so.9.1 => /nix/store/1636k54rcgcqc77z0gww0d8xg0dlrr2h-cudatoolkit-9.1.85.1-unsplit/lib64/libnvrtc.so.9.1 (0x00007f2a01a45000)
In user's environment there is LD_LIBRARY_PATH pointing to driver's directory:
echo $LD_LIBRARY_PATH
/run/opengl-driver/lib:/run/opengl-driver-32/lib
In pure shell LD_LIBRARY_PATH is empty.
So basically we need to setup a proper variable value.
@akamaus I'm impressed :+1: by your debugging. Also I don't really get why things are done like this in torch.
@jyp
One question is how we're supposed to fix it? Should anything be able to work in pure environment? Is it a proper way to just fix that variable? @xeji ?
Sorry, I'm not familiar with the details of opengl-related things, so I'm not sure about the best way to fix this. What you might try is use a program wrapper that appends /run/opengl-driver/lib
to LD_LIBRARY_PATH
. Just search for uses of wrapProgram
in nixpkgs.
@xeji Thanks for this.
However, @akamaus, shouldn't it be that the rpath of _nvrtc.cpython-36m-x86_64-linux-gnu.so should be changed instead of changing LD_LIBRARY_PATH using a wrapper? (I am really not an expert in this, but as far as I understand this is what I have been doing so far in the tensorflow package).
Ok, I see now that I was in error. libcuda.so.1 is supposed to come from the system-wide installed driver, not from the nix store.
Going back to my problem. I am trying to run pytorch on a fedora machine, without the --pure
option. Here is what ldd says outside the nix-shell:
ldd /nix/store/djiwnlid23xwc5kig4z04f8w5zxjfzs4-python3-3.6.6-env/lib/python3.6/site-packages/torch/_nvrtc.cpython-36m-x86_64-linux-gnu.so | grep cuda
libcuda.so.1 => /lib64/libcuda.so.1 (0x00007f0aee316000)
libnvrtc.so.9.1 => /nix/store/1636k54rcgcqc77z0gww0d8xg0dlrr2h-cudatoolkit-9.1.85.1-unsplit/lib64/libnvrtc.so.9.1 (0x00007f0aecbe8000)
So, libcuda.so.1 is found, presumably at the correct location (system-wide installed driver). Indeed:
ldconfig -p | grep cuda
libicudata.so.57 (libc6,x86-64) => /lib64/libicudata.so.57
libcuda.so.1 (libc6,x86-64) => /lib64/libcuda.so.1
libcuda.so.1 (libc6) => /lib/libcuda.so.1
libcuda.so (libc6,x86-64) => /lib64/libcuda.so
libcuda.so (libc6) => /lib/libcuda.so
Let's now repeat that from within the shell:
ldd /nix/store/djiwnlid23xwc5kig4z04f8w5zxjfzs4-python3-3.6.6-env/lib/python3.6/site-packages/torch/_nvrtc.cpython-36m-x86_64-linux-gnu.so | grep cuda
libcuda.so.1 => not found
libnvrtc.so.9.1 => /nix/store/1636k54rcgcqc77z0gww0d8xg0dlrr2h-cudatoolkit-9.1.85.1-unsplit/lib64/libnvrtc.so.9.1 (0x00007fdd92433000)
libcuda.so.1 is not found. ldconfig is now less than helpful:
ldconfig -p
ldconfig: Can't open cache file /nix/store/2qgjpsn1zkf0clvrrjympwf6ar2dx83r-glibc-2.27/etc/ld.so.cache
: No such file or directory
And python does not like to be fed an LD_LIBRARY_PATH:
LD_LIBRARY_PATH=/lib64 python
/nix/store/56nrxy58wbhvs2sy3rir1jqa68p0kkm5-bash-4.4-p23/bin/bash: relocation error: /lib64/libc.so.6: symbol _dl_starting_up version GLIBC_PRIVATE not defined in file ld-linux-x86-64.so.2 with link time reference
So, how is one supposed to provide this system library (libcuda.so.1) within a nix-shell?
LD_LIBRARY_PATH=/lib64
overwrites the path with only /lib64
, appending /lib64
to the existing LD_LIBRARY_PATH
might be better.
@jyp
I guess part of your hack with LD_LIBRARY_PATH=/lib64
might be massive overriding libraries used. Try running ldd python
to see which libraries are actually loaded. Maybe that explains /lib64/libc.so.6
complains. Maybe LD_PRELOAD would be a better fit :open_mouth:
Also, latter on the way you may encounter nvidia kernel module and userspace (cuda or nvidia, can't remember exactly) library version mismatch, be prepared. I had to make sure nvidia version in nixpkgs is perfectly the same as version installed in your fedora. I saw it with TF and ubuntu.
@akamaus It works!
LD_PRELOAD=/lib64/libcuda.so.1:/lib64/libnvidia-fatbinaryloader.so.390.67 python
Python 3.6.6 (default, Jun 27 2018, 05:47:41)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>>
I am quite aware of the version mismatch problem --- I've been maintainer of the TF package for many months. I don't think that there is any way around it though, since the driver is loaded in the kernel. (But perhaps you'll surprise me again...)
To wrap up: I am ready to close this issue. It would be nice to document all this somewhere in nixpkgs manual though.
@jyp
from one side I have a feeling of incompleteness. From another side looks like it's a result of ideological conflict between user-space isolation provided by Nix and singleton nature of a kernel. So proper resolution is probably out of the scope of any particular cuda-using package. Seems to me libcuda is similar to glibc in it's role and should be tackled in a similar way. So, let's close this.
Note that as of CUDA 10 the situation has changed somewhat: https://docs.nvidia.com/deploy/cuda-compatibility/index.html
Previously, the host nvidia.ko
driver and the coupled libcuda.so.1
driver had to be equivalent -- they were tightly woven and must come from the same driver package, as there were no compatibility guarantees. But with CUDA 10, there is a measure of compatibility between newer libcuda.so.1
versions -- in other words, you can deploy a new libcuda.so.1
with an older CUDA 10 nvidia.ko
The push for this is likely part of Nvidia's GPU containerization push -- people want to use arbitrary containers with arbitrary CUDA userspace libraries, but there can only be a single host kernel offering GPU resources (with a particular nvidia.ko
), which the container cannot control.
I believe this should ease the requirements for Nix style builds, because now we can simply pick a random linuxPackages.nvidia_x11
package and use the accompanying libcuda.so.1
in order to provide a runtime driver. In fact with CUDA 10, we could in theory remove libcuda.so
from linuxPackages
itself since the underlying driver package is less important. (We could offer a separate expression for the userspace CUDA driver component, for instance.)
I currently have something like this working: a full PyTorch 1.0 training application that can run inside of a container using the coupled libcuda.so.1
driver from linuxPackages.nvidia_x11
-- In fact I am running this on machines with different kernels entirely from the nvidia_x11
package; my host machine is linuxPackages_4_19
, but it does have nvidia.ko
installed of course, and the container hardcodes LD_PRELOAD
to lib/libcuda.so.1
in the linuxPackages.nvidia_x11
package.
This works fine, although I haven't quite gotten GPU acceleration working inside a container yet, but the libraries all work fine -- I think it's pretty close, though. (I need to set up a machine with the Nvidia Container Runtime so I can actually debug it, but I'm not enthused about supporting that in NixOS...)
In a nutshell: I think this means it may be possible to avoid the usage of /run/opengl-driver/
in the rpath, etc, at least for limited-compatibility cases(or at least reliably work around it, as I have), in the long run.
(triage) So there have been quite a few comments. My glance-over reading of it is that basically there is no longer an issue, and that there might have been a need for documentation but with cuda 10 there won't be this need any longer. Am I聽understanding correctly? If so, I think this could be closed.
@Ekleog, pytorch builds fine for me, both with and without cuda. So I think, this issue should be closed.
@akamaus Thanks!
Let's close then, feel free to reopen and/or ask for a reopening :)
Most helpful comment
Note that as of CUDA 10 the situation has changed somewhat: https://docs.nvidia.com/deploy/cuda-compatibility/index.html
Previously, the host
nvidia.ko
driver and the coupledlibcuda.so.1
driver had to be equivalent -- they were tightly woven and must come from the same driver package, as there were no compatibility guarantees. But with CUDA 10, there is a measure of compatibility between newerlibcuda.so.1
versions -- in other words, you can deploy a newlibcuda.so.1
with an older CUDA 10nvidia.ko
The push for this is likely part of Nvidia's GPU containerization push -- people want to use arbitrary containers with arbitrary CUDA userspace libraries, but there can only be a single host kernel offering GPU resources (with a particular
nvidia.ko
), which the container cannot control.I believe this should ease the requirements for Nix style builds, because now we can simply pick a random
linuxPackages.nvidia_x11
package and use the accompanyinglibcuda.so.1
in order to provide a runtime driver. In fact with CUDA 10, we could in theory removelibcuda.so
fromlinuxPackages
itself since the underlying driver package is less important. (We could offer a separate expression for the userspace CUDA driver component, for instance.)I currently have something like this working: a full PyTorch 1.0 training application that can run inside of a container using the coupled
libcuda.so.1
driver fromlinuxPackages.nvidia_x11
-- In fact I am running this on machines with different kernels entirely from thenvidia_x11
package; my host machine islinuxPackages_4_19
, but it does havenvidia.ko
installed of course, and the container hardcodesLD_PRELOAD
tolib/libcuda.so.1
in thelinuxPackages.nvidia_x11
package.This works fine, although I haven't quite gotten GPU acceleration working inside a container yet, but the libraries all work fine -- I think it's pretty close, though. (I need to set up a machine with the Nvidia Container Runtime so I can actually debug it, but I'm not enthused about supporting that in NixOS...)
In a nutshell: I think this means it may be possible to avoid the usage of
/run/opengl-driver/
in the rpath, etc, at least for limited-compatibility cases(or at least reliably work around it, as I have), in the long run.