Describe the bug
I have a RTX 2080 ti, which I need cuda 10.0 to use. I can see the card via nvidia-smi etc., but I can't convince pytorch to think it's available via nix-shell
To Reproduce
Steps to reproduce the behavior:
test_cuda.py:
#!/usr/bin/env nix-shell
#!nix-shell pytorch-cuda10.nix -i python3
import torch
print("Cuda: {}".format(torch.cuda.is_available()))
pytorch_cuda10.nix
with import <nixpkgs> {};
let
unstable = import <unstable> { config.allowUnfree = true; };
in
(let
python = let
packageOverrides = self: super: {
pytorch = super.pytorch.override {
cudaSupport = true;
cudatoolkit = unstable.cudatoolkit_10;
cudnn = unstable.cudnn_cudatoolkit_10;
magma = unstable.magma;
};
};
in unstable.python3.override {inherit packageOverrides; self = python;};
in python.withPackages(ps: [ps.pytorch])).env
Expected behavior
I expect ./test_cuda.py to print Cuda: True, but it, instead, prints Cuda: False
Metadata
Please run nix run nixpkgs.nix-info -c nix-info -m and paste the result.
$ nix run nixpkgs.nix-info -c nix-info -m
- system: `"x86_64-linux"`
- host os: `Linux 5.3.14, NixOS, 19.09.1476.72a2ced2523 (Loris)`
- multi-user?: `yes`
- sandbox: `yes`
- version: `nix-env (Nix) 2.3`
- channels(root): `"nixos-19.09.1498.0322870203c, unstable-20.03pre204199.3140fa89c51"`
- channels(matt): `""`
- nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
Maintainer information:
# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module:
I don't know if this is helpful, but the following is my (somewhat pared down) system-wide configuration.nix:
{ config, pkgs, lib, ... }:
let
unstable = import <unstable> { config.allowUnfree = true; };
in
{
imports =
[ # Include the results of the hardware scan.
./hardware-configuration.nix
];
nixpkgs.config.allowUnfree = true;
boot.kernelPackages = unstable.linuxPackages_5_3;
boot.kernelParams = [ "nordrand" ];
# Use the systemd-boot EFI boot loader.
boot.loader.systemd-boot.enable = true;
boot.loader.efi.canTouchEfiVariables = true;
hardware.enableAllFirmware = true;
environment.systemPackages = with pkgs; [
cudatoolkit_10
];
services.xserver.videoDrivers = [ "nvidia" ];
systemd.services.nvidia-control-devices = {
wantedBy = [ "multi-user.target" ];
serviceConfig.ExecStart = "${unstable.linuxPackages_5_3.nvidia_x11.bin}/bin/nvidia-smi";
};
# This value determines the NixOS release with which your system is to be
# compatible, in order to avoid breaking some software such as database
# servers. You should change this only after NixOS release notes say you
# should.
system.stateVersion = "19.09"; # Did you read the comment?
}
I'm not quite sure what exactly I was doing wrong, but I figured this out:
{ pkgs ? import <nixpkgs> {
config.allowUnfree = true;
} }:
let
unstable = import <unstable> { config.allowUnfree = true; };
nvidia_x11 = unstable.linuxPackages_5_3.nvidia_x11;
cudatoolkit = unstable.cudatoolkit_10_0;
cudnn = unstable.cudnn_cudatoolkit_10_0;
python = unstable.python3.withPackages(ps: [
ps.pytorchWithCuda
]);
in pkgs.stdenv.mkDerivation {
name = "cuda-env-shell";
buildInputs = with pkgs; [ python ];
shellHook = ''
export LD_LIBRARY_PATH="${nvidia_x11}/lib"
'';
}
(at which point the test_cuda.py example prints True)
I'm not sure if this is expected behavior or a bug, or maybe missing documentation - should the LD_LIBRARY_PATH be set by default somewhere?
@nuance What does your system /etc/nixos/configuration.nix look like?
Normally PyTorch will find libcuda.so.1 via /run/opengl-driver/lib/libcuda.so.1. The /run/opengl-driver symlink is created by /etc/tmpfiles.d/nixos.conf when hardware.opengl.enable is true (implied by services.xserver.enable = true), and should include the libcuda.so.1 symlink if "nvidia" is in services.xserver.videoDrivers.
I'm not running an xserver (this is a headless box), and hardware.opengl.enable and services.xserver.enable were unset.
I added the line hardware.opengl.enable = true; to my configuration.nix and can now run the previous shell w/out the shellHook definition.
(is that expected? I'm not sure if I'm even using OpenGL when using pytorch, as I'd assume it's just using CUDA etc.?)
My understanding is that PyTorch with CUDA doesn鈥檛 literally use OpenGL, but it uses the same symlink mechanism configured by the hardware.opengl.enable option to expose libcuda. Perhaps the option could be given a more accurate name or an alias or better documentation.
The weird part to me is tensorflow seems to find the gpu / cuda without that flag set - is it a byproduct of how it鈥檚 being linked?
It looks like tensorflow pulls in nvidia_x11 as a build input instead of locating it through the runtime symlink, which means you need to rebuild tensorflow for every kernel update and every new Nvidia driver release. Seems like a worse tradeoff to me.
Thank you for your contributions.
This has been automatically marked as stale because it has had no activity for 180 days.
If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.
Here are suggestions that might help resolve this more quickly:
Hi! I'm new to Nix, is there a way to add linuxPackages.nvidia_x11 and LD_LIBRARY_PATH as __runtime__ dependencies for pytorchWithCuda, on nixpkgs level?
Just ran into this problem on a headless nixos installation and first used the shellHook, then the hardware.opengl.enable - both work. However, given that nvidia-smi indicated no error even without hardware.opengl.enable option, I expected just requesting pytorchWithCuda in shell.nix to be sufficient to get things going.
Thanks
@newkozlukov hardware.opengl.enable is that way. The reason it鈥檚 not needed for nvidia-smi is that nvidia-smi is part of the nvidia_x11 package.
To improve the UX, maybe hardware.opengl.enable should be auto-enabled when you add nvidia to services.xserver.videoDrivers? Right now it鈥檚 only auto-enabled by services.xserver.enable鈥攂ut if you鈥檙e enabling the nvidia driver without enabling X, it鈥檚 almost certainly because you want to use CUDA.
@andersk regarding UX, when initially configuring nixos, I felt actually hesitant to add the services.xserver.videoDrivers stanza. It's not immediately clear whether this would fetch Xorg server&c (which was undesired). If I may say, I'd rather have a separate option that triggers modprobe nvidia (and, apparently, opengl too). It would've probably simplified troubleshooting, if nixos.wiki nvidia/cuda section pointed to hardware.opengl as well (just noticed that tensorflow page points to hardware.opengl.setLdLibraryPath)
It's still not exactly the same as declaring nvidia_x11 a "runtime dependency" (if nix has such a notion) of pytorchWithCuda (we set LD_LIBRARY_PATH/enable symlink in user's configuration.nix, not in pkgs/.../pytorch/default.nix). I skimmed the nix manual, but still amn't quite sure: if I were to override pytorch locally, would it be possible to declare __in the derivation__ that certain environment variables are to be propagated to the shell?
No environment variable is needed. Just configure services.xserver.videoDrivers and hardware.opengl.enable.
It鈥檚 reasonable to expect that you need some configuration: the Nvidia driver is a kernel module, and the point of Nix being purely functional is that a derivation can鈥檛 reconfigure the system just by existing.
It is confusing (but true) that the options you want for using CUDA without X and OpenGL are named with xserver and opengl; this could be improved, as could the documentation.
Hm, I guess, in part, I'm asking because I wanted to figure out whether it's currently possible at all.
But! The hardware.opengl.enable creates an additional symlink which, as the shellHook option demonstrates, is not absolutely required? With the shellHook, the minimal global configuration is to modprobe the kernel module, and to add the systemd unit for control devices. So, if pytorchWithCuda could propagate the paths to mkShell, we could modify even less of global state.
Afaiu, when nixpkgs ships _applications_, it uses buildFHSEnv or patchelf to solve dynamic linking problems. And libraries? When I say in shell.nix that I want to use a certain library (e.g. pytorch), what are other expected uses, if not to be able to search and dynamically link? Do I understand it right that mkDerivation mechanism does not allow this? Are there some "less pure" mechanisms that would?
Thanks!
Oh, I see, propagation of environment to reverse-dependencies in nix is implemented with setup hooks. Is that why you'd rather stick with hardware.opengl.enable?
@newkozlukov What problem do you think it would solve for the PyTorch derivation to declare an LD_LIBRARY_PATH environment variable?
It would not obviate the need to edit the system configuration. Fundamentally, something needs to tell the Nvidia kernel driver to load (and which one: nvidia, nvidiaBeta, nvidiaLegacy304, nvidiaLegacy340, or nvidiaLegacy390).
It would not retain the PyTorch derivation鈥檚 independence from the kernel version and Nvidia driver version that allows the extra rebuilds to be avoided. PyTorch would need to know these things in order to tell you what to set the variable to!
It would not work outside of a build environment that runs setup hooks. For example, you would be unable to install a PyTorch using application with nix-env.
Fundamentally, if PyTorch knew where to look for the Nvidia library, it would just look there; there鈥檇 be no reason for it to ask its dependents to set an environment variable override that forces it to look there. But it doesn鈥檛 know, so we configure a symlink to help it鈥攖he same symlink that鈥檚 used by every OpenGL application in nixpkgs, for exactly the same reason.
The only reason anyone is even talking about the environment variable override is that it鈥檚 too easy to misconfigure the system with the symlink missing. That鈥檚 easy to fix.
Most helpful comment
My understanding is that PyTorch with CUDA doesn鈥檛 literally use OpenGL, but it uses the same symlink mechanism configured by the
hardware.opengl.enableoption to exposelibcuda. Perhaps the option could be given a more accurate name or an alias or better documentation.