Nixpkgs: Pytorch doesn't detect cuda 10-supported card

Created on 6 Dec 2019 · 16Comments · Source: NixOS/nixpkgs

Describe the bug
I have a RTX 2080 ti, which I need cuda 10.0 to use. I can see the card via nvidia-smi etc., but I can't convince pytorch to think it's available via nix-shell

To Reproduce
Steps to reproduce the behavior:

test_cuda.py:

#!/usr/bin/env nix-shell
#!nix-shell pytorch-cuda10.nix -i python3

import torch
print("Cuda: {}".format(torch.cuda.is_available()))

pytorch_cuda10.nix

with import <nixpkgs> {};

let
  unstable = import <unstable> { config.allowUnfree = true; };
in
(let
  python = let
    packageOverrides = self: super: {
      pytorch = super.pytorch.override {
        cudaSupport = true;
        cudatoolkit = unstable.cudatoolkit_10;
        cudnn = unstable.cudnn_cudatoolkit_10;
        magma = unstable.magma;
      };
    };
  in unstable.python3.override {inherit packageOverrides; self = python;};

in python.withPackages(ps: [ps.pytorch])).env

Expected behavior

I expect ./test_cuda.py to print Cuda: True, but it, instead, prints Cuda: False

Metadata
Please run nix run nixpkgs.nix-info -c nix-info -m and paste the result.

$ nix run nixpkgs.nix-info -c nix-info -m
 - system: `"x86_64-linux"`
 - host os: `Linux 5.3.14, NixOS, 19.09.1476.72a2ced2523 (Loris)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.3`
 - channels(root): `"nixos-19.09.1498.0322870203c, unstable-20.03pre204199.3140fa89c51"`
 - channels(matt): `""`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module:

bug hardware

Source

nuance

Most helpful comment

My understanding is that PyTorch with CUDA doesn’t literally use OpenGL, but it uses the same symlink mechanism configured by the hardware.opengl.enable option to expose libcuda. Perhaps the option could be given a more accurate name or an alias or better documentation.

andersk on 12 Dec 2019

👍2

All 16 comments

I don't know if this is helpful, but the following is my (somewhat pared down) system-wide configuration.nix:

{ config, pkgs, lib, ... }:

let
  unstable = import <unstable> { config.allowUnfree = true; };
in
{
  imports =
    [ # Include the results of the hardware scan.
      ./hardware-configuration.nix
    ];

  nixpkgs.config.allowUnfree = true;
  boot.kernelPackages = unstable.linuxPackages_5_3;
  boot.kernelParams = [ "nordrand" ];

  # Use the systemd-boot EFI boot loader.
  boot.loader.systemd-boot.enable = true;
  boot.loader.efi.canTouchEfiVariables = true;

  hardware.enableAllFirmware = true;

  environment.systemPackages = with pkgs; [
    cudatoolkit_10
  ];

  services.xserver.videoDrivers = [ "nvidia" ];

  systemd.services.nvidia-control-devices = {
    wantedBy = [ "multi-user.target" ];
    serviceConfig.ExecStart = "${unstable.linuxPackages_5_3.nvidia_x11.bin}/bin/nvidia-smi";
  };

  # This value determines the NixOS release with which your system is to be
  # compatible, in order to avoid breaking some software such as database
  # servers. You should change this only after NixOS release notes say you
  # should.
  system.stateVersion = "19.09"; # Did you read the comment?
}

nuance on 6 Dec 2019

I'm not quite sure what exactly I was doing wrong, but I figured this out:

{ pkgs ? import <nixpkgs> {
  config.allowUnfree = true;
} }:

let
  unstable = import <unstable> { config.allowUnfree = true; };
  nvidia_x11 = unstable.linuxPackages_5_3.nvidia_x11;
  cudatoolkit = unstable.cudatoolkit_10_0;
  cudnn = unstable.cudnn_cudatoolkit_10_0;
  python =  unstable.python3.withPackages(ps: [
    ps.pytorchWithCuda
  ]);
in pkgs.stdenv.mkDerivation {
  name = "cuda-env-shell";
  buildInputs = with pkgs; [ python ];
  shellHook = ''
      export LD_LIBRARY_PATH="${nvidia_x11}/lib"
   '';
}

(at which point the test_cuda.py example prints True)

I'm not sure if this is expected behavior or a bug, or maybe missing documentation - should the LD_LIBRARY_PATH be set by default somewhere?

nuance on 9 Dec 2019

@nuance What does your system /etc/nixos/configuration.nix look like?

Normally PyTorch will find libcuda.so.1 via /run/opengl-driver/lib/libcuda.so.1. The /run/opengl-driver symlink is created by /etc/tmpfiles.d/nixos.conf when hardware.opengl.enable is true (implied by services.xserver.enable = true), and should include the libcuda.so.1 symlink if "nvidia" is in services.xserver.videoDrivers.

andersk on 11 Dec 2019

👍2

I'm not running an xserver (this is a headless box), and hardware.opengl.enable and services.xserver.enable were unset.

I added the line hardware.opengl.enable = true; to my configuration.nix and can now run the previous shell w/out the shellHook definition.

nuance on 11 Dec 2019

(is that expected? I'm not sure if I'm even using OpenGL when using pytorch, as I'd assume it's just using CUDA etc.?)

nuance on 11 Dec 2019

andersk on 12 Dec 2019

👍2

The weird part to me is tensorflow seems to find the gpu / cuda without that flag set - is it a byproduct of how it’s being linked?

nuance on 12 Dec 2019

It looks like tensorflow pulls in nvidia_x11 as a build input instead of locating it through the runtime symlink, which means you need to rebuild tensorflow for every kernel update and every new Nvidia driver release. Seems like a worse tradeoff to me.

andersk on 12 Dec 2019

👀1

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

Search for maintainers and people that previously touched the related code and @ mention them in a comment.
Ask on the NixOS Discourse.
Ask on the #nixos channel on irc.freenode.net.

stale[bot] on 9 Jun 2020

Hi! I'm new to Nix, is there a way to add linuxPackages.nvidia_x11 and LD_LIBRARY_PATH as __runtime__ dependencies for pytorchWithCuda, on nixpkgs level?

Just ran into this problem on a headless nixos installation and first used the shellHook, then the hardware.opengl.enable - both work. However, given that nvidia-smi indicated no error even without hardware.opengl.enable option, I expected just requesting pytorchWithCuda in shell.nix to be sufficient to get things going.

Thanks

newkozlukov on 15 Sep 2020

@newkozlukov hardware.opengl.enable is that way. The reason it’s not needed for nvidia-smi is that nvidia-smi is part of the nvidia_x11 package.

To improve the UX, maybe hardware.opengl.enable should be auto-enabled when you add nvidia to services.xserver.videoDrivers? Right now it’s only auto-enabled by services.xserver.enable—but if you’re enabling the nvidia driver without enabling X, it’s almost certainly because you want to use CUDA.

andersk on 15 Sep 2020

@andersk regarding UX, when initially configuring nixos, I felt actually hesitant to add the services.xserver.videoDrivers stanza. It's not immediately clear whether this would fetch Xorg server&c (which was undesired). If I may say, I'd rather have a separate option that triggers modprobe nvidia (and, apparently, opengl too). It would've probably simplified troubleshooting, if nixos.wiki nvidia/cuda section pointed to hardware.opengl as well (just noticed that tensorflow page points to hardware.opengl.setLdLibraryPath)

It's still not exactly the same as declaring nvidia_x11 a "runtime dependency" (if nix has such a notion) of pytorchWithCuda (we set LD_LIBRARY_PATH/enable symlink in user's configuration.nix, not in pkgs/.../pytorch/default.nix). I skimmed the nix manual, but still amn't quite sure: if I were to override pytorch locally, would it be possible to declare __in the derivation__ that certain environment variables are to be propagated to the shell?

newkozlukov on 15 Sep 2020

No environment variable is needed. Just configure services.xserver.videoDrivers and hardware.opengl.enable.

It’s reasonable to expect that you need some configuration: the Nvidia driver is a kernel module, and the point of Nix being purely functional is that a derivation can’t reconfigure the system just by existing.

It is confusing (but true) that the options you want for using CUDA without X and OpenGL are named with xserver and opengl; this could be improved, as could the documentation.

andersk on 15 Sep 2020

Hm, I guess, in part, I'm asking because I wanted to figure out whether it's currently possible at all.

But! The hardware.opengl.enable creates an additional symlink which, as the shellHook option demonstrates, is not absolutely required? With the shellHook, the minimal global configuration is to modprobe the kernel module, and to add the systemd unit for control devices. So, if pytorchWithCuda could propagate the paths to mkShell, we could modify even less of global state.

Afaiu, when nixpkgs ships _applications_, it uses buildFHSEnv or patchelf to solve dynamic linking problems. And libraries? When I say in shell.nix that I want to use a certain library (e.g. pytorch), what are other expected uses, if not to be able to search and dynamically link? Do I understand it right that mkDerivation mechanism does not allow this? Are there some "less pure" mechanisms that would?

Thanks!

newkozlukov on 15 Sep 2020

Oh, I see, propagation of environment to reverse-dependencies in nix is implemented with setup hooks. Is that why you'd rather stick with hardware.opengl.enable?

newkozlukov on 16 Sep 2020

@newkozlukov What problem do you think it would solve for the PyTorch derivation to declare an LD_LIBRARY_PATH environment variable?

It would not obviate the need to edit the system configuration. Fundamentally, something needs to tell the Nvidia kernel driver to load (and which one: nvidia, nvidiaBeta, nvidiaLegacy304, nvidiaLegacy340, or nvidiaLegacy390).
It would not retain the PyTorch derivation’s independence from the kernel version and Nvidia driver version that allows the extra rebuilds to be avoided. PyTorch would need to know these things in order to tell you what to set the variable to!
It would not work outside of a build environment that runs setup hooks. For example, you would be unable to install a PyTorch using application with nix-env.

Fundamentally, if PyTorch knew where to look for the Nvidia library, it would just look there; there’d be no reason for it to ask its dependents to set an environment variable override that forces it to look there. But it doesn’t know, so we configure a symlink to help it—the same symlink that’s used by every OpenGL application in nixpkgs, for exactly the same reason.

The only reason anyone is even talking about the environment variable override is that it’s too easy to misconfigure the system with the symlink missing. That’s easy to fix.

andersk on 16 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings