Nixpkgs: Tensorflow build fails on master

Created on 10 Nov 2017  路  36Comments  路  Source: NixOS/nixpkgs

Issue description

On 18.03-Impala the package pythonPackages.tensorflow fails to build when using pythonPackages, python3Packages or python36Packages (didn't work on my machine and the machine of @fpletz)

The error message looks like this:

$ nix-build -A pythonPackages.tensorflow
...
____Loading package: tensorflow/tools/pip_package
____Loading package: @bazel_tools//tools/cpp
____Loading package: @bazel_tools//tools/jdk
____Loading package: @local_config_xcode//
____Loading package: @local_jdk//
____Loading package: @local_config_cc//
____Loading complete.  Analyzing...
____Loading package: tensorflow/python/tools
____Loading package: @nccl_archive//
____Loading package: tensorflow/python
ERROR: /tmp/nix-build-python2.7-tensorflow-1.3.1.drv-0/source/tensorflow/tools/pip_package/BUILD:100:1: no such package '@zlib_archive//': BUILD file not found on package path and referenced by '//tensorflow/tools/pip_package:licenses'.
ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted.
____Elapsed time: 4.076s
builder for '/nix/store/1chcdjqqnljpwd8xznwf7pql02s2h38x-python2.7-tensorflow-1.3.1.drv' failed with exit code 1
error: build of '/nix/store/1chcdjqqnljpwd8xznwf7pql02s2h38x-python2.7-tensorflow-1.3.1.drv' failed
$ nix-build -A python3Packages.tensorflow
...
____Loading package: tensorflow/tools/pip_package
____Loading package: @bazel_tools//tools/cpp
____Loading package: @bazel_tools//tools/jdk
____Loading package: @local_config_xcode//
____Loading package: @local_jdk//
____Loading package: @local_config_cc//
____Loading complete.  Analyzing...
____Loading package: tensorflow/contrib/slim
____Loading package: tensorflow/python
____Loading package: tensorflow/contrib/tensor_forest
____Loading package: @grpc//
____Loading package: tensorflow/contrib/timeseries
ERROR: /tmp/nix-build-python3.6-tensorflow-1.3.1.drv-0/source/tensorflow/tools/pip_package/BUILD:100:1: no such package '@png_archive//': BUILD file not found on package path and referenced by '//tensorflow/tools/pip_package:licenses'.
ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted.
____Elapsed time: 4.044s
builder for '/nix/store/2hqd8afh58mvr72301apqh5yv6x57bql-python3.6-tensorflow-1.3.1.drv' failed with exit code 1
error: build of '/nix/store/2hqd8afh58mvr72301apqh5yv6x57bql-python3.6-tensorflow-1.3.1.drv' failed

The first commit which affects tensorflow and is not incorporated in 17.09 is https://github.com/NixOS/nixpkgs/commit/1f2a18d9163f75c1001a04157f195557b0c24f8a#diff-4c48ecaa454daa000c372b8b2ca7cfbe by @abbradar.
However this might be a transitive issue caused by Bazel.

Steps to reproduce

run nix-build -A pythonPackages.tensorflow or nix-build -A python3Packages.tensorflow on master.

When pinning NixOS to 17.09 it works perfectly fine.

Technical details

Most helpful comment

Fixed in https://github.com/NixOS/nixpkgs/commit/94ebc13a6ac5c6448a932ca48ae9e2bd9ce755ea -- the core issue was tensorfow after all, because I tested exclusively CUDA builds this has been left unnoticed. Let's leave this open until Hydra builds the package.

All 36 comments

note: I'm just trying to "learn" tensorflow, so I'm definitely not an expert about this...

Hm, I guess this is because of sandboxing. Can you try enabling it and see if it works?

Unfortunately not:

$ nix run nixpkgs.pythonPackages.tensorflow --sandbox
builder for '/nix/store/ikrfvs0blbi9ibvw7y1s0nkv3l7ikbcr-python2.7-tensorflow-1.3.1.drv' failed with exit code 1; last 10 log lines:
  ____Loading package: tensorflow/contrib/boosted_trees
  ____Loading package: tensorflow/contrib/cluster_resolver
  ____Loading package: tensorflow/python/saved_model
  ____Loading package: tensorflow/contrib/signal
  ____Loading package: tensorflow/core
  ____Loading package: @protobuf//
  ____Loading package: third_party/hadoop
  ERROR: /tmp/nix-build-python2.7-tensorflow-1.3.1.drv-0/source/tensorflow/tools/pip_package/BUILD:100:1: no such package '@lmdb//': BUILD file not found on package path and referenced by '//tensorflow/tools/pip_package:licenses'.
  ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted.
  ____Elapsed time: 4.051s
[0 built (1 failed), 0.0 MiB DL]
error: build of '/nix/store/ikrfvs0blbi9ibvw7y1s0nkv3l7ikbcr-python2.7-tensorflow-1.3.1.drv' failed

I can reproduce this on master; strange but most likely something else has changed that broke the build. Either way I have 1.4 update ready that builds for me.

if 1.4 builds I think we don't need any further investigation here, do we?

Let's leave this open until 1.4 lands.

In case this needs fixed prior to 1.4, c3255fe8ec326d2c8fe9462d49ed83aa64d3e68f appears to be the commit that breaks this, though there seem to be some glibc 2.26 issues too (at least with CUDA).

@abbradar what's needed to merge your tensorflow branch?

Successfully built it without cuda support with this patch on top of your tensorflow-new branch:

From 1eb8b76515e50aad1c7fbc3690f75df19567d418 Mon Sep 17 00:00:00 2001
From: Robin Gloster <[email protected]>
Date: Thu, 7 Dec 2017 19:45:17 +0100
Subject: [PATCH] tensorflow: correctly optionalize cuda + fix deps hash

---
 pkgs/development/python-modules/tensorflow/default.nix | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/pkgs/development/python-modules/tensorflow/default.nix b/pkgs/development/python-modules/tensorflow/default.nix
index 8b59916f009..fcf6fbfe6e6 100644
--- a/pkgs/development/python-modules/tensorflow/default.nix
+++ b/pkgs/development/python-modules/tensorflow/default.nix
@@ -71,7 +71,7 @@ let
       mkdir -p "$PYTHON_LIB_PATH"
     '';

-    NIX_CFLAGS_COMPILE = cudatoolkit.ccFlags;
+    NIX_CFLAGS_COMPILE = lib.optional cudaSupport cudatoolkit.ccFlags;
     NIX_LDFLAGS = lib.optionals cudaSupport [ "-lcublas" "-lcudnn" "-lcuda" "-lcudart" ];

     hardeningDisable = [ "all" ];
@@ -89,7 +89,7 @@ let
         rm -rf $bazelOut/external/{bazel_tools,\@bazel_tools.marker,local_*,\@local_*}
       '';

-      sha256 = "0sq0a7vsajzqwxgg82xw1q74n7vdq37n9d5z7p0c8gzpmyw7mgc9";
+      sha256 = "10k7i61ya33dcy98i0s7r8f1d4s4rwjl5myfyiyr46skjpzydxdv";
     };

     buildAttrs = {
-- 
2.15.0

Does it fail if you build with CUDA?

Built both with the patch above.

ping @abbradar

My build demands the same hash as @globin 's. Because tensorflow is anyway broken in master, I'd propose to go with this patch until we build a better understanding of what's going on. (The alternative would be to rollback to a wheel-based pre-build of tensorflow).

Actually, I did not manage to rebase the branch in question onto master. (The patch of bazel does not apply any more, and the simplest fix yields a version which can't build tensorflow.) It seems that in the absence of @abbradar we'll have to return to the wheel-based build.

After some more investigation: the bazel patch is not applied in the tensorflow-new branch nor in master. (In fact it won't apply). But after rebasing it is attempted (but fails). Fixing or disabling the patch yields to a build failure for tensorflow.

I went back to the wheel-based build (locally). For the record here is the nix file that I'm using.

{ stdenv
, symlinkJoin
, lib
, fetchurl
, buildPythonPackage
, isPy3k, isPy35, isPy36, isPy27
, cudaSupport ? false
, cudatoolkit ? null
, cudnn ? null
, linuxPackages ? null
, tensorflow-tensorboard
, six
, protobuf
, numpy
, mock
, backports_weakref
, absl-py
, zlib
, python
}:

assert cudaSupport -> cudatoolkit != null
                   && cudnn != null
                   && linuxPackages != null;

# unsupported combination
assert ! (stdenv.isDarwin && cudaSupport);

# tensorflow is built from a downloaded wheel, because the upstream
# project's build system is an arcane beast based on
# bazel. Untangling it and building the wheel from source is an open
# problem.

buildPythonPackage rec {
  pname = "tensorflow";
  version = "1.5.0rc1";
  name = "${pname}-${version}";
  format = "wheel";
  disabled = ! (isPy35 || isPy36 || isPy27);

  # cudatoolkit is split (see https://github.com/NixOS/nixpkgs/commit/bb1c9b027d343f2ce263496582d6b56af8af92e6)
  # However this means that libcusolver is not loadable by tensor flow. So we undo the split here.
  cudatoolkit_joined = symlinkJoin {
    name = "unsplit_cudatoolkit";
    paths = [ cudatoolkit.out
              cudatoolkit.lib ];};

  src = let
      tfurl = sys: proc: pykind:
        let
          tfpref = if proc == "gpu"
            then "gpu/tensorflow_gpu"
            else "cpu/tensorflow";
        in
        "https://storage.googleapis.com/tensorflow/${sys}/${tfpref}-${version}-${pykind}.whl";
      dls =
        {
        darwin.cpu = {
          py2 = {
            url = tfurl "mac" "cpu" "py2-none-any" ;
            sha256 = "0nkymqbqjx8rsmc8vkc26cfsg4hpr6lj9zrwhjnfizvkzbbsh5z4";
          };
          py3 = {
            url = tfurl "mac" "cpu" "py3-none-any" ;
            sha256 = "1rj4m817w3lajnb1lgn3bwfwwk3qwvypyx11dim1ybakbmsc1j20";
          };
        };
        linux-x86_64.cpu = {
          py2 = {
            url = tfurl "linux" "cpu" "cp27-none-linux_x86_64";
            sha256 = "09pcyx0yfil4dm6cij8n3907pfgva07a38avrbai4qk5h6hxm8w9";
          };
          py35 = {
            url = tfurl "linux" "cpu" "cp35-cp35m-linux_x86_64";
            sha256 = "0p10zcf41pi33bi025fibqkq9rpd3v0rrbdmc9i9yd7igy076a07";
          };
          py36 = {
            url = tfurl "linux" "cpu" "cp36-cp36m-linux_x86_64";
            sha256 = "1qm8lm2f6bf9d462ybgwrz0dn9i6cnisgwdvyq9ssmy2f1gp8hxk";
          };
        };
        linux-x86_64.cuda = {
          py2 = {
            url = tfurl "linux" "gpu" "cp27-none-linux_x86_64";
            sha256 = "10yyyn4g2fsv1xgmw99bbr0fg7jvykay4gb5pxrrylh7h38h6wah";
          };
          py35 = {
            url = tfurl "linux" "gpu" "cp35-cp35m-linux_x86_64";
            sha256 = "0icwnhkcf3fxr6bmbihqzipnn4pxybd06qv7l3k0p4xdgycwzmzk";
          };
          py36 = {
            url = tfurl "linux" "gpu" "cp36-cp36m-linux_x86_64";
            sha256 = "16n8fx8h66jy07p93fvny8knq8ri1i2svm2sbw9fq44lhrhqi4az";
          };
        };
      };
    in
    fetchurl (
      if stdenv.isDarwin then
        if isPy3k then
          dls.darwin.cpu.py3
        else
          dls.darwin.cpu.py2
      else
        if isPy35 then
          if cudaSupport then
            dls.linux-x86_64.cuda.py35
          else
            dls.linux-x86_64.cpu.py35
        else if isPy36 then
          if cudaSupport then
            dls.linux-x86_64.cuda.py36
          else
            dls.linux-x86_64.cpu.py36
        else
          if cudaSupport then
            dls.linux-x86_64.cuda.py2
          else
            dls.linux-x86_64.cpu.py2
    );

  propagatedBuildInputs =
    [ numpy six protobuf mock backports_weakref absl-py ]
    ++ lib.optional (!isPy36) tensorflow-tensorboard
    ++ lib.optionals cudaSupport [ cudatoolkit_joined cudnn stdenv.cc ];

  # tensorflow-gpu depends on tensorflow_tensorboard, which cannot be
  # built at the moment (some of its dependencies do not build
  # [htlm5lib9999999 (seven nines) -> tensorboard], and it depends on an old version of
  # bleach) Hence we disable dependency checking for now.
  installFlags = lib.optional isPy36 "--no-dependencies";

  # Note that we need to run *after* the fixup phase because the
  # libraries are loaded at runtime. If we run in preFixup then
  # patchelf --shrink-rpath will remove the cuda libraries.
  postFixup = let
    rpath = stdenv.lib.makeLibraryPath
      (if cudaSupport then
        [ stdenv.cc.cc.lib zlib cudatoolkit_joined cudnn
          linuxPackages.nvidia_x11 ]
      else
        [ stdenv.cc.cc.lib zlib ]
      );
  in
  ''
    rrPath="$out/${python.sitePackages}/tensorflow/:${rpath}"
    internalLibPath="$out/${python.sitePackages}/tensorflow/python/_pywrap_tensorflow_internal.so"
    find $out -name '*.so' -exec patchelf --set-rpath "$rrPath" {} \;
  '';

  doCheck = false;

  meta = with stdenv.lib; {
    description = "TensorFlow helps the tensors flow";
    homepage = http://tensorflow.org;
    license = licenses.asl20;
    maintainers = with maintainers; [ jyp ];
    platforms = with platforms; if cudaSupport then linux else linux ++ darwin;
  };
}

@jyp I'm trying this out. Which nixpkgs commit does this building on top of?

@timsears d2d1a2dfbabaf723ebc2102a3c7baa5138303bc2

Note that this was quick and dirty --- only one hash is updated (py 3.6 with cuda).

@jyp thanks. For others (or myself when I forget :-) I got tensorflowWithCuda running using @jpy wheel based build. I am using local nixpkgs github repo under my home directory.. Tracking nixos-unstable. Tested with commit 5402412b97247bcc plus the following changes.

  1. replace ~/nixpkgs/pkgs/development/python-modules/tensorflow/default.nix with the expression provided above
  2. edit ~/nixpkgs/pkgs/top-level/python-packages.nix. Make the expression for tensorflow look like
tensorflow = callPackage ../development/python-modules/tensorflow rec {
   cudaSupport = pkgs.config.cudaSupport or false;
   cudatoolkit = pkgs.cudatoolkit9;
   cudnn = pkgs.cudnn_cudatoolkit9;
  };

(Note: This uses a recent version of cudnn that you have to nupload with nix-prefetch-url.)
For completness sake, here's the recent versions I ended up with. Loading cuda after nix packaging can be very flaky.

nix-repl> cudnn.version
"7.0.5"
nix-repl> cudatoolkit.version
"9.0.176"
nix-repl> cudnn_cudatoolkit9
芦derivation /nix/store/1al281ymrdlhj2z4nhbnycsclwf9yh5n-cudatoolkit-9.0-cudnn-7.0.5.drv禄
  1. Here's a shell.nix that will build an environment good for testing.
{ pkgs ? (import <nixpkgs>) {} }: 
with pkgs;

stdenv.mkDerivation {
  name = "MachineLearningEnv";
  buildInputs = [
    pandoc # or other command line tools you may need
    ]
    ++ (with python36Packages; [
    tensorflowWithCuda
    #tensorflow # provides the cpu version
    jupyter
    Keras
    pillow
    widgetsnbextension
    scikitlearn
    seaborn
    matplotlib
    ]);
}

Related to @jyp comment, I should note that I had to change one of the hashes in the expression. Those changes are not reflect in the the expression.

I was planning to make a PR with the above patch for tensorflow 1.5.0, but unfortunately I got:

RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb

I tried to track this down but it is daunting to even get a mapping from API versions to numpy versions. Updating to numpy 1.14 did not change the error.

Actually, tensorflow 1.5.0 now requires numpy 1.14, but nixpkgs has an issue with that version: https://github.com/NixOS/nixpkgs/issues/33559

Some weirdness:

  • if I change the numpy version in nixpkgs, the build goes through but it does not actually change the version of numpy. Apparently there is a weird fallback mechanism in fetchPypi with makes it pick the wrong version as long as the sha256 match.
  • On pypi, the tensorflow claims to depend on numpy 1.13 only.

PR is available in #34418 but cannot be merged at the moment. If you care, please help with the blocking issues: #33559

@jyp Although your PR is closed, regarding Numpy: you could override numpy in the tensorflow derivation, e.g.

let
  numpy_1_14 = numpy.overridePythonAttrs(oldAttrs: rec {
    version = "1.14.0";
    pname = "numpy";
    name = "${pname}-${version}";
    src = python.pkgs.fetchPypi {
      inherit pname version;
      extension = "zip";
      sha256 = "1ywrq31sy8hkgis1sv9kgac53v2478r1i01442s0f8r1bf9l7rix";
    };
  });
in
  ...
   buildInputs = [ numpy_1_14 ];
  ...

Hacky but it should work. I'm going to try and tackle building it this weekend as I have a project which depends on it.

Regarding the sha not triggering a re-fetch, it does that on fetchurl and fetchFromGithub as well. If the sha is the same there isn't any reason to try fetching again. I usually change the sha by one character to get it to re-fetch and print the actual hash (for quick testing); so you can copy the actual sha from the mismatch error that gets printed.

@lukeadams
I have another PR with 1.4 which should be mergeable now. If you support the revert to a wheel-based build please comment about it in the PR.

https://github.com/NixOS/nixpkgs/pull/34420

@abbradar is back working on the bazel build. Some fixes went in and he thinks that the current failure is due to other factors.

Fixed in https://github.com/NixOS/nixpkgs/commit/94ebc13a6ac5c6448a932ca48ae9e2bd9ce755ea -- the core issue was tensorfow after all, because I tested exclusively CUDA builds this has been left unnoticed. Let's leave this open until Hydra builds the package.

awesome, thanks!

@abbradar isn't the hydra server building bin.nix and not the default.nix? Or, at least, on master in python-packages.nix it says

  tensorflow =
    if stdenv.isDarwin
    then callPackage ../development/python-modules/tensorflow/bin.nix { }
    else callPackage ../development/python-modules/tensorflow/bin.nix rec {
      ...
    };

If I change this to use bazel based default.nix file I get

nix-build -A pythonPackages.tensorflow
...
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
........
Loading:
Loading: 0 packages loaded
Analyzing: target //tensorflow/tools/pip_package:build_pip_package (2 packages loaded)
Analyzing: target //tensorflow/tools/pip_package:build_pip_package (65 packages loaded)
Analyzing: target //tensorflow/tools/pip_package:build_pip_package (115 packages loaded)
ERROR: /build/output/external/jpeg/BUILD:122:12: Illegal ambiguous match on configurable attribute "deps" in @jpeg//:jpeg:
@jpeg//:k8
@jpeg//:armeabi-v7a
Multiple matches are not allowed unless one is unambiguously more specialized.
ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted:

/build/output/external/jpeg/BUILD:122:12: Illegal ambiguous match on configurable attribute "deps" in @jpeg//:jpeg:
@jpeg//:k8
@jpeg//:armeabi-v7a
Multiple matches are not allowed unless one is unambiguously more specialized.
INFO: Elapsed time: 4.480s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (124 packages loaded)
FAILED: Build did NOT complete successfully (124 packages loaded)
builder for '/nix/store/lq5lr44hp3c61qbfafllwvin93zsx7p4-tensorflow-build-1.5.0.drv' failed with exit code 1
cannot build derivation '/nix/store/lfazj9kahmpgqmbp8ncgxd0647f1ckvn-python2.7-tensorflow-1.5.0.drv': 1 dependencies couldn't be built
error: build of '/nix/store/lfazj9kahmpgqmbp8ncgxd0647f1ckvn-python2.7-tensorflow-1.5.0.drv' failed

@jyp @timsears the bin.nix build in master also fails to work with python.withPackages. Checking out master and then doing

nix-build -E 'with import ./. { }; python.withPackages (p: with p; [tensorflow])'
building '/nix/store/zj8adg9ri2jfmr0xg86rlk0i9nk0sw85-python-2.7.15-env.drv'...
collision between `/nix/store/z9a2y31akz53df8g9d6f5klr7kdzgmvb-python2.7-tensorflow-tensorboard-1.7.0/bin/.tensorboard-wrapped' and `/nix/store/wmxg0d5zw88a9m8zk1pmza3ga05npgzc-python2.7-tensorflow-1.7.1/bin/.tensorboard-wrapped'
builder for '/nix/store/zj8adg9ri2jfmr0xg86rlk0i9nk0sw85-python-2.7.15-env.drv' failed with exit code 25
error: build of '/nix/store/zj8adg9ri2jfmr0xg86rlk0i9nk0sw85-python-2.7.15-env.drv' failed

I'm guessing this is due to tensorflow-tensorboard being a propagated-build-input of tensorflow.

@jyp @timsears unpacked the tensorflow wheel and poked around in it. Found this purelib/tensorflow/tools/pip_package/setup.py

CONSOLE_SCRIPTS = [
    'freeze_graph = tensorflow.python.tools.freeze_graph:main',
    'toco_from_protos = tensorflow.contrib.lite.toco.python.toco_from_protos:main',
    'toco = tensorflow.contrib.lite.toco.python.toco_wrapper:main',
    'saved_model_cli = tensorflow.python.tools.saved_model_cli:main',
    # We need to keep the TensorBoard command, even though the console script
    # is now declared by the tensorboard pip package. If we remove the
    # TensorBoard command, pip will inappropriately remove it during install,
    # even though the command is not removed, just moved to a different wheel.
    'tensorboard = tensorboard.main:run_main',
]

From the sounds of that, possibly the tensorboard executable should just be erased as a post step?

Possibly #42783 is a way out of this.

With regard to the bin.nix variant andtensorboard, I added

rm $out/bin/tensorboard $out/bin/.tensorboard-wrapped

to the postFixup phase script of tensorflow and it solved the pythonPackages issue (tested both the GPU and CPU versions under Linux).

@twhitehead Please submit a PR -- this issue is about something else entirely (and closed).

@jyp will do. Also found issue #42809 about the bazel build issue I noticed too, so will continue that discussion there.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

retrry picture retrry  路  3Comments

tomberek picture tomberek  路  3Comments

langston-barrett picture langston-barrett  路  3Comments

yawnt picture yawnt  路  3Comments

edolstra picture edolstra  路  3Comments