Nixpkgs: Problem with deps in build-bazel-package

Created on 16 May 2020 · 10Comments · Source: NixOS/nixpkgs

Description
While trying to package TensorFlow 2.2.0 for ROCm I observed an issue that I think might be related to the way build-bazel-package works. It appears that when fetching sources for a Bazel build, bazel fetch is run. I am seeing the following behavior:

...
ERROR: /build/output/external/mkl_dnn/BUILD.bazel:53:1: no such package '@mkl_dnn//third_party/mkl_dnn': BUILD file not found in directory 'third_party/mkl_dnn' of external repository @mkl_dnn. Add a BUILD file to a directory to mark it as a package. and referenced by '@mkl_dnn//:mkl_dnn'
...
ERROR: /build/output/external/mkl_dnn/BUILD.bazel:53:1: no such package '@mkl_dnn//third_party/mkl_dnn': BUILD file not found in directory 'third_party/mkl_dnn' of external repository @mkl_dnn. Add a BUILD file to a directory to mark it as a package. and referenced by '@mkl_dnn//:mkl_dnn'
...
ERROR: /build/output/external/mkl_dnn/BUILD.bazel:53:1: no such package '@mkl_dnn//third_party/mkl_dnn': BUILD file not found in directory 'third_party/mkl_dnn' of external repository @mkl_dnn. Add a BUILD file to a directory to mark it as a package. and referenced by '@mkl_dnn//:mkl_dnn'
...

and am not the first to observe this error, as seen in this issue.

My assumption (possibly a poor one) is that bazel fetch is fetching all dependencies for TensorFlow even if those dependencies are not needed for the given build parameters (or something in both my own and @mjlbach's derivations is/was wrong). I assume that the build system is smarter than this, and that a standard bazel build does not do the same thing, downloading only what it needs given the build flags. Following that thought would the behavior change if deps was switched to use bazel build --build=false <flags> instead of bazel fetch? If this did in fact only download what was needed for the build, packages would likely need to be changed to add switch statements for each combination of optional inputs to select a correct sha256.

I have not attempted to pursue this change myself, as I:

am not familiar enough with Bazel to really know why this error is happening
am not sure if any of my assumptions about the build system are correct
thought asking here first would be a better option than spending time on a possibly wrong solution

To Reproduce
Steps to reproduce the behavior:

Attempt to bump the version of TensorFlow to 2.2.0 using bazel as packaged in nixpkgs to meet the minimum Bazel version needed.

Expected behavior
Correctly resolve the Bazel deps derivation.

Notify
@mboes
@uri-canva
@timokau
@abbradar
@kalbasit
@matthewbauer
@andrew-d
@mjlbach
@flokli
@andir
@acowley

Metadata
Result of nix-shell -p nix-info --run "nix-info -m":

 - system: `"x86_64-linux"`
 - host os: `Linux 5.4.39, NixOS, 20.09pre225264.683c68232e9 (Nightingale)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.3.4`
 - channels(root): `"nixos-20.09pre225264.683c68232e9"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

bug

Source

Wulfsta

All 10 comments

Last time I had this issue, it was resolved by dropping the bazel version down to the matching version for TF 1.15.1. On unstable, current (default) bazel version is 2.1.0, we could try overriding to 2.0.0 which is technically what is required for TF 2.2. We added a version argument to buidlBazelPackage last time I unbroke the bazel build on tensorflow so we could keep additional bazels around since this seems to happen fairly often.

For reference, I think even the 2.1 derivation builds with bazel 0.29. I see you overrode this, was 2.2 failing to build with the older bazel?

mjlbach on 16 May 2020

I did get a working deps with bazel_1, but wanted to ensure that I was at least using the minimum version TensorFlow claims to need. Yes, 0.29 was failing with some issue I don't recall. I think it was also in deps.

Wulfsta on 16 May 2020

I overrode buildphase, and my suspicions about fetch were correct. This successfully builds deps, and I would recommend this change be considered for the next version of Bazel that is packaged.

Wulfsta on 22 May 2020

👍1

That sounds like a bug in bazel though, is it reported anywhere?

uri-canva on 22 May 2020

@uri-canva I would be more inclined to believe that it's a TensorFlow bug, and that it exists because fetch without flags is probably not as frequently used as build with flags.

Wulfsta on 22 May 2020

If the build --build=false actually fetches the dependencies then I'm for this change. It does end up making the deps derivation smaller anyway by fetching only what it needs. @Wulfsta have you tried if it works?

kalbasit on 28 May 2020

@kalbasit it appears to do what I expected here. I don't have a less hardware specific derivation to demonstrate with, but to test you can just copy that overriden phase to any other package (I did remove a flag used for darwin builds, so that would need to be uncommented).

Wulfsta on 28 May 2020

The obvious problem with this change is that it requires a potentially large number of hashes in a switch statement - I think the product of the cardinality of the set of options for each optional input that affects the fetch phase.

Wulfsta on 28 May 2020

That's not a big issue in practice is it? We don't share fetch derivations across multiple build derivations, so chances are if you're changing anything in the build derivation you were going to change the fetch derivation as well anyway.

uri-canva on 29 May 2020

It very well might be, given a handful of inputs. Here is what the TensorFlow derivation does for cuda vs. CPU builds. If you were to keep adding boolean cases you end up with 2^n hashes to account for, where n is the number of inputs. I obviously don't know much about what Bazel projects look like, but this number grows quickly. This is at least something to keep in mind, since this change would increase the granularity of the fetch phase.

Wulfsta on 29 May 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings