Description
While trying to package TensorFlow 2.2.0 for ROCm I observed an issue that I think might be related to the way build-bazel-package works. It appears that when fetching sources for a Bazel build, bazel fetch is run. I am seeing the following behavior:
...
ERROR: /build/output/external/mkl_dnn/BUILD.bazel:53:1: no such package '@mkl_dnn//third_party/mkl_dnn': BUILD file not found in directory 'third_party/mkl_dnn' of external repository @mkl_dnn. Add a BUILD file to a directory to mark it as a package. and referenced by '@mkl_dnn//:mkl_dnn'
...
ERROR: /build/output/external/mkl_dnn/BUILD.bazel:53:1: no such package '@mkl_dnn//third_party/mkl_dnn': BUILD file not found in directory 'third_party/mkl_dnn' of external repository @mkl_dnn. Add a BUILD file to a directory to mark it as a package. and referenced by '@mkl_dnn//:mkl_dnn'
...
ERROR: /build/output/external/mkl_dnn/BUILD.bazel:53:1: no such package '@mkl_dnn//third_party/mkl_dnn': BUILD file not found in directory 'third_party/mkl_dnn' of external repository @mkl_dnn. Add a BUILD file to a directory to mark it as a package. and referenced by '@mkl_dnn//:mkl_dnn'
...
and am not the first to observe this error, as seen in this issue.
My assumption (possibly a poor one) is that bazel fetch is fetching all dependencies for TensorFlow even if those dependencies are not needed for the given build parameters (or something in both my own and @mjlbach's derivations is/was wrong). I assume that the build system is smarter than this, and that a standard bazel build does not do the same thing, downloading only what it needs given the build flags. Following that thought would the behavior change if deps was switched to use bazel build --build=false <flags> instead of bazel fetch? If this did in fact only download what was needed for the build, packages would likely need to be changed to add switch statements for each combination of optional inputs to select a correct sha256.
I have not attempted to pursue this change myself, as I:
To Reproduce
Steps to reproduce the behavior:
bazel as packaged in nixpkgs to meet the minimum Bazel version needed.Expected behavior
Correctly resolve the Bazel deps derivation.
Notify
@mboes
@uri-canva
@timokau
@abbradar
@kalbasit
@matthewbauer
@andrew-d
@mjlbach
@flokli
@andir
@acowley
Metadata
Result of nix-shell -p nix-info --run "nix-info -m":
- system: `"x86_64-linux"`
- host os: `Linux 5.4.39, NixOS, 20.09pre225264.683c68232e9 (Nightingale)`
- multi-user?: `yes`
- sandbox: `yes`
- version: `nix-env (Nix) 2.3.4`
- channels(root): `"nixos-20.09pre225264.683c68232e9"`
- nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
Last time I had this issue, it was resolved by dropping the bazel version down to the matching version for TF 1.15.1. On unstable, current (default) bazel version is 2.1.0, we could try overriding to 2.0.0 which is technically what is required for TF 2.2. We added a version argument to buidlBazelPackage last time I unbroke the bazel build on tensorflow so we could keep additional bazels around since this seems to happen fairly often.
For reference, I think even the 2.1 derivation builds with bazel 0.29. I see you overrode this, was 2.2 failing to build with the older bazel?
I did get a working deps with bazel_1, but wanted to ensure that I was at least using the minimum version TensorFlow claims to need. Yes, 0.29 was failing with some issue I don't recall. I think it was also in deps.
I overrode buildphase, and my suspicions about fetch were correct. This successfully builds deps, and I would recommend this change be considered for the next version of Bazel that is packaged.
That sounds like a bug in bazel though, is it reported anywhere?
@uri-canva I would be more inclined to believe that it's a TensorFlow bug, and that it exists because fetch without flags is probably not as frequently used as build with flags.
If the build --build=false actually fetches the dependencies then I'm for this change. It does end up making the deps derivation smaller anyway by fetching only what it needs. @Wulfsta have you tried if it works?
@kalbasit it appears to do what I expected here. I don't have a less hardware specific derivation to demonstrate with, but to test you can just copy that overriden phase to any other package (I did remove a flag used for darwin builds, so that would need to be uncommented).
The obvious problem with this change is that it requires a potentially large number of hashes in a switch statement - I think the product of the cardinality of the set of options for each optional input that affects the fetch phase.
That's not a big issue in practice is it? We don't share fetch derivations across multiple build derivations, so chances are if you're changing anything in the build derivation you were going to change the fetch derivation as well anyway.
It very well might be, given a handful of inputs. Here is what the TensorFlow derivation does for cuda vs. CPU builds. If you were to keep adding boolean cases you end up with 2^n hashes to account for, where n is the number of inputs. I obviously don't know much about what Bazel projects look like, but this number grows quickly. This is at least something to keep in mind, since this change would increase the granularity of the fetch phase.