Bazel: Error with external dependencies: some change are not seen

Created on 15 Jul 2016  Â·  49Comments  Â·  Source: bazelbuild/bazel

%> bazel version
Build label: 0.2.1-2016-04-08 (@447f7f3)
Build target: bazel-out/local_linux-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Apr 8 17:35:41 2016 (1460136941)
Build timestamp: 1460136941
Build timestamp as int: 1460136941

My project has an external dependency on a tar.gz hosted in s3 (new_http_archive). This dependency has several different build targets that I depend on in one of my packages (runtime dependency, so using the data attribute). I wanted to unify those targets into a single filegroup so that I could depend on them easily in multiple places. In the build file for my external dependency I created a filegroup that lists each of the relevant targets as both srcs and data. When I attempt to build the dependent package I receive an error about the new filegroup not existing:

ERROR: /path/to/workspace/dependent/BUILD:48:1: no such target '@dependency//:system': target 'system' not declared in package '' defined by /path/to/cache/dir/external/dependency/BUILD and referenced by '//dependent:runtime'.

The only way to get rid of this error is to expunge the build tree and rebuild. It seems that the state of the external dependency build targets are not reevaluated if the external dependency has not changed. I believe this to be an error in Bazel's handling of external dependencies because changes in the build file should be visible to other build targets without first destroying the state of the build tree.

misc > misc bug under investigation

Most helpful comment

Sorry for the delay! Thanks for the repro, it is reproducible for me. On first glance, it looks like some sort of bad interaction between new_local_repository and bind(). By changing your example to local_repository, removing the bind(), and copying the BUILD file into the repo I "fixed" the issue, but obviously something is very wrong here. I'll continue investigating.

All 49 comments

0.2.1 is rather old release, can you update? We fixed a lot of invalidation issues since then.

Closing for now, feel free to ping if you have more issues.

We're still seeing problems even on 0.3.0. For example:

INFO: Found 1 target...
ERROR: missing input file '@<redacted>//:<redacted>'.
ERROR: <redacted>/073241073e47b03b6e3590c37cd049f8/external/<redacted>/BUILD:3:1: @<redacted>//:<redacted>: missing input file '@<redacted>//:<redacted>'.
Target <redacted> failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: <redacted>73241073e47b03b6e3590c37cd049f8/external/<redacted>/BUILD:3:1 1 input file(s) do not exist.
INFO: Elapsed time: 21.521s, Critical Path: 4.59s

This was caused by changing a new_local_repository pointing at the local filesystem to a new_remote_archive that pulls a tarball from a remote host. The BUILD file for the dependency was also updated because the paths inside the tarball were slightly different than the local repository.

I found the caveats with caching of external dependencies but running bazel fetch //... did _not_ fix it. I ended up fixing it by adding a version number to the repo name and updating the bind call so the path at //external:thing didn't change. This is a somewhat acceptable workaround but isn't intuitive.

Is this expected behavior that we need to work around, are we doing something wrong, or is this a bazel bug that will eventually be fixed?

That is not expected and I cannot reproduce. Do you have a way to reproduce that you should share?

I've reproduced it fairly consistently on a large production repo but I haven't been able to reproduce it in a test repo that I can share. I didn't get a chance to do a stripped down test changing a new_local_repository to a new_http_archive yet so it could be something with that. When I get some time I'll try to properly reproduce it and report back.

I was bitten by this issue again: this time the CROSSTOOL was changed to remove a copt, that same copt was added to targets in an external dep, but building it resulted in errors because the copt was _not_ present when rebuilding the external dep (verified by inspecting the command that failed). Performing an clean --expunge_async and rebuilding with --subcommands succeeded and _did_ have the required copt.

It seems like build files for external dependencies are aggressively cached and updating the files on disk does not always invalidate that cache. I'm not familiar with Bazel internals so I'm only speculating, though.

In any case, this issue continues to plague my project with bizarre errors that are hard to reproduce and ultimately boil down to "bazel doesn't handle external dependencies well". Oddly enough, we consistently ran into this particular in CI for ~15 builds and could not reproduce it locally. As soon as we add a bazel clean --expunge_async to the CI procedure, it passes. Potential workarounds include:

  • bazel clean when updating external dependencies

    • May or may not be necessary.

    • Slows down CI and increases download bandwidth for external deps.

    • Errors linger and crop up over the next few weeks as developers update their local checkouts.

  • Consider external dependencies immutable and update the name anytime it changes e.g. @library => @library_v2.

    • Unclear that this is necessary.

    • You end up with ugly names.

We're on Bazel 0.3.0. Unfortunately, I have not been able to get a small test repository to reproduce the issue but I also only tried to do so for about half an hour with no luck.

+1 for reopening this, the poor tracking of changes in external deps has been a huge pain for us.

+1 for reopening this as well. Just happened to me, and cost me a half day's work.

This cost me about 4 hours today.

I'm trying to put together a simple reproduction for this. So far my repro case is using new_http_archive with a tarball and an external cc_library target.

The basic things I've tried that haven't worked:

  • adding a second cc_library target to the build file for the external target (without changing the tarball), and depending on it from another target
  • Adding a defines to the target in the build file for the external target

I've tried pulling in a variant of our CROSSTOOL file and the compiler definition (because that affects how C/C++ are built), but I'm still not able to reproduce this in isolation.

I'm going to try to build a reproduction of this within our repository, even if I can't share it, just to prove to myself that there is a reproducible bug here.

Spent a few hours trying to reproduce this in our repository, changing the defines parameter of a sample external rule between builds, but bazel always picks up the changes correctly.

More interestingly, I've tried variations on issues that have provoked this before, but bazel always rebuilds correctly in those cases too.

We've seen this most often on our CI server, so I'm wondering if there's some option that we pass to bazel during the CI build that might be provoking this. Our CI runs with sandboxing turned on, so I'm going to try the same change set with sandboxing on.

I've managed to reproduce this within our repository on a small test case. I'm going to try to extract this into something that doesn't depend on our entire repository.

I haven't been able to extract my test case into something that I can share yet, but it involves manually resetting the timestamp on the build file that has changed, and I suspect it's provoking on of the less-stringent code paths in https://github.com/bazelbuild/bazel/blob/df036962c94eeac0ccb6d1359d6f9681195b45ef/src/main/java/com/google/devtools/build/lib/actions/cache/DigestUtils.java#L173-L187

Of note, that logic hasn't changed between our version of bazel and the current version of bazel, so I suspect this is still present in the current version of bazel.

Finally got this into a share-able state. To reproduce just check out https://github.com/trainman419/1525 and run the repro.sh script

Bump. I've had several people confirm this internally using the script that I provided; has anyone from bazel had time to look into this?

@kchodorow bump. This continues to be a problem for me, and I've done my best to provide a reliable test case for it. Can you please at least try the repro and confirm that you can also reproduce this?

Ping. Can someone take a look at this please?

Sorry for the delay! Thanks for the repro, it is reproducible for me. On first glance, it looks like some sort of bad interaction between new_local_repository and bind(). By changing your example to local_repository, removing the bind(), and copying the BUILD file into the repo I "fixed" the issue, but obviously something is very wrong here. I'll continue investigating.

Thanks! For the record we've also reproduced this on bazel 0.4.3

(side-note: this does not repro with my mac, but on my linux workstation it does)

Could that be because of the FileValue that does not change (timestamp does not change and size does not change, and since the file system does not support fast checksum, we do not use checksum)?

(side-note: this does not repro with my mac, but on my linux workstation it does)

We can repro this on Linux, too but so far not on macOS. Both are 0.4.3.

That is because the date command is non working on Mac OS. I am pretty sure
we can repro on Mac OS if we fix the repro script.

On Wed, Jan 18, 2017, 9:10 AM Andreas Bergmeier notifications@github.com
wrote:

(side-note: this does not repro with my mac, but on my linux workstation
it does)

We can repro this on Linux, too but so far not on macOS.

—
You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/1525#issuecomment-273409978,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADjHfxN4SuIB_88GSv0Lb7bGfy7BGhtEks5rTcjugaJpZM4JM7ou
.

I don't think I've ever elicited so many emojis :)

Damien seems to be correct. I can actually reproduce it without external repositories:

$ touch -m -d "$now" repro/repro_1525.c
$ bazel run //repro:repro_1525
...
doA() = 1337
$ sed -i 's/"doA() =/"doB() =/' repro/repro_1525.c
$ touch -m -d "$now" repro/repro_1525.c
$ bazel run //repro:repro_1525
...
doA() = 1337

So... that's not good.

That's normal. We do not hash all function for speed reason, we use a fast approx of metadata, if those metadata doesn't change, we cannot see the change. So the metadata include the file size and the timestamp and other fast information. There is no good answer to it :/

/cc @ulfjack @janakdr

One option to consider might be digesting BUILD files unconditionally.
There are not that many of them in an average build, and it seems like
Bazel users are auto-generating them in ways that are particularly good at
exposing this deficiency.

On Wed, Jan 18, 2017 at 12:59 PM, Damien Martin-Guillerez <
[email protected]> wrote:

That's normal. We do not hash all function for speed reason, we use a fast
approx of metadata, if those metadata doesn't change, we cannot see the
change. So the metadata include the file size and the timestamp and other
fast information. There is no good answer to it :/

/cc @ulfjack https://github.com/ulfjack @janakdr
https://github.com/janakdr

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/1525#issuecomment-273551187,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJqlcK1XKXfS3nNiVNaRIvtLCuojXCBwks5rTlMWgaJpZM4JM7ou
.

Also if we activate watchfs it will reduce largely the file that we need hash

True, although Bazel will still have to hash all the relevant files on a
clean build.

On Wed, Jan 18, 2017 at 1:12 PM, Damien Martin-Guillerez <
[email protected]> wrote:

Also if we activate watchfs it will reduce largely the file that we need
hash

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/1525#issuecomment-273554512,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJqlcOuDy3jqkUmEypHPjY41TZfH5QHHks5rTlX1gaJpZM4JM7ou
.

Yes :(

If --watchfs is enabled, can we skip the hashing?

I theory we could but that would only cover non notified files during the
lifetime of the server. We need a closer perf analysis.

On Thu, Jan 19, 2017, 9:00 AM Ulf Adams notifications@github.com wrote:

If --watchfs is enabled, can we skip the hashing?

—
You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/1525#issuecomment-273706157,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADjHf_r_ltMmuDCn2hlHq1IkngwMuTgjks5rTxg4gaJpZM4JM7ou
.

Does watchfs depend on inotify (on Linux)?
We have enough BUILD files currently, that inotify limits may be exhausted.

Yes it does and on FSEvents on OSX. If it reach the limit it will restat
everything

On Thu, Jan 19, 2017, 9:19 AM Andreas Bergmeier notifications@github.com
wrote:

Does watchfs depend on inotify (on Linux)?
We have enough BUILD files currently, that inotify limits may be
exhausted.

—
You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/1525#issuecomment-273709203,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADjHf_ohKr4i7bxYdhfNxxRJHZeQwrZkks5rTxyVgaJpZM4JM7ou
.

We actually managed to diagnose a strange bug yesterday for someone where
Bazel wasn't noticing updates to files in the tree (any updates) on OS X.
It seems that this fallback behavior is buggy on os x. User had several
copies of sublime open which uses FSEvents too. So some overall system
limit is likely being exhausted. But repeated calls to bazel with watchfs
enabled finished immediately as if no files have changed. (bazel version
0.3.1)

On Thu, Jan 19, 2017 at 12:20 AM, Damien Martin-Guillerez <
[email protected]> wrote:

Yes it does and on FSEvents on OSX. If it reach the limit it will restat
everything

On Thu, Jan 19, 2017, 9:19 AM Andreas Bergmeier notifications@github.com
wrote:

Does watchfs depend on inotify (on Linux)?
We have enough BUILD files currently, that inotify limits may be
exhausted.

—
You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub
<https://github.com/bazelbuild/bazel/issues/1525#issuecomment-273709203
,
or mute the thread
ohKr4i7bxYdhfNxxRJHZeQwrZkks5rTxyVgaJpZM4JM7ou>

.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/1525#issuecomment-273709447,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AWCkRhF0bxx1fpSAxFeBvBveYDwjNnFDks5rTxzhgaJpZM4JM7ou
.

There was a bug in the code, that I think got fixed post 0.3.1. Can you try the latest release?

I don't really know what to do next with that bug so giving the hot potato to Ulf :p

https://github.com/bazelbuild/bazel/issues/1525#issuecomment-273304683

Thanks! For the record we've also reproduced this on bazel 0.4.3

I can reproduce with the repro.sh script, but not with @kchodorow's instructions.

Oh, I see. I did it wrong, I can reproduce.

Now I can't reproduce it anymore, neither way. I was trying to test a fix and I can't reproduce the problem. :-(

Ok, I can reproduce and I have a fix.

Is there a PR out for this fix? I keep bumping into this bug and it's causing a lot of grief!

Thanks!

Unfortunately, there's a problem, so I can't commit my change.

My change is to use ctime in addition to mtime to detect file changes. On Linux we have a high-precision ctime value which is not under control of the user, and which gets updated on every change to the file. Unfortunately, Bazel calls chmod on output files after action completion after it records the ctime, and the chmod call modifies the ctime. So the next build, Bazel thinks all output files are modified. We still get cache hits, because the content digest doesn't change, but it's still a problem.

Ideally, Bazel would not call chmod after action completion, but it'll take me some more time to figure out how to change that.

Not exactly a fix but what about touching the file first and if necessary doing a chmod after? It will probably break action if they do not expect files to be present.

The best plan I have is to move the place where we call chmod closer to when the file is generated and before we record the file metadata.

Any further updates here?

Not really. I talked to Lukacs, and he brought up some questions that I didn't know the answer to. I need to find some time to dig even deeper.

The basic approach for fixing the problem is to make FileContentsProxy use ctime in addition to mtime to represent file contents.

Unfortunately, after an action is complete, SkyframeActionExecutor calls chmod on all output files, which changes the ctime. The chmod calls can only happen after the action is complete, as the files may not exist earlier. However, the ActionMetadataHandler caches the FileContentsProxy from the first time getMetadata is called for an output file, and this sometimes happens before the action is complete. In particular, it happens when ActionMetadataHandler.injectDigest is called as well as when an action internally creates multiple Spawn instances.

The only solution I see to this conundrum is to move the chmod calls to the ActionMetadataHandler, such that we always call chmod before we create the FileContentsProxy, regardless of whether we need to get the metadata during an action or afterwards. Note that SkyframeActionExecutor already calls ActionMetadataHandler for every output file; if there was a previous call for that output file, it's already chmodded, otherwise we chmod it then.

Okay, I think I've got a fix now. I'm out next week, so the fix may only get merged the week after.

@ulfjack: Any update on this issue?

I have two pending commits in code review. I pinged the reviewer to take another look. Sorry for the delay - I was out for a week, and also had some urgent other stuff to attend to.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GaofengCheng picture GaofengCheng  Â·  3Comments

ob picture ob  Â·  3Comments

ensonic picture ensonic  Â·  3Comments

kastiglione picture kastiglione  Â·  3Comments

f1recracker picture f1recracker  Â·  3Comments