Bazel: Downloading succeeds with a valid (cached?) hash but invalid URL

Created on 2 May 2018 · 22Comments · Source: bazelbuild/bazel

Description of the problem / feature request:

I'm noticing that download requests on bazel 0.12.0 (and 0.13.0) are succeeding with broken URLs, at least for java_import_external. These are correctly failing in 0.11.0

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

git clone https://github.com/google/bazel-common
bazel build third_party/java/truth

Switch this line to artifact = "com.google.truth:truth:0.blahblah",

bazel build third_party/java/truth

Note that this succeeds, even seemingly after bazel clean --expunge. If you inspect @com_google_truth_truth//:com_google_truth_truth's jar (i.e.bazel-bazel-common/external/com_google_truth_truth/truth-0.blahblah.jar), the maven information in that jar will still say version0.39` (the original truth version).

Is it possible that the new caching feature is not working correctly?

I think this may be related to the fact that I haven't changed the sha256. If I do that, I get the right error (either that the mirrors are down if the version is 0.blahblah or that the checksum is incorrect if it's set to a real version, like 0.40). Is there a lookup in some cache based on hash that ignores the URL entirely?

What operating system are you running Bazel on?

Linux

What's the output of `bazel info release`?

release 0.13.0

Any other information, logs, or outputs that you want to share?

Replace these lines with your answer.

If the files are large, upload as attachment or provide link.

P1 area-ExternalDeps bad error messaging team-XProduct bug

Source

ronshapiro

All 22 comments

Your analysis is correct. This is a consequence of using content accessible storage as the backend for repository cache. All CAS is aware of is the SHA1 and nothing else, for now.

Note, that for remote cache and action cache it is sufficient to build the key from sha1 only, because, the input is the SHA1 of the all inputs, gathered for the action.

Given this rule:

maven_jar(
     name = "com_google_guava_guava",
     artifact = "com.google.guava:guava:18.0",
     sha1 = "cce0823396aa693798f8882e64213b1772032b09",
 )

and given that repository cache is activated per default on most recent Bazel version, we get this entry:

$ sha1sum /home/davido/.cache/bazel/_bazel_davido/cache/repos/v1/content_addressable/sha1/ad97fe8faaf01a3d3faacecd58e8fa6e78a973ca/file
ad97fe8faaf01a3d3faacecd58e8fa6e78a973ca

the file content is just guava:

$ ls $(bazel info output_base)/external/com_google_guava_guava/jar/
Starting local Bazel server and connecting to it...
..........
BUILD.bazel  guava-18.0.jar

If we run bazel clean --expunge and change the artifact to artifact = "com.google.guava:guavabljblja:18.0" and re-run the build again:

$ bazel build @com_google_guava_guava//...
INFO: Analysed 2 targets (8 packages loaded).
INFO: Found 2 targets...
INFO: Elapsed time: 2.502s, Critical Path: 0.01s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action

it works and only thing changed is the file name:

$ ls $(bazel info output_base)/external/com_google_guava_guava/jar/
Starting local Bazel server and connecting to it...
..........
BUILD.bazel  guavabljblja-18.0.jar

The reason for that behavior is: how the code is currently organized, the original artifact ID is not a part of the CAS, and is entirely lost. All we can consult is sha1 and the file content.

Gerrit Code Review (Buck and) Bazel build tool chain solved this problem by combining into the CAS key the artifact id. We use our own maven_jar Skylark implementation that is curl based: [1]. Our CAS key implementation doesn't suffer from artifact mutation problem you are describing in this issue: [2]:

$ ls ~/.gerritcodereview/bazel-cache/downloaded-artifacts/ | grep guava
guava-24.1-jre.jar-96c528475465aeb22cce60605d230a7e67cebd7b
guava-24.1-jre-src.jar-310ad448ea9f117b6cc3ee9642285922e5b681fd
guava-retrying-2.0.0.jar-974bc0a04a11cc4806f7c20a34703bd23c34e7f4
guava-retrying-2.0.0-src.jar-0a8e9267e624d0b6ea5f5bc0a502b01ee84a8f6f

[1] https://github.com/GerritCodeReview/bazlets/blob/master/tools/maven_jar.bzl
[2] http://paste.openstack.org/show/720257

davido on 3 May 2018

The situation is even worse, in case the artifact is changed to a valid value, say version is bumped, without changing the sha1: say from "com.google.guava:guava:18.0" to "com.google.guava:guava:19.0", and this without changing the sha1. The user may think, that the newer version of the artifact is downloaded and all is fine, but actually nothing happened, except that the old artifact is retrieved from the cache and the file is renamd to "guava-19.0.jar", but has the content of guava 18.0.

Repository cache activation per default in recent Bazel versions, should probably be reconsidered, until this behaviour is fixed.

davido on 3 May 2018

👍1

I'm not seeing how you affect the CAS key - can you point me to some line numbers?

Is there a way to purge that cache locally (with a bazel command, not by rm -rf)?

I guess that the build will fail for someone (likely on CI, where there's no cache?) so it's not awful, but it's definitely unexpected and seems to go against reproducibility/correctness of a build.

cc: @jart for the implications on java_import_external.

ronshapiro on 3 May 2018

What is ~/.gerritcodereview/bazel-cache? The way I generally solve this problem with java_import_external is by not using local caches. mirror.bazel.build is the cache.

jart on 3 May 2018

What is ~/.gerritcodereview/bazel-cache?

As I said, we use our own version of maven_jar, that is using download_file.py script that is using curl.

Here is the line, where the combined CAS key is constructed: <artifact>-<sha1> https://github.com/GerritCodeReview/bazlets/blob/master/tools/download_file.py#L74 .

The way I generally solve this problem with java_import_external is by not using local caches.

Repository cache was enabled per default in https://github.com/bazelbuild/bazel/commit/6e0933886d3c6b7f68075da4bdb08500ce2b6f86. I'm not aware of any way to disable it, unless you downgrade or patch Bazel.

davido on 3 May 2018

Is there a way to purge that cache locally (with a bazel command, not by rm -rf)?

No, it's not. See this design proposal for garbage collection of repository cache.

davido on 3 May 2018

@davido If I'm able to help rules_closure meet Gerrit's requirements (w/ hesitancy on C++11 https://github.com/bazelbuild/rules_closure/pull/251) would you consider migrating to java_import_external which uses Bazel's Downloader?

If you keep going the curl route, I've found the following shell code tremendously helpful for traditional builds on POSIX systems: https://gist.github.com/jart/8c5288db4398b8bd7a1e20f8deec4a16

jart on 4 May 2018

@jart

[...] would you consider migrating to java_import_external which uses Bazel's Downloader?

java_import_external is still missing some important features, like handling sources classifier. Gerrit toolchain depends on that. For one we need to provide sources fro JSNI support in GWT, for another, we generate Eclipse IDE's .classpath file from our build definition, that references the sources artifacts for debugging purpose. I know, we could duplicate all artifacts and retrieve sources and binaries in two steps, but this is awkward.

All current Bazel external dependency management options don't handle that use case or try to do it, but have problems, most notably, native maven_jar rule doesn't store/retrieve sources classifier artifact in repository cache: #4798, that I'm trying to fix in https://bazel-review.googlesource.com/#/c/bazel/+/53950.

In our own maven_jar implementation, for this WORKSPACE file content:

maven_jar(
    name = "guava",
    artifact = "com.google.guava:guava:24.1-jre",
    sha1 = "96c528475465aeb22cce60605d230a7e67cebd7b",
)

we would download these files:

$ ls /home/davido/.cache/bazel/_bazel_davido/5c01f4f713b675540b8b424c5c647f63/execroot/gerrit/external/guava/jar/
BUILD  guava-24.1-jre.jar  guava-24.1-jre-src.jar

and would generate this:

java_import(
    name = 'jar',
    jars = ['guava-24.1-jre.jar'],
    srcjar = "guava-24.1-jre-src.jar",   
)

java_import(
    name = 'neverlink',
    jars = ['guava-24.1-jre.jar'],
    neverlink = 1,   
)

java_import(
    name = 'src',
    jars = ['guava-24.1-jre-src.jar'],
)

And, as I mentioned above, our own CAS implementation, located in ~/.gerritcodereview/bazel-cache, just works.

If you keep going the curl route, I've found the following shell code tremendously helpful for traditional builds on POSIX systems: https://gist.github.com/jart/8c5288db4398b8bd7a1e20f8deec4a16

Thanks for sharing, I will have a look.

davido on 8 May 2018

//CC @aehlig @jin.

I think, there are two approaches to fix this repoistory cache problem and to store artifact/URL in the CAS and to be able to verify it later, when the artifact is retrieved from the cache and report the mismatch:

combine the artifact/URL in the CAS key
store the artifact/URL in addition to the file content, say in meta file

davido on 8 May 2018

@dkelmer I think this will affect workspace resolving, since the resolver is short circuited to the incorrect JAR in the CAS? Or is this exactly the issue that the resolving pass is suppose to pick out?

jin on 24 May 2018

//CC @aehlig @jin.

I think, there are two approaches to fix this repoistory cache problem and to store artifact/URL in the CAS and to be able to verify it later, when the artifact is retrieved from the cache and report the mismatch: [...]

From a repository cache point of view, this is all working as intended. One of the design points of the download cache is to avoid downloads, even if upstream moves (including implicit moves, if URLs are overwritten by local mirrors). Also, for the downloader there is no such thing as "the" URL, as it is called with a list of alternative URLs trying to fetch a file with the given hash from whatever of those URLs is best reachable.

As the whole problem is pretty maven specific (for http_archive the desired semantics really is "give me a file with that hash, by whatever means you can produce it"), it should, in my opinion, be fixed (if at all) within maven_jar.

aehlig on 25 May 2018

@aehlig In the issue that got closed yesterday I suggested a way of keeping the file hash and redownloading when the maven jar dep gets bumped. Realistically, maven jars only change when the artifact version changes. Proxies rewriting URLs shouldn't be a part of it right?

carl-mastrangelo on 25 May 2018

Proxies rewriting URLs shouldn't be a part of it right?

I'm not worried about proxies changing URLs. But my argument was that
the same archive of an open-source project file might be mirrored at
different locations. For example, the following URLs (among many many
others) all refer to the same file with SHA256 sum
'0ba5403111f8e5ea22be7d51ab74c8ccb576dc30ddfbf18a46cb51f9139790ab'.

Moreover, depending on the tradition, different projects would consider
different of those "the canonical URL" for that file. On top of that,
there might as well be an internal mirror in a big organisation. Still,
it seems reasonable to not download the file again from each mirror,
if we have it already.

That's why I argued that a solution would probably have to be maven
specific.

Note, however, that those kind of confusions will go away, once we
make progress with the WORKSPACE.resolved proposal (see

https://docs.google.com/document/d/1HfRGRW4MwnVUG24rw3HJIBkdEYfXlESnlKOboO97A3A/ and
https://docs.google.com/document/d/1kVNXcw3nLlfFQRR_87SGOka9DJ8nnawlYHUIK4m3s0I/ )
as then the hand-written WORKSPACE file will not contain hashes, whereas
the auomatically generated WORKSPACE.resolved file will contain the
correct hashes. (So, the newly to be added bazel sync command will
act similar to the make makesum known from ports trees.)

--
Klaus Aehlig
Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Paul Terence Manicle, Halimah DeLaine Prado

aehlig on 28 May 2018

We should go the route similar to what is suggested in #5250.

dslomov on 11 Feb 2019

Our export proceedure changed https://bazel-review.googlesource.com/c/bazel/+/91030 to aff8319e25f599f57a2098939a7e4ccc000d8ead (basically censoring away the commit message), so mentioning by hand that this change is related to this issue as well as #5045.

aehlig on 21 Feb 2019

Our export proceedure changed https://bazel-review.googlesource.com/c/bazel/+/91030 to [aff8319]

That seems like a bug

dkelmer on 21 Feb 2019

👍1

Huh, I just updated the github.com/square/bazel_maven_repository README 5 minutes ago about this issue, and wrote a blog post.

http://www.geekinasuit.com/2019/02/bazely-thinking-and-tale-of-content.html

I hold by the conclusion, that bazel privileges the content (the hash) over the address, and I've made it a caution in my bazel maven integration. I was thinking of schemes to add more validation to this, so that you can't make the silly mistake of bumping the version without bumping the hash without warning. But part of it needs to be that bazel workspace maintainers need to think about the sha as the key, not the list-of-urls or they'll run into this over and over.

cgruber on 21 Feb 2019

Current state is that, whenever a repository fails, we get a message of the form

INFO: Repository 'ext' used the following cache hits instead of downloading the corresponding file
* Hash 'd458...' for http://example.com/...
If the definition of 'ext' was updated, verify that the hashes were also updated.

which points the user to the possible error in the definition. It also shows the location (including call stack) where the repository was defined.

@dslomov, when tagging this issue as "bad error messaging" you stated

We should go the route similar to what is suggested in #5250.

That issue proposes to change the cache from having lookups by (predicted) hash of the content of the file to looking up pairs of artifactId and content hash. The latter is meaningful for maven, where a canonical artifactId exists (cc @dkelmer), but not that clear for the generic interface ctx.download. Issue #5250, however, does not propose any changes to the error messaging. So I wonder, what was meant by that comment.

aehlig on 26 Apr 2019

That issue proposes to change the cache from having lookups by (predicted) hash of the content of the file to looking up pairs of artifactId and content hash.

Why not do both? Bazel could make a file named sha256(downloadUrl + sha256(fileContent)), which is a symlink to the current sha256(fileContent) filename. That would mean Bazel would have to download the file again when the url changes, but if the content stays the same it doesn't need to update it. It would also work for the non-Maven case.

carl-mastrangelo on 26 Apr 2019

👍1

Currently discussed: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md

aehlig on 2 May 2019

Currently discussed: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md

Proposed implementation (presuming the proposal gets accepted)

aehlig on 15 May 2019

I'm crazy excited about this. I've managed to screw up dependencies more than once on this issue, forgetting to update the hash. I've started using my maven deps without cacheing in the short-term, just to make sure.

cgruber on 15 May 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

--experimental_resolved_file_instead_of_workspace error symbol is private and cannot be imported

C00IHandLuke · 3Comments

Add aquery configuration() function

kastiglione · 3Comments

--action_env doesn't apply to genrule with host configuration

meteorcloudy · 3Comments

Have a general mechanism for collecting code coverage

iirina · 3Comments

Cannot define string literals in cc_library

f1recracker · 3Comments

Bazel: Downloading succeeds with a valid (cached?) hash but invalid URL

Description of the problem / feature request:

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

Any other information, logs, or outputs that you want to share?

All 22 comments

Related issues

What's the output of `bazel info release`?