I'm noticing that download requests on bazel 0.12.0 (and 0.13.0) are succeeding with broken URLs, at least for java_import_external. These are correctly failing in 0.11.0
git clone https://github.com/google/bazel-common
bazel build third_party/java/truth
Switch this line to artifact = "com.google.truth:truth:0.blahblah",
bazel build third_party/java/truth
Note that this succeeds, even seemingly after bazel clean --expunge. If you inspect @com_google_truth_truth//:com_google_truth_truth's jar (i.e.bazel-bazel-common/external/com_google_truth_truth/truth-0.blahblah.jar), the maven information in that jar will still say version0.39` (the original truth version).
Is it possible that the new caching feature is not working correctly?
I think this may be related to the fact that I haven't changed the sha256. If I do that, I get the right error (either that the mirrors are down if the version is 0.blahblah or that the checksum is incorrect if it's set to a real version, like 0.40). Is there a lookup in some cache based on hash that ignores the URL entirely?
Linux
bazel info release?release 0.13.0
Replace these lines with your answer.
If the files are large, upload as attachment or provide link.
Your analysis is correct. This is a consequence of using content accessible storage as the backend for repository cache. All CAS is aware of is the SHA1 and nothing else, for now.
Note, that for remote cache and action cache it is sufficient to build the key from sha1 only, because, the input is the SHA1 of the all inputs, gathered for the action.
Given this rule:
maven_jar(
name = "com_google_guava_guava",
artifact = "com.google.guava:guava:18.0",
sha1 = "cce0823396aa693798f8882e64213b1772032b09",
)
and given that repository cache is activated per default on most recent Bazel version, we get this entry:
$ sha1sum /home/davido/.cache/bazel/_bazel_davido/cache/repos/v1/content_addressable/sha1/ad97fe8faaf01a3d3faacecd58e8fa6e78a973ca/file
ad97fe8faaf01a3d3faacecd58e8fa6e78a973ca
the file content is just guava:
$ ls $(bazel info output_base)/external/com_google_guava_guava/jar/
Starting local Bazel server and connecting to it...
..........
BUILD.bazel guava-18.0.jar
If we run bazel clean --expunge and change the artifact to artifact = "com.google.guava:guavabljblja:18.0" and re-run the build again:
$ bazel build @com_google_guava_guava//...
INFO: Analysed 2 targets (8 packages loaded).
INFO: Found 2 targets...
INFO: Elapsed time: 2.502s, Critical Path: 0.01s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
it works and only thing changed is the file name:
$ ls $(bazel info output_base)/external/com_google_guava_guava/jar/
Starting local Bazel server and connecting to it...
..........
BUILD.bazel guavabljblja-18.0.jar
The reason for that behavior is: how the code is currently organized, the original artifact ID is not a part of the CAS, and is entirely lost. All we can consult is sha1 and the file content.
Gerrit Code Review (Buck and) Bazel build tool chain solved this problem by combining into the CAS key the artifact id. We use our own maven_jar Skylark implementation that is curl based: [1]. Our CAS key implementation doesn't suffer from artifact mutation problem you are describing in this issue: [2]:
$ ls ~/.gerritcodereview/bazel-cache/downloaded-artifacts/ | grep guava
guava-24.1-jre.jar-96c528475465aeb22cce60605d230a7e67cebd7b
guava-24.1-jre-src.jar-310ad448ea9f117b6cc3ee9642285922e5b681fd
guava-retrying-2.0.0.jar-974bc0a04a11cc4806f7c20a34703bd23c34e7f4
guava-retrying-2.0.0-src.jar-0a8e9267e624d0b6ea5f5bc0a502b01ee84a8f6f
[1] https://github.com/GerritCodeReview/bazlets/blob/master/tools/maven_jar.bzl
[2] http://paste.openstack.org/show/720257
The situation is even worse, in case the artifact is changed to a valid value, say version is bumped, without changing the sha1: say from "com.google.guava:guava:18.0" to "com.google.guava:guava:19.0", and this without changing the sha1. The user may think, that the newer version of the artifact is downloaded and all is fine, but actually nothing happened, except that the old artifact is retrieved from the cache and the file is renamd to "guava-19.0.jar", but has the content of guava 18.0.
Repository cache activation per default in recent Bazel versions, should probably be reconsidered, until this behaviour is fixed.
I'm not seeing how you affect the CAS key - can you point me to some line numbers?
Is there a way to purge that cache locally (with a bazel command, not by rm -rf)?
I guess that the build will fail for someone (likely on CI, where there's no cache?) so it's not awful, but it's definitely unexpected and seems to go against reproducibility/correctness of a build.
cc: @jart for the implications on java_import_external.
What is ~/.gerritcodereview/bazel-cache? The way I generally solve this problem with java_import_external is by not using local caches. mirror.bazel.build is the cache.
What is ~/.gerritcodereview/bazel-cache?
As I said, we use our own version of maven_jar, that is using download_file.py script that is using curl.
Here is the line, where the combined CAS key is constructed: <artifact>-<sha1> https://github.com/GerritCodeReview/bazlets/blob/master/tools/download_file.py#L74 .
The way I generally solve this problem with java_import_external is by not using local caches.
Repository cache was enabled per default in https://github.com/bazelbuild/bazel/commit/6e0933886d3c6b7f68075da4bdb08500ce2b6f86. I'm not aware of any way to disable it, unless you downgrade or patch Bazel.
Is there a way to purge that cache locally (with a bazel command, not by rm -rf)?
No, it's not. See this design proposal for garbage collection of repository cache.
@davido If I'm able to help rules_closure meet Gerrit's requirements (w/ hesitancy on C++11 https://github.com/bazelbuild/rules_closure/pull/251) would you consider migrating to java_import_external which uses Bazel's Downloader?
If you keep going the curl route, I've found the following shell code tremendously helpful for traditional builds on POSIX systems: https://gist.github.com/jart/8c5288db4398b8bd7a1e20f8deec4a16
@jart
[...] would you consider migrating to java_import_external which uses Bazel's Downloader?
java_import_external is still missing some important features, like handling sources classifier. Gerrit toolchain depends on that. For one we need to provide sources fro JSNI support in GWT, for another, we generate Eclipse IDE's .classpath file from our build definition, that references the sources artifacts for debugging purpose. I know, we could duplicate all artifacts and retrieve sources and binaries in two steps, but this is awkward.
All current Bazel external dependency management options don't handle that use case or try to do it, but have problems, most notably, native maven_jar rule doesn't store/retrieve sources classifier artifact in repository cache: #4798, that I'm trying to fix in https://bazel-review.googlesource.com/#/c/bazel/+/53950.
In our own maven_jar implementation, for this WORKSPACE file content:
maven_jar(
name = "guava",
artifact = "com.google.guava:guava:24.1-jre",
sha1 = "96c528475465aeb22cce60605d230a7e67cebd7b",
)
we would download these files:
$ ls /home/davido/.cache/bazel/_bazel_davido/5c01f4f713b675540b8b424c5c647f63/execroot/gerrit/external/guava/jar/
BUILD guava-24.1-jre.jar guava-24.1-jre-src.jar
and would generate this:
java_import(
name = 'jar',
jars = ['guava-24.1-jre.jar'],
srcjar = "guava-24.1-jre-src.jar",
)
java_import(
name = 'neverlink',
jars = ['guava-24.1-jre.jar'],
neverlink = 1,
)
java_import(
name = 'src',
jars = ['guava-24.1-jre-src.jar'],
)
And, as I mentioned above, our own CAS implementation, located in ~/.gerritcodereview/bazel-cache, just works.
If you keep going the curl route, I've found the following shell code tremendously helpful for traditional builds on POSIX systems: https://gist.github.com/jart/8c5288db4398b8bd7a1e20f8deec4a16
Thanks for sharing, I will have a look.
//CC @aehlig @jin.
I think, there are two approaches to fix this repoistory cache problem and to store artifact/URL in the CAS and to be able to verify it later, when the artifact is retrieved from the cache and report the mismatch:
@dkelmer I think this will affect workspace resolving, since the resolver is short circuited to the incorrect JAR in the CAS? Or is this exactly the issue that the resolving pass is suppose to pick out?
//CC @aehlig @jin.
I think, there are two approaches to fix this repoistory cache problem and to store artifact/URL in the CAS and to be able to verify it later, when the artifact is retrieved from the cache and report the mismatch: [...]
From a repository cache point of view, this is all working as intended. One of the design points of the download cache is to avoid downloads, even if upstream moves (including implicit moves, if URLs are overwritten by local mirrors). Also, for the downloader there is no such thing as "the" URL, as it is called with a list of alternative URLs trying to fetch a file with the given hash from whatever of those URLs is best reachable.
As the whole problem is pretty maven specific (for http_archive the desired semantics really is "give me a file with that hash, by whatever means you can produce it"), it should, in my opinion, be fixed (if at all) within maven_jar.
@aehlig In the issue that got closed yesterday I suggested a way of keeping the file hash and redownloading when the maven jar dep gets bumped. Realistically, maven jars only change when the artifact version changes. Proxies rewriting URLs shouldn't be a part of it right?
Proxies rewriting URLs shouldn't be a part of it right?
I'm not worried about proxies changing URLs. But my argument was that
the same archive of an open-source project file might be mirrored at
different locations. For example, the following URLs (among many many
others) all refer to the same file with SHA256 sum
'0ba5403111f8e5ea22be7d51ab74c8ccb576dc30ddfbf18a46cb51f9139790ab'.
Moreover, depending on the tradition, different projects would consider
different of those "the canonical URL" for that file. On top of that,
there might as well be an internal mirror in a big organisation. Still,
it seems reasonable to not download the file again from each mirror,
if we have it already.
That's why I argued that a solution would probably have to be maven
specific.
Note, however, that those kind of confusions will go away, once we
make progress with the WORKSPACE.resolved proposal (see
bazel sync command willmake makesum known from ports trees.)--
Klaus Aehlig
Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Paul Terence Manicle, Halimah DeLaine Prado
We should go the route similar to what is suggested in #5250.
Our export proceedure changed https://bazel-review.googlesource.com/c/bazel/+/91030 to aff8319e25f599f57a2098939a7e4ccc000d8ead (basically censoring away the commit message), so mentioning by hand that this change is related to this issue as well as #5045.
Our export proceedure changed https://bazel-review.googlesource.com/c/bazel/+/91030 to [aff8319]
That seems like a bug
Huh, I just updated the github.com/square/bazel_maven_repository README 5 minutes ago about this issue, and wrote a blog post.
http://www.geekinasuit.com/2019/02/bazely-thinking-and-tale-of-content.html
I hold by the conclusion, that bazel privileges the content (the hash) over the address, and I've made it a caution in my bazel maven integration. I was thinking of schemes to add more validation to this, so that you can't make the silly mistake of bumping the version without bumping the hash without warning. But part of it needs to be that bazel workspace maintainers need to think about the sha as the key, not the list-of-urls or they'll run into this over and over.
Current state is that, whenever a repository fails, we get a message of the form
INFO: Repository 'ext' used the following cache hits instead of downloading the corresponding file
* Hash 'd458...' for http://example.com/...
If the definition of 'ext' was updated, verify that the hashes were also updated.
which points the user to the possible error in the definition. It also shows the location (including call stack) where the repository was defined.
@dslomov, when tagging this issue as "bad error messaging" you stated
We should go the route similar to what is suggested in #5250.
That issue proposes to change the cache from having lookups by (predicted) hash of the content of the file to looking up pairs of artifactId and content hash. The latter is meaningful for maven, where a canonical artifactId exists (cc @dkelmer), but not that clear for the generic interface ctx.download. Issue #5250, however, does not propose any changes to the error messaging. So I wonder, what was meant by that comment.
That issue proposes to change the cache from having lookups by (predicted) hash of the content of the file to looking up pairs of artifactId and content hash.
Why not do both? Bazel could make a file named sha256(downloadUrl + sha256(fileContent)), which is a symlink to the current sha256(fileContent) filename. That would mean Bazel would have to download the file again when the url changes, but if the content stays the same it doesn't need to update it. It would also work for the non-Maven case.
Currently discussed: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md
Proposed implementation (presuming the proposal gets accepted)
I'm crazy excited about this. I've managed to screw up dependencies more than once on this issue, forgetting to update the hash. I've started using my maven deps without cacheing in the short-term, just to make sure.