Bazel: Remote cache hashes sometimes depend on user's absolute path to workspace

Created on 12 May 2017 · 18Comments · Source: bazelbuild/bazel

Description of the problem / feature request / question:

In my organization's pretty complex build system, it seems that if two users have the same workspace, and we are using remote caching, some of the build artifacts will only download from the cache if the two users have the exact same absolute path to their workspace.

For example, suppose Alice checks out her org's workspace ~/code on her machine, and Bob also checks it out into ~/code on his machine. But since they have different user names, ~ is different, so Alice has it in /home/alice/code and Bob has it in /home/bob/code.

I think this is because, when runfile directories are created, the manifest files for those runfiles directories contain absolute paths to the destinations of the symlinks — even if those destinations are under bazel-out. So e.g. on my Mac, the manifest might contain a path like /private/var/tmp/_bazel_mikemorearty/9c7cd8557030180b585ae77b9f68b5a2/....

I think that those manifest files are sometimes included among the list of input files (not sure if that's the right term) for some build actions. So those absolute paths end up changing the hash that Bazel looks for in the remote cache.

If that theory is correct, then one possible fix might be to write the manifest files in a different format, such that the second item on each line might include some variable names such as $BAZEL_OUT or something.

I'm not sure if this is limited to rules that create "middleman" actions (I don't really understand those), or to actions that that create persistent workers (my code does that).

If possible, provide a minimal example to reproduce the problem:

I haven't had time to reduce this to a simple test case, but I can provide some log files and some snippets from our build files that may help.

I took our code and put copies of it into two different directories on my machine, and then built a target, with appropriate command-line arguments to use a remote cache. I also hacked the Bazel sources, adding this debugging line (along with some others, but this is the important one):

diff --git a/src/main/java/com/google/devtools/build/lib/skyframe/ActionExecutionFunction.java b/src/main/java/com/google/devtools/build/lib/skyframe/ActionExecutionFunction.java
index 538ef9000..80f70031a 100644
--- a/src/main/java/com/google/devtools/build/lib/skyframe/ActionExecutionFunction.java
+++ b/src/main/java/com/google/devtools/build/lib/skyframe/ActionExecutionFunction.java
@@ -622,6 +622,7 @@ public class ActionExecutionFunction implements SkyFunction, CompletionReceiver
             inputArtifactData.put(input, treeValue.getSelfData());
           } else {
             Preconditions.checkState(value instanceof FileArtifactValue, depsEntry);
+            System.err.println(">>>" + input + " " + value);
             inputArtifactData.put(input, (FileArtifactValue) value);
           }
         }

First I did a clean build of one directory, and then of the other. The files log1.txt and log2.txt contain the slightly edited output for the two builds.

My build includes a tool called tsc. If you diff those two files, you will see that tsc's runfiles have different hashcodes, such as this line (and there are other similar ones in the diff) -- HASH just represents where I did a search-and-replace in these log files so that I could diff them, but notice that the digests at the ends of the lines are different:

418c419
< >>>Artifact:[[/private/var/tmp/_bazel_mikemorearty/HASH/execroot/asana2_or_3]bazel-out/local-fastbuild/internal]_middlemen/tools_Sweb_Stsc-runfiles FileArtifactValue{digest=[16, -75, 124, 40, -70, 43, 115, 55, 51, 0, 61, -18, 13, -60, 92, -21], mtime=-1, size=1}
---
> >>>Artifact:[[/private/var/tmp/_bazel_mikemorearty/HASH/execroot/asana2_or_3]bazel-out/local-fastbuild/internal]_middlemen/tools_Sweb_Stsc-runfiles FileArtifactValue{digest=[62, 60, -82, 53, -61, -17, 117, -13, 93, -122, -88, 50, -43, -51, -40, 19], mtime=-1, size=1}

The Adding to hasher lines are from some other debugging code in RemoteSpawnAction.java. You can see that bazel-out/local-fastbuild/internal/_middlemen/tools_Sweb_Stsc-runfiles has a different hash in the two builds.

Environment info

Operating System:

OSX 10.12.4. Should reproduce on any OS.

Bazel version (output of bazel info release):

0.4.5

under investigation

Source

mmorearty

Most helpful comment

Ulf,
Is there a hash key doc? If so any chance for a link?
I think it would help future visitors of this issue
On Sat, 13 May 2017 at 23:54 doug tangren notifications@github.com wrote:

Excellent!

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/2998#issuecomment-301274179,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABUIF6VhN1CprRH8UK8mpsXoWHr7AgIZks5r5hidgaJpZM4NYrwq
.

ittaiz on 16 May 2017

👍3

All 18 comments

Is there any documentation on what is included/excluded in a cache key?

I'm selling my company on bazel specifically for its remote caching features. I don't think we'll have these same issues because the code location is always the same on dev machines but Im wondering what other factors contribute to cache keys. Things like environment variables and build arguments. In some builds we'e add debug flags in ci for debugging. i.e. --verbose_explanations --explain explain.out --profile build.prof and omit them on dev machines. Do debug flags also contribute to these cache keys?

softprops on 12 May 2017

👍1

I would expect everything to contribute to the cache key that (potentially) changes the resulting binary. After all the point is to not repeat work, but to still be able to get exactly the same output for the same input.

MarkusTeufelberger on 12 May 2017

I would expect everything to contribute to the cache key that (potentially) changes the resulting binary.

I won't expect build profiling output to contribute to the binary so that works for me. That's actually me guessing though. I would not like to guess and have better documentation for :)

softprops on 12 May 2017

Can you check with 0.5.0 (possibly with the release candidate?). I wonder if 40bf169dfad2a3285d255c9100450b29be63202c has fixed this.

The cache key is computed from the command line, the input files, the output files, and the environment variables for each action. We should always be using relative paths, not absolute paths for the cache key.

ulfjack on 12 May 2017

Unfortunately, if workers are enabled for a specific action type, we never use the remote cache right now. I have a series of pending changes to fix that, but they haven't landed yet.

ulfjack on 12 May 2017

@ulfjack, this appears to be fixed in release-0.5.0! (I tested it at commit 63600d7e0.)

mmorearty on 13 May 2017

The cache key is computed from the command line, the input files, the output files, and the environment variables for each action.

Does that mean if a bazel build providing cmdline flags like --verbose_explanations --explain explain.out --profile build.prof would yield a different cache key then a bazel build without?

softprops on 13 May 2017

No. Only the command-line of the specific action is part of the cache key for that action. Think "javac Source.java" will not be equal to "javac -g:none Source.java". If you run bazel --copt=-O2, you still get cache hits for all the Java actions.

ulfjack on 13 May 2017

Excellent!

softprops on 13 May 2017

Ulf,
Is there a hash key doc? If so any chance for a link?
I think it would help future visitors of this issue
On Sat, 13 May 2017 at 23:54 doug tangren notifications@github.com wrote:

Excellent!

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/2998#issuecomment-301274179,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABUIF6VhN1CprRH8UK8mpsXoWHr7AgIZks5r5hidgaJpZM4NYrwq
.

ittaiz on 16 May 2017

👍3

I'm updating the remote/README.md right now, and will move it to the public docs when I'm done.

ulfjack on 22 May 2017

I'll attach the relevant changes to #904.

ulfjack on 22 May 2017

Thanks @ulfjack looks great!

softprops on 26 May 2017

It seems this is still happening... I'm not sure what I did wrong when I said it was not happening in 0.5.0.

I narrowed it down a bit. When a middleman is needed, there are a couple of middleman actions that are created:

runfiles_artifacts, whose inputs are all the files that are part of the runfiles, and
runfiles, which has only two inputs: the above-mentioned runfiles_artifacts action, and the outputManifest file (.../foo.runfiles/MANIFEST).

That outputManifest is the problem. As discussed above, it's a physical file on disk, and its digest is a hash of its contents. But its contents include absolute paths that include both the username and a hash of the path to the project, e.g.
/private/var/tmp/_bazel_mikemorearty/9c7cd8557030180b585ae77b9f68b5a2/...

So e.g. our CI machine builds with one username, and my local builds are with a different username, so the remote cache entries are not shared between them for anything that depends on runfiles that match the above scenario.

You can find the code for this in RunfilesSupport.java:

  private Artifact createRunfilesMiddleman(ActionConstructionContext context,
      Artifact artifactsMiddleman, Artifact outputManifest) {
    return context.getAnalysisEnvironment().getMiddlemanFactory().createRunfilesMiddleman(
        context.getActionOwner(), owningExecutable,
        //////////// NEXT LINE: outputManifest is one of the two inputs to this action
        ImmutableList.of(artifactsMiddleman, outputManifest),
        context.getConfiguration().getMiddlemanDirectory(context.getRule().getRepository()),
        "runfiles");
  }

mmorearty on 16 Jun 2017

I can't reproduce the problem. It's true that the output manifests contain absolute paths, but they should never be in any cache keys.

ulfjack on 16 Jun 2017

Okay I was finally able to make a small repro case. You need three files:

WORKSPACE: empty
BUILD:

py_binary(
name = "tool1",
srcs = ["tool1.py"]
)
tool1.py:

print "this is tool1"

Make two copies of this little project, e.g. one in directory test1 and another in test2.
In each dir, build with:

bazel build --host_jvm_args=-Dbazel.DigestFunction=SHA1 [REMOTE CACHE ARGS] //:tool1

If you look at the traffic to your remote cache, you will see that some of the cache files for the two projects are different.

To get further confirmation that this is because of the contents of the runfiles manifest files: In ActionCacheChecker.java, add this (the easiest way is to save this to a file and then git apply mypatch):

diff --git a/src/main/java/com/google/devtools/build/lib/actions/ActionCacheChecker.java b/src/main/java/com/google/devtools/build/lib/actions/ActionCacheChecker.java
index a7f0b82..99b3fea 100644
--- a/src/main/java/com/google/devtools/build/lib/actions/ActionCacheChecker.java
+++ b/src/main/java/com/google/devtools/build/lib/actions/ActionCacheChecker.java
@@ -18,6 +18,7 @@ import com.google.common.base.Predicate;
 import com.google.common.collect.ImmutableList;
 import com.google.common.collect.ImmutableMap;
 import com.google.common.collect.Iterables;
+import com.google.common.hash.HashCode;
 import com.google.devtools.build.lib.actions.ActionAnalysisMetadata.MiddlemanType;
 import com.google.devtools.build.lib.actions.cache.ActionCache;
 import com.google.devtools.build.lib.actions.cache.ActionCache.Entry;
@@ -434,6 +435,13 @@ public class ActionCacheChecker {
       }
     }

+    int i = 1;
+    String actionDescription  = action.getMnemonic() + ": " + action.prettyPrint();
+    for (Artifact input : action.getInputs()) {
+      System.err.println(">>>input #" + (i++) + " for [" + actionDescription  + "]: " +
+          HashCode.fromBytes(metadataHandler.getMetadataMaybe(input).digest) + " " + input.getExecPath());
+    }
+
     metadataHandler.setDigestForVirtualArtifact(middleman, entry.getFileDigest());
     if (changed) {
       actionCache.put(cacheKey, entry);

This will cause it to print all the input artifacts for each middleman rule.

For me, when I built in the first directory, the output included this:

>>>input #1 for [Middleman: runfiles for //:tool1]: ddccd589e821d933e854501db36dae2b bazel-out/local-fastbuild/internal/_middlemen/tool1-runfiles_artifacts
>>>input #2 for [Middleman: runfiles for //:tool1]: 39ee68c18e32d11afb8ca1bcb510f2cf9ab0a4fb bazel-out/local-fastbuild/bin/tool1.runfiles/MANIFEST

Notice that the MANIFEST file is one of the inputs. If you do shasum bazel-out/local-fastbuild/bin/tool1.runfiles/MANIFEST you will get the same value, 39ee68c18e32d11afb8ca1bcb510f2cf9ab0a4fb, so it's the hash of the entire contents of the file, including the absolute paths.

And in the second directory it had a different hash for the MANIFEST file:

>>>input #1 for [Middleman: runfiles for //:tool1]: ddccd589e821d933e854501db36dae2b bazel-out/local-fastbuild/internal/_middlemen/tool1-runfiles_artifacts
>>>input #2 for [Middleman: runfiles for //:tool1]: ba67503883caa67069afef538a7df446902f30c7 bazel-out/local-fastbuild/bin/tool1.runfiles/MANIFEST

mmorearty on 16 Jun 2017

What you're describing is working as intended. The ActionCacheChecker works at a different level than the remote cache, and it handles these files containing absolute paths, but that doesn't mean that the files are uploaded to the remote cache, or used as inputs to any action other than the runfiles symlink tree action.

You'd need to instrument the RemoteSpawnStrategy:

diff --git a/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnStrategy.java b/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnStrategy.java
index 98af3056d..2f9d663dc 100644
--- a/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnStrategy.java
+++ b/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnStrategy.java
@@ -57,6 +57,7 @@ import java.util.List;
 import java.util.Map;
 import java.util.SortedMap;
 import java.util.TreeSet;
+import java.util.logging.Logger;

 /**
  * Strategy that uses a distributed cache for sharing action input and output files. Optionally this
@@ -67,6 +68,8 @@ import java.util.TreeSet;
   contextType = SpawnActionContext.class
 )
 final class RemoteSpawnStrategy implements SpawnActionContext {
+  private static final Logger logger = Logger.getLogger(RemoteSpawnStrategy.class.getName());
+
   private final Path execRoot;
   private final StandaloneSpawnStrategy standaloneStrategy;
   private final boolean verboseFailures;
@@ -144,8 +147,10 @@ final class RemoteSpawnStrategy implements SpawnActionContext {
       RemoteActionCache remoteCache,
       ActionKey actionKey)
       throws ExecException, InterruptedException {
+    logger.info("Executing locally");
     standaloneStrategy.exec(spawn, actionExecutionContext);
     if (remoteOptions.remoteUploadLocalResults && remoteCache != null && actionKey != null) {
+      logger.info("Uploading result");
       ArrayList<Path> outputFiles = new ArrayList<>();
       for (ActionInput output : spawn.getOutputFiles()) {
         Path outputFile = execRoot.getRelative(output.getExecPathString());
@@ -292,6 +297,7 @@ final class RemoteSpawnStrategy implements SpawnActionContext {
               : null;
       boolean acceptCachedResult = this.remoteOptions.remoteAcceptCached;
       if (result != null) {
+        logger.info("Cache hit");
         // We don't cache failed actions, so we know the outputs exist.
         // For now, download all outputs locally; in the future, we can reuse the digests to
         // just update the TreeNodeRepository and continue the build.
@@ -303,6 +309,7 @@ final class RemoteSpawnStrategy implements SpawnActionContext {
           acceptCachedResult = false; // Retry the action remotely and invalidate the results.
         }
       }
+      logger.info("Cache miss");

       if (workExecutor == null) {
         execLocally(spawn, actionExecutionContext, remoteCache, actionKey);

Using that patch, and building a C++ binary, I get this output:

170619 07:58:08.992:I 604 [com.google.devtools.build.lib.remote.RemoteSpawnStrategy.exec] Cache miss
170619 07:58:08.992:I 604 [com.google.devtools.build.lib.remote.RemoteSpawnStrategy.execLocally] Executing locally
170619 07:58:09.008:I 604 [com.google.devtools.build.lib.remote.RemoteSpawnStrategy.execLocally] Uploading result
170619 07:58:09.089:I 604 [com.google.devtools.build.lib.remote.RemoteSpawnStrategy.exec] Cache miss
170619 07:58:09.089:I 604 [com.google.devtools.build.lib.remote.RemoteSpawnStrategy.execLocally] Executing locally
170619 07:58:09.091:I 600 [com.google.devtools.build.lib.remote.RemoteSpawnStrategy.exec] Cache miss
170619 07:58:09.092:I 600 [com.google.devtools.build.lib.remote.RemoteSpawnStrategy.execLocally] Executing locally
170619 07:58:09.096:I 600 [com.google.devtools.build.lib.remote.RemoteSpawnStrategy.execLocally] Uploading result
170619 07:58:09.109:I 604 [com.google.devtools.build.lib.remote.RemoteSpawnStrategy.execLocally] Uploading result

Running the same build in a copy of the client results in this log:

170619 08:02:30.443:I 603 [com.google.devtools.build.lib.remote.RemoteSpawnStrategy.exec] Cache hit
170619 08:02:30.484:I 597 [com.google.devtools.build.lib.remote.RemoteSpawnStrategy.exec] Cache hit
170619 08:02:30.485:I 603 [com.google.devtools.build.lib.remote.RemoteSpawnStrategy.exec] Cache hit

(I should probably merge a change like this (with a bit more detail), so that it's easier to debug possible issues in the future.)

For py_binary / py_test rules, unless you're using generated Python code (e.g., for protocol buffers), there's no remote step at all - the runfiles tree creation is completely local - so you won't see either a "cache hit" or "cache miss" line.

ulfjack on 19 Jun 2017

The reason the runfiles tree creation runs local is that there's no point in caching that or running it remotely (at this point in time); it consists of creating a few symlinks in the local output directory.

ulfjack on 19 Jun 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Release - August 2018 - Target RC date: 2018-08-01 - name: 0.17.0

dslomov · 84Comments

Release - September 2018 - name: 0.18.0

laurentlb · 76Comments

Release 0.19 - October 2018

laurentlb · 111Comments

Tracking issue for "Remote Builds without the Bytes"

buchgr · 82Comments

Release 0.27 - June 2019

dslomov · 61Comments