Bazel: Unexpected IO error (Not a directory)

Created on 2 Mar 2018  路  37Comments  路  Source: bazelbuild/bazel

ERROR: /Users/buildkite/builds/darwin-x86-64-1-1/bazel/re2/BUILD:26:1: Couldn't build
file _objs/re2/re2/filtered_re2.o: C++ compilation of rule '//:re2' failed: Unexpected IO error.:
/private/var/tmp/_bazel_buildkite/90b05a586a8f6522c740e54d1334c7a0/execroot/
com_googlesource_code_re2/external/local_config_cc/wrapped_clang (Not a directory)

We see this error frequently on macOS. @mhlopko suggested that it's a somewhat known bug in external repositories.

cc: @aehlig @dslomov

P0 apple team-Remote-Exec team-Rules-CPP

Most helpful comment

I have finally found the problem and solution to this bug.The issue is that we always create Path objects using execRoot.getRelativePath(rootRelativePath). Now that mostly works except for external dependencies, because these aren't symlinked under the execroot but only below their artifact root, which happens to be the output base.

The solution is to create the path object for an Artifact only using its ArtifactRoot (Path p = artifact.getRoot().getRoot().getRelative(rootRelativePath) and to only use the exec path for ActionInput objects.

I ll send out a fix.

All 37 comments

Is anyone investigating?

I just tried to collect evidence that this is an instance of existing issue, but I cannot find any. But I vaguely remember strange issues with skylark repositories on mac like this one.

Thanks. Marcel, may I assign this to you or is there someone for whom it's a better fit?

I'd suggest @aehlig (for external repositories expertise) or @philwo (for mac expertise). I don't plan to investigate this issue further.

Thanks! Assigning to @philwo because @aehlig is out on vacation.

Probably the same error is on re2 project in this build: https://buildkite.com/bazel/bazel-with-downstream-projects-bazel/builds/174#76e869fc-616e-4893-a2d6-f86403c64dc0

failed: Unexpected IO error.: /private/var/tmp/_bazel_buildkite/5325b99c04dd286d6e24d5b36b141d80/execroot/com_googlesource_code_re2/external/bazel_tools/tools/test/test-setup.sh (No such file or directory)

I have managed to get a stacktrace for the error:

ERROR: /Users/buchgr/code/bazel/src/main/cpp/util/BUILD:126:1: C++ compilation of rule '//src/main/cpp/util:strings' failed: Unexpected IO error.java.io.FileNotFoundException: /private/var/tmp/_bazel_buchgr/5de787c83f067e12ca3f7ef44fb23d3f/execroot/io_bazel/external/local_config_cc/wrapped_ar (No such file or directory)
    at com.google.devtools.build.lib.unix.NativePosixFiles.stat(Native Method)
    at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:177)
    at com.google.devtools.build.lib.unix.UnixFileSystem.isExecutable(UnixFileSystem.java:248)
    at com.google.devtools.build.lib.vfs.Path.isExecutable(Path.java:839)
    at com.google.devtools.build.lib.remote.TreeNodeRepository.getOrComputeDirectory(TreeNodeRepository.java:375)
    at com.google.devtools.build.lib.remote.TreeNodeRepository.computeMerkleDigests(TreeNodeRepository.java:407)
    at com.google.devtools.build.lib.remote.TreeNodeRepository.computeMerkleDigests(TreeNodeRepository.java:405)
    at com.google.devtools.build.lib.remote.TreeNodeRepository.computeMerkleDigests(TreeNodeRepository.java:405)
    at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:98)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:91)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:64)
    at com.google.devtools.build.lib.rules.cpp.SpawnGccStrategy.execWithReply(SpawnGccStrategy.java:65)
    at com.google.devtools.build.lib.rules.cpp.CppCompileAction.execute(CppCompileAction.java:1187)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeActionTask(SkyframeActionExecutor.java:892)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.prepareScheduleExecuteAndCompleteAction(SkyframeActionExecutor.java:823)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.access$900(SkyframeActionExecutor.java:112)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.call(SkyframeActionExecutor.java:690)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.call(SkyframeActionExecutor.java:644)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:414)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:440)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:194)
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:347)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:355)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
: /private/var/tmp/_bazel_buchgr/5de787c83f067e12ca3f7ef44fb23d3f/execroot/io_bazel/external/local_config_cc/wrapped_ar (No such file or directory)

After Bazel crashes, the file in question exists however. The error is hard to reproduce and I have only ever managed to reproduce it when the machine was under load. The error only seems to happen for files from remote repositories.

Some more instances of the error:

https://buildkite.com/bazel/publish-bazel-binaries/builds/50#fc59eefc-b8d3-4c53-9854-5df86d2061d1
/Users/buildkite/builds/buildkite-macpro-1-1/bazel/publish-bazel-binaries/src/main/cpp/util/BUILD:32:1: Couldn't build file src/main/cpp/util/_objs/file/src/main/cpp/util/file.o: C++ compilation of rule '//src/main/cpp/util:file' failed: Unexpected IO error.: /private/var/tmp/_bazel_buildkite/e8261f14c63cbece55e044e3e278423a/execroot/io_bazel/external/local_config_cc/cc_wrapper.sh (Not a directory)

https://buildkite.com/bazel/bazel-bazel/builds/1058#a5d34e61-0b0e-47f8-a6cf-ddbb4c774628
ERROR: /Users/buildkite/builds/buildkite-macpro-2-1/bazel/bazel-bazel/src/main/cpp/util/BUILD:96:1: Couldn't build file src/main/cpp/util/_objs/logging/src/main/cpp/util/logging.o: C++ compilation of rule '//src/main/cpp/util:logging' failed: Unexpected IO error.: /private/var/tmp/_bazel_buildkite/7f1cd7ef09bcdfc83a912d002941c64f/execroot/io_bazel/external/local_config_cc/wrapped_clang (Not a directory)

The code that triggers this error is

execRoot.getRelative(input.getExecPathString()).isExecutable()

with exectRoot being a Path object and Path.isExecutable() leading to the stat system call.

cc: @ulfjack

It's also quite possible that this is a bug in NativePosixFileSystem, as it only ever appears on macOS. I have never seen this error on Windows or Linux. I suggest setting io.bazel.EnableJni=0 for macOS on Buildkite and see if the error continues to happen.

So far, I think we've only seen this error for executable files, right? I wonder if it's something in MacOS, maybe related to its verification of executable files?

@ulfjack IIUC all output files of an action are marked executable in Bazel?

None of the paths posted so far contain "bazel-out". I'm not sure how exactly the files under external/ come into existence - the paths are related to external repositories, but that doesn't really help us narrow it down. It could be coincidence that we only see it happen for external repositories paths. The remote cache is implicitly sorting the files by name, so maybe it's just the external/ paths always get sorted first or something like that.

I have disabled the NativePosixFileSystem and this error then happens also with the JavaIoFilesystem.

ERROR: /Users/buchgr/code/bazel/src/main/cpp/util/BUILD:83:1: C++ compilation of rule '//src/main/cpp/util:port' failed: Unexpected IO error.java.io.FileNotFoundException: /private/var/tmp/_bazel_buchgr/5de787c83f067e12ca3f7ef44fb23d3f/execroot/io_bazel/external/local_config_cc/wrapped_clang_pp (No such file or directory)
    at com.google.devtools.build.lib.vfs.JavaIoFileSystem.isExecutable(JavaIoFileSystem.java:164)
    at com.google.devtools.build.lib.vfs.Path.isExecutable(Path.java:839)
    at com.google.devtools.build.lib.remote.TreeNodeRepository.getOrComputeDirectory(TreeNodeRepository.java:375)
    at com.google.devtools.build.lib.remote.TreeNodeRepository.computeMerkleDigests(TreeNodeRepository.java:407)
    at com.google.devtools.build.lib.remote.TreeNodeRepository.computeMerkleDigests(TreeNodeRepository.java:405)
    at com.google.devtools.build.lib.remote.TreeNodeRepository.computeMerkleDigests(TreeNodeRepository.java:405)
    at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:98)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:91)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:64)
    at com.google.devtools.build.lib.rules.cpp.SpawnGccStrategy.execWithReply(SpawnGccStrategy.java:65)
    at com.google.devtools.build.lib.rules.cpp.CppCompileAction.execute(CppCompileAction.java:1187)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeActionTask(SkyframeActionExecutor.java:892)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.prepareScheduleExecuteAndCompleteAction(SkyframeActionExecutor.java:823)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.access$900(SkyframeActionExecutor.java:112)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.call(SkyframeActionExecutor.java:690)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.call(SkyframeActionExecutor.java:644)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:414)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:440)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:194)
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:347)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:355)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
: /private/var/tmp/_bazel_buchgr/5de787c83f067e12ca3f7ef44fb23d3f/execroot/io_bazel/external/local_config_cc/wrapped_clang_pp (No such file or directory)

@aehlig @dslomov can you comment on how files from external repositories come into existence in the execroot?

@ulfjack any ideas on how to proceed?

The next thing to do is rule out the likely answers. Does the file exist when we do the call?

Also, if we're writing the file, are we properly closing it afterwards?

I added the following code before constructing the TreeNodeRepository.

    for (ActionInput input : inputMap.values()) {
      Path p = execRoot.getRelative(input.getExecPathString());
      try {
        p.isExecutable();
      } catch (IOException e) {
        File f = new File(p.getPathString());
        report(Event.debug("stacktrace: " + Throwables
            .getStackTraceAsString(e) + ", exists: " + f.isFile()));
      }
    }

According to java.io.File.isFile() all files for which Path.isExecutable() threw an exception do indeed exist. What's more interesting is that after having added this code, there are no more failures in the TreeNodeRepository. So either the call to Path.isExecutable() or File.isFile() seems to influence the result of the subsequent Path.isExecutable() call. I'd assume that both calls translate to stat. I ll try to find out more.

The following patch makes the error go away, which makes me conclude that it's a problem with both our FileSystem implementations.

diff --git a/src/main/java/com/google/devtools/build/lib/remote/TreeNodeRepository.java b/src/main/java/com/google/devtools/build/lib/remote/TreeNodeRepository.java
index 7767cb8c7b..c1a4f34817 100644
--- a/src/main/java/com/google/devtools/build/lib/remote/TreeNodeRepository.java
+++ b/src/main/java/com/google/devtools/build/lib/remote/TreeNodeRepository.java
@@ -39,6 +39,7 @@ import com.google.devtools.build.lib.vfs.Symlinks;
 import com.google.devtools.remoteexecution.v1test.Digest;
 import com.google.devtools.remoteexecution.v1test.Directory;
 import com.google.protobuf.ByteString;
+import java.io.File;
 import java.io.IOException;
 import java.util.ArrayList;
 import java.util.Arrays;
@@ -372,7 +373,7 @@ public final class TreeNodeRepository {
             b.addFilesBuilder()
                 .setName(entry.getSegment())
                 .setDigest(DigestUtil.getFromInputCache(input, inputFileCache))
-                .setIsExecutable(execRoot.getRelative(input.getExecPathString()).isExecutable());
+                .setIsExecutable(new File(execRoot.getRelative(input.getExecPathString()).getPathString()).canExecute());
           }
         } else {
           Digest childDigest = Preconditions.checkNotNull(treeNodeDigestCache.get(child));

Hmm, looks like this issue is also in 0.12.0. Now it's causing presubmit to fail sometimes.
https://buildkite.com/bazel/google-bazel-presubmit/builds/1645

@meteorcloudy Ohh, that's very exciting! Does it possibly mean that we only introduced this between 0.11.0 and 0.12.0? I always thought that it exists much longer already, but somehow we didn't trigger it before.

Is it possible we can make some progress on this issue this week? I'd like to have 0.13.0 by the end of the month.

I unfortunately don't have time to look into it this week. Maybe @philwo ?

Hmm, I understand this is a very tricky bug.
Considering it also exists in 0.12.0, do you think it's okay to remove it as a release blocker for 0.13.0?
@philwo @buchgr

I think it should not be a release blocker. It probably also does exist before 0.12.0 but we didn't notice it as it only ever happens with remote caching / execution on macOS and that has simply not been tested much by us or anyone else.

SGTM

Hi,

we currently don't know why it's happening and I don't have the time to investigate this more. It's affecting quite a lot of users though. In the mean time, I propose to be pragmatic and to always mark all files as executable [1]. I am not proposing to close this bug.

Do you have any objections @ola-rozenfeld @werkt @ulfjack ?

[1] https://source.bazel.build/bazel/+/master:src/main/java/com/google/devtools/build/lib/remote/TreeNodeRepository.java;l=380?q=TreeNodeRepository.java

My branch https://github.com/werkt/bazel/tree/bazel-0.13.0-remote-execution has a commit (0d84285810aecaf992d6289e0fab5b57668345fc) which I believe may have some influence on this. If someone can run it (it will not have the 'always mark files executable' commit) and verify if the problem can no longer be observed on a Mac, I will rebase it, revert 3e3b71a as a part of it, and submit as a PR.

This happened again today in Bazel CI: https://buildkite.com/bazel/bazel-with-downstream-projects-bazel/builds/254#43161f5d-12a1-4577-a1b9-8ede3f58b5ab

We still see this every now and then on buildkite. @philwo spent some time on this last week, I'm not sure he was able to find the culprit.

Is anyone looking at this bug?

Unfortunately I don't have time to look into this, but it's a really annoying bug :( Also, I don't really know where to even start looking into this. Would be glad for any help.

Happened again here: https://buildkite.com/bazel/rules-docker-docker/builds/966#9d04d913-880b-421b-9b01-c0161174cd18

I have finally found the problem and solution to this bug.The issue is that we always create Path objects using execRoot.getRelativePath(rootRelativePath). Now that mostly works except for external dependencies, because these aren't symlinked under the execroot but only below their artifact root, which happens to be the output base.

The solution is to create the path object for an Artifact only using its ArtifactRoot (Path p = artifact.getRoot().getRoot().getRelative(rootRelativePath) and to only use the exec path for ActionInput objects.

I ll send out a fix.

Wow! Thanks!! 馃榾

I鈥檓 curious: Why does it only sometimes fail though (and mostly on macOS)? Is there a race somewhere that lets this work most of the time even though the path is wrong?

Wow, that's quite impressive. How did you find this bug?

Also, it seems to be easy to do the wrong thing and hard to do the right thing, i.e. I would never guess that artifact.getRoot() then another getRoot() and finally getRelative() is the way to go. Do you have any suggestion about simplifying the API or somehow making it easy to create Artifacts the right way?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

f1recracker picture f1recracker  路  3Comments

kastiglione picture kastiglione  路  3Comments

ob picture ob  路  3Comments

cyberbono3 picture cyberbono3  路  3Comments

sandipmgiri picture sandipmgiri  路  3Comments