Using a Process to invoke an is_executable=True script can result in:
Error launching process: Os { code: 26, kind: Other, message: "Text file busy" }
We have tests that cover this behavior: https://github.com/pantsbuild/pants/blob/49f5a0aa8b68a9f079481d7f6a682b217ad9742b/src/python/pants/engine/process_test.py#L250-L277 ... so it's possible that this is non-deterministic.
Grr this blocks #10442 https://travis-ci.com/github/pantsbuild/pants/jobs/366894459#L776.
To reproduce, apply this diff:
diff --git a/src/python/pants/engine/process.py b/src/python/pants/engine/process.py
index 2dbb58714..0001946d2 100644
--- a/src/python/pants/engine/process.py
+++ b/src/python/pants/engine/process.py
@@ -336,12 +336,12 @@ class BinaryPaths(EngineAware):
@rule(desc="Find binary path")
async def find_binary(request: BinaryPathRequest) -> BinaryPaths:
# TODO(John Sirois): Replace this script with a statically linked native binary so we don't
- # depend on either /bin/bash being available on the Process host.
- # TODO(#10507): Running the script directly from a shebang sometimes results in a "Text file
- # busy" error.
- script_path = "./script.sh"
+ # depend on either `/usr/bin/env` or bash being available on the Process host.
+ script_path = "./find_binary.sh"
script_content = dedent(
"""
+ #!/usr/bin/env bash
+
set -euo pipefail
if command -v which > /dev/null; then
@@ -363,7 +363,7 @@ async def find_binary(request: BinaryPathRequest) -> BinaryPaths:
Process(
description=f"Searching for `{request.binary_name}` on PATH={search_path}",
input_digest=script_digest,
- argv=["/bin/bash", script_path, request.binary_name],
+ argv=[script_path, request.binary_name],
env={"PATH": search_path},
),
)
Then run something like ./pants run build-support/bin/check_headers.py --no-process-execution-use-local-cache --no-pantsd. The flags at the end are to force you to re-search the python binary to use to run the Pex.
I suspect that this only happens on Linux. I haven't been able to get this error locally on macOS. On my Ubuntu VM, I hit a different error that none of the interpreters were discovered in the first place with this change, rather than hitting Error Code 26. In CI, we've hit Error Code 26 a few times.
The current hypothesis is that we're not calling fsync enough, which is called sync in Rust. See https://doc.rust-lang.org/std/fs/struct.File.html#method.sync_all.
We call fsync right now once on the input digest:
If you follow materialize_directory, you'll see that we call .sync_all() at the end:
The hypothesis is that for some reason that is not enough. We want to experiment if calling fsync again right here would fix things:
(I'm not sure what that snippet will be. Stu had an idea).
It seems likely that this is docker related. This thread is interesting: https://github.com/moby/moby/issues/9547
Will see if the nuclear option (a global sync) resolves this.
It seems likely that this is docker related.
That sounds plausible, but I'm not sure Docker is it. We saw this error on the lint shard _before_ RBE was used, i.e. when running ./pants run build-support/bin/get_rbe_token.py. That happens before RBE is used. That lint shard does not use Docker and only uses the default Travis environment. (Unless, under-the-hood, Travis is running Docker?)
Still happens :/ https://travis-ci.com/github/pantsbuild/pants/jobs/369865059#L952
Yeah, see: https://stackoverflow.com/questions/37288453/calling-fsync2-after-close2
That and its pointers say (open, write, close, open, fsync, close) != (open, fsync, close) on various Linux kernels. We'll need to _always_ call /bin/sync when present to get this hack working for CI and then I'm working on a fix that tries to retain perf and still do (open, write, fsync, close) by doing (open, write) x N .... -> (fsync, close) x N
Wrapping things up here:
For completeness, some postscript on #10577:
Although controlling forks in a central location and serializing fork/exec against any binaries we write out solves the issue for code we control, it does not solve the general problem when we do not control all code (libraries we link against). As such, #10577 added a bounded retry loop to protect against uncontrolled forks.
An alternative approach requiring no explicit locks or retry loop workarounds would be to write out binaries in a seperate multithreaded process that does no forking at all. We'd need to wire up a communication channel (pipes) to send digests that need to be written out and be able to await completion of the writing. That form of solution would probably pair well with our still unused brfs fuse filesystem. To use that filesystem, we'll want to keep it mounted at least as longs as pantsd stays up if not longer. Its easy to imagine a second Pants daemon with a lifttime >= pantsd that:
Note that there is a fork+exec here, but only 1 and on daemon startup before an event loop is ever started. Note also that we finally leverage brfs which should significantly improve digest materialization performance - no data copying, just symlinks that point directly into lmdb memory.
Would brfs even suffer from this, given that you never "open" the executable file for writing? It just "exists" from the get go when read by a client?
Would brfs even suffer from this, given that you never "open" the executable file for writing? It just "exists" from the get go when read by a client?
You're right, it would not since, critically, we would use it asymmetrically and write directly to LMDB (not through the brfs filesystem) and only read from the brfs filesystem. If we wrote through the brfs filesystem interface we'd have the same problem.
Most helpful comment
Wrapping things up here:
https://github.com/golang/go/issues/22220
https://github.com/golang/go/issues/22315
10577 works around 3 with a fork+exec lock.