Pants: Silent crash or failure when remoting.

Created on 25 Mar 2020 · 14Comments · Source: pantsbuild/pants

The subject and labels may be misleading - there is not much to go on here. For posterity in case this becomes frequent though, details follow.

Seen on a master CI burn in the 'Integration tests - V2 (Python 3.6)' shard:

$ travis-wait-enhanced --timeout 65m --interval 9m -- ./build-support/bin/ci.py --integration-tests-v2 --remote-execution-enabled --python-version 3.6
/home/travis/.pex/installed_wheels/b310034b9aeb62e68d8f271e153def66197e9d0d/toml-0.10.0-py2.py3-none-any.whl/toml/decoder.py:47: DeprecationWarning: invalid escape sequence \.
  TIME_RE = re.compile("([0-9]{2}):([0-9]{2}):([0-9]{2})(\.([0-9]{3,6}))?")
17:53:30 [WARN] Unmatched glob from tests/python/pants_test/backend/jvm/tasks:ivy_resolve_integration's `sources` field: "tests/python/pants_test/backend/jvm/tasks/test_ivy_resolve_integration.py"
17:53:38 [INFO] Starting tests: contrib/buildrefactor/tests/python/pants_test/contrib/buildrefactor:buildozer_integration
...
17:53:38 [INFO] Starting tests: tests/python/pants_test/goal:run_tracker_integration
[=== 00:00 CI BEGINS ===]
[=== 00:00 Setting interpreter constraints to ['CPython==3.6.*'] ===]
IntegrationTestsV2Remote
[=== 00:36 Running integration tests via V2 remote strategy. ===]
Integration test failure (V2 remote)
travis-wait-enhanced FTL Non-zero exit code for ./build-support/bin/ci.py error="exit status 1"
The command "travis-wait-enhanced --timeout 65m --interval 9m -- ./build-support/bin/ci.py --integration-tests-v2 --remote-execution-enabled --python-version 3.6" exited with 1.

The combination of travis-wait-enhanced FTL Non-zero exit code for ./build-support/bin/ci.py and no output from the test command after having scheduled all tests is the interesting bit. Seems like a Pant crash with no details at all.

bug engine remoting and caching

Source

jsirois

👍1

All 14 comments

See again on https://github.com/pantsbuild/pants/pull/9398.

jsirois on 25 Mar 2020

I've often seen Pants output that ~20 tests have succeeded, and then it will silently crash. I believe I have only seen this in the remote execution shards.

Eric-Arellano on 25 Mar 2020

I got output for the first time!

06:23:09 [INFO] Tests succeeded: tests/python/pants_test/backend/jvm/tasks/reports
06:23:11 [INFO] Tests succeeded: contrib/cpp/tests/python/pants_test/contrib/cpp:cpp_toolchain
E0327 06:23:32.125356601 6544 chttp2_transport.c:754] client stream 1 still included in list 1

Eric-Arellano on 27 Mar 2020

I believe that I saw this again.

EDIT: And again, but this time with:

memory allocation of 190881989 bytes failed

EDIT: https://github.com/rust-lang/rust/issues/51245 allows for causing this to panic rather than just aborting, which would get us the stacktrace attempting to allocate that much. It seems to be a strangely consistent size though (~190 MB). Possibly a particular input/output we have that is 190MB...?

stuhood on 11 Apr 2020

I encountered this today with no extra information.

Eric-Arellano on 16 Apr 2020

Some whack a mole style debug output turned up some culprits: https://api.travis-ci.com/v3/job/320301504/log.txt . We're producing some very large outputs it seems. Will look this evening.

03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:43 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:43 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)

stuhood on 17 Apr 2020

❤1

Waited too long to investigate and those digests appear to be gone from the remote store. Will try again to repro.

stuhood on 21 Apr 2020

The 190MB digest is for a native_engine.so. Not too surprising, in retrospect.

The next question might be to ask why we would ever be holding multiple copies of it at once (AFAIK, we shouldn't be.) Will need to do some local debugging.

stuhood on 21 Apr 2020

👍1

I haven't been able to reproduce this locally yet under a profiling, but I've been brewing a hypothesis that we're racing to upload 32 copies of things in some cases when running integration tests. Will confirm that, but if so there are a few tacks:

Add a generalized async semaphore to throttle how much we hold in memory during these operations
~~Avoid holding these in memory while we upload them (they're in LMDB: no need to actually hold them in userspace while uploading)~~
Add some per-file memoization to de-dupe uploads

stuhood on 12 May 2020

As a lay observer, #3 sounds the most compelling because it would presumably remove unnecessary work.

Eric-Arellano on 12 May 2020

Encountered this morning https://api.travis-ci.com/v3/job/332126695/log.txt:

18:56:46:483 [INFO] Starting tests: tests/python/pants_test/backend/project_info/tasks:export_dep_as_jar_integration
memory allocation of 194079374 bytes failed

And then it fails.

Eric-Arellano on 13 May 2020

3. Add some per-file memoization to de-dupe uploads

I imagine this taking the form of a RemoteCasTracker which is basically:

enum RemoteState  {
  ProbablyUploaded,
  Uploading(SharedFuture<()>),
}

struct RemoteCasTracker {
  state: LruCache<Digest, RemoteState>
}

Each time an upload is attempted, we'd check the RemoteCasTracker - if it's ProbablyUploaded we'll skip it, if it's Uploading we'll return a clone of the upload future, and if we get definite information that it isn't uploaded we'll trigger an upload and set it to the Uploading state.

illicitonion on 16 May 2020

No repros of this since #9793. We'll likely still want to do one or both of the other things from https://github.com/pantsbuild/pants/issues/9395#issuecomment-627462097 at some point, but will wait a bit longer before closing this and opening a followup.