The subject and labels may be misleading - there is not much to go on here. For posterity in case this becomes frequent though, details follow.
Seen on a master CI burn in the 'Integration tests - V2 (Python 3.6)' shard:
$ travis-wait-enhanced --timeout 65m --interval 9m -- ./build-support/bin/ci.py --integration-tests-v2 --remote-execution-enabled --python-version 3.6
/home/travis/.pex/installed_wheels/b310034b9aeb62e68d8f271e153def66197e9d0d/toml-0.10.0-py2.py3-none-any.whl/toml/decoder.py:47: DeprecationWarning: invalid escape sequence \.
TIME_RE = re.compile("([0-9]{2}):([0-9]{2}):([0-9]{2})(\.([0-9]{3,6}))?")
17:53:30 [WARN] Unmatched glob from tests/python/pants_test/backend/jvm/tasks:ivy_resolve_integration's `sources` field: "tests/python/pants_test/backend/jvm/tasks/test_ivy_resolve_integration.py"
17:53:38 [INFO] Starting tests: contrib/buildrefactor/tests/python/pants_test/contrib/buildrefactor:buildozer_integration
...
17:53:38 [INFO] Starting tests: tests/python/pants_test/goal:run_tracker_integration
[=== 00:00 CI BEGINS ===]
[=== 00:00 Setting interpreter constraints to ['CPython==3.6.*'] ===]
IntegrationTestsV2Remote
[=== 00:36 Running integration tests via V2 remote strategy. ===]
Integration test failure (V2 remote)
travis-wait-enhanced FTL Non-zero exit code for ./build-support/bin/ci.py error="exit status 1"
The command "travis-wait-enhanced --timeout 65m --interval 9m -- ./build-support/bin/ci.py --integration-tests-v2 --remote-execution-enabled --python-version 3.6" exited with 1.
The combination of travis-wait-enhanced FTL Non-zero exit code for ./build-support/bin/ci.py and no output from the test command after having scheduled all tests is the interesting bit. Seems like a Pant crash with no details at all.
See again on https://github.com/pantsbuild/pants/pull/9398.
I've often seen Pants output that ~20 tests have succeeded, and then it will silently crash. I believe I have only seen this in the remote execution shards.
I got output for the first time!
06:23:09 [INFO] Tests succeeded: tests/python/pants_test/backend/jvm/tasks/reports
06:23:11 [INFO] Tests succeeded: contrib/cpp/tests/python/pants_test/contrib/cpp:cpp_toolchain
E0327 06:23:32.125356601 6544 chttp2_transport.c:754] client stream 1 still included in list 1
I believe that I saw this again.
EDIT: And again, but this time with:
memory allocation of 190881989 bytes failed
EDIT: https://github.com/rust-lang/rust/issues/51245 allows for causing this to panic rather than just aborting, which would get us the stacktrace attempting to allocate that much. It seems to be a strangely consistent size though (~190 MB). Possibly a particular input/output we have that is 190MB...?
I encountered this today with no extra information.
Some whack a mole style debug output turned up some culprits: https://api.travis-ci.com/v3/job/320301504/log.txt . We're producing some very large outputs it seems. Will look this evening.
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:43 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:43 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
Waited too long to investigate and those digests appear to be gone from the remote store. Will try again to repro.
The 190MB digest is for a native_engine.so. Not too surprising, in retrospect.
The next question might be to ask why we would ever be holding multiple copies of it at once (AFAIK, we shouldn't be.) Will need to do some local debugging.
I haven't been able to reproduce this locally yet under a profiling, but I've been brewing a hypothesis that we're racing to upload 32 copies of things in some cases when running integration tests. Will confirm that, but if so there are a few tacks:
As a lay observer, #3 sounds the most compelling because it would presumably remove unnecessary work.
Encountered this morning https://api.travis-ci.com/v3/job/332126695/log.txt:
18:56:46:483 [INFO] Starting tests: tests/python/pants_test/backend/project_info/tasks:export_dep_as_jar_integration
memory allocation of 194079374 bytes failed
And then it fails.
3. Add some per-file memoization to de-dupe uploads
I imagine this taking the form of a RemoteCasTracker which is basically:
enum RemoteState {
ProbablyUploaded,
Uploading(SharedFuture<()>),
}
struct RemoteCasTracker {
state: LruCache<Digest, RemoteState>
}
Each time an upload is attempted, we'd check the RemoteCasTracker - if it's ProbablyUploaded we'll skip it, if it's Uploading we'll return a clone of the upload future, and if we get definite information that it isn't uploaded we'll trigger an upload and set it to the Uploading state.
No repros of this since #9793. We'll likely still want to do one or both of the other things from https://github.com/pantsbuild/pants/issues/9395#issuecomment-627462097 at some point, but will wait a bit longer before closing this and opening a followup.
I imagine this taking the form of a RemoteCasTracker which is basically:
Might also be able to get away with an async Mutex around the result value, which might make it easier to recover from failure.
No repros in a few weeks! I've moved the optimization ideas to #9960, and am resolving this as fixed.