https://ci.nodejs.org/job/node-test-commit-aix/11376/nodes=aix61-ppc64/console
not ok 934 parallel/test-http2-createwritereq
---
duration_ms: 1.818
severity: crashed
stack: |-
oh no!
exit code: CRASHED (Signal: 11)
I guess I'll start by pinging the test author: @apapirovski
Signal 11 = segmentation fault. I'm guessing AIX on CI is not configured to preserve core files, but let's ask. Ping @nodejs/build.
And ping @nodejs/platform-aix for additional troubleshooting and/or suggestions on how to proceed.
Is there any way to run stress tests at a given git SHA or something like that? https://github.com/nodejs/node/pull/17406 or https://github.com/nodejs/node/pull/17718 would seem like possible causes to me
Is there any way to run stress tests at a given git SHA or something like that? #17406 or #17718 would seem like possible causes to me
In the past, I've created branches at specific SHAs, pushed them to my fork (possibly to my master branch, I can't remember if I got it to work off another branch or not), and run the stress test off of my fork.
I guess it might make sense to run a stress test against master to see if this is reproducible.
Running it serially: https://ci.nodejs.org/job/node-stress-single-test/1574/nodes=aix61-ppc64/
Running it in parallel (96 processes): https://ci.nodejs.org/job/node-stress-single-test/1575/nodes=aix61-ppc64/
Ran a 2000 times locally with no luck on recreate. Either wait for @apapirovski to seek hints on possible causes, or gain login access to CI. /cc @mhdawson @gibfahn
The test does not seem to use too much of memory (few bytes of data transported through http2 and validate it passes through properly with differing encodings) which rules out native memory constraints as a source of trouble (which was a known difference between CI and local AIX boxes that led to some failures.)
Reproduced it in CI stress tests, so at least it's reproducible in that environment. Running serially, it got 31 failures in 9999 runs:
9999 OK: 9968 NOT OK: 31 TOTAL: 9999
+ '[' 31 '!=' 0 ']'
+ echo The test is flaky.
The test is flaky.
+ exit 1
Build step 'Conditional steps (multiple)' marked build as failure
Notifying upstream projects of job completion
Finished: FAILURE
...so running under load is not a precondition for failure.
I don't think that test does anything particularly weird. If I had to guess, other http2 tests that write stuff might be flaky on that platform too. It's reasonably likely that the recent changes to http2 are responsible. I'll try to have a look when I have time but as @addaleax mentioned, a good place to start is running a couple of stress tests against master rewound to the two PRs above.
thanks @apapirovski . @Trott - is it (artificially rewinding the two PRs and running stress on it) something which can be done through the existing build jobs or custom scripts?
In terms of isolation: both the PRs contain considerable amount of code changes that makes it difficult for me to manually locate the crash reason. On the other hand, if we isolate the issue into one of these PRs, still the failure reason needs to come bottom up (starting with the crashing context). So I prefer (in my way of problem determination) performing core dump analysis directly in the CI machine.
❌ = failure, ✅ = success
❌ Stress test rewound to e0e6b68 (right after #11718 landed): https://ci.nodejs.org/job/node-stress-single-test/1576/nodes=aix61-ppc64/
âś… Stress test rewound to e3b44b6 (one commit before the above so that it's just the first of the two commits that landed in #11718): https://ci.nodejs.org/job/node-stress-single-test/1577/nodes=aix61-ppc64/
âś… Stress test rewound to e554bc8 (right before #11718): https://ci.nodejs.org/job/node-stress-single-test/1579/nodes=aix61-ppc64/
@gireeshpunathil Stress tests aren't done running yet, but it sure looks like the issue is probably with e0e6b68. (Stress test at that commit failed, and the stress test at each of the two commits before it are both clean so far....
Stress tests are done and the evidence does point to e0e6b68....
thanks @Trott , for the info
Btw, I could reproduce this locally on Linux, so it isn’t AIX-specific … it’s my patch that caused this so I’ll try to figure out what’s going on here :)
thanks @addaleax , that is good news as in we don't need to access CI. How frequent is the crash? similar to that of AIX?
It’s less than 1 in 1000 – I think that’s about the same rate.
From some initial poking around, it seems like one of the Http2Session instances is garbage collected while it is handling data from the socket (i.e. inside nghttp2_session_mem_recv()).
Reproducible test case + fix @ https://github.com/nodejs/node/pull/17863
removing the test label here because it’s not an issue with the test itself
Most helpful comment
Btw, I could reproduce this locally on Linux, so it isn’t AIX-specific … it’s my patch that caused this so I’ll try to figure out what’s going on here :)