Node: Flaky parallel/test-trace-events-fs-sync

Created on 30 May 2018  Â·  51Comments  Â·  Source: nodejs/node

Test failing occasionally on FreeBSD with different assertion failures:

https://ci.nodejs.org/job/node-test-commit-freebsd/18147/ (nodejs/node#20962)

assert.js:80
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: fs.sync.access:
    at Object.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:131:10)
    at Module._compile (internal/modules/cjs/loader.js:702:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:713:10)
    at Module.load (internal/modules/cjs/loader.js:612:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:551:12)
    at Function.Module._load (internal/modules/cjs/loader.js:543:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:744:10)
    at startup (internal/bootstrap/node.js:239:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:573:3)

https://ci.nodejs.org/job/node-test-commit-freebsd/18103/ (nodejs/node#21003)

assert.js:80
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: fs.sync.ftruncate:
    at Object.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:131:10)
    at Module._compile (internal/modules/cjs/loader.js:702:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:713:10)
    at Module.load (internal/modules/cjs/loader.js:612:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:551:12)
    at Function.Module._load (internal/modules/cjs/loader.js:543:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:744:10)
    at startup (internal/bootstrap/node.js:239:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:573:3)

https://ci.nodejs.org/job/node-test-commit-freebsd/18143/ (nodejs/node#21032)

assert.js:80
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: fs.sync.readlink:
    at Object.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:131:10)
    at Module._compile (internal/modules/cjs/loader.js:702:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:713:10)
    at Module.load (internal/modules/cjs/loader.js:612:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:551:12)
    at Function.Module._load (internal/modules/cjs/loader.js:543:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:744:10)
    at startup (internal/bootstrap/node.js:239:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:573:3)
CI / flaky test help wanted trace_events

Most helpful comment

I fixed a thread race upstream (https://github.com/v8/v8/commit/602aeb40d0e9b76e7206b2f0f5d7eaf10d82d22c). I'm running a build with that fix picked up. So far no failures in 200 runs. Previously I was seeing a failure in ~50 runs.

All 51 comments

Seems like this assertion should print the entire proc object rather than just proc.status. This way we could see proc.error, proc.signal, proc.stderr and other things that might be useful to see when the test fails.

Not just FreeBSD. Here it is failing on LinuxONE:

https://ci.nodejs.org/job/node-test-commit-linuxone/1863/nodes=rhel72-s390x/console

03:04:07 not ok 1843 parallel/test-trace-events-fs-sync
03:04:07   ---
03:04:07   duration_ms: 2.122
03:04:07   severity: fail
03:04:07   exitcode: 1
03:04:07   stack: |-
03:04:07     assert.js:80
03:04:07       throw new AssertionError(obj);
03:04:07       ^
03:04:07     
03:04:07     AssertionError [ERR_ASSERTION]: fs.sync.rmdir: 
03:04:07         at Object.<anonymous> (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/parallel/test-trace-events-fs-sync.js:131:10)
03:04:07         at Module._compile (internal/modules/cjs/loader.js:702:30)
03:04:07         at Object.Module._extensions..js (internal/modules/cjs/loader.js:713:10)
03:04:07         at Module.load (internal/modules/cjs/loader.js:612:32)
03:04:07         at tryModuleLoad (internal/modules/cjs/loader.js:551:12)
03:04:07         at Function.Module._load (internal/modules/cjs/loader.js:543:3)
03:04:07         at Function.Module.runMain (internal/modules/cjs/loader.js:744:10)
03:04:07         at startup (internal/bootstrap/node.js:261:19)
03:04:07         at bootstrapNodeJSCore (internal/bootstrap/node.js:595:3)
03:04:07   ...

I updated the title of this to remove on FreeBSD as it appears to be more widespread than that.

Stress test to confirm this is reproducible on FreeBSD:

https://ci.nodejs.org/job/node-stress-single-test/1906/nodes=freebsd11-x64/

stderr is empty and the assertion swallows all the other info that might be useful (status, stdout, signal).

PR to improve output: https://github.com/nodejs/node/pull/21120

Failure on linuxone:

19:19:16 not ok 1832 parallel/test-trace-events-fs-sync
19:19:16   ---
19:19:16   duration_ms: 1.458
19:19:16   severity: fail
19:19:16   exitcode: 1
19:19:16   stack: |-
19:19:16     assert.js:80
19:19:16       throw new AssertionError(obj);
19:19:16       ^
19:19:16     
19:19:16     AssertionError [ERR_ASSERTION]: fs.sync.fchmod: 
19:19:16         at Object.<anonymous> (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/parallel/test-trace-events-fs-sync.js:131:10)
19:19:16         at Module._compile (internal/modules/cjs/loader.js:702:30)
19:19:16         at Object.Module._extensions..js (internal/modules/cjs/loader.js:713:10)
19:19:16         at Module.load (internal/modules/cjs/loader.js:612:32)
19:19:16         at tryModuleLoad (internal/modules/cjs/loader.js:551:12)
19:19:16         at Function.Module._load (internal/modules/cjs/loader.js:543:3)
19:19:16         at Function.Module.runMain (internal/modules/cjs/loader.js:744:10)
19:19:16         at startup (internal/bootstrap/node.js:261:19)
19:19:16         at bootstrapNodeJSCore (internal/bootstrap/node.js:595:3)
19:19:16   ...

Stress test now that we have more useful output: https://ci.nodejs.org/job/node-stress-single-test/1918/

So... we've got a SIGSEGV... :|

09:15:45 not ok 53 parallel/test-trace-events-fs-sync
09:15:45   ---
09:15:45   duration_ms: 0.340
09:15:45   severity: fail
09:15:45   exitcode: 1
09:15:45   stack: |-
09:15:45     assert.js:80
09:15:45       throw new AssertionError(obj);
09:15:45       ^
09:15:45     
09:15:45     AssertionError [ERR_ASSERTION]: fs.sync.access:
09:15:45     { status: null,
09:15:45       signal: 'SIGSEGV',
09:15:45       output: [ null, '', '' ],
09:15:45       pid: 83533,
09:15:45       stdout: '',
09:15:45       stderr: '' }
09:15:45         at Object.<anonymous> (/usr/home/iojs/build/workspace/node-stress-single-test/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:136:10)
09:15:45         at Module._compile (internal/modules/cjs/loader.js:702:30)
09:15:45         at Object.Module._extensions..js (internal/modules/cjs/loader.js:713:10)
09:15:45         at Module.load (internal/modules/cjs/loader.js:612:32)
09:15:45         at tryModuleLoad (internal/modules/cjs/loader.js:551:12)
09:15:45         at Function.Module._load (internal/modules/cjs/loader.js:543:3)
09:15:45         at Function.Module.runMain (internal/modules/cjs/loader.js:744:10)
09:15:45         at startup (internal/bootstrap/node.js:267:19)
09:15:45         at bootstrapNodeJSCore (internal/bootstrap/node.js:600:3)

This could be related to the trace events failures we've previously seen on Windows. ping @jasnell @ofrobots since one of you previously mentioned some segfaults that you saw in the trace events code and were planning on fixing?

I'm investigating some SIGSEGVs in tracing today.

There are a few flaws in the choreography of the tracing event loop on shutdown that cause the segfault: basically the libuv threadpool mutex gets destroyed as part of the static destructors, but the tracing agent shutdown happens after that. It tries to do some IO to write the terminating JSON characters, which end up crashing. I'm still looking into what might be the best way to fix this.

This explains the segfault. It is not clear to me whether the other original failures are related to this same underlying issue, but first things first.

/cc @eugeneo

Failed on LinuxONE again today.

https://ci.nodejs.org/job/node-test-commit-linuxone/2386/nodes=rhel72-s390x/console

10:12:08 not ok 1848 parallel/test-trace-events-fs-sync
10:12:08   ---
10:12:08   duration_ms: 1.462
10:12:08   severity: fail
10:12:08   exitcode: 1
10:12:08   stack: |-
10:12:08     assert.js:80
10:12:08       throw new AssertionError(obj);
10:12:08       ^
10:12:08     
10:12:08     AssertionError [ERR_ASSERTION]: fs.sync.fstat:
10:12:08     { status: null,
10:12:08       signal: 'SIGSEGV',
10:12:08       output: [ null, '', '' ],
10:12:08       pid: 51890,
10:12:08       stdout: '',
10:12:08       stderr: '' }
10:12:08         at Object.<anonymous> (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/parallel/test-trace-events-fs-sync.js:136:10)
10:12:08         at Module._compile (internal/modules/cjs/loader.js:702:30)
10:12:08         at Object.Module._extensions..js (internal/modules/cjs/loader.js:713:10)
10:12:08         at Module.load (internal/modules/cjs/loader.js:612:32)
10:12:08         at tryModuleLoad (internal/modules/cjs/loader.js:551:12)
10:12:08         at Function.Module._load (internal/modules/cjs/loader.js:543:3)
10:12:08         at Function.Module.runMain (internal/modules/cjs/loader.js:744:10)
10:12:08         at startup (internal/bootstrap/node.js:237:19)
10:12:08         at bootstrapNodeJSCore (internal/bootstrap/node.js:561:3)
10:12:08   ...

Failed again on LinuxONE today.

https://ci.nodejs.org/job/node-test-commit-linuxone/nodes=rhel72-s390x/3018/console

22:34:36 not ok 1910 parallel/test-trace-events-fs-sync
22:34:36   ---
22:34:36   duration_ms: 3.131
22:34:36   severity: fail
22:34:36   exitcode: 1
22:34:36   stack: |-
22:34:36     assert.js:80
22:34:36       throw new AssertionError(obj);
22:34:36       ^
22:34:36     
22:34:36     AssertionError [ERR_ASSERTION]: fs.sync.stat:
22:34:36     { status: null,
22:34:36       signal: 'SIGSEGV',
22:34:36       output: [ null, '', '' ],
22:34:36       pid: 60920,
22:34:36       stdout: '',
22:34:36       stderr: '' }
22:34:36         at Object.<anonymous> (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/parallel/test-trace-events-fs-sync.js:139:10)
22:34:36         at Module._compile (internal/modules/cjs/loader.js:689:30)
22:34:36         at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
22:34:36         at Module.load (internal/modules/cjs/loader.js:599:32)
22:34:36         at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
22:34:36         at Function.Module._load (internal/modules/cjs/loader.js:530:3)
22:34:36         at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
22:34:36         at startup (internal/bootstrap/node.js:266:19)
22:34:36         at bootstrapNodeJSCore (internal/bootstrap/node.js:596:3)
22:34:36   ...

LinuxONE again:

https://ci.nodejs.org/job/node-test-commit-linuxone/nodes=rhel72-s390x/3551/console

21:48:50 not ok 1954 parallel/test-trace-events-fs-sync
21:48:50   ---
21:48:50   duration_ms: 2.975
21:48:50   severity: fail
21:48:50   exitcode: 1
21:48:50   stack: |-
21:48:50     assert.js:84
21:48:50       throw new AssertionError(obj);
21:48:50       ^
21:48:50     
21:48:50     AssertionError [ERR_ASSERTION]: fs.sync.open:
21:48:50     { status: null,
21:48:50       signal: 'SIGSEGV',
21:48:50       output: [ null, '', '' ],
21:48:50       pid: 24620,
21:48:50       stdout: '',
21:48:50       stderr: '' }
21:48:50         at Object.<anonymous> (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/parallel/test-trace-events-fs-sync.js:139:10)
21:48:50         at Module._compile (internal/modules/cjs/loader.js:689:30)
21:48:50         at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
21:48:50         at Module.load (internal/modules/cjs/loader.js:599:32)
21:48:50         at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
21:48:50         at Function.Module._load (internal/modules/cjs/loader.js:530:3)
21:48:50         at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
21:48:50         at startup (internal/bootstrap/node.js:266:19)
21:48:50         at bootstrapNodeJSCore (internal/bootstrap/node.js:596:3)
21:48:50   ...

LinuxONE again:

https://ci.nodejs.org/job/node-test-commit-linuxone/3651/nodes=rhel72-s390x/console

22:22:00 not ok 1982 parallel/test-trace-events-fs-sync
22:22:00   ---
22:22:00   duration_ms: 3.631
22:22:00   severity: fail
22:22:00   exitcode: 1
22:22:00   stack: |-
22:22:00     assert.js:84
22:22:00       throw new AssertionError(obj);
22:22:00       ^
22:22:00     
22:22:00     AssertionError [ERR_ASSERTION]: fs.sync.unlink:
22:22:00     { status: null,
22:22:00       signal: 'SIGSEGV',
22:22:00       output: [ null, '', '' ],
22:22:00       pid: 3019,
22:22:00       stdout: '',
22:22:00       stderr: '' }
22:22:00         at Object.<anonymous> (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/parallel/test-trace-events-fs-sync.js:139:10)
22:22:00         at Module._compile (internal/modules/cjs/loader.js:689:30)
22:22:00         at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
22:22:00         at Module.load (internal/modules/cjs/loader.js:599:32)
22:22:00         at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
22:22:00         at Function.Module._load (internal/modules/cjs/loader.js:530:3)
22:22:00         at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
22:22:00         at startup (internal/bootstrap/node.js:266:19)
22:22:00         at bootstrapNodeJSCore (internal/bootstrap/node.js:596:3)
22:22:00   ...

Can anybody reproduce this locally, ideally on Linux? I haven’t had success so far…

More fails I dup reported in #22238, All are SIGSEGV but event varies (e.g. chmod / symlink / mkdtemp / ...).

assert.js:84
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: fs.sync.chmod:
{ status: null,
  signal: 'SIGSEGV',
  output: [ null, '', '' ],
  pid: 57470,
  stdout: '',
  stderr: '' }
    at Object.<anonymous> (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/parallel/test-trace-events-fs-sync.js:139:10)
    at Module._compile (internal/modules/cjs/loader.js:689:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
    at Module.load (internal/modules/cjs/loader.js:599:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
    at Function.Module._load (internal/modules/cjs/loader.js:530:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
    at startup (internal/bootstrap/node.js:257:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:594:3)

running -j 8 parallel/test-t* X 200

https://ci.nodejs.org/job/node-test-commit-linuxone/4108/nodes=rhel72-s390x/console

13:27:15 not ok 1860 parallel/test-trace-events-fs-sync
13:27:15   ---
13:27:15   duration_ms: 1.822
13:27:15   severity: fail
13:27:15   exitcode: 1
13:27:15   stack: |-
13:27:15     assert.js:84
13:27:15       throw new AssertionError(obj);
13:27:15       ^
13:27:15     
13:27:15     AssertionError [ERR_ASSERTION]: fs.sync.open:
13:27:15     { status: null,
13:27:15       signal: 'SIGSEGV',
13:27:15       output: [ null, '', '' ],
13:27:15       pid: 17232,
13:27:15       stdout: '',
13:27:15       stderr: '' }
13:27:15         at Object.<anonymous> (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/parallel/test-trace-events-fs-sync.js:139:10)
13:27:15         at Module._compile (internal/modules/cjs/loader.js:689:30)
13:27:15         at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
13:27:15         at Module.load (internal/modules/cjs/loader.js:599:32)
13:27:15         at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
13:27:15         at Function.Module._load (internal/modules/cjs/loader.js:530:3)
13:27:15         at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
13:27:15         at startup (internal/bootstrap/node.js:257:19)
13:27:15         at bootstrapNodeJSCore (internal/bootstrap/node.js:628:3)
13:27:15   ...

FreeBSD today, SIGSEGV as usual.

https://ci.nodejs.org/job/node-test-commit-freebsd/19955/nodes=freebsd11-x64/console

00:09:42 not ok 1894 parallel/test-trace-events-fs-sync # TODO : Fix flaky test
00:09:42   ---
00:09:42   duration_ms: 1.740
00:09:42   severity: flaky
00:09:42   exitcode: 1
00:09:42   stack: |-
00:09:42     assert.js:84
00:09:42       throw new AssertionError(obj);
00:09:42       ^
00:09:42     
00:09:42     AssertionError [ERR_ASSERTION]: fs.sync.lchown:
00:09:42     { status: null,
00:09:42       signal: 'SIGSEGV',
00:09:42       output: [ null, '', '' ],
00:09:42       pid: 48848,
00:09:42       stdout: '',
00:09:42       stderr: '' }
00:09:42         at Object.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:139:10)
00:09:42         at Module._compile (internal/modules/cjs/loader.js:689:30)
00:09:42         at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
00:09:42         at Module.load (internal/modules/cjs/loader.js:599:32)
00:09:42         at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
00:09:42         at Function.Module._load (internal/modules/cjs/loader.js:530:3)
00:09:42         at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
00:09:42         at startup (internal/bootstrap/node.js:257:19)
00:09:42         at bootstrapNodeJSCore (internal/bootstrap/node.js:636:3)
00:09:42   ...

@ofrobots Anything further to report on this? Trying to get a feel if there's work going on and we're just not seeing it here in this issue, or if stuff has stalled out...

Labeling help wanted.

https://ci.nodejs.org/job/node-test-commit-linuxone/4521/nodes=rhel72-s390x/console

03:03:23 not ok 1994 parallel/test-trace-events-fs-sync # TODO : Fix flaky test
03:03:23   ---
03:03:23   duration_ms: 4.13
03:03:23   severity: flaky
03:03:23   exitcode: 1
03:03:23   stack: |-
03:03:23     assert.js:84
03:03:23       throw new AssertionError(obj);
03:03:23       ^
03:03:23     
03:03:23     AssertionError [ERR_ASSERTION]: fs.sync.unlink:
03:03:23     { status: null,
03:03:23       signal: 'SIGSEGV',
03:03:23       output: [ null, '', '' ],
03:03:23       pid: 59157,
03:03:23       stdout: '',
03:03:23       stderr: '' }
03:03:23         at Object.<anonymous> (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/parallel/test-trace-events-fs-sync.js:139:10)
03:03:23         at Module._compile (internal/modules/cjs/loader.js:689:30)
03:03:23         at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
03:03:23         at Module.load (internal/modules/cjs/loader.js:599:32)
03:03:23         at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
03:03:23         at Function.Module._load (internal/modules/cjs/loader.js:530:3)
03:03:23         at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
03:03:23         at startup (internal/bootstrap/node.js:259:19)
03:03:23         at bootstrapNodeJSCore (internal/bootstrap/node.js:729:3)
03:03:23   ...

This continues to fail over and over on multiple platforms. Here are two more from the same node-daily-master run.

https://ci.nodejs.org/job/node-test-commit-freebsd/20195/nodes=freebsd11-x64/console

00:10:20 not ok 1897 parallel/test-trace-events-fs-sync # TODO : Fix flaky test
00:10:20   ---
00:10:20   duration_ms: 1.929
00:10:20   severity: flaky
00:10:20   exitcode: 1
00:10:20   stack: |-
00:10:20     assert.js:84
00:10:20       throw new AssertionError(obj);
00:10:20       ^
00:10:20     
00:10:20     AssertionError [ERR_ASSERTION]: fs.sync.link:
00:10:20     { status: null,
00:10:20       signal: 'SIGSEGV',
00:10:20       output: [ null, '', '' ],
00:10:20       pid: 46494,
00:10:20       stdout: '',
00:10:20       stderr: '' }
00:10:20         at Object.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:139:10)
00:10:20         at Module._compile (internal/modules/cjs/loader.js:689:30)
00:10:20         at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
00:10:20         at Module.load (internal/modules/cjs/loader.js:599:32)
00:10:20         at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
00:10:20         at Function.Module._load (internal/modules/cjs/loader.js:530:3)
00:10:20         at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
00:10:20         at startup (internal/bootstrap/node.js:264:19)
00:10:20         at bootstrapNodeJSCore (internal/bootstrap/node.js:739:3)
00:10:20   ...

https://ci.nodejs.org/job/node-test-commit-linuxone/4693/nodes=rhel72-s390x/console:

00:03:44 not ok 2000 parallel/test-trace-events-fs-sync # TODO : Fix flaky test
00:03:44   ---
00:03:44   duration_ms: 4.206
00:03:44   severity: flaky
00:03:44   exitcode: 1
00:03:44   stack: |-
00:03:44     assert.js:84
00:03:44       throw new AssertionError(obj);
00:03:44       ^
00:03:44     
00:03:44     AssertionError [ERR_ASSERTION]: fs.sync.write:
00:03:44     { status: null,
00:03:44       signal: 'SIGSEGV',
00:03:44       output: [ null, '', '' ],
00:03:44       pid: 11197,
00:03:44       stdout: '',
00:03:44       stderr: '' }
00:03:44         at Object.<anonymous> (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/parallel/test-trace-events-fs-sync.js:139:10)
00:03:44         at Module._compile (internal/modules/cjs/loader.js:689:30)
00:03:44         at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
00:03:44         at Module.load (internal/modules/cjs/loader.js:599:32)
00:03:44         at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
00:03:44         at Function.Module._load (internal/modules/cjs/loader.js:530:3)
00:03:44         at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
00:03:44         at startup (internal/bootstrap/node.js:264:19)
00:03:44         at bootstrapNodeJSCore (internal/bootstrap/node.js:739:3)
00:03:44   ...

Bisecting starting with b797103e2635bc059a7d5b6ed4fcc0a771a61432 (which should fail):

https://ci.nodejs.org/job/node-stress-single-test/2002/nodes=freebsd11-x64/console

That didn't work. Let's try again with -j 8 --repeat 16

https://ci.nodejs.org/job/node-stress-single-test/2003/

@Trott I was able to repro with -j 8 parallel/test-t* X 200 repeats on linuxONE.

@refack Only did a spot check, but looks like it's always test-tls-server-verify that fails and never test-trace-events-fs-sync on that commit.

Trying b797103 on both LinuxONE and FreeBSD using -j 8 parallel/test-t*:

https://ci.nodejs.org/job/node-stress-single-test/2006/

@refack Only did a spot check, but looks like it's always test-tls-server-verify that fails and never test-trace-events-fs-sync on that commit.

Yep. No fails in test-trace-events-fs-sync in 09c63460ebda30c7d5e7f40532a311fb2139803a

@Trott's b797103 failed on linuxONE

next bisect step 09c63460ebda30c7d5e7f40532a311fb2139803a - https://ci.nodejs.org/job/node-stress-single-test/nodes=rhel72-s390x/2010/

Found fail in original commit (at iteration echo '311) 09c6346 - https://ci.nodejs.org/job/node-stress-single-test/nodes=rhel72-s390x/2010

Found fail in original commit (at iteration echo '311) 09c6346 - https://ci.nodejs.org/job/node-stress-single-test/nodes=rhel72-s390x/2010

For anyone else looking for it, it's above iteration 311, in iteration 310 and looks like this:

not ok 187 parallel/test-trace-events-fs-sync
  ---
  duration_ms: 0.528
  severity: fail
  exitcode: 1
  stack: |-
    assert.js:80
      throw new AssertionError(obj);
      ^

    AssertionError [ERR_ASSERTION]: fs.sync.access:
    { status: null,
      signal: 'SIGSEGV',
      output: [ null, '', '' ],
      pid: 21920,
      stdout: '',
      stderr: '' }
        at Object.<anonymous> (/data/iojs/build/workspace/node-stress-single-test/nodes/rhel72-s390x/test/parallel/test-trace-events-fs-sync.js:136:10)
        at Module._compile (internal/modules/cjs/loader.js:702:30)
        at Object.Module._extensions..js (internal/modules/cjs/loader.js:713:10)
        at Module.load (internal/modules/cjs/loader.js:612:32)
        at tryModuleLoad (internal/modules/cjs/loader.js:551:12)
        at Function.Module._load (internal/modules/cjs/loader.js:543:3)
        at Function.Module.runMain (internal/modules/cjs/loader.js:744:10)
        at startup (internal/bootstrap/node.js:267:19)
        at bootstrapNodeJSCore (internal/bootstrap/node.js:600:3)
  ...

OK, so the problem existed when the test was introduced and may have gotten worse since then. Not sure what to do at this point other than try to enlist the help of knowledgable people...@nodejs/trace-events @addaleax @andreasmadsen

https://ci.nodejs.org/job/node-test-commit-freebsd/20292/nodes=freebsd11-x64/console

03:10:57 not ok 1900 parallel/test-trace-events-fs-sync
03:10:57   ---
03:10:57   duration_ms: 2.326
03:10:57   severity: fail
03:10:57   exitcode: 1
03:10:57   stack: |-
03:10:57     assert.js:84
03:10:57       throw new AssertionError(obj);
03:10:57       ^
03:10:57     
03:10:57     AssertionError [ERR_ASSERTION]: fs.sync.open:
03:10:57     { status: null,
03:10:57       signal: 'SIGSEGV',
03:10:57       output: [ null, '', '' ],
03:10:57       pid: 12448,
03:10:57       stdout: '',
03:10:57       stderr: '' }
03:10:57         at Object.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:139:10)
03:10:57         at Module._compile (internal/modules/cjs/loader.js:689:30)
03:10:57         at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
03:10:57         at Module.load (internal/modules/cjs/loader.js:599:32)
03:10:57         at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
03:10:57         at Function.Module._load (internal/modules/cjs/loader.js:530:3)
03:10:57         at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
03:10:57         at startup (internal/bootstrap/node.js:264:19)
03:10:57         at bootstrapNodeJSCore (internal/bootstrap/node.js:739:3)
03:10:57   ...

Also kind of puzzling: The test is marked as flaky, but that run above is red. (Maybe node-daily-master doesn't account for flakes on purpose or something? But I've definitely seen it yellow so....)

Also kind of puzzling: The test is marked as flaky, but that run above is red. (Maybe node-daily-master doesn't account for flakes on purpose or something? But I've definitely seen it yellow so....)

From https://ci.nodejs.org/job/node-test-commit-freebsd/20292/nodes=freebsd11-x64/consoleFull

03:05:00 /usr/local/bin/python tools/test.py -j 2 -p tap --logfile test.tap \
03:05:00    --mode=release --flaky-tests=run \
03:05:00     default addons addons-napi doctool

I think other jobs run with --flaky-tests=dontcare.

https://ci.nodejs.org/job/node-test-commit-freebsd/configure:

# get ccache to reuse cache across workspaces
CCACHE_BASEDIR=$(pwd)

CC="ccache cc"
CXX="ccache c++"
NODE_TEST_DIR=${HOME}/node-tmp
FLAKY_TESTS=dontcare
gmake run-ci -j $(getconf NPROCESSORS_ONLN)

Should FLAKY_TESTS (and NODE_TEST_DIR) be exported since run-ci runs make recursively?
https://github.com/nodejs/node/blob/2bf4697ff4be6a510410be1f1e67549be347614b/Makefile#L489-L490

@nodejs/build

The hits just keep coming with this one...

https://ci.nodejs.org/job/node-test-commit-freebsd/20323/nodes=freebsd11-x64/console

00:53:20 not ok 1869 parallel/test-trace-events-fs-sync
00:53:20   ---
00:53:20   duration_ms: 3.710
00:53:20   severity: fail
00:53:20   exitcode: 1
00:53:20   stack: |-
00:53:20     assert.js:84
00:53:20       throw new AssertionError(obj);
00:53:20       ^
00:53:20     
00:53:20     AssertionError [ERR_ASSERTION]: fs.sync.utimes:
00:53:20     { status: null,
00:53:20       signal: 'SIGSEGV',
00:53:20       output: [ null, '', '' ],
00:53:20       pid: 29493,
00:53:20       stdout: '',
00:53:20       stderr: '' }
00:53:20         at Object.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/test/parallel/test-trace-events-fs-sync.js:139:10)
00:53:20         at Module._compile (internal/modules/cjs/loader.js:689:30)
00:53:20         at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
00:53:20         at Module.load (internal/modules/cjs/loader.js:599:32)
00:53:20         at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
00:53:20         at Function.Module._load (internal/modules/cjs/loader.js:530:3)
00:53:20         at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
00:53:20         at startup (internal/bootstrap/node.js:270:19)
00:53:20         at bootstrapNodeJSCore (internal/bootstrap/node.js:746:3)
00:53:20   ...

/ping @nodejs/testing hoping for ideas or someone who feels they have the appropriate skills and time to address this.

This is far and away our most frequent and vexing failing test (as far as I can tell anyway). Given its long-standing nature, the frequency with which it occurs, and the apparent situation that no one is working on it (probably because it's not a simple test problem but more likely a bug in C++ somwhere), I'm considering bringing it to the TSC to figure out what to do.

Looks like it's might be vastly more reproducible on FreeBSD than LinuxONE if you use -J test/parallel/test-t*. In https://ci.nodejs.org/job/node-stress-single-test/2013/, those parameters have so far gotten 122 failures in 437 runs. In comparison, LinuxONE shows 3 failures in 538 runs.

Anyone on @nodejs/build up for figuring out how to get a FreeBSD core file and debugging it? @nodejs/platform-freebsd

I am able to look at this today. Can I get access to a FreeBSD machine?

From the core dump:

Stack:

#0  0x00000000013cd8ac in v8::platform::tracing::TracingController::GetCategoryGroupName ()
#1  0x00000000013cce9d in v8::platform::tracing::JSONTraceWriter::AppendTraceEvent ()
#2  0x000000000097b56c in node::tracing::NodeTraceWriter::AppendTraceEvent ()
#3  0x000000000097787d in node::tracing::Agent::AppendTraceEvent ()
#4  0x000000000097a46f in node::tracing::NodeTraceBuffer::Flush ()
#5  0x00000000013cd4dc in v8::platform::tracing::TracingController::StopTracing ()
#6  0x000000000097717e in node::tracing::Agent::Disconnect ()
#7  0x00000000008c42bc in node::Start ()
#8  0x000000000087e095 in _start ()
#9  0x000000080224a000 in ?? ()
#10 0x0000000000000000 in ?? ()

v8::platform::tracing::TracingController::GetCategoryGroupName

(gdb) disas _ZN2v88platform7tracing17TracingController20GetCategoryGroupNameEPKh
0x00000000013cd8a0 <_ZN2v88platform7tracing17TracingController20GetCategoryGroupNameEPKh+0>:    push   rbp
0x00000000013cd8a1 <_ZN2v88platform7tracing17TracingController20GetCategoryGroupNameEPKh+1>:    mov    rbp,rsp
0x00000000013cd8a4 <_ZN2v88platform7tracing17TracingController20GetCategoryGroupNameEPKh+4>:    mov    eax,0x2255a40
0x00000000013cd8a9 <_ZN2v88platform7tracing17TracingController20GetCategoryGroupNameEPKh+9>:    sub    rdi,rax
0x00000000013cd8ac <_ZN2v88platform7tracing17TracingController20GetCategoryGroupNameEPKh+12>:   mov    rax,QWORD PTR [rdi*8+0x223d750]
0x00000000013cd8b4 <_ZN2v88platform7tracing17TracingController20GetCategoryGroupNameEPKh+20>:   pop    rbp
0x00000000013cd8b5 <_ZN2v88platform7tracing17TracingController20GetCategoryGroupNameEPKh+21>:   ret

Corresponding to

const char* TracingController::GetCategoryGroupName(
    const uint8_t* category_group_enabled) {
  // Calculate the index of the category group by finding
  // category_group_enabled in g_category_group_enabled array.
  uintptr_t category_begin =
      reinterpret_cast<uintptr_t>(g_category_group_enabled);
  uintptr_t category_ptr = reinterpret_cast<uintptr_t>(category_group_enabled);
  // Check for out of bounds category pointers.
  DCHECK(category_ptr >= category_begin &&
         category_ptr < reinterpret_cast<uintptr_t>(g_category_group_enabled +
                                                    MAX_CATEGORY_GROUPS));
  uintptr_t category_index =
      (category_ptr - category_begin) / sizeof(g_category_group_enabled[0]);
  return g_category_groups[category_index];
}

rdi is 0xfffffffffddaa5c0 which happens to be the value 0 - $rax. IOW, rdi, the first parameter (category_group_enabled) passed by caller was nullptr.

I'm looking into how when that is possible.

I fixed a thread race upstream (https://github.com/v8/v8/commit/602aeb40d0e9b76e7206b2f0f5d7eaf10d82d22c). I'm running a build with that fix picked up. So far no failures in 200 runs. Previously I was seeing a failure in ~50 runs.

Should FLAKY_TESTS (and NODE_TEST_DIR) be exported since run-ci runs make recursively?

@richardlau They should be exported AFAICT because of the way Jenkins manipulates scripts.
tl;dr fixed.

Bad news: It turns out that the test has variable intermittency – some times it fails once in 500 runs. The fix in https://github.com/nodejs/node/issues/21038#issuecomment-419245452 is not sufficient.

Good news: I have a fairly good idea what is going wrong. It is a thread race that causes a crash on shutdown. At shutdown time, we try to flush all the remaining trace events in the trace buffer to trace file. One (or more) of these traces happens to be uninitialized – the name happens to be null. Looking at how the trace event objects are created:

uint64_t TracingController::AddTraceEvent(
    char phase, const uint8_t* category_enabled_flag, const char* name,
    const char* scope, uint64_t id, uint64_t bind_id, int num_args,
    const char** arg_names, const uint8_t* arg_types,
    const uint64_t* arg_values,
    std::unique_ptr<v8::ConvertableToTraceFormat>* arg_convertables,
    unsigned int flags) {
  uint64_t handle;
  TraceObject* trace_object = trace_buffer_->AddTraceEvent(&handle);
  if (trace_object) {
    trace_object->Initialize(
        phase, category_enabled_flag, name, scope, id, bind_id, num_args,
        arg_names, arg_types, arg_values, arg_convertables, flags,
        CurrentTimestampMicroseconds(), CurrentCpuTimestampMicroseconds());
  }
  return handle;
}

This is the one of the two possible paths where a trace is allocated. We allocate an object into the buffer, and then initialize it immediately. Since the trace is uninitialized at the time of the crash, there must be a thread in the system that has just allocated the trace, but hasn't reached the point to initialize it.

With some more debugging I was able to catch this in a core dump:

(gdb) bt
#0  0x0000000803481959 in xdr_ypresponse () from /lib/libc.so.7
#1  0x000000080350a109 in _gettimeofday () from /lib/libc.so.7
#2  0x000000080350a1f8 in _gettimeofday () from /lib/libc.so.7
#3  0x00000008034f33f1 in clock_gettime () from /lib/libc.so.7
#4  0x00000000009fcee6 in uv__hrtime (type=<value optimized out>) at ../deps/uv/src/unix/posix-hrtime.c:33
#5  0x0000000000978229 in node::tracing::TracingController::CurrentTimestampMicroseconds ()
#6  0x00000000013cd634 in v8::platform::tracing::TracingController::AddTraceEvent ()
#7  0x000000000093fd07 in node::(anonymous namespace)::PlatformWorkerThread ()
#8  0x000000080318fc06 in pthread_create () from /lib/libthr.so.3
#9  0x0000000000000000 in ?? ()

This is a background worker thread that is trying to initialize a trace while the main thread is trying to flush the traces.

(gdb) bt
#0  0x000000080347947a in thr_kill () from /lib/libc.so.7
#1  0x0000000803479444 in raise () from /lib/libc.so.7
#2  0x00000008034793b9 in abort () from /lib/libc.so.7
#3  0x0000000000979dfd in node::tracing::InternalTraceBuffer::Flush ()
#4  0x000000000097a3ca in node::tracing::NodeTraceBuffer::Flush ()
#5  0x00000000013cd4ec in v8::platform::tracing::TracingController::StopTracing ()
#6  0x000000000097724e in node::tracing::Agent::Disconnect ()
#7  0x00000000008c42bc in node::Start ()
#8  0x000000000087e095 in _start ()
#9  0x000000080224a000 in ?? ()
#10 0x0000000000000000 in ?? ()

There is a secondary bug that manifests this way:

#0  0x000000080347947a in thr_kill () from /lib/libc.so.7
#1  0x0000000803479444 in raise () from /lib/libc.so.7
#2  0x00000008034793b9 in abort () from /lib/libc.so.7
#3  0x00000000009faa94 in uv_mutex_lock (mutex=<value optimized out>) at ../deps/uv/src/unix/thread.c:275
#4  0x0000000000979aec in node::tracing::InternalTraceBuffer::AddTraceEvent ()
#5  0x00000000013cd613 in v8::platform::tracing::TracingController::AddTraceEvent ()
#6  0x000000000093fd07 in node::(anonymous namespace)::PlatformWorkerThread ()
#7  0x000000080318fc06 in pthread_create () from /lib/libthr.so.3
#8  0x0000000000000000 in ?? ()

The other threads:

  7 process 101305  0x000000080319db2c in pthread_cleanup_pop () from /lib/libthr.so.3
  6 process 101251  0x000000080319db2c in pthread_cleanup_pop () from /lib/libthr.so.3
  5 process 101221  0x000000080319db2c in pthread_cleanup_pop () from /lib/libthr.so.3
  4 process 101178  0x000000080319db2c in pthread_cleanup_pop () from /lib/libthr.so.3
  3 process 101175  0x000000080342c968 in _umtx_op () from /lib/libc.so.7
  2 process 100503  0x000000080319db2c in pthread_cleanup_pop () from /lib/libthr.so.3
* 1 process 100427  0x000000080347947a in thr_kill () from /lib/libc.so.7

I think this might be due to the fact that a thread is trying to acquire a mutex has been already been destroyed.

https://github.com/nodejs/node/pull/22812 has landed. Hopefully it fixes it. There might still be some other thread races, but I haven't been able to observe any so far in my testing.

@ofrobots 🎉 Thank you so much!

I'll open a PR to unmark the test as flaky and close this issue.

It's flaky still, but in a different way (SIGABRT rather than SIGSEGV). Will open a new issue, but ref this one...

Was this page helpful?
0 / 5 - 0 ratings