(roachtest).alterpk-tpcc failed on master@954fe69d554162aec0fbc001aad1fe5103d8df13:
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200306-1790595/alterpk-tpcc/run_1
test_runner.go:756: test timed out (10h0m0s)
(roachtest).alterpk-tpcc failed on master@752dea867f3aeb142e98c22f8d320ce19041aa8d:
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200307-1793930/alterpk-tpcc/run_1
test_runner.go:756: test timed out (10h0m0s)
(roachtest).alterpk-tpcc failed on master@dfa5bd527ae7d7373dd03c62118df87a87a77130:
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200308-1795062/alterpk-tpcc/run_1
test_runner.go:756: test timed out (10h0m0s)
cc @lucy-zhang is this failure related to the new sc + jobs PR? Looking at the logs, it seems like there is some deadlock while trying to run the schema change. Each node has these repeated logs:
I200308 08:44:55.878309 181 sql/schema_changer.go:973 [n1,scExec] schema change on customer (60 v11) mutation 2 starting execution...
I200308 08:44:55.878579 181 sql/schema_changer.go:1025 [n1,scExec] schema change on customer (60 v11) mutation 2: another node is currently operating on this table
and no progress is made for 10 hours
954fe69d554162aec0fbc001aad1fe5103d8df13 doesn't have those changes (and I rolled them back on master anyway, for the time being, to deal with a problem with migrations).
That log message is coming from here:
https://github.com/cockroachdb/cockroach/blob/954fe69d554162aec0fbc001aad1fe5103d8df13/pkg/sql/schema_changer.go#L1026
So it's a problem with trying to acquire a schema change lease, which hopefully should be solved once I do merge that PR. (I don't know what caused the deadlock, though.) Let's stress this test again once that PR is merged.
(roachtest).alterpk-tpcc failed on master@c473f40078994551cebcbe00fdbf1fa388957658:
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200309-1796240/alterpk-tpcc/run_1
test_runner.go:756: test timed out (10h0m0s)
I'm investigating this issue a little more and don't understand whats happening in the logs.
We see here
I200309 07:40:00.803524 188887 sql/backfill.go:176 [n1,client=35.237.131.38:45106,hostnossl,user=root,scExec] Running backfill for "customer", v=11, m=2
That a backfill has started for our index. However, after that we see that all the nodes start trying to grab the lease:
Then, we see that validation for the index backfills have completed:
I200309 08:35:16.147475 1091744 sql/backfill.go:1332 [n1,client=35.237.131.38:45106,hostnossl,user=root,scExec] validation: index customer/customer_c_w_id_c_d_id_c_id_key row count = 15000000, time so far 20.41641794s
I200309 08:35:17.720733 1091745 sql/backfill.go:1332 [n1,client=35.237.131.38:45106,hostnossl,user=root,scExec] validation: index customer/customer_idx_rewrite_for_primary_key_change row count = 15000000, time so far 21.988191096s
I200309 08:35:18.303016 28 server/status/runtime.go:498 [n1] runtime stats: 8.7 GiB RSS, 485 goroutines, 224 MiB/5.1 GiB/1.6 GiB GO alloc/idle/total, 4.3 GiB/5.5 GiB CGO alloc/total, 2577.0 CGO/sec, 326.6/11.7 %(u/s)time, 0.0 %gc (19x), 781 MiB/3.5 MiB (r/w)net
I200309 08:35:21.004309 1091746 sql/backfill.go:1376 [n1,client=35.237.131.38:45106,hostnossl,user=root,scExec] validation: table customer row count = 15000000, took 25.276541782s
However, it then doesn't look like we make it past SchemaChanger.done, because we then see these log messages:
I200309 07:41:06.716590 30 sql/schema_changer.go:973 [n1,scExec] schema change on customer (59 v11) mutation 2 starting execution...
I200309 07:41:06.728168 30 sql/schema_changer.go:1025 [n1,scExec] schema change on customer (59 v11) mutation 2: another node is currently operating on this table
But it also doesn't look like we started to rollback the schema change either.
@rohany Now that the new schema change job is merged, I think we should wait a few days and see if we get any more failures. The thing that's happening here (being unable to acquire a schema change lease) shouldn't be possible anymore, since schema change leases won't exist, though maybe there's some deeper underlying problem that would also cause problems with the new schema change job.
(roachtest).alterpk-tpcc failed on master@72c4a1bd411f2f82bf9aaa22883821a946614148:
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200310-1799071/alterpk-tpcc/run_1
test_runner.go:779: test timed out (10h0m0s)
I think this failure above still doesn't have the thew sc jobs commit yet.
(roachtest).alterpk-tpcc failed on master@5570c01402796edb7cd06eb8ce7f615371f22d42:
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200311-1801614/alterpk-tpcc/run_1
test_runner.go:779: test timed out (10h0m0s)
Its unclear from these logs what exactly is going wrong, though the situation still seems similar to the logs before the schema changes jobs pr. One thing is that 500 warehouses is probably too big to be running on a 3 node 4 cpu cluster, but it doesn't explain the hang.
(roachtest).alterpk-tpcc failed on master@793a9200c16693aff32aa6a4dd9d8bbcbddb30aa:
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:162
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:747
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_082527.690_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:162
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:747
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1804460-1583998462-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200312 08:25:29.248938 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 27.383279ms
I200312 08:25:37.613422 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 8.364418946s
I200312 08:25:39.721125 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 2.10765782s
I200312 08:26:23.891371 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 44.170123384s
I200312 08:28:16.731792 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 1m52.840272696s
I200312 08:35:42.761535 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 7m26.029484569s
Error: check failed: 3.3.2.6: driver: bad connection
Error: exit status 1
stdout::
- exit status 1
cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:801: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1804460-1583998462-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
3: 4164
2: 4753
1: dead
Error: 1: dead
(roachtest).alterpk-tpcc failed on master@69dc87d68addedf2fabfb2b14c098cfb35b5f3d0:
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:162
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_074140.622_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:162
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1806460-1584083876-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200313 07:41:42.211690 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 34.938153ms
I200313 07:41:49.842967 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 7.631204223s
I200313 07:41:51.237037 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 1.393985549s
I200313 07:43:56.072945 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 2m4.835669018s
I200313 07:45:41.145017 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 1m45.07188352s
I200313 09:04:01.195185 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 1h18m20.049886518s
Error: check failed: 3.3.2.6: driver: bad connection
Error: exit status 1
stdout::
- exit status 1
cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:807: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1806460-1584083876-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
2: 5069
3: 4474
1: dead
Error: 1: dead
Hmm, latest failure just looks like a random node failure. Haven't seen the deadlock in the past 2 runs...
(roachtest).alterpk-tpcc failed on master@33d71472dc01cbc5064b3c5e1fcd666a33f606de:
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_075854.318_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1808695-1584256913-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200315 07:58:55.922470 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 41.619839ms
I200315 07:59:01.348431 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 5.425894287s
I200315 07:59:02.701637 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 1.353158729s
I200315 08:01:09.808336 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 2m7.106621238s
I200315 08:03:05.947362 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 1m56.138906477s
I200315 08:07:29.970423 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 4m24.022889884s
Error: check failed: 3.3.2.6: driver: bad connection
Error: exit status 1
stdout::
- exit status 1
cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:807: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1808695-1584256913-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
3: 4249
2: 4172
1: dead
Error: 1: dead
Latest failure is another repeat of #45975 (but with the hash aggregator OOMing).
(roachtest).alterpk-tpcc failed on master@d8bdc938177ea4b5226b95ee6cc51af0951054fc:
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_082437.670_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1809758-1584345030-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200316 08:24:39.270006 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 35.92733ms
I200316 08:24:49.904259 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 10.634146263s
I200316 08:24:51.275544 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 1.371235148s
I200316 08:25:37.289066 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 46.013472956s
I200316 08:27:30.049392 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 1m52.760047412s
I200316 08:39:43.494368 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 12m13.444756007s
Error: check failed: 3.3.2.6: driver: bad connection
Error: exit status 1
stdout::
- exit status 1
cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:807: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1809758-1584345030-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
3: 4200
1: dead
2: 4236
Error: 1: dead
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
powered by pkg/cmd/internal/issues
(roachtest).alterpk-tpcc failed on master@5a3d0c9539a671f0e55b680d3021b18dde9d190d:
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_080831.452_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1811809-1584430043-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200317 08:08:33.108975 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 39.073387ms
I200317 08:08:39.075546 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 5.966527849s
I200317 08:08:39.821899 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 746.309056ms
I200317 08:10:48.846461 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 2m9.024454483s
I200317 08:12:48.632427 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 1m59.785851001s
I200317 08:17:34.047775 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 4m45.415230721s
Error: check failed: 3.3.2.6: driver: bad connection
Error: exit status 1
stdout::
- exit status 1
cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:807: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1811809-1584430043-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
2: 4233
3: 4197
1: dead
Error: 1: dead
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
powered by pkg/cmd/internal/issues
(roachtest).alterpk-tpcc failed on master@b5f030223fbcf22e806c48a3c46e74a73a54f50f:
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200318-1814553/alterpk-tpcc/run_1
test_runner.go:800: test timed out (10h0m0s)
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
powered by pkg/cmd/internal/issues
@rohany The hash aggregator OOM bug wouldn't have caused this most recent timeout, right? If that's the case, then the failure is concerning.
Yup, this is different.
I'm looking at the logs, and I don't see my newly added log that marks when the backfill is completed ("Completed backfill for ....") for the primary key change, but I see that both the validation steps have finished (in validateIndexes), so that gives us somewhere to look where the hang is.
(roachtest).alterpk-tpcc failed on master@d571624ae4f833c49c717728a74cc7be78a791f0:
cluster.go:1935,alterpk.go:163,test_runner.go:753: error with attached stack trace:
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2010
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_081031.299_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1816964-1584603715-12-n4cpu4:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200319 08:10:32.914531 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 56.33867ms
I200319 08:10:46.288969 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 13.374378691s
I200319 08:10:47.721635 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 1.432598089s
I200319 08:11:32.886678 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 45.164991168s
I200319 08:14:03.467908 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 2m30.580935102s
I200319 08:17:03.106152 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 2m59.637891081s
Error: check failed: 3.3.2.6: pq: communication error: rpc error: code = Canceled desc = context canceled
Error: exit status 1
stdout::
- exit status 1
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
powered by pkg/cmd/internal/issues
cc @yuzefovich ever seen this oom trace?
fatal error: runtime: out of memory
runtime stack:
runtime.throw(0x42f99c6, 0x16)
/usr/local/go/src/runtime/panic.go:774 +0x72 fp=0x7faf8abed898 sp=0x7faf8abed868 pc=0x79a7e2
runtime.sysMap(0xc244000000, 0x4000000, 0x7508798)
/usr/local/go/src/runtime/mem_linux.go:169 +0xc5 fp=0x7faf8abed8d8 sp=0x7faf8abed898 pc=0x7835e5
runtime.(*mheap).sysAlloc(0x74c5980, 0x2000, 0x2000, 0x200)
/usr/local/go/src/runtime/malloc.go:701 +0x1cd fp=0x7faf8abed980 sp=0x7faf8abed8d8 pc=0x7766fd
runtime.(*mheap).grow(0x74c5980, 0x1, 0xffffffff)
/usr/local/go/src/runtime/mheap.go:1255 +0xa3 fp=0x7faf8abed9d0 sp=0x7faf8abed980 pc=0x7916a3
runtime.(*mheap).allocSpanLocked(0x74c5980, 0x1, 0x75087a8, 0xc007b3a000)
/usr/local/go/src/runtime/mheap.go:1170 +0x266 fp=0x7faf8abeda50 sp=0x7faf8abed9d0 pc=0x791536
runtime.(*mheap).alloc_m(0x74c5980, 0x1, 0xc007b3001a, 0x4)
/usr/local/go/src/runtime/mheap.go:1022 +0xc2 fp=0x7faf8abedaa0 sp=0x7faf8abeda50 pc=0x790dd2
runtime.(*mheap).alloc.func1()
/usr/local/go/src/runtime/mheap.go:1093 +0x4c fp=0x7faf8abedad8 sp=0x7faf8abedaa0 pc=0x7c640c
runtime.systemstack(0x7fafa0c906d0)
/usr/local/go/src/runtime/asm_amd64.s:370 +0x66 fp=0x7faf8abedae0 sp=0x7faf8abedad8 pc=0x7c8a46
runtime.mstart()
/usr/local/go/src/runtime/proc.go:1146 fp=0x7faf8abedae8 sp=0x7faf8abedae0 pc=0x79eda0
goroutine 281570 [running]:
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc000a85a10 sp=0xc000a85a08 pc=0x7c89d0
runtime.(*mheap).alloc(0x74c5980, 0x1, 0x101001a, 0x0)
/usr/local/go/src/runtime/mheap.go:1092 +0x8a fp=0xc000a85a60 sp=0xc000a85a10 pc=0x7910ba
runtime.(*mcentral).grow(0x74c63b8, 0x0)
/usr/local/go/src/runtime/mcentral.go:255 +0x7b fp=0xc000a85aa0 sp=0xc000a85a60 pc=0x78300b
runtime.(*mcentral).cacheSpan(0x74c63b8, 0xc000a85b40)
/usr/local/go/src/runtime/mcentral.go:106 +0x2fe fp=0xc000a85b00 sp=0xc000a85aa0 pc=0x782b2e
runtime.(*mcache).refill(0x7fafa1fe2d98, 0x1a)
/usr/local/go/src/runtime/mcache.go:138 +0x85 fp=0xc000a85b20 sp=0xc000a85b00 pc=0x7825d5
runtime.(*mcache).nextFree(0x7fafa1fe2d98, 0xc000a85b1a, 0x0, 0x0, 0x0)
/usr/local/go/src/runtime/malloc.go:854 +0x87 fp=0xc000a85b58 sp=0xc000a85b20 pc=0x776f27
runtime.mallocgc(0xc0, 0x3ff5f00, 0x1, 0x4)
/usr/local/go/src/runtime/malloc.go:1022 +0x793 fp=0xc000a85bf8 sp=0xc000a85b58 pc=0x777863
runtime.makeslice(0x3ff5f00, 0x4, 0x4, 0x4)
/usr/local/go/src/runtime/slice.go:49 +0x6c fp=0xc000a85c28 sp=0xc000a85bf8 pc=0x7b10ac
github.com/cockroachdb/cockroach/pkg/sql/rowflow.(*routerOutput).popRowsLocked.func1(0xc00d05d980, 0x4b85fa0, 0xc01b568340, 0xc000a85d20, 0xc000a85e28, 0x10, 0x10, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowflow/routers.go:161 +0x1f9 fp=0xc000a85cd8 sp=0xc000a85c28 pc=0x2b58e19
github.com/cockroachdb/cockroach/pkg/sql/rowflow.(*routerOutput).popRowsLocked(0xc00d05d980, 0x4b85fa0, 0xc01b568340, 0xc000a85e28, 0x10, 0x10, 0xc000a85e28, 0x10, 0x10, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowflow/routers.go:166 +0x18a fp=0xc000a85d38 sp=0xc000a85cd8 pc=0x2b5212a
github.com/cockroachdb/cockroach/pkg/sql/rowflow.(*routerBase).Start.func1(0x4b85fa0, 0xc01b568340, 0xc008808780, 0xc00d05d980, 0xc011518670)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowflow/routers.go:317 +0x6e4 fp=0xc000a85fb8 sp=0xc000a85d38 pc=0x2b59614
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc000a85fc0 sp=0xc000a85fb8 pc=0x7caaa1
created by github.com/cockroachdb/cockroach/pkg/sql/rowflow.(*routerBase).Start
/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowflow/routers.go:280 +0x9f
(on node 2 for more details)
The trace doesn't help really. It just means the node OOMed. We need profiles to understand what's going on here... but doesn't alter PK do a bunch of index builds? Maybe this is the same root cause as the IndexBackfiller OOM we were looking at recently...
This was during tpcc checks -- probably 3.3.2.7 (where we ran into the hash aggregator oom)
This happened during 3.3.2.6, but I'm confused why I was able to run the query manually when using binaries with hash aggregator memory monitoring fix (which the last failure also had in the binary). (I guess I wasn't altering the primary key, so maybe Jordan's suggestion is a good path to explore.)
On the side note, we seem to be missing memory accounting in rowflow.routerOutput as well :/
(roachtest).alterpk-tpcc failed on master@bcffc157e3af316c2bfcf4850e67205010425211:
cluster.go:1935,alterpk.go:163,test_runner.go:753: error with attached stack trace:
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2010
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_074153.168_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1819517-1584688629-01-n4cpu4:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200320 07:41:54.802818 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 68.0927ms
I200320 07:42:06.647382 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 11.844515639s
I200320 07:42:08.828918 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 2.181486691s
I200320 07:43:08.591889 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 59.762923849s
I200320 07:45:34.879397 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 2m26.287442343s
I200320 07:50:06.000105 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 4m31.120607137s
Error: check failed: 3.3.2.6: pq: communication error: rpc error: code = Canceled desc = context canceled
Error: exit status 1
stdout::
- exit status 1
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
46317 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003181957_v20.1.0-beta.3 release-blocker
46136 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003112118_v20.1.0-beta.3 release-blocker
powered by pkg/cmd/internal/issues
(roachtest).alterpk-tpcc failed on master@322f83364bc0807c3147af0631cb7317c584ce43:
cluster.go:1935,alterpk.go:163,test_runner.go:753: error with attached stack trace:
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2010
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_073103.139_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1821791-1584774229-06-n4cpu4:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200321 07:31:04.800590 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 81.949461ms
I200321 07:31:13.169282 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 8.368647632s
I200321 07:31:14.673128 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 1.503771706s
I200321 07:32:19.538194 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 1m4.865018024s
I200321 07:34:41.273991 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 2m21.735694083s
I200321 07:39:26.693407 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 4m45.419344785s
Error: check failed: 3.3.2.6: pq: communication error: rpc error: code = Canceled desc = context canceled
Error: exit status 1
stdout::
- exit status 1
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
powered by pkg/cmd/internal/issues
(roachtest).alterpk-tpcc failed on master@055561809b95488bff2cad19422e7f4a7472e3a2:
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_080246.528_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1823985-1585035031-03-n4cpu4:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200324 08:02:48.179678 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 60.790866ms
I200324 08:03:02.929553 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 14.749830337s
I200324 08:03:04.543073 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 1.613450229s
I200324 08:04:08.837255 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 1m4.294110837s
I200324 08:06:37.344327 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 2m28.507014474s
I200324 08:13:25.374945 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 6m48.030557786s
Error: check failed: 3.3.2.6: pq: communication error: rpc error: code = Canceled desc = context canceled
Error: exit status 1
stdout::
- exit status 1
cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1823985-1585035031-03-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
2: 4048
1: 4902
3: dead
Error: 3: dead
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
46317 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003181957_v20.1.0-beta.3 release-blocker
46136 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003112118_v20.1.0-beta.3 release-blocker
powered by pkg/cmd/internal/issues
cc @yuzefovich this stack:
runtime stack:
runtime.throw(0x4303f41, 0x16)
/usr/local/go/src/runtime/panic.go:774 +0x72 fp=0x7f29a8dca898 sp=0x7f29a8dca868 pc=0x79a7e2
runtime.sysMap(0xc238000000, 0x4000000, 0x751ae38)
/usr/local/go/src/runtime/mem_linux.go:169 +0xc5 fp=0x7f29a8dca8d8 sp=0x7f29a8dca898 pc=0x7835e5
runtime.(*mheap).sysAlloc(0x74d8000, 0x2000, 0x7f0000000000, 0x341fbd7)
/usr/local/go/src/runtime/malloc.go:701 +0x1cd fp=0x7f29a8dca980 sp=0x7f29a8dca8d8 pc=0x7766fd
runtime.(*mheap).grow(0x74d8000, 0x1, 0xffffffff)
/usr/local/go/src/runtime/mheap.go:1255 +0xa3 fp=0x7f29a8dca9d0 sp=0x7f29a8dca980 pc=0x7916a3
runtime.(*mheap).allocSpanLocked(0x74d8000, 0x1, 0x751ae48, 0x328c264)
/usr/local/go/src/runtime/mheap.go:1170 +0x266 fp=0x7f29a8dcaa50 sp=0x7f29a8dca9d0 pc=0x791536
runtime.(*mheap).alloc_m(0x74d8000, 0x1, 0xc014bf003a, 0x7cc5f6)
/usr/local/go/src/runtime/mheap.go:1022 +0xc2 fp=0x7f29a8dcaaa0 sp=0x7f29a8dcaa50 pc=0x790dd2
runtime.(*mheap).alloc.func1()
/usr/local/go/src/runtime/mheap.go:1093 +0x4c fp=0x7f29a8dcaad8 sp=0x7f29a8dcaaa0 pc=0x7c640c
runtime.systemstack(0x7f2b62600c10)
/usr/local/go/src/runtime/asm_amd64.s:370 +0x66 fp=0x7f29a8dcaae0 sp=0x7f29a8dcaad8 pc=0x7c8a46
runtime.mstart()
/usr/local/go/src/runtime/proc.go:1146 fp=0x7f29a8dcaae8 sp=0x7f29a8dcaae0 pc=0x79eda0
goroutine 355172 [running]:
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc014bff948 sp=0xc014bff940 pc=0x7c89d0
runtime.(*mheap).alloc(0x74d8000, 0x1, 0xc01401003a, 0x181f17c)
/usr/local/go/src/runtime/mheap.go:1092 +0x8a fp=0xc014bff998 sp=0xc014bff948 pc=0x7910ba
runtime.(*mcentral).grow(0x74d9238, 0x0)
/usr/local/go/src/runtime/mcentral.go:255 +0x7b fp=0xc014bff9d8 sp=0xc014bff998 pc=0x78300b
runtime.(*mcentral).cacheSpan(0x74d9238, 0xc237ffbc08)
/usr/local/go/src/runtime/mcentral.go:106 +0x2fe fp=0xc014bffa38 sp=0xc014bff9d8 pc=0x782b2e
runtime.(*mcache).refill(0x7f2b639f86d0, 0x3a)
/usr/local/go/src/runtime/mcache.go:138 +0x85 fp=0xc014bffa58 sp=0xc014bffa38 pc=0x7825d5
runtime.(*mcache).nextFree(0x7f2b639f86d0, 0x3a, 0xc014bffac8, 0x7e731c, 0xc00048e6c0)
/usr/local/go/src/runtime/malloc.go:854 +0x87 fp=0xc014bffa90 sp=0xc014bffa58 pc=0x776f27
runtime.mallocgc(0x300, 0x425e960, 0x7ea301, 0xc22b95c8a0)
/usr/local/go/src/runtime/malloc.go:1022 +0x793 fp=0xc014bffb30 sp=0xc014bffa90 pc=0x777863
runtime.newobject(0x425e960, 0xc22b95c8a0)
/usr/local/go/src/runtime/malloc.go:1151 +0x38 fp=0xc014bffb60 sp=0xc014bffb30 pc=0x777c58
github.com/cockroachdb/cockroach/pkg/storage.glob..func15(0x6dfffc0, 0x1)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/rocksdb.go:1668 +0x2d fp=0xc014bffb80 sp=0xc014bffb60 pc=0x182991d
sync.(*Pool).Get(0x6dfffc0, 0xc014bffc70, 0x7775d4)
/usr/local/go/src/sync/pool.go:148 +0xa6 fp=0xc014bffbc8 sp=0xc014bffb80 pc=0x7e9636
github.com/cockroachdb/cockroach/pkg/storage.newRocksDBBatch(0xc00048e600, 0xc237ffbd01, 0xc00048e600)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/rocksdb.go:1676 +0x31 fp=0xc014bffbf0 sp=0xc014bffbc8 pc=0x181d141
github.com/cockroachdb/cockroach/pkg/storage.(*RocksDB).NewWriteOnlyBatch(0xc00048e600, 0xc237feed00, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/rocksdb.go:1121 +0x30 fp=0xc014bffc18 sp=0xc014bffbf0 pc=0x1819900
github.com/cockroachdb/cockroach/pkg/storage.(*rocksDBMapBatchWriter).Flush(0xc036515ef0, 0x556000000076e4a5, 0xc00a878380)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/disk_map.go:231 +0x88 fp=0xc014bffc48 sp=0xc014bffc18 pc=0x17cf9d8
github.com/cockroachdb/cockroach/pkg/sql/rowcontainer.(*DiskRowContainer).newIterator(0xc021769080, 0x4b92fe0, 0xc02e055a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowcontainer/disk_row_container.go:285 +0x53 fp=0xc014bffcb0 sp=0xc014bffc48 pc=0x1fbd0a3
github.com/cockroachdb/cockroach/pkg/sql/rowcontainer.(*DiskRowContainer).NewIterator(0xc021769080, 0x4b92fe0, 0xc02e055a00, 0xc237fc0c80, 0x4)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowcontainer/disk_row_container.go:293 +0x74 fp=0xc014bffd40 sp=0xc014bffcb0 pc=0x1fbd1f4
github.com/cockroachdb/cockroach/pkg/sql/rowcontainer.(*hashDiskRowBucketIterator).Reset(0xc035294600, 0x4b92fe0, 0xc02e055a00, 0xc237fc0c80, 0x4, 0x4, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowcontainer/hash_row_container.go:614 +0x17e fp=0xc014bffde0 sp=0xc014bffd40 pc=0x1fc031e
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*hashJoiner).readProbeSide(0xc004a4e000, 0x3, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/hashjoiner.go:469 +0x15c fp=0xc014bffe98 sp=0xc014bffde0 pc=0x20eb8bc
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*hashJoiner).Next(0xc004a4e000, 0xb, 0x4b92fe0, 0xc02e055a00, 0x4b92fe0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/hashjoiner.go:239 +0x278 fp=0xc014bfff00 sp=0xc014bffe98 pc=0x20eaa18
github.com/cockroachdb/cockroach/pkg/sql/execinfra.Run(0x4b92fe0, 0xc02e055a00, 0x4bb6f20, 0xc004a4e000, 0x4b713e0, 0xc034017c00)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/base.go:170 +0x35 fp=0xc014bfff40 sp=0xc014bfff00 pc=0x20c0a05
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBase).Run(0xc004a4e000, 0x4b92fe0, 0xc02e055a00)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:749 +0x92 fp=0xc014bfff80 sp=0xc014bfff40 pc=0x20c87e2
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).startInternal.func1(0xc02e055b40, 0x3, 0x4, 0x4b92fe0, 0xc02e055a00, 0xc026fc27e0, 0x2)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:316 +0x5c fp=0xc014bfffa8 sp=0xc014bfff80 pc=0x20d9b6c
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc014bfffb0 sp=0xc014bfffa8 pc=0x7caaa1
created by github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).startInternal
/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:315 +0x2d9
It doesn't seem like a problem necessarily with the hash joiner, but an OOM somewhere else right?
Here is the last memory usage log message:
7.7 GiB/279 MiB/8.9 GiB GO alloc/idle/total, 3.8 GiB/4.9 GiB CGO
I'm curious about cgo usage. I assume it comes from backfilling the newly created indexes? Do we have memory accounting for that?
At this point the backfill is over though, so I don't think it is from backfilling the indexes?
I'll take another look at the memory accounting today.
We need to figure out how to get memory profiles for these runs. It's going
to be hard to figure out what's going on without them.
On Tue, Mar 24, 2020 at 11:49 AM yuzefovich notifications@github.com
wrote:
I'll take another look at the memory accounting today.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/45812#issuecomment-603320983,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAKWLP7HBK6I5ZZJ5WMM4LRJDJANANCNFSM4LDFKUJQ
.
I'll do manually what roachtest does, so I should be able to see the memory profiles. Last time I did this, but I did not perform the altering of primary key while running tpcc, and the cluster was happy.
(roachtest).alterpk-tpcc failed on master@6c3c5d4a9a0b837538a558fa22dcd4ff712fd8f5:
cluster.go:1935,alterpk.go:163,test_runner.go:753: error with attached stack trace:
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2010
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_075500.038_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1826302-1585121307-05-n4cpu4:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200325 07:55:01.649465 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 39.28485ms
I200325 07:55:07.798830 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 6.149313448s
I200325 07:55:10.097753 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 2.298875719s
I200325 07:57:21.497403 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 2m11.399605056s
I200325 07:59:39.041365 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 2m17.543862907s
I200325 08:12:04.625868 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 12m25.584429082s
Error: check failed: 3.3.2.6: pq: communication error: rpc error: code = Canceled desc = context canceled
Error: exit status 1
stdout::
- exit status 1
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
46317 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003181957_v20.1.0-beta.3 release-blocker
46136 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003112118_v20.1.0-beta.3 release-blocker
powered by pkg/cmd/internal/issues
(roachtest).alterpk-tpcc failed on master@b7d098425ff43372784ed7c34ee25e5e5fc1a58b:
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1829260-1585208956-03-n4cpu4:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200326 08:21:58.026513 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 52.66758ms
I200326 08:22:06.550151 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 8.523578183s
I200326 08:22:08.830075 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 2.279870397s
I200326 08:24:15.241985 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 2m6.411824562s
I200326 08:26:50.757047 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 2m35.514980722s
I200326 08:32:36.381382 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 5m45.624246206s
Error: check failed: 3.3.2.6: driver: bad connection
Error: DEAD_ROACH_PROBLEM:
- error with user detail: Node 4. Command with error:
```
./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
```
- exit status 1
stdout::
- exit status 30
cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1829260-1585208956-03-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
2: 4147
3: 4128
1: dead
Error: UNCLASSIFIED_PROBLEM:
- 1: dead
main.glob..func13
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
main.wrap.func1
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
main.main
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1793
runtime.main
/usr/local/go/src/runtime/proc.go:203
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
46317 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003181957_v20.1.0-beta.3 release-blocker
46136 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003112118_v20.1.0-beta.3 release-blocker
powered by pkg/cmd/internal/issues
(roachtest).alterpk-tpcc failed on master@c0855e9fcf979cca1094d7edeeacb9882ddc9f96:
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_075050.774_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:163
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1832621-1585293773-05-n4cpu4:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200327 07:50:52.365340 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 35.578016ms
I200327 07:51:01.671880 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 9.306486243s
I200327 07:51:03.107883 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 1.435945914s
I200327 07:51:58.271277 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 55.163350847s
I200327 07:54:20.617575 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 2m22.346151521s
I200327 07:54:24.297042 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 3.679388219s
Error: check failed: 3.3.2.6: pq: root: memory budget exceeded: 10240 bytes requested, 3940648960 currently allocated, 3940653056 bytes in budget
Error: DEAD_ROACH_PROBLEM:
- error with user detail: Node 4. Command with error:
```
./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
```
- exit status 1
stdout::
- exit status 30
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
46317 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003181957_v20.1.0-beta.3 release-blocker
46136 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003112118_v20.1.0-beta.3 release-blocker
powered by pkg/cmd/internal/issues
cc @yuzefovich we did it
(roachtest).alterpk-tpcc failed on master@0e9f07ae35ad39e8789bc51c49066c123968a666:
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_083138.913_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1834610-1585381212-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200328 08:31:40.845573 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 176.538299ms
I200328 08:31:49.254756 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 8.409136135s
I200328 08:31:50.705584 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 1.450780288s
I200328 08:33:44.004075 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 1m53.298444796s
I200328 08:35:49.143329 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 2m5.13905376s
I200328 08:35:51.319327 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 2.17591357s
Error: check failed: 3.3.2.6: pq: root: memory budget exceeded: 10240 bytes requested, 3781232640 currently allocated, 3781237760 bytes in budget
Error: DEAD_ROACH_PROBLEM:
- error with user detail: Node 4. Command with error:
```
./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
```
- exit status 1
stdout::
- exit status 30
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
46317 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003181957_v20.1.0-beta.3 release-blocker
46136 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003112118_v20.1.0-beta.3 release-blocker
powered by pkg/cmd/internal/issues
(roachtest).alterpk-tpcc failed on master@92495ffb2182671b3338a8c6cd2734cb6de5d392:
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_080101.998_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1836230-1585466755-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200329 08:01:03.631991 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 61.353397ms
I200329 08:01:09.326129 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 5.694082806s
I200329 08:01:10.052244 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 726.066567ms
I200329 08:03:03.149468 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 1m53.097079484s
I200329 08:04:58.394098 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 1m55.244551363s
I200329 08:05:01.097850 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 2.703662347s
Error: check failed: 3.3.2.6: pq: root: memory budget exceeded: 10240 bytes requested, 3781232640 currently allocated, 3781235712 bytes in budget
Error: DEAD_ROACH_PROBLEM:
- error with user detail: Node 4. Command with error:
```
./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
```
- exit status 1
stdout::
- exit status 30
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
46317 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003181957_v20.1.0-beta.3 release-blocker
46136 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003112118_v20.1.0-beta.3 release-blocker
powered by pkg/cmd/internal/issues
(roachtest).alterpk-tpcc failed on master@8a50e9f474c84e46fa822206b87c53e972631884:
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_080900.034_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1838888-1585639742-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200331 08:09:01.669070 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 51.992566ms
I200331 08:09:09.860917 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 8.19175785s
I200331 08:09:11.261369 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 1.400374712s
I200331 08:09:45.698863 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 34.437322929s
I200331 08:11:40.682089 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 1m54.98306478s
I200331 08:11:43.178949 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 2.496722442s
Error: check failed: 3.3.2.6: pq: root: memory budget exceeded: 10240 bytes requested, 3781232640 currently allocated, 3781237760 bytes in budget
Error: DEAD_ROACH_PROBLEM:
- error with user detail: Node 4. Command with error:
```
./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
```
- exit status 1
stdout::
- exit status 30
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
powered by pkg/cmd/internal/issues
(roachtest).alterpk-tpcc failed on master@9c590593565908c97b81fe697a6f31414807daf8:
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_090514.658_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1841461-1585727592-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200401 09:05:16.588595 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 164.281717ms
I200401 09:05:22.298175 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 5.709508525s
I200401 09:05:23.773111 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 1.474875528s
I200401 09:07:17.703149 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 1m53.929940877s
I200401 09:09:18.163879 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 2m0.460596846s
I200401 09:17:49.228624 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 8m31.064556462s
Error: check failed: 3.3.2.6: pq: communication error: rpc error: code = Canceled desc = context canceled
Error: DEAD_ROACH_PROBLEM:
- error with user detail: Node 4. Command with error:
```
./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
```
- exit status 1
stdout::
- exit status 30
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
46317 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003181957_v20.1.0-beta.3 release-blocker
46136 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-provisional_202003112118_v20.1.0-beta.3 release-blocker
powered by pkg/cmd/internal/issues
Alright, #46700 has been backported, so crossing my fingers that tonight's run will succeed.
(roachtest).alterpk-tpcc failed on master@51e8960053a8224856234ffc1fb610c3843f7d7d:
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1845254-1585812734-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200402 08:26:48.156220 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 151.264861ms
I200402 08:26:57.618460 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 9.462149145s
I200402 08:26:59.804825 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 2.186324352s
I200402 08:27:59.275771 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 59.4708659s
I200402 08:30:18.058951 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 2m18.783032566s
I200402 08:33:51.421063 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 3m33.361830337s
Error: check failed: 3.3.2.6: driver: bad connection
Error: DEAD_ROACH_PROBLEM:
- error with user detail: Node 4. Command with error:
```
./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
```
- exit status 1
stdout::
- exit status 30
cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1845254-1585812734-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
2: 4341
1: dead
3: 4587
Error: UNCLASSIFIED_PROBLEM:
- 1: dead
main.glob..func13
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
main.wrap.func1
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
main.main
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1793
runtime.main
/usr/local/go/src/runtime/proc.go:203
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
powered by pkg/cmd/internal/issues
Ugh :(
fatal error: runtime: out of memory
...
goroutine 2669945 [running]:
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc005657a10 sp=0xc005657a08 pc=0x7c89d0
runtime.(*mheap).alloc(0x750dac0, 0x1, 0xc1eb01001a, 0x0)
/usr/local/go/src/runtime/mheap.go:1092 +0x8a fp=0xc005657a60 sp=0xc005657a10 pc=0x7910ba
runtime.(*mcentral).grow(0x750e4f8, 0x0)
/usr/local/go/src/runtime/mcentral.go:255 +0x7b fp=0xc005657aa0 sp=0xc005657a60 pc=0x78301b
runtime.(*mcentral).cacheSpan(0x750e4f8, 0x5)
/usr/local/go/src/runtime/mcentral.go:106 +0x2fe fp=0xc005657b00 sp=0xc005657aa0 pc=0x782b3e
runtime.(*mcache).refill(0x7fe8153991f0, 0x1a)
/usr/local/go/src/runtime/mcache.go:138 +0x85 fp=0xc005657b20 sp=0xc005657b00 pc=0x7825e5
runtime.(*mcache).nextFree(0x7fe8153991f0, 0x7fe7fe45c01a, 0x5, 0x5, 0x0)
/usr/local/go/src/runtime/malloc.go:854 +0x87 fp=0xc005657b58 sp=0xc005657b20 pc=0x776f37
runtime.mallocgc(0xc0, 0x401d860, 0xc14f5fbf01, 0xc14f5fbf60)
/usr/local/go/src/runtime/malloc.go:1022 +0x793 fp=0xc005657bf8 sp=0xc005657b58 pc=0x777873
runtime.makeslice(0x401d860, 0x4, 0x4, 0x4)
/usr/local/go/src/runtime/slice.go:49 +0x6c fp=0xc005657c28 sp=0xc005657bf8 pc=0x7b10ac
github.com/cockroachdb/cockroach/pkg/sql/rowflow.(*routerOutput).popRowsLocked.func1(0xc02553eb80, 0x4bb4860, 0xc0268aba80, 0xc005657d20, 0xc005657e28, 0x10, 0x10, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowflow/routers.go:161 +0x1f9 fp=0xc005657cd8 sp=0xc005657c28 pc=0x2b77259
github.com/cockroachdb/cockroach/pkg/sql/rowflow.(*routerOutput).popRowsLocked(0xc02553eb80, 0x4bb4860, 0xc0268aba80, 0xc005657e28, 0x10, 0x10, 0xc005657e28, 0x10, 0x10, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowflow/routers.go:166 +0x18a fp=0xc005657d38 sp=0xc005657cd8 pc=0x2b7049a
github.com/cockroachdb/cockroach/pkg/sql/rowflow.(*routerBase).Start.func1(0x4bb4860, 0xc0268aba80, 0xc034f4ef00, 0xc02553eb80, 0xc007099ab0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowflow/routers.go:317 +0x6e4 fp=0xc005657fb8 sp=0xc005657d38 pc=0x2b77a54
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc005657fc0 sp=0xc005657fb8 pc=0x7caaa1
created by github.com/cockroachdb/cockroach/pkg/sql/rowflow.(*routerBase).Start
/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowflow/routers.go:280 +0x9f
I think I saw that we're not doing memory accounting in some place in routerOutput when I looked at this a couple weeks ago, but I thought it wouldn't matter (unless, of course, the consumer of the outputs is very slow). I'll take another look today.
@yuzefovich this failures looks like it is on the beta 3 brach, which doesn't have recent fixes right?
Why do you say beta 3? It says "master", no?
Oh nevermind i misread the related issues linked as tags on this one. My bad.
Hm, another possible place where we're not fully accounting for the memory is the automatic stats collection, I think I saw sampleAggregator in the stack trace of every crash (possibly, though, we just have a stats collection job running all the time).
Does stats collection use an internal executor? iirc those aren't hooked up to memory monitors.
There are two separate processors that perform the stats collection, so it shouldn't be using an internal executor.
@rytaft I'm trying to understand the lifecycle of the memory used for histogram data, and here is what I see:
sampleAggregator.mainLoop we're getting the sample values that are stored SampleReservoirwriteResults where we might call generateHistogram for some column of sampled valuesgenerateHistogram copies all sampled values into values (which is not accounted for) and then calls stats.EquiDepthHistogram which divides all the sampled values into buckets and encodes the upper bound sample value for each bucketencoded values are stored into HistogramData.Buckets and are also unaccounted forwriteResults, we're deleting old statistic and inserting a new oneInsertNewStat we're marshaling HistogramData which will also allocate memory that we don't account for; we're also using an internal executor to actually insert the marshaled histogram data (which might make an additional copy or something, I don't know)InsertNewStat returns, we should lose all references to the temporary unaccounted memory I mentioned, so it should be given to GC.Am I missing something here? It seems like there could be a spike in this unaccounted memory, but I don't have a feeling for how big that spike could be.
Eh, the OOM can be reliably reproduced on a 3 node roachprod cluster with default n1-standard-4 nodes after importing 500 warehouses and running the query 3.3.2.6 manually (so the alteration of primary key is not needed). I tried to add more logging to try to understand where we're not accounting for the memory properly but still don't see it.
Here is an explain analyze (on an occasion it didn't crash) of the query for the context on amount of data flying around:

We seem to be estimating the memory usage of the aggregators relatively good now. For example, here is one of the heap profiles when first stage aggregator finishes and the second one is about to:

And at about this time in the logs we have:
[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-mem-5: bytes usage increases to 1.0 GiB
[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-funcs-mem-3, bytes usage max 170 MiB
[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-mem-3, bytes usage max 1.9 GiB
[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-funcs-mem-5: bytes usage increases to 128 MiB (+10240)
In all of the crashes I've observed we successfully get to the point where hash join is being performed and actually get pretty far in the join's execution. However, RSS jumps quite significantly (say from 13 GiB RSS, 203 goroutines, 7.3 GiB/2.5 GiB/10 GiB GO alloc/idle/total to 14 GiB RSS, 201 goroutines, 9.2 GiB/720 MiB/10 GiB GO alloc/idle/total in 10 seconds), but this jump is not reflected in the heap profiles. It is possible that OS reports RSS with a delay, but I'm very puzzled by this.
I'm tired of looking at this issue and wondering whether someone else should take a stab at it. Here are a few suggestions:
set cluster setting sql.stats.automatic_collection.enabled=false;
select crdb_internal.set_vmodule('hashjoiner=1,bytes_usage=1,hash_row_container=2,row_based_flow=1');
roachprod start yahor-test:1-3 --env="COCKROACH_MEMPROF_INTERVAL=3s"
Hi @yuzefovich -- sorry for my slow reply to you earlier -- since you've disabled auto stats, does this mean you've ruled out histograms as the cause? (You can also explicitly disable histograms using set cluster setting sql.stats.histogram_collection.enabled = false;.)
I still think there is memory misaccounting while performing automatic stats collection, but the crashes occur (seems like with lower likelihood) with auto stats disabled, so I don't think they are the root of cause of these OOMs.
Ok thanks! I'll investigate the memory accounting of stats collection on Monday (unless there is some urgency to do it sooner...).
Thanks! I don't think it's urgent since it seems to me that generally we're accounting for "permanently" used memory while automatically collecting stats, but I'm concerned about "temporary" allocations that could spike - I just don't know how big the spikes can be.
(roachtest).alterpk-tpcc failed on master@82fec00c83d4bfe35b906264ccb568568cec15b7:
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_082033.550_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1851668-1585985862-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200404 08:20:35.396562 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 94.007016ms
I200404 08:20:43.585080 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 8.188439543s
I200404 08:20:46.075089 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 2.489950798s
I200404 08:22:28.782320 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 1m42.707109563s
I200404 08:24:41.019144 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 2m12.236664801s
I200404 08:32:19.429694 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 7m38.410327593s
Error: check failed: 3.3.2.6: driver: bad connection
Error: DEAD_ROACH_PROBLEM:
- error with user detail: Node 4. Command with error:
```
./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
```
- exit status 1
stdout::
- exit status 30
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
powered by pkg/cmd/internal/issues
(roachtest).alterpk-tpcc failed on master@b1a0b989bbfef500075a485edc762fe42ca7b32a:
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_080522.569_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1853455-1586071414-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200405 08:05:24.183469 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 53.342763ms
I200405 08:05:31.785387 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 7.601842022s
I200405 08:05:34.392096 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 2.606617715s
I200405 08:06:40.709953 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 1m6.317813813s
I200405 08:08:40.770217 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 2m0.060106663s
I200405 08:16:52.169000 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 8m11.398593236s
Error: check failed: 3.3.2.6: pq: communication error: rpc error: code = Canceled desc = context canceled
Error: DEAD_ROACH_PROBLEM:
- error with user detail: Node 4. Command with error:
```
./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
```
- exit status 1
stdout::
- exit status 30
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
powered by pkg/cmd/internal/issues
(roachtest).alterpk-tpcc failed on master@beac4a53e0e2e2236eb5957f67abc1bf476ad1b6:
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_075641.938_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1854541-1586157738-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200406 07:56:43.703421 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 66.908502ms
I200406 07:56:49.005046 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 5.30152617s
I200406 07:56:51.492729 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 2.487643063s
I200406 07:58:07.460154 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 1m15.967279962s
I200406 08:00:05.841061 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 1m58.380748316s
I200406 11:12:44.203253 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 3h12m38.361899276s
Error: check failed: 3.3.2.6: driver: bad connection
Error: DEAD_ROACH_PROBLEM:
- error with user detail: Node 4. Command with error:
```
./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
```
- exit status 1
stdout::
- exit status 30
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
powered by pkg/cmd/internal/issues
Unfortunately, this crash seems like a regression. Using 19.2.5 I created 3 node cluster with default node sizes, imported 500 warehouses, waited for auto stats to be collected, and then ran first query from 3.3.2.6 check manually twice, and both times it succeeded in about 4 minutes. Then I stopped the cluster and restarted it using the current master, but when I executed the query, it crashed on the first try. This issue deserves more investigation. cc @jordanlewis
(roachtest).alterpk-tpcc failed on master@2032dafccfa311c7538960e974953cb9dc1d4e50:
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: output in %s
-- arg 1: <string>
- output in run_081029.241_n4_workload_check_tpcc:
- error with attached stack trace:
main.execCmd
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
main.(*cluster).RunL
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
main.(*cluster).RunE
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
main.(*cluster).Run
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
main.registerAlterPK.func3
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
main.(*testRunner).runTest.func2
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
- error with embedded safe details: %s returned:
stderr:
%s
stdout:
%s
-- arg 1: <string>
-- arg 2: <string>
-- arg 3: <string>
- /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1857109-1586244496-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
stderr:
I200407 08:10:30.874469 1 workload/tpcc/tpcc.go:372 check 3.3.2.1 took 45.088823ms
I200407 08:10:41.353822 1 workload/tpcc/tpcc.go:372 check 3.3.2.2 took 10.479249493s
I200407 08:10:43.551308 1 workload/tpcc/tpcc.go:372 check 3.3.2.3 took 2.19743782s
I200407 08:11:54.442385 1 workload/tpcc/tpcc.go:372 check 3.3.2.4 took 1m10.891018778s
I200407 08:13:56.825404 1 workload/tpcc/tpcc.go:372 check 3.3.2.5 took 2m2.382769241s
I200407 08:17:27.545161 1 workload/tpcc/tpcc.go:372 check 3.3.2.6 took 3m30.719600577s
Error: check failed: 3.3.2.6: driver: bad connection
Error: DEAD_ROACH_PROBLEM:
- error with user detail: Node 4. Command with error:
```
./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
```
- exit status 1
stdout::
- exit status 30
Artifacts: /alterpk-tpcc See this test on roachdashMore
Related:
powered by pkg/cmd/internal/issues
Here is EXPLAIN ANALYZE of the query on 19.2.5 version:
https://cockroachdb.github.io/distsqlplan/decode.html#eJzEmG1P20oWx9_vp7DmFaxMPOfM44m0EtyWvYuWJl2g2tvdi6IU-0LUJGZto1tU9buv7ISS2GMzeSh904LtmfnPf86c3zl8Zfn_pqzPTn97f35yNghOBifnH_9zGhy8Pbu8uvzX-WFwcHl6fvrmKkhHf44mcRiko3j5_-LfdDq6mRfB3y-G74LfWZrFSfY7C4YXb08vgl8-OocFb08v3xwGp7-9OX1_FZycnz8vMn36fPr0_XS0WOkmfZgXB389XKxUrTOaTuZJ8OvF8MP7cqmD1tGHK3ravlmIYiGbp3EyGM-SnPX_y4CFDFnIBLsO2X2W3iR5nmblq6_Vh2fxF9bnIZvM7x-K8vF1yG7SLGH9r6yYFNOE9dnV-NM0uUjGcZJFnIUsTorxZFpNf59NZuPs8bjaDgvZ5f14nveDo4hHEAnOeTCex4EM0uIuyXIWsuFD0Q-ORXiM4TGEx4aFLEv_zIMsGcf9QEnJSZR682I8nQbFZJb0A4CeRSm1Kif49FgkT5-jNMG7yS_s-lvI0odiuYNn4Z8eg7txfrcuuVwWw1KCZNffrqulbhPWh2_hdpaI9fmfT3aUF-nN59Efn0eT-Mvx8_NVnyhCbgRGFJlIII90ZRjoumNL0et2SZLKKC5rfknbI6sMmLpf0LPBrw2_ni3ATSw4ub3NkttxkWaRdjvMQnYy-DgaDK9Ggw_n5wfHcNh4hM1Honz0ZvhhcDW6GP778uDQa9Mww54lZQ2Wu56NvwSzZJZmj8FDnsT9QAvaPlTW4kS0mvQ828O8Ou0kXpuvmqXDRtq_jZcf3o3OBlcHx_Kweflk7fIJSYrbmq84A-ihAEBwGyv5_u6g3MHbcCv7_zHO7y6TYngfAa5rvHq8T_orOZ6FbJr8URysaD_8Wza5vVt_VL-xC5vLkcGq15yDBf70Zi2QRU9IFFyWdlfzuwcuXq0flbA9gwplNTYv0iyJg3wSJ_2g-tp9ftXxLd7Ek_zz8rmyjmN9Pim131swSI_S-whU_Uyda-u1tcE_U4MXvL6z6yhCE0EkBVZJWfhSTKPWyhoXxZTV2KCYlT-KYht4o7anWGQoIjDEqx8i5EgRHkUAIpLcalP9RJGyOpILvCkvvCluFYKqZ6Qy05NBoWWTb7QB314wZyUx29fkW-uuwfbQGLDGnYeV_iGAgz0CDuCnE05achJOkLGvT7jNzN2ZcOIVCSeJDPI2wikLKMhJuOeBLYRDkBrglQmH_pkU_SgDWpc1_5IzOpIRKKAX4SKQUFqoxbDqkdYSdSMjyh3K3m62bOCI3IEtjQbpKBI6skIQL38wkSxps8BKg9AtXZMmDZbqFmIPBdfU6DI365pesGUlG5rX7Zqcm4aZ7FmllBQtXZP6IVDBfUKF_3SoKCnqxuIMVI9rkuq1mbKZtzszRb5u18SJ3EyBnhBKKd3WNS0HtjAFlDCLo_pJXZPD5Iskv0_neeLVE_HymJL4Nlkce54-ZDfJ-yy9qZZZ_DqsxlVVR5zkxeKtXPxyNn96lRfjau5ljKcPRbkxsIBAxumF5portE03gEurjOasijN_TcD9RSkDLlFCWuCm-itRTRT_nt7zZF70y2ZkeWQbSVTeEjVo6U6unEDh9gqhrpCvKsQ1gbw-GDu3Jzx3ZzRp49wdakkkqzqusb3GTrrFgG-MGm01d6pRigsufLwWbq-7FZKvQMNBkUug5Fbuok9scMmxVZ_WFix3BqspEyg67hNyMqDQcardmmqXvEuUUmBdoozVVmufO47buFa74522SXKmIdAkjN1Boey847UkVB-sOgeD7s4Qeh_cQE7ccmdIgSBDFrSHOdZtTrdCX4ogt0bplqgHro10Rb0EMCjtxqI8uYHcWkQnb0EbK610Ztaab8btm-mMC9sdFnYv4LBgOHeT2yihPTaH4N5ctz5vllhQVgpn1gGllU_UtqTqboG-KLEgkJxZUZCw1o-7tCdqaCOd90cKxRX4YK0l_XUL3AQhpNGpkARKzR3F68Z2bYALLdBZBWiyxnqVAS1-QaNq7eLFisLVyWXPBP90zt6oOLdqJThZC-6rZbkhn8tPPd1iQLdE786Ck-HaeUZgNJKUzoipi-TbifTtLThpRS3lpDKk3IVHU06juF2vE0Q3EKC7ivJvJdAKZ1QglxJbWokaE9oMf6HQ828wOJEzjwDXioT2yyMvyPHuJjRI6WQogSZNPoxqyyONunNbMlhykkEJEruA4QWBm5DBoLs8RctReP0FYUuNG_ACjLs-FdYK4brmQIRlmnKo6u4KfBFBS0Rcf_vL_wMAAP__8os0VQ==