I wrote a customer sysbench test, the lua script file is named issue.lua, and the content is:
#!/usr/bin/env sysbench
function thread_init()
drv = sysbench.sql.driver()
con = drv:connect()
con:query([[
prepare stmt from "
select * from t1, t2
where t1.a = t2.a
and t2.a = ?
and t2.b = 1
"
]])
end
function thread_done()
con:disconnect()
end
function event()
con:query("set @a = 1")
con:query("execute stmt using @a")
end
function prepare()
local drv = sysbench.sql.driver()
local con = drv:connect()
con:query("DROP TABLE IF EXISTS t1")
con:query([[
CREATE TABLE t1 (
a DOUBLE,
b DOUBLE,
c DOUBLE DEFAULT NULL,
PRIMARY KEY (a, b)
)
]])
con:query([[INSERT INTO t1(a, b) VALUES (1, 1)]])
con:query([[INSERT INTO t1(a, b) VALUES (1, 2)]])
con:query([[INSERT INTO t1(a, b) VALUES (2, 1)]])
con:query([[INSERT INTO t1(a, b) VALUES (2, 2)]])
con:query([[INSERT INTO t1(a, b) VALUES (3, 1)]])
con:query([[INSERT INTO t1(a, b) VALUES (3, 2)]])
con:query("DROP TABLE IF EXISTS t2")
con:query([[
CREATE TABLE t2 (
a DOUBLE,
b DOUBLE DEFAULT NULL,
c DOUBLE DEFAULT NULL,
UNIQUE KEY idx_a (a)
)
]])
con:query([[INSERT INTO t2(a, b) VALUES(1, 1)]])
con:query([[INSERT INTO t2(a, b) VALUES(2, 1)]])
con:query([[INSERT INTO t2(a, b) VALUES(3, 1)]])
end
function cleanup()
local drv = sysbench.sql.driver()
local con = drv:connect()
con:query("DROP TABLE IF EXISTS t1")
con:query("DROP TABLE IF EXISTS t2")
end
Each connection constantly print a panic log as following when the sysbench runs:
[2020/04/25 16:47:17.266 +08:00] [ERROR] [index_lookup_merge_join.go:300] ["panic in outerMergeWorker.run"] [conn=3] [r="\"invalid memory address or nil pointer dereference\""] ["stack trace"="github.com/pingcap/tidb/executor.(*outerMergeWorker).run.func1\n\t/Users/jianzhang.zj/Code/tidb/executor/ index_lookup_merge_join.go:302\nruntime.gopanic\n\t/opt/goroot/src/runtime/panic.go:679\nruntime.panicmem\n\t/opt/goroot/src/runtime/panic.go:199\nruntime.sigpanic\n\t/opt/goroot/src/runtime/signal_unix.go:394\ngithub.com/pingcap/tidb/session.(*TxnState).GetSnapshot\n\t<autogenerated>:1\ngithub.com/pingcap/tidb/ executor.(*PointGetExecutor).Next\n\t/Users/jianzhang.zj/Code/tidb/executor/point_get.go:141\ngithub.com/pingcap/tidb/executor.Next\n\t/Users/jianzhang.zj/Code/tidb/executor/executor.go:249\ngithub.com/pingcap/tidb/executor.(*SelectionExec).Next\n\t/Users/jianzhang.zj/Code/tidb/executor/executor.go:1212\ngithub.com/ pingcap/tidb/executor.Next\n\t/Users/jianzhang.zj/Code/tidb/executor/executor.go:249\ngithub.com/pingcap/tidb/executor.(*outerMergeWorker).buildTask\n\t/Users/jianzhang.zj/Code/tidb/executor/index_lookup_merge_join.go:357\ngithub.com/pingcap/tidb/executor.(*outerMergeWorker).run\n\t/Users/jianzhang.zj/Code/tidb/ executor/index_lookup_merge_join.go:307"]
step 1: start a tidb-server with mocktikv and plan cache enabled
step 2: prepare table and data via the sysbench test:
sysbench --report-interval=1 --time=300 --threads=4 --mysql-host=127.0.0.1 --mysql-port=4000 --mysql-user=root --mysql-db=test --db-driver=mysql issue prepare
sysbench --report-interval=1 --time=300 --threads=4 --mysql-host=127.0.0.1 --mysql-port=4000 --mysql-user=root --mysql-db=test --db-driver=mysql issue run
panic as described above
tidb-server -V or run select tidb_version(); on TiDB)➜ ./bin/tidb-server -V
Release Version: v4.0.0-beta.2-352-g5ede18f10
Git Commit Hash: 5ede18f10eedfe2e3690d7728dec3ffa4b0af2d5
Git Branch: master
UTC Build Time: 2020-04-25 08:41:26
GoVersion: go1.13.10
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
@zz-jason I can't reproduce this problem in my environment using your script and commands. I added INL_MERGE_JOIN(t1,t2) to the prepared statement, the panic was still not reproduced.
I have reproduced the panic in a SMP environment. Change the --threads=4 in the sysbench script to --thread=1 can reproduce the problem as well, but after I turned off the plan cache, the panic disappeared. However, the problem is not caused by plan cache actually.
When plan cache is disabled, the plan for the query is HashJoin instead of IndexMergeJoin, because the constant propagation would simplify the t1.a = t2.a in the query, so there is no column equal condition which can be used as index join lookup key. When plan cache is enabled, the constant propagation for col = const which contains parameters is forbidden, so the INL_MERGE_JOIN hint can take effect. The panic is caused by PointGet in IndexMergeJoin which runs in SMP context.
The root cause of the panic was introduced by https://github.com/pingcap/tidb/pull/14775 originally, and the code which triggers the panic was introduced by https://github.com/pingcap/tidb/pull/15400, but the panic is actually exposed by a recent PR https://github.com/pingcap/tidb/pull/15627.
RCA of the panic: in code snippet of PointGetExecutor::Next:
136 e.txn, err = e.ctx.Txn(false)
137 if err != nil {
138 return err
139 }
140 if e.txn.Valid() && txnCtx.StartTS == txnCtx.GetForUpdateTS() {
141 e.snapshot = e.txn.GetSnapshot()
142 } else {
143 e.snapshot, err = e.ctx.GetStore().GetSnapshot(kv.Version{Ver: snapshotTS})
144 if err != nil {
145 return err
146 }
147 }
the e.txn is of struct TxnState during runtime of the query. The panic happens occasionally, when it doesn't happen, the code path falls into line 143, while when the panic happens, the code path always falls into line 141. The panic is caused by the fact that e.txn.(*TxnState).Transction becomes nil in line 141, however since it passes the check of Valid() in line 140, e.txn.(*TxnState).Transction cannot be nil then. The only explanation for this pretty weird phenomenon is that, e.txn.(*TxnState).Transction is set nil by other threads during the execution time window of line 140 and line 141.
I turned on the -race flag when compiling tidb-server, and found out the data race:
WARNING: DATA RACE
Read at 0x00c000dc0210 by goroutine 109:
github.com/pingcap/tidb/session.(*TxnState).Valid()
/Users/cauchy/go/src/github.com/pingcap/tidb/session/txn.go:125 +0x3a
github.com/pingcap/tidb/executor.(*PointGetExecutor).Next()
/Users/cauchy/go/src/github.com/pingcap/tidb/executor/point_get.go:140 +0x25c
github.com/pingcap/tidb/executor.Next()
/Users/cauchy/go/src/github.com/pingcap/tidb/executor/executor.go:249 +0x167
github.com/pingcap/tidb/executor.(*SelectionExec).Next()
/Users/cauchy/go/src/github.com/pingcap/tidb/executor/executor.go:1212 +0x6e5
github.com/pingcap/tidb/executor.Next()
/Users/cauchy/go/src/github.com/pingcap/tidb/executor/executor.go:249 +0x167
github.com/pingcap/tidb/executor.(*outerMergeWorker).buildTask()
/Users/cauchy/go/src/github.com/pingcap/tidb/executor/index_lookup_merge_join.go:357 +0xa0c
github.com/pingcap/tidb/executor.(*outerMergeWorker).run()
/Users/cauchy/go/src/github.com/pingcap/tidb/executor/index_lookup_merge_join.go:307 +0xdb
Previous write at 0x00c000dc0210 by goroutine 107:
github.com/pingcap/tidb/session.(*session).doCommit.func1()
/Users/cauchy/go/src/github.com/pingcap/tidb/session/txn.go:201 +0x71
github.com/pingcap/tidb/session.(*session).doCommit()
/Users/cauchy/go/src/github.com/pingcap/tidb/session/session.go:396 +0x990
github.com/pingcap/tidb/session.(*session).doCommitWithRetry()
/Users/cauchy/go/src/github.com/pingcap/tidb/session/session.go:454 +0x19f
github.com/pingcap/tidb/session.(*session).CommitTxn()
/Users/cauchy/go/src/github.com/pingcap/tidb/session/session.go:513 +0x163
github.com/pingcap/tidb/session.finishStmt()
/Users/cauchy/go/src/github.com/pingcap/tidb/session/tidb.go:211 +0x48b
github.com/pingcap/tidb/session.runStmt()
/Users/cauchy/go/src/github.com/pingcap/tidb/session/tidb.go:293 +0x4cd
github.com/pingcap/tidb/session.(*session).ExecuteStmt()
/Users/cauchy/go/src/github.com/pingcap/tidb/session/session.go:1163 +0xa7b
github.com/pingcap/tidb/server.(*TiDBContext).ExecuteStmt()
/Users/cauchy/go/src/github.com/pingcap/tidb/server/driver_tidb.go:248 +0x8e
github.com/pingcap/tidb/server.(*clientConn).handleStmt()
/Users/cauchy/go/src/github.com/pingcap/tidb/server/conn.go:1290 +0xc0
github.com/pingcap/tidb/server.(*clientConn).handleQuery()
/Users/cauchy/go/src/github.com/pingcap/tidb/server/conn.go:1278 +0x15e
github.com/pingcap/tidb/server.(*clientConn).dispatch()
/Users/cauchy/go/src/github.com/pingcap/tidb/server/conn.go:901 +0x620
github.com/pingcap/tidb/server.(*clientConn).Run()
/Users/cauchy/go/src/github.com/pingcap/tidb/server/conn.go:715 +0x364
github.com/pingcap/tidb/server.(*Server).onConn()
/Users/cauchy/go/src/github.com/pingcap/tidb/server/server.go:415 +0xfe5
The reason of the data race is that: PointGet is outer child of IndexMergeJoin, during IndexMergeJoinExecutor::Open, the outer worker goroutine is started to fetch rows by PointGetExecutor::Next, while the main goroutine returns back to runStmt -> finishStmt .. -> doCommit -> changeToInvalid, which would reset the TxnState.Transaction to nil, thus panic happens in the worker goroutine.
I guess there may exist other problems / panics behind this single symptom described in this issue, because PointGet now may be running in parallel with other goroutines(not only the main goroutine in this case, but other goroutines, e.g, workers of IndexJoin / IndexHashJoin?), while the original implementation of its execution seems not to consider the possibility of concurrent running? We'd better do a full check of these potential concurrent situations to make sure PointGet is safe there.
I checked out the commit before https://github.com/pingcap/tidb/pull/15400, and the panic still happened, but in another code place, e.g, when calling TxnState.IsReadOnly in point get execution.
I prefer @lzmhhh123 @SunRunAway you guys to fix this problem because you know more about the potential risks of the concurrent execution than me.
BTW, another problem exposed by this issue is that, in https://github.com/pingcap/tidb/blob/master/executor/index_lookup_merge_join.go#L300, the outer worker only prints an ERROR log without reporting this error to upper level, so the query actually can succeed with empty result, which is unexpected. Please fix this as well.
BTW, another problem exposed by this issue is that, in https://github.com/pingcap/tidb/blob/master/executor/index_lookup_merge_join.go#L300, the outer worker only prints an ERROR log without reporting this error to upper level, so the query actually can succeed with empty result, which is unexpected. Please fix this as well.
@qw4990 PTAL
BTW, another problem exposed by this issue is that, in
master/executor/index_lookup_merge_join.go#L300, the outer worker only prints an ERROR log without reporting this error to upper level, so the query actually can succeed with empty result, which is unexpected. Please fix this as well.@qw4990 PTAL
@SunRunAway the issue title is not reflecting the remaining unresolved problem. I think we can create another issue for that and close this one.
BTW, another problem exposed by this issue is that, in
master/executor/index_lookup_merge_join.go#L300, the outer worker only prints an ERROR log without reporting this error to upper level, so the query actually can succeed with empty result, which is unexpected. Please fix this as well.@qw4990 PTAL
@SunRunAway the issue title is not reflecting the remaining unresolved problem. I think we can create another issue for that and close this one.
Thanks for the reminding, #18070 and #18572 are reflecting this problem and both of them are resolved now. I'm going to close this issue.
Most helpful comment
I have reproduced the panic in a SMP environment. Change the
--threads=4in the sysbench script to--thread=1can reproduce the problem as well, but after I turned off the plan cache, the panic disappeared. However, the problem is not caused by plan cache actually.When plan cache is disabled, the plan for the query is
HashJoininstead ofIndexMergeJoin, because the constant propagation would simplify thet1.a = t2.ain the query, so there is no column equal condition which can be used as index join lookup key. When plan cache is enabled, the constant propagation forcol = constwhich contains parameters is forbidden, so theINL_MERGE_JOINhint can take effect. The panic is caused by PointGet in IndexMergeJoin which runs in SMP context.The root cause of the panic was introduced by https://github.com/pingcap/tidb/pull/14775 originally, and the code which triggers the panic was introduced by https://github.com/pingcap/tidb/pull/15400, but the panic is actually exposed by a recent PR https://github.com/pingcap/tidb/pull/15627.
RCA of the panic: in code snippet of
PointGetExecutor::Next:the
e.txnis of structTxnStateduring runtime of the query. The panic happens occasionally, when it doesn't happen, the code path falls into line 143, while when the panic happens, the code path always falls into line 141. The panic is caused by the fact thate.txn.(*TxnState).Transctionbecomes nil in line 141, however since it passes the check ofValid()in line 140,e.txn.(*TxnState).Transctioncannot be nil then. The only explanation for this pretty weird phenomenon is that,e.txn.(*TxnState).Transctionis set nil by other threads during the execution time window of line 140 and line 141.I turned on the
-raceflag when compiling tidb-server, and found out the data race:The reason of the data race is that: PointGet is outer child of IndexMergeJoin, during
IndexMergeJoinExecutor::Open, the outer worker goroutine is started to fetch rows byPointGetExecutor::Next, while the main goroutine returns back torunStmt -> finishStmt .. -> doCommit -> changeToInvalid, which would reset theTxnState.Transactionto nil, thus panic happens in the worker goroutine.I guess there may exist other problems / panics behind this single symptom described in this issue, because
PointGetnow may be running in parallel with other goroutines(not only the main goroutine in this case, but other goroutines, e.g, workers ofIndexJoin/IndexHashJoin?), while the original implementation of its execution seems not to consider the possibility of concurrent running? We'd better do a full check of these potential concurrent situations to make surePointGetis safe there.I checked out the commit before https://github.com/pingcap/tidb/pull/15400, and the panic still happened, but in another code place, e.g, when calling
TxnState.IsReadOnlyin point get execution.I prefer @lzmhhh123 @SunRunAway you guys to fix this problem because you know more about the potential risks of the concurrent execution than me.
BTW, another problem exposed by this issue is that, in https://github.com/pingcap/tidb/blob/master/executor/index_lookup_merge_join.go#L300, the outer worker only prints an ERROR log without reporting this error to upper level, so the query actually can succeed with empty result, which is unexpected. Please fix this as well.