Tidb: [plan cache] panic on `explain for connection`

Created on 25 Apr 2020  Â·  3Comments  Â·  Source: pingcap/tidb

Bug Report

I wrote a customer sysbench test, the lua script file is named issue.lua, and the content is:

#!/usr/bin/env sysbench

function thread_init()
    drv = sysbench.sql.driver()
    con = drv:connect()
    con:query([[
        prepare stmt from "
            select * from t1, t2
            where t1.a = t2.a
            and t2.a = ?
            and t2.b = 'b'
            and t2.c = 'c'
        "
    ]])
end

function thread_done()
    con:disconnect()
end

function event()
    con:query("set @a = 1")
    con:query("execute stmt using @a")
end

function prepare()
    local drv = sysbench.sql.driver()
    local con = drv:connect()

    con:query("DROP TABLE IF EXISTS t1")
    con:query([[
        CREATE TABLE t1 (
            a     BIGINT,
            b     VARCHAR(40),
            col3  VARCHAR(40) DEFAULT NULL,
            col4  VARCHAR(40) DEFAULT NULL,
            col5  VARCHAR(40) DEFAULT NULL,
            col6  VARCHAR(40) DEFAULT NULL,
            col7  VARCHAR(40) DEFAULT NULL,
            col8  VARCHAR(40) DEFAULT NULL,
            col9  VARCHAR(40) DEFAULT NULL,
            col10 VARCHAR(40) DEFAULT NULL,
            PRIMARY KEY (a, b)
        )
    ]])

    con:query([[INSERT INTO t1(a, b) VALUES (1, "1")]])
    con:query([[INSERT INTO t1(a, b) VALUES (1, "2")]])
    con:query([[INSERT INTO t1(a, b) VALUES (2, "1")]])
    con:query([[INSERT INTO t1(a, b) VALUES (2, "2")]])
    con:query([[INSERT INTO t1(a, b) VALUES (3, "1")]])
    con:query([[INSERT INTO t1(a, b) VALUES (3, "2")]])

    con:query("DROP TABLE IF EXISTS t2")
    con:query([[
        CREATE TABLE t2 (
            a     BIGINT,
            b     VARCHAR(40),
            c     VARCHAR(40),
            col4  VARCHAR(40) DEFAULT NULL,
            col5  VARCHAR(40) DEFAULT NULL,
            col6  VARCHAR(40) DEFAULT NULL,
            col7  VARCHAR(40) DEFAULT NULL,
            col8  VARCHAR(40) DEFAULT NULL,
            col9  VARCHAR(40) DEFAULT NULL,
            col10 VARCHAR(40) DEFAULT NULL,
            UNIQUE KEY idx_a (a)
        )
    ]])

    con:query([[INSERT INTO t2(a, b, c) VALUES(1, "b", "c")]])
    con:query([[INSERT INTO t2(a, b, c) VALUES(2, "b", "c")]])
    con:query([[INSERT INTO t2(a, b, c) VALUES(3, "b", "c")]])
end

function cleanup()
    local drv = sysbench.sql.driver()
    local con = drv:connect()
    con:query("DROP TABLE IF EXISTS t1")
    con:query("DROP TABLE IF EXISTS t2")
end

NOTE: the content is different from it described in #16817.

1. What did you do?

  • step 1: step 1: start a tidb-server with mocktikv and plan cache enabled

  • step 2: prepare table and data via the sysbench test:

sysbench --report-interval=1 --time=300 --threads=4 --mysql-host=127.0.0.1 --mysql-port=4000 --mysql-user=root --mysql-db=test --db-driver=mysql issue prepare
  • step 3: run the sysbench test:
sysbench --report-interval=1 --time=300 --threads=4 --mysql-host=127.0.0.1 --mysql-port=4000 --mysql-user=root --mysql-db=test --db-driver=mysql issue run
  • step 4: got one of the connection id through log file, open a MySQL client and run explain for connection xx to see the execution plan on that connection, I got an error in the return result:
TiDB([email protected]:test) > explain for connection 2;
ERROR 1105 (HY000): runtime error: index out of range [0] with length 0

and the panic stack in tidb log file is:

[2020/04/25 17:24:11.449 +08:00] [ERROR] [adapter.go:123] ["execute sql panic"] [conn=6] [sql="explain for connection 2"] [stack="github.com/pingcap/tidb/executor.(*recordSet).Next.func1\n\t/Users/jianzhang.zj/Code/tidb/executor/adapter.go:123\nruntime.gopanic\n\t/opt/goroot/src/runtime/panic.go:679\nruntime.goPanicIndex\n\t/opt/goroot/src/runtime/panic.go:75\ngithub.com/pingcap/tidb/expression.(*ParamMarker).GetUserVar\n\t/Users/jianzhang.zj/Code/tidb/expression/constant.go:80\ngithub.com/pingcap/tidb/expression.(*Constant).getLazyDatum\n\t/Users/jianzhang.zj/Code/tidb/expression/constant.go:186\ngithub.com/pingcap/tidb/expression.(*Constant).EvalString\n\t/Users/jianzhang.zj/Code/tidb/expression/constant.go:264\ngithub.com/pingcap/tidb/expression.(*builtinCastStringAsRealSig).evalReal\n\t/Users/jianzhang.zj/Code/tidb/expression/builtin_cast.go:1158\ngithub.com/pingcap/tidb/expression.(*ScalarFunction).EvalReal\n\t/Users/jianzhang.zj/Code/tidb/expression/scalar_function.go:354\ngithub.com/pingcap/tidb/expression.(*ScalarFunction).Eval\n\t/Users/jianzhang.zj/Code/tidb/expression/scalar_function.go:323\ngithub.com/pingcap/tidb/expression.(*Constant).getLazyDatum\n\t/Users/jianzhang.zj/Code/tidb/expression/constant.go:190\ngithub.com/pingcap/tidb/expression.(*Constant).Eval\n\t/Users/jianzhang.zj/Code/tidb/expression/constant.go:199\ngithub.com/pingcap/tidb/expression.(*Constant).ExplainInfo\n\t/Users/jianzhang.zj/Code/tidb/expression/explain.go:65\ngithub.com/pingcap/tidb/expression.(*ScalarFunction).explainInfo\n\t/Users/jianzhang.zj/Code/tidb/expression/explain.go:38\ngithub.com/pingcap/tidb/expression.(*ScalarFunction).ExplainInfo\n\t/Users/jianzhang.zj/Code/tidb/expression/explain.go:28\ngithub.com/pingcap/tidb/expression.sortedExplainExpressionList\n\t/Users/jianzhang.zj/Code/tidb/expression/explain.go:119\ngithub.com/pingcap/tidb/expression.SortedExplainExpressionList\n\t/Users/jianzhang.zj/Code/tidb/expression/explain.go:109\ngithub.com/pingcap/tidb/planner/core.(*PhysicalSelection).ExplainInfo\n\t/Users/jianzhang.zj/Code/tidb/planner/core/explain.go:282\ngithub.com/pingcap/tidb/planner/core.(*Explain).prepareOperatorInfo\n\t/Users/jianzhang.zj/Code/tidb/planner/core/common_plans.go:953\ngithub.com/pingcap/tidb/planner/core.(*Explain).explainPlanInRowFormat\n\t/Users/jianzhang.zj/Code/tidb/planner/core/common_plans.go:839\ngithub.com/pingcap/tidb/planner/core.(*Explain).explainPlanInRowFormat\n\t/Users/jianzhang.zj/Code/tidb/planner/core/common_plans.go:903\ngithub.com/pingcap/tidb/planner/core.(*Explain).explainPlanInRowFormat\n\t/Users/jianzhang.zj/Code/tidb/planner/core/common_plans.go:886\ngithub.com/pingcap/tidb/planner/core.(*Explain).explainPlanInRowFormat\n\t/Users/jianzhang.zj/Code/tidb/planner/core/common_plans.go:886\ngithub.com/pingcap/tidb/planner/core.(*Explain).explainPlanInRowFormat\n\t/Users/jianzhang.zj/Code/tidb/planner/core/common_plans.go:928\ngithub.com/pingcap/tidb/planner/core.(*Explain).RenderResult\n\t/Users/jianzhang.zj/Code/tidb/planner/core/common_plans.go:821\ngithub.com/pingcap/tidb/executor.(*ExplainExec).generateExplainInfo\n\t/Users/jianzhang.zj/Code/tidb/executor/explain.go:106\ngithub.com/pingcap/tidb/executor.(*ExplainExec).Next\n\t/Users/jianzhang.zj/Code/tidb/executor/explain.go:53\ngithub.com/pingcap/tidb/executor.Next\n\t/Users/jianzhang.zj/Code/tidb/executor/executor.go:249\ngithub.com/pingcap/tidb/executor.(*recordSet).Next\n\t/Users/jianzhang.zj/Code/tidb/executor/adapter.go:126\ngithub.com/pingcap/tidb/server.(*tidbResultSet).Next\n\t/Users/jianzhang.zj/Code/tidb/server/driver_tidb.go:369\ngithub.com/pingcap/tidb/server.(*clientConn).writeChunks\n\t/Users/jianzhang.zj/Code/tidb/server/conn.go:1436\ngithub.com/pingcap/tidb/server.(*clientConn).writeResultset\n\t/Users/jianzhang.zj/Code/tidb/server/conn.go:1402\ngithub.com/pingcap/tidb/server.(*clientConn).handleStmt\n\t/Users/jianzhang.zj/Code/tidb/server/conn.go:1309\ngithub.com/pingcap/tidb/server.(*clientConn).handleQuery\n\t/Users/jianzhang.zj/Code/tidb/server/conn.go:1278\ngithub.com/pingcap/tidb/server.(*clientConn).dispatch\n\t/Users/jianzhang.zj/Code/tidb/server/conn.go:901\ngithub.com/pingcap/tidb/server.(*clientConn).Run\n\t/Users/jianzhang.zj/Code/tidb/server/conn.go:715\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/Users/jianzhang.zj/Code/tidb/server/server.go:415"]

2. What did you expect to see?

no panic, executes successfully.

3. What did you see instead?

the error and panic described in the above.

4. What version of TiDB are you using? (tidb-server -V or run select tidb_version(); on TiDB)

➜ ./bin/tidb-server -V
Release Version: v4.0.0-beta.2-352-g5ede18f10
Git Commit Hash: 5ede18f10eedfe2e3690d7728dec3ffa4b0af2d5
Git Branch: master
UTC Build Time: 2020-04-25 08:41:26
GoVersion: go1.13.10
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
epiplan-cache severitmajor siplanner typbug

Most helpful comment

To completely avoid the data race of stored plan in ProcessInfo, we can serialize the plan into string when storing it into ProcessInfo, and we can discard all these session context changes.

All 3 comments

I found the cause of the problem, but I encountered some difficulties in solving it, seek help..

A Simple way to reproduce

  1. Create the following two files in a directory.

runner.lua

function thread_init()
    con = sysbench.sql.driver():connect()
    con:query("drop table if exists t0;")
    con:query("create table t0(a int);")
    con:query([[prepare stmt from "select * from t0 where a = ?";]])
    con:query("set @a = 1")
end

function event()
    con:query("execute stmt using @a")
end

function thread_done()
    con:disconnect()
end

watcher.lua

function thread_init()
    con = sysbench.sql.driver():connect()
end

function event()
    con:query("explain for connection 1")
end

function thread_done()
    con:disconnect()
end
  1. Set PreparedPlanCache.Enabled is true in config.go, compile and run tidb.

  2. Start a terminal, switch to the lua's directory, bench runner:

sysbench --report-interval=1 --time=3000 --threads=1 --mysql-host=127.0.0.1 --mysql-port=4000 --mysql-user=root --mysql-db=test --db-driver=mysql runner run

If connectionId of runner is not 1, please modify the watcher.lua.

  1. Start another terminal, switch to the lua's directory, bench watcher:
sysbench --report-interval=1 --time=3000 --threads=1 --mysql-host=127.0.0.1 --mysql-port=4000 --mysql-user=root --mysql-db=test --db-driver=mysql watcher run

then the watcher failed, error info:

ERROR 1105 (HY000): runtime error: index out of range [0] with length 0
github.com/pingcap/tidb/executor.(*recordSet).Next.func1\n\t/mnt/d/project/pingcap/tidb/executor/adapter.go:123\nruntime.gopanic\n\t/home/littlefall/go/src/runtime/panic.go:967\nruntime.goPanicIndex\n\t/home/littlefall/go/src/runtime/panic.go:88\ngithub.com/pingcap/tidb/expression.(*ParamMarker).GetUserVar\n\t/mnt/d/project/pingcap/tidb/expression/constant.go:80\ngithub.com/pingcap/tidb/expression.(*Constant).getLazyDatum\n\t/mnt/d/project/pingcap/tidb/expression/constant.go:186\ngithub.com/pingcap/tidb/expression.(*Constant).EvalString\n\t/mnt/d/project/pingcap/tidb/expression/constant.go:264\ngithub.com/pingcap/tidb/expression.(*builtinCastStringAsRealSig).evalReal\n\t/mnt/d/project/pingcap/tidb/expression/builtin_cast.go:1182\ngithub.com/pingcap/tidb/expression.(*ScalarFunction).EvalReal\n\t/mnt/d/project/pingcap/tidb/expression/scalar_function.go:354\ngithub.com/pingcap/tidb/expression.(*ScalarFunction).Eval\n\t/mnt/d/project/pingcap/tidb/expression/scalar_function.go:323\ngithub.com/pingcap/tidb/expression.(*Constant).getLazyDatum\n\t/mnt/d/project/pingcap/tidb/expression/constant.go:190\ngithub.com/pingcap/tidb/expression.(*Constant).Eval\n\t/mnt/d/project/pingcap/tidb/expression/constant.go:199\ngithub.com/pingcap/tidb/expression.(*Constant).ExplainInfo\n\t/mnt/d/project/pingcap/tidb/expression/explain.go:65\ngithub.com/pingcap/tidb/expression.(*ScalarFunction).explainInfo\n\t/mnt/d/project/pingcap/tidb/expression/explain.go:38\ngithub.com/pingcap/tidb/expression.(*ScalarFunction).ExplainInfo\n\t/mnt/d/project/pingcap/tidb/expression/explain.go:28\ngithub.com/pingcap/tidb/expression.sortedExplainExpressionList\n\t/mnt/d/project/pingcap/tidb/expression/explain.go:119\ngithub.com/pingcap/tidb/expression.SortedExplainExpressionList\n\t/mnt/d/project/pingcap/tidb/expression/explain.go:109\ngithub.com/pingcap/tidb/planner/core.(*PhysicalSelection).ExplainInfo\n\t/mnt/d/project/pingcap/tidb/planner/core/explain.go:283\ngithub.com/pingcap/tidb/planner/core.(*Explain).prepareOperatorInfo\n\t/mnt/d/project/pingcap/tidb/planner/core/common_plans.go:957\ngithub.com/pingcap/tidb/planner/core.(*Explain).explainPlanInRowFormat\n\t/mnt/d/project/pingcap/tidb/planner/core/common_plans.go:843\ngithub.com/pingcap/tidb/planner/core.(*Explain).explainPlanInRowFormat\n\t/mnt/d/project/pingcap/tidb/planner/core/common_plans.go:907\ngithub.com/pingcap/tidb/planner/core.(*Explain).explainPlanInRowFormat\n\t/mnt/d/project/pingcap/tidb/planner/core/common_plans.go:932\ngithub.com/pingcap/tidb/planner/core.(*Explain).RenderResult\n\t/mnt/d/project/pingcap/tidb/planner/core/common_plans.go:825\ngithub.com/pingcap/tidb/executor.(*ExplainExec).generateExplainInfo\n\t/mnt/d/project/pingcap/tidb/executor/explain.go:106\ngithub.com/pingcap/tidb/executor.(*ExplainExec).Next\n\t/mnt/d/project/pingcap/tidb/executor/explain.go:53\ngithub.com/pingcap/tidb/executor.Next\n\t/mnt/d/project/pingcap/tidb/executor/executor.go:249\ngithub.com/pingcap/tidb/executor.(*recordSet).Next\n\t/mnt/d/project/pingcap/tidb/executor/adapter.go:126\ngithub.com/pingcap/tidb/server.(*tidbResultSet).Next\n\t/mnt/d/project/pingcap/tidb/server/driver_tidb.go:369\ngithub.com/pingcap/tidb/server.(*clientConn).writeChunks\n\t/mnt/d/project/pingcap/tidb/server/conn.go:1448\ngithub.com/pingcap/tidb/server.(*clientConn).writeResultset\n\t/mnt/d/project/pingcap/tidb/server/conn.go:1414\ngithub.com/pingcap/tidb/server.(*clientConn).handleStmt\n\t/mnt/d/project/pingcap/tidb/server/conn.go:1321\ngithub.com/pingcap/tidb/server.(*clientConn).handleQuery\n\t/mnt/d/project/pingcap/tidb/server/conn.go:1290\ngithub.com/pingcap/tidb/server.(*clientConn).dispatch\n\t/mnt/d/project/pingcap/tidb/server/conn.go:901\ngithub.com/pingcap/tidb/server.(*clientConn).Run\n\t/mnt/d/project/pingcap/tidb/server/conn.go:715\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/mnt/d/project/pingcap/tidb/server/server.go:415

Problem Reason

The reason for the problem is that ParamMaker will be reset during the query process. If it has been reseted and it has not been added, explain for connection accessing the array will cross the boundary.

In essence, it is a data race problem.

pull/16206 proposes a good method to try to solve this problem as follow, it checks whether value-get is in the execution of an explain statement. If in, it returns a empty value.

func (c *Constant) getLazyDatum() (dt types.Datum, isLazy bool, err error) {
    if p := c.ParamMarker; p != nil {
        if p.ctx.GetSessionVars().StmtCtx.InExplainStmt {
            // Since `ParamMarker` is not nil only in prepare/execute context, the query must be `explain for connection` when coming here.
            // The PreparedParams may have been reset already, to avoid panic, we just use the pre-evaluated datum for this constant.
            return dt, false, nil
        }
        dt = p.GetUserVar()
        isLazy = true
        return
    } else if c.DeferredExpr != nil {
        dt, err = c.DeferredExpr.Eval(chunk.Row{})
        isLazy = true
        return
    }
    return
}

// GetType implements Expression interface.
func (c *Constant) GetType() *types.FieldType {
    if p := c.ParamMarker; p != nil && !p.ctx.GetSessionVars().StmtCtx.InExplainStmt {
        // GetType() may be called in multi-threaded context, e.g, in building inner executors of IndexJoin,
        // so it should avoid data race. We achieve this by returning different FieldType pointer for each call.
        tp := types.NewFieldType(mysql.TypeUnspecified)
        dt := p.GetUserVar()
        types.DefaultParamTypeForValue(dt.GetValue(), tp)
        return tp
    }
    return c.RetType
}

But it doesn't work because sctx in c.ParamMarker is the sctx of the query plan, not the sctx of the explain statement.

Possible Solution

Solution 1

Use the same method as 16206, but upload the explain sctx through Plan.ExplainInfo, SortedExplainExpressionList, Expression.ExplainInfo, Expression.Eval, builtinFunc.Eval, so many eval functions, and even expression.GetType.

I tried to do it, but found that the shortcomings are:

  1. Need to add parameters to very many functions and calls.
  2. Some update will be a bit ugly. For example, if you just want a type of constant, you need to cal GetType(nil).
  3. Some functions already have a sctx parameter, which may cause confusion.
  4. In the explain result, the varaible will display null.

Solution 2

A simple method is not to upload sctx, but before determining the value, first determine whether it will cross the boundary. If it will, returns a empty value and generates a warning.

func (d *ParamMarker) GetUserVar() (datum types.Datum) {
    if d.order < len(sessionVars.PreparedParams) {
        datum = sessionVars.PreparedParams[d.order]
    } else {
        sessionVars.StmtCtx.AppendWarning(errors.New(fmt.Sprintf("There is another session with explain query for this executing plan, it will get a null value for variables.")))
        logutil.BgLogger().Warn("A query plan is being executed, please try again.")
    }
}

The disadvantages of this solutions are:

  1. This solution still has a small probability to cause the array to cross the boundary. A lock needs to be added, which will affect the performance.
  2. The warning is actually added to the session of the query plan, not the explain session, which is counterintuitive.

Solution 3

The third method is to combine the above two, upload sctx + lock + return warning, you can feedback the warning to the session where explain, but the cost is the most.

Seek for help

I am confused about which method should be used. Which one do you think is better? Is there a more better way?

@zz-jason @eurekaka @lysu

To completely avoid the data race of stored plan in ProcessInfo, we can serialize the plan into string when storing it into ProcessInfo, and we can discard all these session context changes.

To completely avoid the data race of stored plan in ProcessInfo, we can serialize the plan into string when storing it into ProcessInfo, and we can discard all these session context changes.

Thanks! I will try it later.

Was this page helpful?
0 / 5 - 0 ratings