Tidb: 1105 runtime error: invalid memory address or nil pointer dereference

Created on 6 Aug 2020  路  21Comments  路  Source: pingcap/tidb

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Can not really reproduce, it does not happen all the time.

2. What did you expect to see? (Required)

To not receive a error

3. What did you see instead (Required)

General error: 1105 runtime error: invalid memory address or nil pointer dereference

[2020/08/06 11:11:05.744 +02:00] [ERROR] [misc.go:88] ["panic in the recoverable goroutine"] [r="\"invalid memory address or nil pointer dereference\""] ["stack trace"="github.com/pingcap/tidb/util.WithRecovery.func1\n\t/home/jenkins/age
nt/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/util/misc.go:90\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:679\nruntime.panicmem\n\t/usr/local/go/src/runtime/panic.go:199\nruntime.sigpanic\n\t/usr/local/go/src/runtime
/signal_unix.go:394\ngithub.com/pingcap/tidb/session.(TxnState).GetSnapshot\n\t:1\ngithub.com/pingcap/tidb/executor.(BatchPointGetExec).initialize\n\t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/ti
db/executor/batch_point_get.go:128\ngithub.com/pingcap/tidb/executor.(BatchPointGetExec).Next\n\t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/executor/batch_point_get.go:89\ngithub.com/pingcap/tidb/executor.N
ext\n\t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/executor/executor.go:248\ngithub.com/pingcap/tidb/executor.(
HashJoinExec).fetchBuildSideRows\n\t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/
pingcap/tidb/executor/join.go:257\ngithub.com/pingcap/tidb/executor.(*HashJoinExec).fetchAndBuildHashTable.func1\n\t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/executor/join.go:675\ngithub.com/pingcap/tidb/ut
il.WithRecovery\n\t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/util/misc.go:93"]

4. Affected version (Required)

v 4.00

5. Root Cause Analysis

PrioritP1 componenexecutor severitcritical typbug

All 21 comments

/label component/executor

github.com/pingcap/tidb/util.WithRecovery.func1
    /home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/util/misc.go:90
runtime.gopanic
    /usr/local/go/src/runtime/panic.go:679
runtime.panicmem
    /usr/local/go/src/runtime/panic.go:199
runtime.sigpanic
    /usr/local/go/src/runtime/signal_unix.go:394
github.com/pingcap/tidb/session.(*TxnState).GetSnapshot
    :1
github.com/pingcap/tidb/executor.(*BatchPointGetExec).initialize
    /home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/executor/batch_point_get.go:128
github.com/pingcap/tidb/executor.(*BatchPointGetExec).Next
    /home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/executor/batch_point_get.go:89
github.com/pingcap/tidb/executor.Next
    /home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/executor/executor.go:248
github.com/pingcap/tidb/executor.(*HashJoinExec).fetchBuildSideRows
    /home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/executor/join.go:257
github.com/pingcap/tidb/executor.(*HashJoinExec).fetchAndBuildHashTable.func1
    /home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/executor/join.go:675
github.com/pingcap/tidb/util.WithRecovery
    /home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/util/misc.go:93

@cberescu could you please give more information on this issue? e.g. how do you find it and how it will likely to occur?

It is hard to duplicate it, it appears in a complex query, i will attach the query if you think it may help here :

[2020/08/06 11:11:05.756 +02:00] [ERROR] [conn.go:728] ["command dispatched failed"] [conn=8355] [connInfo="id:8355, addr:78.46.249.6:60308 status:10, collation:utf8_general_ci, user:alex"] [command=Query] [status="inTxn:0, autocom
mit:1"] [sql="SELECT ROUND(AVG(users_ratings.value),1) as avgScore, COUNT(users_ratings.value) as totalRatings,verification_requests.last_update as verifiedLastUpdate, profiles.id_user as id,\nprofiles.id as idp,profiles.age,profiles.cou
ntry,profiles.name, profiles.show_price,profiles.show_skype,profiles.score,profiles.min_sum, images.id as idImage,images.ext as imgExt,\nvideos.video_link as videoLink,videos.video_id as videoId,users.username,users.verified,profiles.ins
ert_date as register_date,users.paypermin,\nIF((devices.status=\"online\") AND ((devices.services_status=\"online\") OR (devices.services_status=\"\") OR (devices.version=\"1.4.3/3.0.7\"))   ,1,0) as dstatus,\n\nIFNULL((SELECT username F
ROM users AS studios WHERE studios.id=users.id_parent AND users.id IN (174569,1154157,1151609,1150540,1145234,1145005,1142218,136995,1134450,1134163,1125097,1121530,1117395,1109043,1108471,1107323,1102165,1099851,98445,1097807,1097713,10
93495,1090391,1088241,1084185,1078454,73863,1071451,1055248,52403,1049316,1038740,36081,34029,987778,984532,980564,974397,968443,965822,961565,959942,957710,957146,956951,951304,929905,924820,921862,912553)),\"n/a\") AS studioId,\n\nIFNU
LL((SELECT profiles_types_options.value\n    FROM profiles_types_links\n    INNER JOIN profiles_types_options ON profiles_types_options.id=profiles_types_links.id_option\n        and profiles_types_options.id_type=(SELECT id FROM profile
s_types where name=\"category\" AND id_domain='1')\n    WHERE id_profile=profiles.id AND id_profile IN (28652,166896,166599,167024,165855,166516,165796,29006,166410,164512,163461,163189,162756,161902,162108,161680,161332,160929,14619,161
143,160916,160520,160404,159921,159531,159038,10489,158344,156726,7232,156058,155014,4796,4295,152494,152542,151465,151097,149353,149116,148193,147908,156388,147701,147458,146566,143536,142666,142378,161146)\n        ),\"n/a\") AS catego
ry,\n\n\nIFNULL((SELECT SUBSTRING_INDEX(GROUP_CONCAT(profiles_tags.name),\",\",3)\n    FROM profiles_tags_links\n    INNER JOIN profiles_tags ON profiles_tags.id=profiles_tags_links.id_tag\n    WHERE profiles_tags_links.id_profile=profil
es.id AND profiles_tags_links.id_profile IN (28652,166896,166599,167024,165855,166516,165796,29006,166410,164512,163461,163189,162756,161902,162108,161680,161332,160929,14619,161143,160916,160520,160404,159921,159531,159038,10489,158344,
156726,7232,156058,155014,4796,4295,152494,152542,151465,151097,149353,149116,148193,147908,156388,147701,147458,146566,143536,142666,142378,161146) ),\n        \"n/a\") AS top3tags,\n\nIFNULL((SELECT SUBSTRING_INDEX(GROUP_CONCAT(CONCAT(
profile_images.id,\".\",profile_images.ext)),\",\",3)\n    FROM images as profile_images\n    WHERE profile_images.id_user=profiles.id_user AND profile_images.type=\"gallery\" AND profile_images.id_user IN (174569,1154157,1151609,1150540
,1145234,1145005,1142218,136995,1134450,1134163,1125097,1121530,1117395,1109043,1108471,1107323,1102165,1099851,98445,1097807,1097713,1093495,1090391,1088241,1084185,1078454,73863,1071451,1055248,52403,1049316,1038740,36081,34029,987778,
984532,980564,974397,968443,965822,961565,959942,957710,957146,956951,951304,929905,924820,921862,912553)\n    ),\"\") AS 3images\n\nFROM profiles\nLEFT JOIN users_ratings ON users_ratings.id_user_rated=profiles.id_user AND users_ratings
.value > 0 AND users_ratings.id_user_rated IN (174569,1154157,1151609,1150540,1145234,1145005,1142218,136995,1134450,1134163,1125097,1121530,1117395,1109043,1108471,1107323,1102165,1099851,98445,1097807,1097713,1093495,1090391,1088241,10
84185,1078454,73863,1071451,1055248,52403,1049316,1038740,36081,34029,987778,984532,980564,974397,968443,965822,961565,959942,957710,957146,956951,951304,929905,924820,921862,912553)\nLEFT JOIN verification_requests ON verification_reque
sts.id_user=profiles.id_user AND verification_requests.id_user IN (174569,1154157,1151609,1150540,1145234,1145005,1142218,136995,1134450,1134163,1125097,1121530,1117395,1109043,1108471,1107323,1102165,1099851,98445,1097807,1097713,109349
5,1090391,1088241,1084185,1078454,73863,1071451,1055248,52403,1049316,1038740,36081,34029,987778,984532,980564,974397,968443,965822,961(len: 6837)"] [txn_mode=PESSIMISTIC] [err="runtime error: invalid memory address or nil pointer derefe
rence\ngithub.com/pingcap/tidb/executor.(*HashJoinExec).fetchAndBuildHashTable.func2\n\t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/executor/join.go:678\ngithub.com/pingcap/tidb/util.WithRecovery.func1\n\t/ho
me/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/util/misc.go:85\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:679\nruntime.panicmem\n\t/usr/local/go/src/runtime/panic.go:199\nruntime.sigpanic\n\t/usr/local/
go/src/runtime/signal_unix.go:394\ngithub.com/pingcap/tidb/session.(*TxnState).GetSnapshot\n\t<autogenerated>:1\ngithub.com/pingcap/tidb/executor.(*BatchPointGetExec).initialize\n\t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.
com/pingcap/tidb/executor/batch_point_get.go:128\ngithub.com/pingcap/tidb/executor.(*BatchPointGetExec).Next\n\t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/executor/batch_point_get.go:89\ngithub.com/pingcap/t
idb/executor.Next\n\t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/executor/executor.go:248\ngithub.com/pingcap/tidb/executor.(*HashJoinExec).fetchBuildSideRows\n\t/home/jenkins/agent/workspace/tidb_v4.0.0/go/s
rc/github.com/pingcap/tidb/executor/join.go:257\ngithub.com/pingcap/tidb/executor.(*HashJoinExec).fetchAndBuildHashTable.func1\n\t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/executor/join.go:675\ngithub.com/p
ingcap/tidb/util.WithRecovery\n\t/home/jenkins/agent/workspace/tidb_v4.0.0/go/src/github.com/pingcap/tidb/util/misc.go:93\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"]

If it is anything else i can give or require please let me know i am available to help in any way i can.

@cberescu thank you for being so responsive, I'll be looking into this issue, more over, could you please a more specific version number? Say, the output of select tidb_version()

Hi, the version was 4.0.0 , i have now updated to 4.0.3 , I can give a try and see it this happens with this version also.

Hi, the version was 4.0.0 , i have now updated to 4.0.3 , I can give a try and see it this happens with this version also.

Cool, thanks a lot. By the way, how frequent will this error pops up?

I can not tell you exactly how often it pops up, it did 2 times pretty close to each other within an hour, after that i moved the query back to mysql. At this moment it is running for about 3 hours and nothing has happened. I will let it run for the next 24 or 48 hours and if nothing happens then i think we can conclude the issue was fixed with the 4.0.3 version .

I will keep you informed if anything happens or it doesn't. Thank you.

Cool, thank you so much for your warm heart, let's work it out together!

Looking forward to your update!

It looks like a data race when BatchPointGet is executed in multi-thread context. You can enable -race flag when building tidb-server, and then try the reproduction again.

Hey guys,

I have let the query where i found the issue to be processed by tidb for the last 48h and i did not see for the error to appear again. So it seems maybe there was a fix between version 4.0.0 and 4.0.3

I will anyway keep it running like this for the time being.

Hey guys,

I have let the query where i found the issue to be processed by tidb for the last 48h and i did not see for the error to appear again. So it seems maybe there was a fix between version 4.0.0 and 4.0.3

I will anyway keep it running like this for the time being.

Could you try @eurekaka 's suggestion, by rebuilding the tidb-server with -race flag and then continue the reproduction?

Hey @ichn-hu unfortunately i do not know how to do that, i used the tiup to deploy the cluster. Is this something i can do with it? If you have a documentation about how i can do it i can deploy a instance with the flag set and send all traffic to it.

It looks like a data race when BatchPointGet is executed in multi-thread context. You can enable -race flag when building tidb-server, and then try the reproduction again.

Hi @eurekaka , would you please elaborate for @cberescu ?

Hey @ichn-hu unfortunately i do not know how to do that, i used the tiup to deploy the cluster. Is this something i can do with it? If you have a documentation about how i can do it i can deploy a instance with the flag set and send all traffic to it.

Hi @cberescu , sorry for a late reply, basically you need to compile & build the tidb server by yourself with go's race detector turned on.

To compile the tidb-server, you need to have a go environment, then clone the code repository, and use

WITH_RACE=1 make server

to compile the tidb server with the race detector.

And you could then test against the compiled binary file, which will detect race condition on the fly.

Let me know if you have any more questions.

More on this, you might not need to use tiup to test the binary file in a cluster, you could just ./bin/tidb-server to start a single tidb server in a standalone mode (you don't even need tikv) if your test case is not that huge.

If you need to use tiup in any case, if you are using tiup playground, you could specify a binary file path, see tiup playground --help; if you are using tiup cluster, you might need to do some research on how to specifiy a compiled binary to deploy.

for now i do not seem to receive the error anymore, could it have been fixed maybe in one of releases since 4.0.0? As i said the error was received on the 4.0.0 version.

for now i do not seem to receive the error anymore, could it have been fixed maybe in one of releases since 4.0.0? As i said the error was received on the 4.0.0 version.

Probably. Perhaps we could consider close this issue? @cberescu feel free to reopen it as long as the problem accur again, thanks for your patience and help~

@SunRunAway would you please help close this issue?

After investigation, this issue is confirmed to be resolved by later PRs, namely https://github.com/pingcap/tidb/pull/18145/, and the cause is identified here: https://github.com/pingcap/tidb/issues/16817#issuecomment-620969926 . In short, the BatchPointGetExec tries to get a snapshot from the txnState, which after checking validity, might be concurrently changed before snapshot = txn.GetSnapshot(), resulted in null pointer dereference.

@cberescu as long as you are using later release versions, such as 4.0.5, you are now free from such panic.

that is great news, thanks for your explanation and patience with me.

Was this page helpful?
0 / 5 - 0 ratings