Tidb: ANALYZE TABLE causes SIGSEGV on latest trunk

Created on 26 Mar 2020  ยท  12Comments  ยท  Source: pingcap/tidb

Consider the following statements:

CREATE TABLE t0(c0 INT UNIQUE, c2 INT UNIQUE);
REPLACE INTO t0(c0, c2) VALUES (0, 0), (0, 0), (0, 1);
ANALYZE TABLE t0; -- SIGSEGV

Unexpectedly, the ANALYZE causes tidb-server to crash:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x1d1920a]

goroutine 52 [running]:
github.com/pingcap/tidb/store/tikv/latch.(*Latches).releaseSlot(0xc0002f51a0, 0xc002445e00, 0x0)
    /home/manuel/research/projects/tidb/store/tikv/latch/latch.go:183 +0x17a
github.com/pingcap/tidb/store/tikv/latch.(*Latches).release(0xc0002f51a0, 0xc002445e00, 0xc00016ff80, 0x0, 0x0, 0xc000085f80, 0x0, 0x0)
    /home/manuel/research/projects/tidb/store/tikv/latch/latch.go:167 +0x70
github.com/pingcap/tidb/store/tikv/latch.(*LatchesScheduler).run(0xc000352440)
    /home/manuel/research/projects/tidb/store/tikv/latch/scheduler.go:56 +0xf4
created by github.com/pingcap/tidb/store/tikv/latch.NewScheduler
    /home/manuel/research/projects/tidb/store/tikv/latch/scheduler.go:43 +0x129

I can reproduce this bug on the latest trunk version:

mysql> select tidb_version();
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tidb_version()                                                                                                                                                                                                                                                                                               |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Release Version: v4.0.0-beta.2-75-ga6de0e38d-dirty
Git Commit Hash: a6de0e38d49c97671d316590c8c945eb518ca2b2
Git Branch: master
UTC Build Time: 2020-03-26 12:11:33
GoVersion: go1.13.4
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

I cannot reproduce this on the latest release version (5.7.25-TiDB-v3.0.12), which is why I assume that this is not a security vulnerability, and report the issue here on GitHub.

componenstatistics severitmajor typbug

All 12 comments

Hi @mrigger, could you provide the cluster configuration? I can not reproduce this issue on my laptop.

Or could you provide the full panic log of tidb-server?

Hi @mrigger, could you provide the cluster configuration? I can not reproduce this issue on my laptop.

Here is the content of the configuration file for pd-server:

initial-cluster-state = "new"
โ€‹
enable-prevote = true
lease = 3
namespace-classifier = "table"
tso-save-interval = "3s"
โ€‹
[security]
cacert-path = ""
cert-path = ""
key-path = ""
โ€‹
[log]
level = "info"
โ€‹
[log.file]
โ€‹
[metric]
โ€‹
[schedule]
enable-one-way-merge = false
leader-schedule-limit = 4
max-pending-peer-count = 16
max-snapshot-count = 3
max-store-down-time = "30m"
merge-schedule-limit = 8
region-schedule-limit = 64
replica-schedule-limit = 64
split-merge-interval = "1h"
โ€‹
[replication]
location-labels = []
max-replicas = 1

Here is the one for tidb-server:

port = 4001
lease = "0"
โ€‹
# When create table, split a separated region for it. It is recommended to
# turn off this option if there will be a large number of tables created.
split-table = false
โ€‹
[log]
level = "error"
โ€‹
[status]
status-port = 10081
โ€‹
[performance]
stats-lease = "0"
โ€‹
[txn-local-latches]
enabled = true
capacity = 204800

And here is the one for tikv-server:

[storage]
โ€‹
[raftstore]
 raft-store-max-leader-lease = "4s"
 raft-election-timeout-ticks = 5
 use-delete-range = true
 consistency-check-interval = "60s"
โ€‹
[coprocessor]
โ€‹
[rocksdb]
โ€‹
[raftdb]

pd version: 2f53e1e41ccaa3ab222fbf3cfc031b774407be5f
tikv version: 2a3784a65ebc3fef005ad58272790d5830eeb304
tidb version: a6de0e38d49c97671d316590c8c945eb518ca2b2.

Or could you provide the full panic log of tidb-server?

Do you mean the following?

[2020/03/26 15:58:41.459 +01:00] [INFO] [printer.go:41] ["Welcome to TiDB."] ["Release Version"=v4.0.0-beta.2-75-ga6de0e38d-dirty] ["Git Commit Hash"=a6de0e38d49c97671d316590c8c945eb518ca2b2] ["Git Branch"=master] ["UTC Build Time"="2020-03-26 12:11:33"] [GoVersion=go1.13.4] ["Race Enabled"=false] ["Check Table Before Drop"=false] ["TiKV Min Version"=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]
[2020/03/26 15:58:41.460 +01:00] [INFO] [printer.go:54] ["loaded config"] [config="{\"host\":\"0.0.0.0\",\"advertise-address\":\"0.0.0.0\",\"port\":4001,\"cors\":\"\",\"store\":\"mocktikv\",\"path\":\"/tmp/tidb\",\"socket\":\"\",\"lease\":\"0\",\"run-ddl\":true,\"split-table\":false,\"token-limit\":1000,\"oom-use-tmp-storage\":true,\"tmp-storage-path\":\"/tmp/tidb/tmp-storage\",\"oom-action\":\"cancel\",\"mem-quota-query\":1073741824,\"enable-streaming\":false,\"enable-batch-dml\":false,\"txn-local-latches\":{\"enabled\":true,\"capacity\":204800},\"lower-case-table-names\":2,\"server-version\":\"\",\"log\":{\"level\":\"error\",\"format\":\"text\",\"disable-timestamp\":null,\"enable-timestamp\":null,\"disable-error-stack\":null,\"enable-error-stack\":null,\"file\":{\"filename\":\"asdf\",\"max-size\":300,\"max-days\":0,\"max-backups\":0},\"enable-slow-log\":true,\"slow-query-file\":\"tidb-slow.log\",\"slow-threshold\":300,\"expensive-threshold\":10000,\"query-log-max-len\":4096,\"record-plan-in-slow-log\":1},\"security\":{\"skip-grant-table\":false,\"ssl-ca\":\"\",\"ssl-cert\":\"\",\"ssl-key\":\"\",\"require-secure-transport\":false,\"cluster-ssl-ca\":\"\",\"cluster-ssl-cert\":\"\",\"cluster-ssl-key\":\"\",\"cluster-verify-cn\":null},\"status\":{\"status-host\":\"0.0.0.0\",\"metrics-addr\":\"\",\"status-port\":10081,\"metrics-interval\":15,\"report-status\":true,\"record-db-qps\":false},\"performance\":{\"max-procs\":0,\"max-memory\":0,\"stats-lease\":\"0\",\"stmt-count-limit\":5000,\"feedback-probability\":0.05,\"query-feedback-limit\":1024,\"pseudo-estimate-ratio\":0.8,\"force-priority\":\"NO_PRIORITY\",\"bind-info-lease\":\"3s\",\"txn-total-size-limit\":104857600,\"tcp-keep-alive\":true,\"cross-join\":true,\"run-auto-analyze\":true},\"prepared-plan-cache\":{\"enabled\":false,\"capacity\":100,\"memory-guard-ratio\":0.1},\"opentracing\":{\"enable\":false,\"rpc-metrics\":false,\"sampler\":{\"type\":\"const\",\"param\":1,\"sampling-server-url\":\"\",\"max-operations\":0,\"sampling-refresh-interval\":0},\"reporter\":{\"queue-size\":0,\"buffer-flush-interval\":0,\"log-spans\":false,\"local-agent-host-port\":\"\"}},\"proxy-protocol\":{\"networks\":\"\",\"header-timeout\":5},\"tikv-client\":{\"grpc-connection-count\":4,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"commit-timeout\":\"41s\",\"max-batch-size\":128,\"overload-threshold\":200,\"max-batch-wait-time\":0,\"batch-wait-size\":8,\"enable-chunk-rpc\":true,\"region-cache-ttl\":600,\"store-limit\":0,\"copr-cache\":{\"enabled\":false,\"capacity-mb\":0,\"admission-max-result-mb\":0,\"admission-min-process-ms\":0}},\"binlog\":{\"enable\":false,\"ignore-error\":false,\"write-timeout\":\"15s\",\"binlog-socket\":\"\",\"strategy\":\"range\"},\"compatible-kill-query\":false,\"plugin\":{\"dir\":\"\",\"load\":\"\"},\"pessimistic-txn\":{\"enable\":true,\"max-retry-count\":256},\"check-mb4-value-in-utf8\":true,\"max-index-length\":3072,\"alter-primary-key\":false,\"treat-old-version-utf8-as-utf8mb4\":true,\"enable-table-lock\":false,\"delay-clean-table-lock\":0,\"split-region-max-num\":1000,\"stmt-summary\":{\"enable\":true,\"enable-internal-query\":false,\"max-stmt-count\":200,\"max-sql-length\":4096,\"refresh-interval\":1800,\"history-size\":24},\"repair-mode\":false,\"repair-table-list\":[],\"isolation-read\":{\"engines\":[\"tikv\",\"tiflash\",\"tidb\"]},\"max-server-connections\":4096,\"new_collations_enabled_on_first_bootstrap\":false,\"experimental\":{\"allow-auto-random\":false,\"allow-expression-index\":false},\"enable-dynamic-config\":true}"]

don't use use-delete-range = true in TiKV.

I think we must remove this ASAP @zhangjinpeng1987

Thanks! I've removed use-delete-range = true and will continue fuzzing without it. However, locally, I can still reproduce the issue described above, even without this flag.

I would like to ask if there're some recommended config files for testing? Or just without config files

Thanks @mrigger, I reproduced this error when [txn-local-latches] is set to true in TiDB server. @coocood @tiancaiamao Please help us to locate the root causes.

don't use use-delete-range = true in TiKV.

I think we must remove this ASAP @zhangjinpeng1987

@siddontang We already removed this config from the config template file, you mean we also should remove it from code?

The local latch in TiDB is no longer recommended. We have disabled this feature in https://github.com/pingcap/tidb/pull/15765. Sorry for not disabling it in time. Could you please check if this is still an issue after updating TiDB or turning off local latch?

confirmed that after turning off txn-local-latches, the panic disappears.

Hi @mrigger, I'm going to close this issue now. Feel free to reopen it once it appears again.

This bug is caused by memory overwrite.
The latch module construct a complex hash-alike struct, and the node stores

key, value

The API provide by the latch module is:

func (scheduler *LatchesScheduler) Lock(startTS uint64, keys [][]byte) *Lock {
func (scheduler *LatchesScheduler) UnLock(lock *Lock) {

In Lock(startTS, keys), the keys is a shadow copy, and it is later used as the node key in the latch internal struct.

If the caller modify the underlying memory of keys later, the hash-alike node would be modified unexpected,
and the corruption of the data struct cause the panic eventually.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wwar picture wwar  ยท  3Comments

breeswish picture breeswish  ยท  3Comments

zz-jason picture zz-jason  ยท  3Comments

Lvnszn picture Lvnszn  ยท  3Comments

thinktainer picture thinktainer  ยท  3Comments