Victoriametrics: vmstorage crash

Created on 5 Jun 2020  路  11Comments  路  Source: VictoriaMetrics/VictoriaMetrics

Describe the bug

2020-06-05T06:03:00.721Z    info    VictoriaMetrics@/lib/storage/partition.go:1169  merged 2645139584 rows in 145.442 seconds at 18186958 rows/sec to "/data2/vmstorage/data/small/2020_06/2645139584_7939461_20200604000000.000_20200605055202.000_1615906BA0D72853"; sizeBytes: 3393939808
2020-06-05T06:03:28.017Z    info    VictoriaMetrics@/app/vmstorage/transport/server.go:219  closing vmselect conn from 10.152.6.134:51702
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x8005c1]

goroutine 62 [running]:
github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*TagFilters).Reset(...)
    github.com/VictoriaMetrics/VictoriaMetrics@/lib/storage/tag_filters.go:93
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*vmselectRequestCtx).setupTfss(0xc076f108c0, 0xc076cc6000, 0x0)
    github.com/VictoriaMetrics/VictoriaMetrics@/app/vmstorage/transport/server.go:778 +0x291
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).processVMSelectSearchQuery(0xc0001b5e30, 0xc076f108c0, 0x0, 0x0)
    github.com/VictoriaMetrics/VictoriaMetrics@/app/vmstorage/transport/server.go:717 +0x2bb
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).processVMSelectRequest(0xc0001b5e30, 0xc076f108c0, 0x0, 0x0)
    github.com/VictoriaMetrics/VictoriaMetrics@/app/vmstorage/transport/server.go:469 +0x428
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).processVMSelectConn(0xc0001b5e30, 0xc00027bb60, 0x4, 0xabb479)
    github.com/VictoriaMetrics/VictoriaMetrics@/app/vmstorage/transport/server.go:333 +0xbf
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).RunVMSelect.func1(0xc0001b5e30, 0xb71de0, 0xc01c1e04e0)
    github.com/VictoriaMetrics/VictoriaMetrics@/app/vmstorage/transport/server.go:225 +0x362
created by github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).RunVMSelect
    github.com/VictoriaMetrics/VictoriaMetrics@/app/vmstorage/transport/server.go:190 +0x217
2020-06-05T06:03:30.439Z    info    VictoriaMetrics@/lib/logger/flag.go:11  build version: vmstorage-20200213-163134-tags-v1.33.1-cluster-0-gafecb344
2020-06-05T06:03:30.439Z    info    VictoriaMetrics@/lib/logger/flag.go:12  command line flags

I found this error in stderr while the vmstorage process was restarting. I don't know what query is causing this panic yet, maybe we can do some protection to prevent a single query from causing the entire process to crash.

Version
vmstorage-20200521-151404-tags-v1.35.6-cluster-0-gdcbdc009f

bug

All 11 comments

The following log line suggests you run vmstorage v1.33.1-cluster, not v1.35.6-cluster:

2020-06-05T06:03:30.439Z info VictoriaMetrics@/lib/logger/flag.go:11 build version: vmstorage-20200213-163134-tags-v1.33.1-cluster-0-gafecb344

Could you update all the cluster components to the latest release and check whether the crash is triggered again? Note that communication protocol between vminsert and vmstorage has been changed in backwards-imcompatible way at v1.35.0, so these components may emit errors about imcompabitle protocol during the upgrade when only a part of components is upgraded. These errors should stop after the upgrade is complete.

image
I just tried to upgrade to v1.35.6 and that didn't solve the problem, I just posted an earlier log.
I'll try the latest version now.

image
The latest version v1.36.3 has same problem.

@faceair , thanks for the updated info!

Could you try building vmstorage from the commit e4cef1b678473c33fcd11b8557338066e88ec5b1 and verify whether the crash is triggered there? See instructions on how to build cluster components.

2020-06-05T09:40:02.408Z    info    VictoriaMetrics/lib/storage/partition.go:1161   merged 505121513 rows across 13568497 blocks in 32.124 seconds at 15724232 rows/sec to "/data1/vmstorage/data/small/2020_06/505121513_10125733_20200605000000.000_20200605093915.000_16159BBFE24EA3B5"; sizeBytes: 906273046
2020-06-05T09:50:11.435Z    info    VictoriaMetrics/app/vmstorage/transport/server.go:229   closing vmselect conn from 10.152.6.134:35858
2020-06-05T09:50:11.435Z    info    VictoriaMetrics/app/vmstorage/transport/server.go:229   closing vmselect conn from 10.152.6.134:35854
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7c6a3a]

goroutine 134 [running]:
github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*TagFilters).Reset(...)
    github.com/VictoriaMetrics/VictoriaMetrics/lib/storage/tag_filters.go:137
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*vmselectRequestCtx).setupTfss(0xc05657e000, 0xc0cdf50000, 0x0)
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport/server.go:883 +0x34a
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).processVMSelectSearchQuery(0xc000214480, 0xc05657e000, 0x0, 0x0)
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport/server.go:820 +0x2c2
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).processVMSelectRequest(0xc000214480, 0xc05657e000, 0x0, 0x0)
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport/server.go:510 +0x38c
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).processVMSelectConn(0xc000214480, 0xc032588090, 0x4, 0xa5b760)
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport/server.go:373 +0xbf
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).RunVMSelect.func1(0xc000214480, 0xb0dca0, 0xc054fda3c0)
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport/server.go:235 +0x36f
created by github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).RunVMSelect
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport/server.go:200 +0x21e
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7c6a3a]

goroutine 467 [running]:
github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*TagFilters).Reset(...)
    github.com/VictoriaMetrics/VictoriaMetrics/lib/storage/tag_filters.go:137
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*vmselectRequestCtx).setupTfss(0xc0000c9b00, 0xc07cf12000, 0x0)
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport/server.go:883 +0x34a
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).processVMSelectSearchQuery(0xc000214480, 0xc0000c9b00, 0x0, 0x0)
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport/server.go:820 +0x2c2
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).processVMSelectRequest(0xc000214480, 0xc0000c9b00, 0x0, 0x0)
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport/server.go:510 +0x38c
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).processVMSelectConn(0xc000214480, 0xc04ffd17a0, 0x4, 0xa5b760)
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport/server.go:373 +0xbf
github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).RunVMSelect.func1(0xc000214480, 0xb0dca0, 0xc0536e1b40)
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport/server.go:235 +0x36f
created by github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport.(*Server).RunVMSelect
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/transport/server.go:200 +0x21e
2020-06-05T09:50:13.427Z    info    VictoriaMetrics/lib/logger/flag.go:11   build version: vmstorage-20200605-091548-heads-cluster-0-ge4cef1b6
2020-06-05T09:50:13.427Z    info    VictoriaMetrics/lib/logger/flag.go:12   command line flags

The new version does not fix this problem.

Could you look into vmselect error logs around the time when vmstorage process crashes? These logs could contain information about the query, which triggers vmstorage process crash.

Could you also clarify steps to reproduce the issue? As I understood it is reproduced during graceful shutdown of vmstorage. Or is it reproduced when vmstorage runs in normal mode when a certain query is received from vmselect?

It looks like it was caused by this big series match query.

2020-06-05T09:50:13.699Z    warn    VictoriaMetrics/app/vmselect/main.go:317    error in "/select/0/prometheus/api/v1/series?match%5B%5D=span%7B__span_g1__%3D%22qa-go-rpc%22%2C+__span_g2__%3D%22_all%22%2C+__span_g3__%3D%22client%22%2C+__span_g6__%3D%22_all%22%2C+__span_g8__%3D%22count%22%7D&match%5B%5D=span%7B__span_g1__%3D%22qa-go-web%22%2C+__span_g2__%3D%22_all%22%2C+__span_g3__%3D%22client%22%2C+__span_g6__%3D%22_all%22%2C+__span_g8__%3D%22count%22%7D&match%5B%5D=span%7B__span_g1__%3D%22qa-go-rpc-offline%22%2C+__span_g2__%3D%22_all%22%2C+__span_g3__%3D%22client%22%2C+__span_g6__%3D%22_all%22%2C+__span_g8__%3D%22count%22%7D&match%5B%5D=span%7B__span_g1__%3D%22qa-go-workers%22%2C+__span_g2__%3D%22_all%22%2C+__span_g3__%3D%22client%22%2C+__span_g6__%3D%22_all%22%2C+__span_g8__%3D%22count%22%7D&match%5B%5D=span%7B__span_g1__%3D%22qa-go-actors-router%22%2C+__span_g2__%3D%22_all%22%2C+__span_g3__%3D%22client%22%2C+__span_g6__%3D%22_all%22%2C+__span_g8__%3D%22count%22%7D&match%5B%5D=span%7B__span_g1__%3D%22qa-go-actor-workers%22%2C+__span_g2__%3D%22_all%22%2C+__span_g3__%3D%22client%22%2C+__span_g6__%3D%22_all%22%2C+__span_g8__%3D%22count%22%7D&match%5B%5D=span%7B__span_g1__%3D%22remove-question-member%22%2C+__span_g2__%3D%22_all%22%2C+__span_g3__%3D%22client%22%2C+__span_g6__%3D%22_all%22%2C+__span_g8__%3D%22count%22%7D&match%5B%5D=span%7B__span_g1__%3D%22qa-auto-publish-router%22%2C+__span_g2__%3D%22_all%22%2C+__span_g3__%3D%22client%22%2C+__span_g6__%3D%22_all%22%2C+__span_g8__%3D%22count%22%7D&match%5B%5D=span%7B__span_g1__%3D%22baidupan-reverse-proxy%22%2C+__span_g2__%3D%22_all%22%2C+__span_g3__%3D%22client%22%2C+__span_g6__%3D%22_all%22%2C+__span_g8__%3D%22count%22%7D": cannot fetch data for "AccountID=0, ProjectID=0, MinTimestamp=2020-06-05 09:45:10 +0000 UTC, MaxTimestamp=2020-06-05 09:50:10 +0000 UTC, TagFilters=[\n{Key=\"\", Value=\"span\", IsNegative: false, IsRegexp: false}{Key=\"__span_g1__\", Value=\"qa-go-rpc\", IsNegative: false, IsRegexp: false}{Key=\"__span_g2__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g3__\", Value=\"client\", IsNegative: false, IsRegexp: false}{Key=\"__span_g6__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g8__\", Value=\"count\", IsNegative: false, IsRegexp: false}\n{Key=\"\", Value=\"span\", IsNegative: false, IsRegexp: false}{Key=\"__span_g1__\", Value=\"qa-go-web\", IsNegative: false, IsRegexp: false}{Key=\"__span_g2__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g3__\", Value=\"client\", IsNegative: false, IsRegexp: false}{Key=\"__span_g6__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g8__\", Value=\"count\", IsNegative: false, IsRegexp: false}\n{Key=\"\", Value=\"span\", IsNegative: false, IsRegexp: false}{Key=\"__span_g1__\", Value=\"qa-go-rpc-offline\", IsNegative: false, IsRegexp: false}{Key=\"__span_g2__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g3__\", Value=\"client\", IsNegative: false, IsRegexp: false}{Key=\"__span_g6__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g8__\", Value=\"count\", IsNegative: false, IsRegexp: false}\n{Key=\"\", Value=\"span\", IsNegative: false, IsRegexp: false}{Key=\"__span_g1__\", Value=\"qa-go-workers\", IsNegative: false, IsRegexp: false}{Key=\"__span_g2__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g3__\", Value=\"client\", IsNegative: false, IsRegexp: false}{Key=\"__span_g6__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g8__\", Value=\"count\", IsNegative: false, IsRegexp: false}\n{Key=\"\", Value=\"span\", IsNegative: false, IsRegexp: false}{Key=\"__span_g1__\", Value=\"qa-go-actors-router\", IsNegative: false, IsRegexp: false}{Key=\"__span_g2__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g3__\", Value=\"client\", IsNegative: false, IsRegexp: false}{Key=\"__span_g6__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g8__\", Value=\"count\", IsNegative: false, IsRegexp: false}\n{Key=\"\", Value=\"span\", IsNegative: false, IsRegexp: false}{Key=\"__span_g1__\", Value=\"qa-go-actor-workers\", IsNegative: false, IsRegexp: false}{Key=\"__span_g2__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g3__\", Value=\"client\", IsNegative: false, IsRegexp: false}{Key=\"__span_g6__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g8__\", Value=\"count\", IsNegative: false, IsRegexp: false}\n{Key=\"\", Value=\"span\", IsNegative: false, IsRegexp: false}{Key=\"__span_g1__\", Value=\"remove-question-member\", IsNegative: false, IsRegexp: false}{Key=\"__span_g2__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g3__\", Value=\"client\", IsNegative: false, IsRegexp: false}{Key=\"__span_g6__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g8__\", Value=\"count\", IsNegative: false, IsRegexp: false}\n{Key=\"\", Value=\"span\", IsNegative: false, IsRegexp: false}{Key=\"__span_g1__\", Value=\"qa-auto-publish-router\", IsNegative: false, IsRegexp: false}{Key=\"__span_g2__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g3__\", Value=\"client\", IsNegative: false, IsRegexp: false}{Key=\"__span_g6__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g8__\", Value=\"count\", IsNegative: false, IsRegexp: false}\n{Key=\"\", Value=\"span\", IsNegative: false, IsRegexp: false}{Key=\"__span_g1__\", Value=\"baidupan-reverse-proxy\", IsNegative: false, IsRegexp: false}{Key=\"__span_g2__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g3__\", Value=\"client\", IsNegative: false, IsRegexp: false}{Key=\"__span_g6__\", Value=\"_all\", IsNegative: false, IsRegexp: false}{Key=\"__span_g8__\", Value=\"count\", IsNegative: false, IsRegexp: false}\n]": error occured during search: cannot perform search on vmstorage promdb03.pek01.rack.zhihu.com:9102: cannot execute rpcName="search_v3" on vmstorage "10.137.162.66:9102" with timeout 30.000 seconds; the timeout can be adjusted with `-search.maxQueryDuration` command-line flag: cannot read error message: cannot read 8 bytes with data size: EOF; read only 0 bytes

@faceair , could you try building vmstorage from 01719f4949834c3613817a69bc6ecec91d628537 and verify whether it fixes the crash?

Looks like the latest version is fine, no more crashes, thanks for the quick fix!

Another question is why does the query cause a crash? We may need to add a recovery function to here

The recovery function may hide bugs like this one, so they will stay unnoticed for long time and may induce hard-to-debug indirect bugs. This reduces code quality in the long run. It is better to crash on unexpected bug like this one, so it could be noticed and fixed quickly.

The root cause of the bug was imporper assumption that a = append(a, ptr) initializes non-nil pointers outside the len(a), which belong to cap(a).

Let's keep this bug open until the bugfix is included into the next release of VictoriaMetrics.

The bugfix has been included in v1.37.0. Closing the issue as fixed.

Was this page helpful?
0 / 5 - 0 ratings