Relevant system information:
Describe the bug
Upon upgrade to version 2.1.0, we now get a segmentation fault error in one of our queries.
We upgraded from 1.7.4.
Error happens when I run this query:
INSERT INTO rollup_1d_node_metric(time, cluster_id, node_id, cpu_cost, total_cost)
SELECT time_bucket('1 day', m.end_time),
cluster_id,
node_id,
sum(cpu_cost),
sum(total_cost)
FROM node_metric mm
INNER JOIN metric m ON mm.metric_id = m.id
WHERE mm.end_time BETWEEN time_bucket('1 day', '2019-07-27 10:03:47.177372'::timestamptz) AND CURRENT_TIMESTAMP
AND m.end_time BETWEEN time_bucket('1 day', '2019-07-27 10:03:47.177372'::timestamptz) AND CURRENT_TIMESTAMP
GROUP BY 1, 2, 3
ON CONFLICT (time, cluster_id, node_id)
DO UPDATE
SET cpu_cost = excluded.cpu_cost,
total_cost = excluded.total_cost;
Stack trace:
elijah@elijah-ThinkPad-T480:~/Programs/timescaledb-2.1.0/build$ sudo coredumpctl gdb
PID: 20295 (postgres)
UID: 124 (postgres)
GID: 131 (postgres)
Signal: 11 (SEGV)
Timestamp: Thu 2021-03-18 15:22:06 -04 (35s ago)
Command Line: postgres: 11/main: elijah replex 127.0.0.1(48846) INSERT
Executable: /usr/lib/postgresql/11/bin/postgres
Control Group: /system.slice/system-postgresql.slice/[email protected]
Unit: [email protected]
Slice: system-postgresql.slice
Boot ID: 685e0b94f39444b5895002698dd4c99a
Machine ID: f7db69a1e32e431eb89aa481c8a36d31
Hostname: elijah-ThinkPad-T480
Storage: /var/lib/systemd/coredump/core.postgres.124.685e0b94f39444b5895002698dd4c99a.20295.1616095326000000.zst
Message: Process 20295 (postgres) of user 124 dumped core.
Stack trace of thread 20295:
#0 0x00007fe57d7986e8 ts_chunk_insert_state_create (timescaledb-2.1.0.so + 0x306e8)
#1 0x00007fe57d79596f ts_chunk_dispatch_get_chunk_insert_state (timescaledb-2.1.0.so + 0x2d96f)
#2 0x00007fe57d795d1e chunk_dispatch_exec (timescaledb-2.1.0.so + 0x2dd1e)
#3 0x00005655436b619f ExecProcNode (postgres + 0x2b419f)
#4 0x000056554369250c ExecProcNode (postgres + 0x29050c)
#5 0x00005655437e4e27 ProcessQuery (postgres + 0x3e2e27)
#6 0x00005655437e5130 PortalRunMulti (postgres + 0x3e3130)
#7 0x00005655437e5c4e PortalRun (postgres + 0x3e3c4e)
#8 0x00005655437e30c6 exec_execute_message (postgres + 0x3e10c6)
#9 0x0000565543771bee BackendRun (postgres + 0x36fbee)
#10 0x0000565543772a88 PostmasterMain (postgres + 0x370a88)
#11 0x000056554350ce84 main (postgres + 0x10ae84)
#12 0x00007fe782669cb2 __libc_start_main (libc.so.6 + 0x28cb2)
#13 0x000056554350cf2e _start (postgres + 0x10af2e)
GNU gdb (Ubuntu 9.2-0ubuntu2) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/postgresql/11/bin/postgres...
Reading symbols from /usr/lib/debug/.build-id/0d/08baadc301560a5ca98f1f69aa35464511f4ec.debug...
[New LWP 20295]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: 11/main: elijah replex 127.0.0.1(48846) INSERT '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fe57d7986e8 in ts_chunk_insert_state_create () from /usr/lib/postgresql/11/lib/timescaledb-2.1.0.so
(gdb) bt
#0 0x00007fe57d7986e8 in ts_chunk_insert_state_create () from /usr/lib/postgresql/11/lib/timescaledb-2.1.0.so
#1 0x00007fe57d79596f in ts_chunk_dispatch_get_chunk_insert_state () from /usr/lib/postgresql/11/lib/timescaledb-2.1.0.so
#2 0x00007fe57d795d1e in chunk_dispatch_exec () from /usr/lib/postgresql/11/lib/timescaledb-2.1.0.so
#3 0x00005655436b619f in ExecProcNode (node=0x56554422e088) at ./build/../src/include/executor/executor.h:248
#4 ExecModifyTable (pstate=0x56554422d700) at ./build/../src/backend/executor/nodeModifyTable.c:2025
#5 0x000056554369250c in ExecProcNode (node=0x56554422d320) at ./build/../src/include/executor/executor.h:248
#6 ExecutePlan (execute_once=<optimized out>, dest=0x565543b75480 <donothingDR>, direction=<optimized out>, numberTuples=0,
sendTuples=<optimized out>, operation=CMD_INSERT, use_parallel_mode=<optimized out>, planstate=0x56554422d320,
estate=0x56554422cb20) at ./build/../src/backend/executor/execMain.c:1712
#7 standard_ExecutorRun (queryDesc=0x5655441ecce0, direction=<optimized out>, count=0, execute_once=<optimized out>)
at ./build/../src/backend/executor/execMain.c:353
#8 0x00005655437e4e27 in ProcessQuery (plan=0x565544274958,
sourceText=0x56554420e1e8 "INSERT INTO rollup_1d_node_metric(time, cluster_id, node_id, cpu_cost, total_cost, memory_bytes,\n", ' ' <repeats 34 times>, "memory_used_percent, cpu_cores, cpu_used_percent, disk_costs, discoun"..., params=0x0, queryEnv=0x0,
dest=<optimized out>, completionTag=0x7ffec8874bf0 "") at ./build/../src/backend/tcop/pquery.c:161
#9 0x00005655437e5130 in PortalRunMulti (portal=portal@entry=0x565544195530, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x565543b75480 <donothingDR>, dest@entry=0x5655440f68e0,
altdest=0x565543b75480 <donothingDR>, altdest@entry=0x5655440f68e0, completionTag=completionTag@entry=0x7ffec8874bf0 "")
at ./build/../src/backend/tcop/pquery.c:1286
#10 0x00005655437e5c4e in PortalRun (portal=0x565544195530, count=9223372036854775807, isTopLevel=<optimized out>,
run_once=<optimized out>, dest=0x5655440f68e0, altdest=0x5655440f68e0, completionTag=0x7ffec8874bf0 "")
at ./build/../src/backend/tcop/pquery.c:799
#11 0x00005655437e30c6 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x565544149328, dbname=<optimized out>,
username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:2036
#12 0x0000565543771bee in BackendRun (port=0x565544146de0) at ./build/../src/backend/postmaster/postmaster.c:4409
#13 BackendStartup (port=0x565544146de0) at ./build/../src/backend/postmaster/postmaster.c:4073
#14 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1728
#15 0x0000565543772a88 in PostmasterMain (argc=5, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1401
#16 0x000056554350ce84 in main (argc=5, argv=0x5655440f0fa0) at ./build/../src/backend/main/main.c:228
(gdb)
I did a little digging and error does not occur when I change ON CONFLICT DO UPDATE to ON CONFLICT DO NOTHING.
Also, when I change our build system to use timescaledb 2.0.2, error does not occur.
Could you provide a minimal testcase that fails on 2.1.0 in a fresh database.
In the debug trace above, can you show the line that triggers the segfault?
Here is the query plan generated for the query:
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Custom Scan (HypertableInsert) (cost=0.00..0.04 rows=1 width=296) |
| -> Insert on public.rollup_1d_node_metric (cost=0.00..0.04 rows=1 width=296) |
| Conflict Resolution: UPDATE |
| Conflict Arbiter Indexes: rollup_1d_node_metric_pkey |
| -> Custom Scan (ChunkDispatch) (cost=0.00..0.04 rows=1 width=296) |
| Output: "*SELECT*".time_bucket, "*SELECT*".cluster_id, "*SELECT*".node_id, "*SELECT*".sum, "*SELECT*".sum_1, "*SELECT*".avg, "*SELECT*".avg_1, "*SELECT*".avg_2, "*SELECT*".avg_3, "*SELECT*".sum_2, "*SELECT*".avg_4 |
| -> HashAggregate (cost=0.00..0.03 rows=1 width=296) |
| Output: (time_bucket('1 day'::interval, end_time)), cluster_id, node_id, sum(cpu_cost), sum(total_cost), avg(memory_bytes), avg(memory_used_percent), avg(cpu_cores), avg(cpu_used_percent), sum(disk_costs), avg(discount)|
| Group Key: time_bucket('1 day'::interval, end_time), cluster_id, node_id |
| -> Result (cost=0.00..0.00 rows=0 width=212) |
| Output: time_bucket('1 day'::interval, end_time), cluster_id, node_id, cpu_cost, total_cost, memory_bytes, memory_used_percent, cpu_cores, cpu_used_percent, disk_costs, discount |
| One-Time Filter: false |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
How do I get the line in the stack trace that triggers the segfault? Also, I can look into the minimal example later but I don't have much time right now.
@eloyekunle To get the line in the stack trace, you can use frame to select the frame you want to investigate and list to list where it broke. This is assuming that you have source available at the path that you built it for. I note that there is no line number in the backtrace, so it might be a problem with getting the line.
awesome, thanks!
The bug is triggered by the atthasmissing flag being set on a column, which happens if you alter a column to add a default and there are already rows that did not have a value for this column. To clear the flag, you can run VACUUM FULL on the hypertable.
You only need to vacuum the hypertable and not the chunks, so if the hypertable has a lot of data you can disable the routing of vacuum to chunks by setting timescaledb.restoring to ON using timescaledb_pre_restore and then run the VACUUM FULL on the hypertable.
Don't forget to enable timescaledb.restoring after by running timescaledb_post_restore.
Most helpful comment
The bug is triggered by the
atthasmissingflag being set on a column, which happens if you alter a column to add a default and there are already rows that did not have a value for this column. To clear the flag, you can runVACUUM FULLon the hypertable.You only need to vacuum the hypertable and not the chunks, so if the hypertable has a lot of data you can disable the routing of vacuum to chunks by setting
timescaledb.restoringtoONusingtimescaledb_pre_restoreand then run theVACUUM FULLon the hypertable.Don't forget to enable
timescaledb.restoringafter by runningtimescaledb_post_restore.