Timescaledb: segmentation fault upon upgrade to 2.1.0

Created on 18 Mar 2021 · 7Comments · Source: timescale/timescaledb

Relevant system information:

OS: Ubuntu 20.10
PostgreSQL version: 11
TimescaleDB version: 2.1.0
Installation method: Docker, apt install and built from source

Describe the bug
Upon upgrade to version 2.1.0, we now get a segmentation fault error in one of our queries.

We upgraded from 1.7.4.

Error happens when I run this query:

INSERT INTO rollup_1d_node_metric(time, cluster_id, node_id, cpu_cost, total_cost)
SELECT time_bucket('1 day', m.end_time),
       cluster_id,
       node_id,
       sum(cpu_cost),
       sum(total_cost)
FROM node_metric mm
         INNER JOIN metric m ON mm.metric_id = m.id
WHERE mm.end_time BETWEEN time_bucket('1 day', '2019-07-27 10:03:47.177372'::timestamptz) AND CURRENT_TIMESTAMP
  AND m.end_time BETWEEN time_bucket('1 day', '2019-07-27 10:03:47.177372'::timestamptz) AND CURRENT_TIMESTAMP
GROUP BY 1, 2, 3
ON CONFLICT (time, cluster_id, node_id)
    DO UPDATE
    SET cpu_cost            = excluded.cpu_cost,
        total_cost          = excluded.total_cost;

Stack trace:

elijah@elijah-ThinkPad-T480:~/Programs/timescaledb-2.1.0/build$ sudo coredumpctl gdb
           PID: 20295 (postgres)
           UID: 124 (postgres)
           GID: 131 (postgres)
        Signal: 11 (SEGV)
     Timestamp: Thu 2021-03-18 15:22:06 -04 (35s ago)
  Command Line: postgres: 11/main: elijah replex 127.0.0.1(48846) INSERT
    Executable: /usr/lib/postgresql/11/bin/postgres
 Control Group: /system.slice/system-postgresql.slice/[email protected]
          Unit: [email protected]
         Slice: system-postgresql.slice
       Boot ID: 685e0b94f39444b5895002698dd4c99a
    Machine ID: f7db69a1e32e431eb89aa481c8a36d31
      Hostname: elijah-ThinkPad-T480
       Storage: /var/lib/systemd/coredump/core.postgres.124.685e0b94f39444b5895002698dd4c99a.20295.1616095326000000.zst
       Message: Process 20295 (postgres) of user 124 dumped core.

                Stack trace of thread 20295:
                #0  0x00007fe57d7986e8 ts_chunk_insert_state_create (timescaledb-2.1.0.so + 0x306e8)
                #1  0x00007fe57d79596f ts_chunk_dispatch_get_chunk_insert_state (timescaledb-2.1.0.so + 0x2d96f)
                #2  0x00007fe57d795d1e chunk_dispatch_exec (timescaledb-2.1.0.so + 0x2dd1e)
                #3  0x00005655436b619f ExecProcNode (postgres + 0x2b419f)
                #4  0x000056554369250c ExecProcNode (postgres + 0x29050c)
                #5  0x00005655437e4e27 ProcessQuery (postgres + 0x3e2e27)
                #6  0x00005655437e5130 PortalRunMulti (postgres + 0x3e3130)
                #7  0x00005655437e5c4e PortalRun (postgres + 0x3e3c4e)
                #8  0x00005655437e30c6 exec_execute_message (postgres + 0x3e10c6)
                #9  0x0000565543771bee BackendRun (postgres + 0x36fbee)
                #10 0x0000565543772a88 PostmasterMain (postgres + 0x370a88)
                #11 0x000056554350ce84 main (postgres + 0x10ae84)
                #12 0x00007fe782669cb2 __libc_start_main (libc.so.6 + 0x28cb2)
                #13 0x000056554350cf2e _start (postgres + 0x10af2e)

GNU gdb (Ubuntu 9.2-0ubuntu2) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/postgresql/11/bin/postgres...
Reading symbols from /usr/lib/debug/.build-id/0d/08baadc301560a5ca98f1f69aa35464511f4ec.debug...
[New LWP 20295]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: 11/main: elijah replex 127.0.0.1(48846) INSERT                      '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fe57d7986e8 in ts_chunk_insert_state_create () from /usr/lib/postgresql/11/lib/timescaledb-2.1.0.so
(gdb) bt
#0  0x00007fe57d7986e8 in ts_chunk_insert_state_create () from /usr/lib/postgresql/11/lib/timescaledb-2.1.0.so
#1  0x00007fe57d79596f in ts_chunk_dispatch_get_chunk_insert_state () from /usr/lib/postgresql/11/lib/timescaledb-2.1.0.so
#2  0x00007fe57d795d1e in chunk_dispatch_exec () from /usr/lib/postgresql/11/lib/timescaledb-2.1.0.so
#3  0x00005655436b619f in ExecProcNode (node=0x56554422e088) at ./build/../src/include/executor/executor.h:248
#4  ExecModifyTable (pstate=0x56554422d700) at ./build/../src/backend/executor/nodeModifyTable.c:2025
#5  0x000056554369250c in ExecProcNode (node=0x56554422d320) at ./build/../src/include/executor/executor.h:248
#6  ExecutePlan (execute_once=<optimized out>, dest=0x565543b75480 <donothingDR>, direction=<optimized out>, numberTuples=0, 
    sendTuples=<optimized out>, operation=CMD_INSERT, use_parallel_mode=<optimized out>, planstate=0x56554422d320, 
    estate=0x56554422cb20) at ./build/../src/backend/executor/execMain.c:1712
#7  standard_ExecutorRun (queryDesc=0x5655441ecce0, direction=<optimized out>, count=0, execute_once=<optimized out>)
    at ./build/../src/backend/executor/execMain.c:353
#8  0x00005655437e4e27 in ProcessQuery (plan=0x565544274958, 
    sourceText=0x56554420e1e8 "INSERT INTO rollup_1d_node_metric(time, cluster_id, node_id, cpu_cost, total_cost, memory_bytes,\n", ' ' <repeats 34 times>, "memory_used_percent, cpu_cores, cpu_used_percent, disk_costs, discoun"..., params=0x0, queryEnv=0x0, 
    dest=<optimized out>, completionTag=0x7ffec8874bf0 "") at ./build/../src/backend/tcop/pquery.c:161
#9  0x00005655437e5130 in PortalRunMulti (portal=portal@entry=0x565544195530, isTopLevel=isTopLevel@entry=true, 
    setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x565543b75480 <donothingDR>, dest@entry=0x5655440f68e0, 
    altdest=0x565543b75480 <donothingDR>, altdest@entry=0x5655440f68e0, completionTag=completionTag@entry=0x7ffec8874bf0 "")
    at ./build/../src/backend/tcop/pquery.c:1286
#10 0x00005655437e5c4e in PortalRun (portal=0x565544195530, count=9223372036854775807, isTopLevel=<optimized out>, 
    run_once=<optimized out>, dest=0x5655440f68e0, altdest=0x5655440f68e0, completionTag=0x7ffec8874bf0 "")
    at ./build/../src/backend/tcop/pquery.c:799
#11 0x00005655437e30c6 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x565544149328, dbname=<optimized out>, 
    username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:2036
#12 0x0000565543771bee in BackendRun (port=0x565544146de0) at ./build/../src/backend/postmaster/postmaster.c:4409
#13 BackendStartup (port=0x565544146de0) at ./build/../src/backend/postmaster/postmaster.c:4073
#14 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1728
#15 0x0000565543772a88 in PostmasterMain (argc=5, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1401
#16 0x000056554350ce84 in main (argc=5, argv=0x5655440f0fa0) at ./build/../src/backend/main/main.c:228
(gdb)

bug need-more-info segfault

Source

eloyekunle

Most helpful comment

The bug is triggered by the atthasmissing flag being set on a column, which happens if you alter a column to add a default and there are already rows that did not have a value for this column. To clear the flag, you can run VACUUM FULL on the hypertable.

You only need to vacuum the hypertable and not the chunks, so if the hypertable has a lot of data you can disable the routing of vacuum to chunks by setting timescaledb.restoring to ON using timescaledb_pre_restore and then run the VACUUM FULL on the hypertable.

Don't forget to enable timescaledb.restoring after by running timescaledb_post_restore.

mkindahl on 26 Mar 2021

👍2

All 7 comments

I did a little digging and error does not occur when I change ON CONFLICT DO UPDATE to ON CONFLICT DO NOTHING.

Also, when I change our build system to use timescaledb 2.0.2, error does not occur.

eloyekunle on 18 Mar 2021

Could you provide a minimal testcase that fails on 2.1.0 in a fresh database.

svenklemm on 20 Mar 2021

In the debug trace above, can you show the line that triggers the segfault?

mkindahl on 22 Mar 2021

Here is the query plan generated for the query:

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                                                                                                                     |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Custom Scan (HypertableInsert)  (cost=0.00..0.04 rows=1 width=296)                                                                                                                                                                             |
|  ->  Insert on public.rollup_1d_node_metric  (cost=0.00..0.04 rows=1 width=296)                                                                                                                                                               |
|        Conflict Resolution: UPDATE                                                                                                                                                                                                            |
|        Conflict Arbiter Indexes: rollup_1d_node_metric_pkey                                                                                                                                                                                   |
|        ->  Custom Scan (ChunkDispatch)  (cost=0.00..0.04 rows=1 width=296)                                                                                                                                                                    |
|              Output: "*SELECT*".time_bucket, "*SELECT*".cluster_id, "*SELECT*".node_id, "*SELECT*".sum, "*SELECT*".sum_1, "*SELECT*".avg, "*SELECT*".avg_1, "*SELECT*".avg_2, "*SELECT*".avg_3, "*SELECT*".sum_2, "*SELECT*".avg_4            |
|              ->  HashAggregate  (cost=0.00..0.03 rows=1 width=296)                                                                                                                                                                            |
|                    Output: (time_bucket('1 day'::interval, end_time)), cluster_id, node_id, sum(cpu_cost), sum(total_cost), avg(memory_bytes), avg(memory_used_percent), avg(cpu_cores), avg(cpu_used_percent), sum(disk_costs), avg(discount)|
|                    Group Key: time_bucket('1 day'::interval, end_time), cluster_id, node_id                                                                                                                                                   |
|                    ->  Result  (cost=0.00..0.00 rows=0 width=212)                                                                                                                                                                             |
|                          Output: time_bucket('1 day'::interval, end_time), cluster_id, node_id, cpu_cost, total_cost, memory_bytes, memory_used_percent, cpu_cores, cpu_used_percent, disk_costs, discount                                    |
|                          One-Time Filter: false                                                                                                                                                                                               |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

How do I get the line in the stack trace that triggers the segfault? Also, I can look into the minimal example later but I don't have much time right now.

eloyekunle on 22 Mar 2021

@eloyekunle To get the line in the stack trace, you can use frame to select the frame you want to investigate and list to list where it broke. This is assuming that you have source available at the path that you built it for. I note that there is no line number in the backtrace, so it might be a problem with getting the line.

mkindahl on 22 Mar 2021

awesome, thanks!

eloyekunle on 25 Mar 2021