Timescaledb: Last() function is not optimal and scans all rows in a chunk

Created on 10 Jul 2018 · 17Comments · Source: timescale/timescaledb

The last() function used below is not optimal and scans all rows instead of shortcircuiting
SELECT stage_id, last(outputrecsrate,time) as m1_rate from stage_datapoint_v2 inner join series_v2 on series_v2.dp_id = stage_datapoint_v2.id and org_id = {} and job_Id = {} and sdc_id = {} and time > now() - INTERVAL '7 days' group by stage_id

Currently the workaround is to use a lateral join instead:
select stage_id, m1_rate from (select stage_id, dp_id from series_v2 where org_id = {} and job_Id = {} and sdc_id = {}) s1 inner join lateral (SELECT outputrecsrate as m1_rate from {} where id = s1.dp_id and time > timestamp {} - INTERVAL {} order by time desc limit 1) s2 on true

enhancement

Source

viragkothari

👍8

Most helpful comment

@dianasaur323 hi, is this issue still open ?

k30r88 on 21 Jan 2020

👍5

All 17 comments

@wmay you also liked this issue. are you having a similar problem?

dianasaur323 on 10 Jul 2018

@dianasaur323 I need to get the time of the most recent data in my database to decide what new data to add. If I use a command like

select max(time) from table;

it's painfully slow, as TimescaleDB searches through all the table chunks for the maximum. I'm sure TimescaleDB could use chunk information to make this query much faster, though I don't know how hard that would be to implement. After seeing this post I also tried

select last(time, time) from table;

and it seems to be equally slow. At the moment my workaround is to subset the data by time before running max(), when I can-- similar to @viragkothari's workaround (if I understand the post correctly). It would definitely be convenient if TimescaleDB dealt with these queries more intelligently.

wmay on 10 Jul 2018

@wmay noted! we will include your feedback when we make a fix for this. I think @viragkothari also used limit 1 so that you don't scan everything, in addition to subsetting the data by time.

dianasaur323 on 10 Jul 2018

@wmay Can you post the EXPLAIN call of the top? I believe offhand that max should be able to use an index (unless you don't actually have an index just on time)?

Now, the above should check each chunk, because it doesn't know how to do constraint exclusion with the above command, but the lookup on each chunk should be quite fast if using a simple time index.

mfreed on 22 Jul 2018

For context, the primary use of last has been when applied to a time_bucket aggregate, which is why it currently doesn't use an index for lookup.

mfreed on 22 Jul 2018

@mfreed You're right, I only have a compound index of time with another column. That didn't even occur to me.

wmay on 22 Jul 2018

@wmay it seems we have a similar use case - getting the last inserted data point by time. Our tables typically use a 2-column constraint UNIQUE (time, id). Using max or last hasn't occurred to me before, therefore my "last data point" queries look as follows: SELECT time FROM meters ORDER BY time DESC LIMIT 1;. As far as I understood, SELECT statements do make use of chunk tables so this approach should be faster.

AndyMender on 22 Jul 2018

@AndyMender Your suggested query takes about the same amount of time as max and last on my table. This is with TimescaleDB 0.10.1.

wmay on 22 Jul 2018

@wmay You might try the following function and see if this works for you, for less then 10 chunks the max approach is probably faster but for a lot of chunks the function should pull ahead because the function iterates over chunks and stops on the first hit while max without time constraint will always do an index scan in all chunks. The function could also be optimized to drop the dynamic sql by hardcoding the table name.

CREATE OR REPLACE FUNCTION last_insert(hypertable text, time_column text)
RETURNS timestamptz AS
$$
DECLARE
  last timestamptz;
  chunk RECORD;
BEGIN

  FOR chunk IN
    SELECT c.schema_name,c.table_name
FROM _timescaledb_catalog.chunk c
INNER JOIN _timescaledb_catalog.hypertable h ON h.id = c.hypertable_id AND h.table_name = hypertable
ORDER BY c.id DESC
  LOOP
    EXECUTE 'SELECT max('|| quote_ident(time_column)||') FROM ONLY ' || quote_ident(chunk.schema_name) || '.' || quote_ident(chunk.table_name)  INTO last;
    IF last IS NOT NULL THEN
      RETURN last;
    END IF;
  END LOOP;
  RETURN NULL;
END;
$$ LANGUAGE plpgsql STABLE STRICT;

Use like this:

SELECT last_insert('cpu','time');

svenklemm on 10 Aug 2018

👍1

With 1.1.0, we introduced an optimization for first() last() so that it would use indexes. Quick note that this will not work for queries with group bys. Would be great to hear if you see improved performance with this release.

dianasaur323 on 14 Dec 2018

Explicitly tagging @viragkothari and @wmay if they have any additional thoughts/comments.

mfreed on 4 Jan 2019

I triple this issue, I want to find the last inserted timestamp for each symbol where I am trying to store 1m ohlc data and the table looks quite simple

symbol text
ts timestamptz
o,h,l,c,v in real
primary key(symbol, ts)

But when I do a

select s, max(ts) from ohlc group by s

it scans every chunk
I did read the statement above where @dianasaur323 says it will not work with group bys
How do I run such a query to only contain the last chunk

explain select last(ts, ts ) from ohlc_60 ;
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Aggregate  (cost=404.31..404.32 rows=1 width=8)
   ->  Append  (cost=0.00..369.69 rows=13847 width=8)
         ->  Seq Scan on ohlc_60  (cost=0.00..0.00 rows=1 width=8)
         ->  Seq Scan on _hyper_77_250_chunk  (cost=0.00..18.90 rows=890 width=8)
         ->  Seq Scan on _hyper_77_251_chunk  (cost=0.00..18.90 rows=890 width=8)
         ->  Seq Scan on _hyper_77_252_chunk  (cost=0.00..18.90 rows=890 width=8)
         ->  Seq Scan on _hyper_77_253_chunk  (cost=0.00..18.90 rows=890 width=8)
         ->  Seq Scan on _hyper_77_254_chunk  (cost=0.00..18.90 rows=890 width=8)
         ->  Seq Scan on _hyper_77_255_chunk  (cost=0.00..18.90 rows=890 width=8)
         ->  Seq Scan on _hyper_77_256_chunk  (cost=0.00..18.90 rows=890 width=8)
         ->  Seq Scan on _hyper_77_257_chunk  (cost=0.00..18.90 rows=890 width=8)
         ->  Seq Scan on _hyper_77_258_chunk  (cost=0.00..18.90 rows=890 width=8)
         ->  Seq Scan on _hyper_77_259_chunk  (cost=0.00..18.90 rows=890 width=8)
         ->  Seq Scan on _hyper_77_260_chunk  (cost=0.00..18.90 rows=890 width=8)
         ->  Seq Scan on _hyper_77_261_chunk  (cost=0.00..18.90 rows=890 width=8)
         ->  Seq Scan on _hyper_77_262_chunk  (cost=0.00..17.00 rows=700 width=8)
         ->  Seq Scan on _hyper_77_263_chunk  (cost=0.00..7.29 rows=329 width=8)
         ->  Seq Scan on _hyper_77_264_chunk  (cost=0.00..7.27 rows=327 width=8)
         ->  Seq Scan on _hyper_77_265_chunk  (cost=0.00..7.33 rows=333 width=8)
         ->  Seq Scan on _hyper_77_266_chunk  (cost=0.00..3.11 rows=111 width=8)
         ->  Seq Scan on _hyper_77_267_chunk  (cost=0.00..7.33 rows=333 width=8)
         ->  Seq Scan on _hyper_77_268_chunk  (cost=0.00..7.33 rows=333 width=8)
         ->  Seq Scan on _hyper_77_269_chunk  (cost=0.00..17.00 rows=700 width=8)

slidenerd on 6 Mar 2019

Sorry about the slow response. I didn't have access to the database when I was mentioned earlier. I don't have much in the way of a comment, just adding a test case here.

I put together a database of weather instrument data with 255 million rows which is adding about 350,000 new rows each day. It's currently using postgres 9.6.12. I want to get the newest time to help identify new data, so I use max(time) or last(time, time) to find that. I have a primary key on time and lidar_id. The TimescaleDB chunks have a time period of 1 day.

I tried the queries before and after updating to the most recent version of TimescaleDB. I started with timescaledb v. 0.11.0, and max(time) takes about 22 minutes on my primary data table:

lidar=# \dx timescaledb
                                    List of installed extensions
    Name     | Version | Schema |                            Description                            
-------------+---------+--------+-------------------------------------------------------------------
 timescaledb | 0.11.0  | public | Enables scalable inserts and complex queries for time-series data
(1 row)

lidar=# \timing
Timing is on.
lidar=# select max(time) from profiles;
           max           
-------------------------
 2019-04-08 16:24:16.747
(1 row)

Time: 1325840.121 ms (22:05.840)

Then I upgraded to timescaledb v. 1.3.0, and it appears to be about the same:

lidar=# \dx timescaledb
                                    List of installed extensions
    Name     | Version | Schema |                            Description                            
-------------+---------+--------+-------------------------------------------------------------------
 timescaledb | 1.3.0   | public | Enables scalable inserts and complex queries for time-series data
(1 row)

lidar=# \timing
Timing is on.
lidar=# select max(time) from profiles;
           max           
-------------------------
 2019-04-08 16:24:16.747
(1 row)

Time: 1321221.151 ms (22:01.221)

The result of explain is a mixture of sequential scans and index only scans:

explain select max(time) from profiles;
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=57375858.53..57375858.54 rows=1 width=8)
   ->  Append  (cost=0.00..56733675.78 rows=256873100 width=8)
         ->  Seq Scan on _hyper_1_100_chunk  (cost=0.00..5752.21 rows=22121 width=8)
         ->  Seq Scan on _hyper_1_101_chunk  (cost=0.00..5757.41 rows=22141 width=8)
         ->  Seq Scan on _hyper_1_102_chunk  (cost=0.00..5756.39 rows=22139 width=8)
         ->  Seq Scan on _hyper_1_103_chunk  (cost=0.00..5752.22 rows=22122 width=8)
         ->  Seq Scan on _hyper_1_105_chunk  (cost=0.00..5755.35 rows=22135 width=8)
         ->  Seq Scan on _hyper_1_106_chunk  (cost=0.00..5738.70 rows=22070 width=8)
         ->  Seq Scan on _hyper_1_107_chunk  (cost=0.00..5723.10 rows=22010 width=8)
         ->  Seq Scan on _hyper_1_108_chunk  (cost=0.00..5711.65 rows=21965 width=8)
         ->  Seq Scan on _hyper_1_109_chunk  (cost=0.00..5718.96 rows=21996 width=8)
         ->  Seq Scan on _hyper_1_10_chunk  (cost=0.00..5672.34 rows=22134 width=8)
         ->  Seq Scan on _hyper_1_110_chunk  (cost=0.00..5717.90 rows=21990 width=8)
         ->  Seq Scan on _hyper_1_111_chunk  (cost=0.00..5706.45 rows=21945 width=8)
         ->  Seq Scan on _hyper_1_112_chunk  (cost=0.00..5755.33 rows=22133 width=8)
         ->  Seq Scan on _hyper_1_113_chunk  (cost=0.00..5736.63 rows=22063 width=8)
         ->  Seq Scan on _hyper_1_114_chunk  (cost=0.00..5751.20 rows=22120 width=8)
         ->  Seq Scan on _hyper_1_115_chunk  (cost=0.00..5737.67 rows=22067 width=8)
[ ... about a thousand lines omitted ... ]
         ->  Index Only Scan using "2628_15857_profiles_pkey" on _hyper_1_2628_chunk  (cost=0.42..16284.47 rows=356270 width=8)
         ->  Index Only Scan using "2630_15863_profiles_pkey" on _hyper_1_2630_chunk  (cost=0.42..15988.93 rows=335767 width=8)
         ->  Index Only Scan using "2632_15869_profiles_pkey" on _hyper_1_2632_chunk  (cost=0.42..16853.13 rows=393647 width=8)
         ->  Seq Scan on _hyper_1_2634_chunk  (cost=0.00..163166.40 rows=347140 width=8)
         ->  Index Only Scan using "2636_15881_profiles_pkey" on _hyper_1_2636_chunk  (cost=0.42..158273.90 rows=349773 width=8)
         ->  Index Only Scan using "2638_15887_profiles_pkey" on _hyper_1_2638_chunk  (cost=0.42..16095.83 rows=342094 width=8)
         ->  Index Only Scan using "2640_15893_profiles_pkey" on _hyper_1_2640_chunk  (cost=0.42..11238.72 rows=262553 width=8)
(1108 rows)

Time: 1667.347 ms (00:01.667)

I got the same explanation for last(time, time), so I didn't try to run it.

wmay on 10 May 2019

@svenklemm we fixed this issue in 1.4 right?

dianasaur323 on 16 Jul 2019

@dianasaur323 hi, is this issue still open ?

k30r88 on 21 Jan 2020

👍5

Closing this since first()/last() have optimizations now that don't require scanning all rows under certain conditions. Feel free to reopen if the issue is still relevant.

svenklemm on 13 Oct 2020

Closing this since first()/last() have optimizations now that don't require scanning all rows under certain conditions. Feel free to reopen if the issue is still relevant.

Hi @svenklemm Could you explain or detail these new optimizations and what version of Timescale they are in?