Timescaledb: Timescale never outperforms Postgres in my tests

Created on 20 Aug 2017  路  2Comments  路  Source: timescale/timescaledb

Hello ...
I am currently evaluating this extensions. Problem is, I fail to see any performance gain on any query.

Setup is

  • MacOS Sierra, timescale installed through brew
  • 30 Mio data points, 46 data columns (I now, that is a lot .. but this is how our data looks like)
  • This covers about 2 months worth of data
  • date is not really a timestamp but a date field (so selectivity there is rather poor)

Schema for the plain old postgres table is

        Column         |            Type             |           Modifiers           | Storage | Stats target | Description
-----------------------+-----------------------------+-------------------------------+---------+--------------+-------------
 date                  | timestamp with time zone    |                               | plain   |              |
 order_id              | integer                     |                               | plain   |              |
 campaign_id           | integer                     | not null                      | plain   |              |
 contentunit_id        | integer                     |                               | plain   |              |
 contentunitgroup_id   | integer                     | default 0                     | plain   |              |
 int_creative_id       | integer                     |                               | plain   |              |
 banner_id             | integer                     |                               | plain   |              |
 a_views               | integer                     |                               | plain   |              |
 r_views               | integer                     |                               | plain   |              |
 clicks                | integer                     |                               | plain   |              |
 tpm                   | double precision            |                               | plain   |              |
(more columns)
Indexes:
 "campaign_x_content_unit_old" btree (campaign_id, contentunit_id)
 "content_unit_old" btree (campaign_id)
  "date_old" btree (date)

timeseries was created with SELECT create_hypertable('timeseries', 'date','campaign_id', 3);

(exactly the same columns as above)
Indexes:
    "campaign_x_content_unit" btree (campaign_id, contentunit_id, date DESC)
    "timeseries_campaign_id_date_idx" btree (campaign_id, date DESC)
    "timeseries_date_idx" btree (date DESC)
Child tables: _timescaledb_internal._hyper_6_2_chunk,
              _timescaledb_internal._hyper_6_3_chunk,
            (+ 22 chunks)

Now doing something like this

select date, campaign_id,contentunit_id,avg(clicks),avg(r_views),avg(a_views) 
from timeseries where campaign_id = 2063006 group by date,campaign_id,contentunit_id order by date asc;

Yields 59k rows.
With the timeseries hypertable I get a response time of 3085,819 ms, Postgres gives me 2786,844 ms. Basically any query I do is about as fast or a little bit slower than plain Postgres.

Explain on Postgres gives me

 GroupAggregate  (cost=1544247.84..1591701.08 rows=1207177 width=112)
   Group Key: date, campaign_id, contentunit_id
   ->  Sort  (cost=1544247.84..1548008.93 rows=1504437 width=28)
         Sort Key: date, contentunit_id
         ->  Bitmap Heap Scan on static_campaign_raw  (cost=32639.95..1317891.64 rows=1504437 width=28)
               Recheck Cond: (campaign_id = 2063006)
               ->  Bitmap Index Scan on content_unit_old  (cost=0.00..32263.84 rows=1504437 width=0)
                     Index Cond: (campaign_id = 2063006)
(8 rows)

Timeseries

 GroupAggregate  (cost=128634.23..128641.66 rows=200 width=112)
   Group Key: times.date, times.campaign_id, times.contentunit_id
   ->  Sort  (cost=128634.23..128634.79 rows=225 width=28)
         Sort Key: times.date, times.contentunit_id
         ->  Append  (cost=0.00..128625.44 rows=225 width=28)
               ->  Seq Scan on times  (cost=0.00..0.00 rows=1 width=28)
                     Filter: ((campaign_id = 2063006) AND (_timescaledb_internal.get_partition_for_key((campaign_id)::text) = 60668338))
               ->  Bitmap Heap Scan on _hyper_19_29_chunk  (cost=28.65..3923.21 rows=5 width=28)
                     Recheck Cond: (campaign_id = 2063006)
                     Filter: (_timescaledb_internal.get_partition_for_key((campaign_id)::text) = 60668338)
                     ->  Bitmap Index Scan on "119-campaign_x_content_unit"  (cost=0.00..28.65 rows=1096 width=0)
                           Index Cond: (campaign_id = 2063006)
                    (plus 5 more Bitmap Heap scans over different chunks)

I fail to see why the queries with timeseriesdb are not any faster, to my understanding this should be exactly the type that see performance gains (according to the docs)

Also I find the number of chunks it scans way too high. Events for that campaign are between 2017-05-17 00:00:00+02 and 2017-07-31 00:00:00+02, so should it be at most 2 if the default time for a chunk if 1 month. I also tried to increase the chunk size (SELECT create_hypertable('times', 'date','campaign_id', 1, chunk_time_interval=>7777000000000);), so I get less sub-tables but still no visible performance gain

Any idea what I am doing wrong here?

Bests

Most helpful comment

Hi @mfreed ... thank you very much for the detailed help. With my datasets I can now confirm this:

  • For larger timespans (2 months) I get a query speed improvement of about factor 2 against plain postgres
  • For smaller timespans (2 weeks) it is about 4 to 5 times faster.

Very impressive!

All 2 comments

Hi @wirtsi:

A key way you'll get better performance in Timescale is to build smaller chunks (and therefore indexes on those chunks), touch less data in queries, etc.

So three things:

  1. Your query seems to actually touch all your data, so it's not getting any opportunity to do constraint exclusion and better take advantage of its chunking.
  2. The default config that we ship with is 1 month, so at best your test dataset will have 2 chunks.
  3. Generally better performance through date chunking, not spatial chunking, which you generally won't get better performance from unless you are using multiple disks. More explanation of chunking best practices here.

I'd try the following instead to test:

  • Set your chunk period to something like 1 week:
SELECT create_hypertable('timeseries', 'date', chunk_time_interval=>604800000000);

Then add the appropriate indexes as you seen to have in your table above.

  • Try a query to a 10-day subrange:
SELECT date, campaign_id, contentunit_id, avg(clicks), avg(r_views), avg(a_views)
  FROM timeseries 
  WHERE campaign_id = 2063006 AND date <= '2017-07-31' AND date >= '2017-07-21'
  GROUP BY date, campaign_id, contentunit_id
  ORDER BY date ASC;

If you are continuing to see weirdness, it might be easier to have this type of conversation on Slack.

Hi @mfreed ... thank you very much for the detailed help. With my datasets I can now confirm this:

  • For larger timespans (2 months) I get a query speed improvement of about factor 2 against plain postgres
  • For smaller timespans (2 weeks) it is about 4 to 5 times faster.

Very impressive!

Was this page helpful?
0 / 5 - 0 ratings