Hello ...
I am currently evaluating this extensions. Problem is, I fail to see any performance gain on any query.
Setup is
Schema for the plain old postgres table is
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------+-----------------------------+-------------------------------+---------+--------------+-------------
date | timestamp with time zone | | plain | |
order_id | integer | | plain | |
campaign_id | integer | not null | plain | |
contentunit_id | integer | | plain | |
contentunitgroup_id | integer | default 0 | plain | |
int_creative_id | integer | | plain | |
banner_id | integer | | plain | |
a_views | integer | | plain | |
r_views | integer | | plain | |
clicks | integer | | plain | |
tpm | double precision | | plain | |
(more columns)
Indexes:
"campaign_x_content_unit_old" btree (campaign_id, contentunit_id)
"content_unit_old" btree (campaign_id)
"date_old" btree (date)
timeseries was created with SELECT create_hypertable('timeseries', 'date','campaign_id', 3);
(exactly the same columns as above)
Indexes:
"campaign_x_content_unit" btree (campaign_id, contentunit_id, date DESC)
"timeseries_campaign_id_date_idx" btree (campaign_id, date DESC)
"timeseries_date_idx" btree (date DESC)
Child tables: _timescaledb_internal._hyper_6_2_chunk,
_timescaledb_internal._hyper_6_3_chunk,
(+ 22 chunks)
Now doing something like this
select date, campaign_id,contentunit_id,avg(clicks),avg(r_views),avg(a_views)
from timeseries where campaign_id = 2063006 group by date,campaign_id,contentunit_id order by date asc;
Yields 59k rows.
With the timeseries hypertable I get a response time of 3085,819 ms, Postgres gives me 2786,844 ms. Basically any query I do is about as fast or a little bit slower than plain Postgres.
Explain on Postgres gives me
GroupAggregate (cost=1544247.84..1591701.08 rows=1207177 width=112)
Group Key: date, campaign_id, contentunit_id
-> Sort (cost=1544247.84..1548008.93 rows=1504437 width=28)
Sort Key: date, contentunit_id
-> Bitmap Heap Scan on static_campaign_raw (cost=32639.95..1317891.64 rows=1504437 width=28)
Recheck Cond: (campaign_id = 2063006)
-> Bitmap Index Scan on content_unit_old (cost=0.00..32263.84 rows=1504437 width=0)
Index Cond: (campaign_id = 2063006)
(8 rows)
Timeseries
GroupAggregate (cost=128634.23..128641.66 rows=200 width=112)
Group Key: times.date, times.campaign_id, times.contentunit_id
-> Sort (cost=128634.23..128634.79 rows=225 width=28)
Sort Key: times.date, times.contentunit_id
-> Append (cost=0.00..128625.44 rows=225 width=28)
-> Seq Scan on times (cost=0.00..0.00 rows=1 width=28)
Filter: ((campaign_id = 2063006) AND (_timescaledb_internal.get_partition_for_key((campaign_id)::text) = 60668338))
-> Bitmap Heap Scan on _hyper_19_29_chunk (cost=28.65..3923.21 rows=5 width=28)
Recheck Cond: (campaign_id = 2063006)
Filter: (_timescaledb_internal.get_partition_for_key((campaign_id)::text) = 60668338)
-> Bitmap Index Scan on "119-campaign_x_content_unit" (cost=0.00..28.65 rows=1096 width=0)
Index Cond: (campaign_id = 2063006)
(plus 5 more Bitmap Heap scans over different chunks)
I fail to see why the queries with timeseriesdb are not any faster, to my understanding this should be exactly the type that see performance gains (according to the docs)
Also I find the number of chunks it scans way too high. Events for that campaign are between 2017-05-17 00:00:00+02 and 2017-07-31 00:00:00+02, so should it be at most 2 if the default time for a chunk if 1 month. I also tried to increase the chunk size (SELECT create_hypertable('times', 'date','campaign_id', 1, chunk_time_interval=>7777000000000);), so I get less sub-tables but still no visible performance gain
Any idea what I am doing wrong here?
Bests
Hi @wirtsi:
A key way you'll get better performance in Timescale is to build smaller chunks (and therefore indexes on those chunks), touch less data in queries, etc.
So three things:
I'd try the following instead to test:
SELECT create_hypertable('timeseries', 'date', chunk_time_interval=>604800000000);
Then add the appropriate indexes as you seen to have in your table above.
SELECT date, campaign_id, contentunit_id, avg(clicks), avg(r_views), avg(a_views)
FROM timeseries
WHERE campaign_id = 2063006 AND date <= '2017-07-31' AND date >= '2017-07-21'
GROUP BY date, campaign_id, contentunit_id
ORDER BY date ASC;
If you are continuing to see weirdness, it might be easier to have this type of conversation on Slack.
Hi @mfreed ... thank you very much for the detailed help. With my datasets I can now confirm this:
Very impressive!
Most helpful comment
Hi @mfreed ... thank you very much for the detailed help. With my datasets I can now confirm this:
Very impressive!