Clickhouse: How to query for the latest records in the table?

Created on 4 Dec 2017 · 6Comments · Source: ClickHouse/ClickHouse

Hi~

ClickHouse is good choice for analyzing log data in my case.
Our Log data has a time column that be used as a primary key.
As you know, logs are stored sequentially in chronological order.

Is there a way to query for the latest records in the table without using order by?

question st-need-info

Source

badvir

Most helpful comment

you can combine columns into tuple and separate on another level

select p, V.1 t, V.2 x, V.3 y from (
select p, argMin((t,x,y),t) V from test
group by p)

https://gist.github.com/den-crane/fcc95468f37e40f9ff2a22cf3e250c0f

den-crane on 7 Dec 2018

👍3

All 6 comments

Hi @badvir !

You can filter only the newest records by adding a time predicate to the WHERE clause (e.g. time >= now() - 3600 will filter the latest hour of records and then ORDER BY if you need strict chronological ordering).

Do you store data in the tables of MergeTree family? I assume yes, because you mention primary keys. Data in MergeTree tables is stored sorted according to the primary key. If your primary key starts with the time column, the select will be done efficiently. Unfortunately, ORDER BY does not take advantage of the index at the moment and will sort the data even if it is already stored in the needed order.

As you know, logs are stored sequentially in chronological order.

Not necessarily. Logs arrive in the chronological order (well, almost), but you should store them so that your typical queries preserve data locality. E.g. if you often query the logs of a single service, you should put something like service_id in front of your primary key.

ztlpn on 4 Dec 2017

Due to the fact that last inserted data is most probably was not merged yet, the selects with partitioning key go fast.

In my case - for realtime data from console i usually do something like

SELECT *
 FROM table_name
 WHERE
 part_key = today() and 
 timestamp > now() - 600
ORDER BY timestamp DESC
LIMIT 10

it takes about 2 seconds in my case. (It do fullscan in the partitions with data from today, filter it with where, and later sort).

from scripts (i needed last record data in checks) i do similar but make a subselect instead of today()

SELECT
 max(timestamp) as timestamp,
 argMax(field1,timestamp) as field1,
 argMax(field2,timestamp) as field1, 
FROM table_name
WHERE
 part_key = (SELECT max(part_key) FROM table)

Takes about 2.5 sec in my case. (fullscan of all the table to find max(part_key) FROM table + fullscan of the data from partitions containing data from today)

And with small optimization (see #1540)

SELECT
 max(timestamp) as timestamp,
 argMax(field1,timestamp) as field1,
 argMax(field2,timestamp) as field1, 
FROM table_name
WHERE
 part_key = (SELECT max(max_date) FROM system.parts WHERE table = 'table_name' and database=currentDatabase() )

It takes about 0.17 sec in my case. (one fullscan of the partitions with maximum date)

filimonov on 4 Dec 2017

👍1

@badvir do you have any further questions?

blinkov on 22 Aug 2018

Same question. And I think tuple + max is fine in this situation:

select max((ts, size)) as tsSize, filter_field from table group by filter_field
┌─max(tuple(ts, size))─────────┬─filter_field─────┐
│ ('2018-09-18 08:04:45',1234) │ 105-project │
└──────────────────────────────┴─────────────┘

, isnt' it?