Clickhouse: How to query for the latest records in the table?

Created on 4 Dec 2017  Β·  6Comments  Β·  Source: ClickHouse/ClickHouse

Hi~

ClickHouse is good choice for analyzing log data in my case.
Our Log data has a time column that be used as a primary key.
As you know, logs are stored sequentially in chronological order.

Is there a way to query for the latest records in the table without using order by?

question st-need-info

Most helpful comment

you can combine columns into tuple and separate on another level

select p, V.1 t, V.2 x, V.3 y from (
select p, argMin((t,x,y),t) V from test
group by p)

https://gist.github.com/den-crane/fcc95468f37e40f9ff2a22cf3e250c0f

All 6 comments

Hi @badvir !

You can filter only the newest records by adding a time predicate to the WHERE clause (e.g. time >= now() - 3600 will filter the latest hour of records and then ORDER BY if you need strict chronological ordering).

Do you store data in the tables of MergeTree family? I assume yes, because you mention primary keys. Data in MergeTree tables is stored sorted according to the primary key. If your primary key starts with the time column, the select will be done efficiently. Unfortunately, ORDER BY does not take advantage of the index at the moment and will sort the data even if it is already stored in the needed order.

As you know, logs are stored sequentially in chronological order.

Not necessarily. Logs arrive in the chronological order (well, almost), but you should store them so that your typical queries preserve data locality. E.g. if you often query the logs of a single service, you should put something like service_id in front of your primary key.

Due to the fact that last inserted data is most probably was not merged yet, the selects with partitioning key go fast.

In my case - for realtime data from console i usually do something like

SELECT *
 FROM table_name
 WHERE
 part_key = today() and 
 timestamp > now() - 600
ORDER BY timestamp DESC
LIMIT 10

it takes about 2 seconds in my case. (It do fullscan in the partitions with data from today, filter it with where, and later sort).

from scripts (i needed last record data in checks) i do similar but make a subselect instead of today()

SELECT
 max(timestamp) as timestamp,
 argMax(field1,timestamp) as field1,
 argMax(field2,timestamp) as field1, 
FROM table_name
WHERE
 part_key = (SELECT max(part_key) FROM table)

Takes about 2.5 sec in my case. (fullscan of all the table to find max(part_key) FROM table + fullscan of the data from partitions containing data from today)

And with small optimization (see #1540)

SELECT
 max(timestamp) as timestamp,
 argMax(field1,timestamp) as field1,
 argMax(field2,timestamp) as field1, 
FROM table_name
WHERE
 part_key = (SELECT max(max_date) FROM system.parts WHERE table = 'table_name' and database=currentDatabase() )

It takes about 0.17 sec in my case. (one fullscan of the partitions with maximum date)

@badvir do you have any further questions?

Same question. And I think tuple + max is fine in this situation:

select max((ts, size)) as tsSize, filter_field from table group by filter_field
β”Œβ”€max(tuple(ts, size))─────────┬─filter_field─────┐
β”‚ ('2018-09-18 08:04:45',1234) β”‚ 105-project β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

, isnt' it?

Using the same as @muravjov . I suppose it finds max value for ts and returns the column size in the same row as the found ts.

you can combine columns into tuple and separate on another level

select p, V.1 t, V.2 x, V.3 y from (
select p, argMin((t,x,y),t) V from test
group by p)

https://gist.github.com/den-crane/fcc95468f37e40f9ff2a22cf3e250c0f

Was this page helpful?
0 / 5 - 0 ratings