Clickhouse: Statistics improvements

Created on 29 Nov 2016  ·  16Comments  ·  Source: ClickHouse/ClickHouse

I think it'd be a little useful to send statistics in the HTTP header with the response (regardless of the FORMAT given).

Stats are currently sent only for JSON, JSONCompact and XML formats. It would be useful in some cases to get the stats while using format like TabSeparated or CSV.

Something like that:

$ curl -i "http://127.0.0.1:8123/?query=SELECT+*+FROM+system.numbers+LIMIT+10+FORMAT+Null"
HTTP/1.1 200 OK
Date: Tue, 29 Nov 2016 12:25:20 GMT
Connection: Keep-Alive
Content-Type: text/plain; charset=UTF-8
Transfer-Encoding: chunked
X-ClickHouse-rows: 10
X-ClickHouse-rows-before-limit-at-least: 10
X-ClickHouse-statistics-elapsed: 0.000107524
X-ClickHouse-statistics-rows-read: 10
X-ClickHouse-statistics-bytes-read: 80

Also, would it be OK to include query's memory usage to the stats?

comp-http enhancement

Most helpful comment

I had exactly the same question, but it makes sense that the stats are only available at the end of the response.

I would like to vote for the Third option to use a trailer but only enable it when you specifiy a optional query parameter. e.g. &withTrailerStats so that client that do support it can enable it. Make a great usecase for streaming clients to collect some stats about the queries

All 16 comments

About statistics: good idea in general, but I already had some considerations about it.

  1. Statistics are known only at end of query execution. But at end of query execution, we could have already sent some data to client. There are "streaming" queries, when data is output when it is read from a table. And even for queries with GROUP BY, in case of large cardinality, result will be prepared and sent in multiple pieces.

Several solutions possible.
First, we could write statistics in headers only if we had not sent any data yet. This solution is not suitable, because it will introduce unpredictable and difficult to understand behaviour.
Second, we could add explicit option to buffer results and write statisitcs in headers, if it was enabled. Probably good solution.
Third, we could use trailers instead of headers. I think, it is not viable, because not many client libraries support trailers.

  1. Consider we will add some CPU, IO stats (already planned), and later we will add stats for each participating server, each subquery and thread. Then statistics will become wide and nested. If we want nesting in HTTP Headers, we could use JSON. And if we will use JSON, it is more clear to use single header field. So, we are going to add single header, like X-ClickHouse-Statistics. The only discrepancy is that JSON will be used regardless to output format. But that is Ok.

Currently we have a task to add setting to control buffering of query result.

About memory stats. We already have a task to implement it in progress info of clickhouse-client. We need to collect memory usage on each participating server. And for client, probably we will output maximum memory usage across all servers (because max_memory_usage is per-server limit), and maybe also total memory usage.

I had exactly the same question, but it makes sense that the stats are only available at the end of the response.

I would like to vote for the Third option to use a trailer but only enable it when you specifiy a optional query parameter. e.g. &withTrailerStats so that client that do support it can enable it. Make a great usecase for streaming clients to collect some stats about the queries

Привет. Также интересует возможность получать информацию о ходе выполнения запроса через http.

@Dimon70007 Включите настройку send_progress_in_http_headers. Пока не началась передача результата, ClickHouse будет отсылать заголовок вида

X-ClickHouse-Progress: {"read_rows":"7471104","read_bytes":"74711040","total_rows":"455532544"}

@ztlpn Спасибо за помощь. Добавил <send_progress_in_http_headers>true</send_progress_in_http_headers>
в config.xml - присылает обычные заголовки до получения ответа.
А настройка <output_format_json_quote_64bit_integers>false</output_format_json_quote_64bit_integers>
может как то влиять на отправку headers?

Это пользовательская настройка (https://clickhouse.yandex/docs/ru/operations/settings/index.html). Их часто путают с параметрами, задаваемыми в config.xml (наверно, добавим проверку, что они не перепутаны).

В вашем случае достаточно добавить &send_progress_in_http_headers=1 в HTTP-запрос и должно заработать. Или настройку можно включить в конфиге users.xml (только осторожнее, не все клиенты хорошо воспринимают большое количество заголовков).

``` $ echo "

SELECT
    id,
    count() as count
FROM 
(
    SELECT 
        number AS id, 
        toDateTime('2017-10-30 00:02:03') + (id * 1000) AS begin, 
        toDateTime('2017-10-30 00:02:03') + (id * 2000) AS end
    FROM system.numbers 
    LIMIT 30000
) 
ARRAY JOIN timeSlots(begin, toUInt32(end - begin)) AS ts
GROUP BY id
ORDER BY count desc
LIMIT 50

" | curl -v -u default:qwerty 'http://127.0.0.1:8123?send_progress_in_http_headers=1&http_headers_progress_interval_ms=100' -d @-

  • Rebuilt URL to: http://127.0.0.1:8123/?send_progress_in_http_headers=1&http_headers_progress_interval_ms=100
  • Hostname was NOT found in DNS cache
  • Trying 127.0.0.1...
  • Connected to 127.0.0.1 (127.0.0.1) port 8123 (#0)
  • Server auth using Basic with user 'default'
    POST /?send_progress_in_http_headers=1&http_headers_progress_interval_ms=100 HTTP/1.1
    Authorization: Basic ZGVmYXVsdDpxd2VydHk=
    User-Agent: curl/7.35.0
    Host: 127.0.0.1:8123
    Accept: /
    Content-Length: 400
    Content-Type: application/x-www-form-urlencoded
  • upload completely sent off: 400 out of 400 bytes
    < HTTP/1.1 200 OK
    < Date: Tue, 31 Oct 2017 21:39:40 GMT
    < Connection: Keep-Alive
    < Content-Type: text/tab-separated-values; charset=UTF-8
    < Transfer-Encoding: chunked
    < Keep-Alive: timeout=3
    <
    29999 16667
    29998 16667
    ...
    `` ответ приходит как положено, headers тоже вроде получаю, а вотX-ClickHouse-Progressпочему-то не приходит. Может это из-за того что шлю post запрос и указываю query? П.С. проверил на дефолтной конфигурации (разве что добавилpasswordиsend_progress_in_http_headers` для профиля default). П.П.С. спасибо за помощь начинающим кликхаусятам))

С system.numbers прогресс не работает (попробуйте задать запрос в clickhouse-client, прогрессбара тоже не будет, только скорость выполнения). Должно работать с нормальной таблицей, для которой можно оценить число строк, которые нужно прочитать и обработать (в вашем запросе можно было бы оценить из значения LIMIT, но это не реализовано).

прописал
<send_progress_in_http_headers>1</send_progress_in_http_headers>
<http_headers_progress_interval_ms>500</http_headers_progress_interval_ms> в users.xml. обновился до версии 1.1.54343 b отправка информации о прогрессе заработала как надо (на версии 1.1.54236 не работало) может кому пригодится. Еще раз спасибо за подробные разъяснения!

I'm wondering - what if some special syntax can appear to allow to monitor running queries 'from outside'. Something like MONITOR in redis.

So you can just open second Clickhouse connection, send some command like monitor query where query_id = 123123 and start getting info like:
timestamp,status,read_rows,read_bytes,total_rows,memory_usage,...

That can be used really straight-ahead without hacks with reading mutiple values of same HTTP headers arriving in realtime, without need to process HTTP-trailers, without buffering data on server just because 'client wanted to see progress bar till the end, so we can't start send data yet' and other 'exotic' non-common usages of HTTP. Just good-old plain HTTP.

There is the system.processes table that contains this information. You won't be able to receive updates in the streaming fashion though, you would have to poll.

There is the system.processes table that contains this information.
You won't be able to receive updates in the streaming fashion though, you would have to poll.

I know that and actually i do that. :) But that doesn't contain final stats - like number of rows extracted, final memory used, etc. That data can be extracted from system.query_log, but it can be disabled and by default has a 7.5 sec flush delay.

So that monitor can also contain final row for each query containing overall stats.
like

2018-01-01 00:00:00, query_start, 1, 10, 100
2018-01-01 00:00:00, progress, 3, 10, 100
2018-01-01 00:00:01, progress, 6, 10, 100
2018-01-01 00:00:01, progress, 9, 12, 100
2018-01-01 00:00:02, progress, 12, 12, 100
2018-01-01 00:00:02, finish, 0, 12, 100

Anyway - that is just an another idea how to solve the problem described in that ticket. If you like that - you can take that, if you don't - you can make it some other way :) Currenlty there are no straight-ahead way to get final stats of the query in general case.

HTTP-headers are ok, but generally in most cases you need to hack HTTP-client to get that steaming mode of HTTP headers reading (as i remember - you also were hacking POCO to make that stream-write of headers), so it's hard to call universal and handy way of getting progress.

Also send_progress_in_http_headers blocks data-stream, server is forced to buffer output (and waste memory for that), and you need to wait till the end of request process, and you can't get partial result in a meanwhile.

Now when there is an ability to collect logs from server, that idea of "monitor"ing query from outside / with another query sounds even better.

I.e. plain HTTP doesn't support any multiplexing, so instead of creating special headers for everything, just create ability to see the details of that's going on on another channel / another HTTP connection.

One more possible approach: #6294

It seems like https://github.com/yandex/ClickHouse/pull/5116 released in 19.8.3.8 should fix original issue by introducing X-ClickHouse-Summary header and improving X-ClickHouse-Progress behaviour.

Response buffering should allow to solve the problem of sending headers after data https://clickhouse.yandex/docs/en/interfaces/http/#response-buffering

See
https://github.com/yandex/ClickHouse/pull/5116/files#diff-348825873ae9ebfe18501ebebe96e002

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings