I'm testing ClickHouse to see if it's a fit for our data.
I found that performing a GROUP BY with LIMIT query, results in Memory limit (for query) exceeded, even though max_bytes_before_external_group_by and max_bytes_before_external_sort are set.
set send_logs_level='debug'
set max_memory_usage=20000000000
set max_bytes_before_external_group_by=10000000000
set max_bytes_before_external_sort=10000000000
CREATE TABLE hdfs_table (... some columns.. , high_cardinality_key_1 Nullable(Int64), high_cardinality_key_2 Nullable(Int64)) ENGINE=HDFS('hdfs://host:9000/some_dir/*', 'Parquet')
1. select high_cardinality_key_1, count(high_cardinality_key_2) from hdfs_table group by high_cardinality_key_1 limit 1
2. select count(distinct hour) from hdfs_table
I expected the query to spill to disk and return results even though memory is limited.
In the second query, hour is a very low cardinality column (only 24 distinct values) Is the whole data loaded in memory even though only some columns are needed?
The debug logs for the first query:
<Debug> MemoryTracker: Current memory usage: 6.03 GiB.
<Debug> MemoryTracker: Current memory usage: 10.00 GiB.
<Debug> Aggregator: Writing part of aggregation data into temporary file /var/lib/clickhouse/tmp/tmp68165zfaaaa.
...
<Debug> MemoryTracker: Current memory usage: 16.01 GiB.
...
<Debug> Aggregator: Writing part of aggregation data into temporary file /var/lib/clickhouse/tmp/tmp68165yhaaaa.
...
Progress: 733.14 million rows, 64.52 GB (19.19 million rows/s., 1.69 GB/s.) <Error> executeQuery: Code: 241, e.displayText() = DB::Exception: Memory limit (for query) exceeded: would use 18.63 GiB (attempt to allocate chunk of 33554448 bytes), maximum: 18.63 GiB (version 19.16.2.2 (official build))
Any idea ?
Thanks
Just tried to query some parquet files from local FS and also small file (about 150MB, 29M rows) from local HDFS installation.
SELECT countDistinct(BannerID)
FROM file('huge_event_log*.parq', 'Parquet', 'EventTime DateTime, BannerID UInt64, Cost UInt64, CounterType UInt8')
┌─uniqExact(BannerID)─┐
│ 2788144 │
└─────────────────────┘
1 rows in set. Elapsed: 269.551 sec. Processed 5.06 billion rows, 106.26 GB (18.77 million rows/s., 394.20 MB/s.)
and in server logs:
2019.11.07 19:23:45.078137 [ 133 ] {01ab166d-77c9-465a-968c-1d6882e28217} <Debug> MemoryTracker: Peak memory usage (total): 1.43 GiB.
Memory consumption seems to be OK, but I cannot reproduce your case precisely. Maybe you have some peculiar properties of your files? How many parquet files you have? Which compression format do you use?
In our code, we read data from HDFS with small batches https://github.com/clickhouse/ClickHouse/blob/master/dbms/src/IO/ReadBufferFromHDFS.cpp#L61.
I think main problem is amout of files in hdfs://host:9000/some_dir/*.
Yes, just created 200 files and get OOM:
$ ls event_log* | wc -l
200
$ clickhouse-client
:) select countDistinct(BannerID) from file('event_log*.parq', 'Parquet', 'EventTime DateTime, BannerID UInt64, Cost UInt64, CounterType UInt8');
[7 19:52:13 2019] oom_kill_process+0x254/0x280
[7 19:52:13 2019] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[7 19:52:13 2019] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=clickhouse,pid=1767,uid=40154
[7 19:52:14 2019] oom_reaper: reaped process 1767 (clickhouse), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
@stavrolia Look at narrowBlockInputStreams.
Thank you for looking into it.
Indeed we have about 200 parquet files in the directory. Also the compression used is snappy.
Yes, just created 200 files and get OOM:
$ ls event_log* | wc -l 200 $ clickhouse-client :) select countDistinct(BannerID) from file('event_log*.parq', 'Parquet', 'EventTime DateTime, BannerID UInt64, Cost UInt64, CounterType UInt8');[7 19:52:13 2019] oom_kill_process+0x254/0x280 [7 19:52:13 2019] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7 19:52:13 2019] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=clickhouse,pid=1767,uid=40154 [7 19:52:14 2019] oom_reaper: reaped process 1767 (clickhouse), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
hi ,I created a similar table, read the HDFS file as orc, the same problem will occur, what do I need to set to avoid this situation, my configuration is the same as above, machine configuration: 8 core 32gb memory ,please give me a hand,thanks
That's a trivial task, @stavrolia will fix.
@stavrolia - do you want to fix it before lunch or at the evening today?