i am trying to group by a distributed tables (~100 million records ) so two interesting issues are happening
1)It is throwing "Attempt to read after eof: error"
2)click house service on the distributed server is getting shut down
I really put my head down but still could get solution for this error.
Second
What exactly max_bytes_before_external_group_by&max_bytes_before_external_order_by` configuration signifies and if i change them what impact do they make because right now its value in 0 and the query that is causing problem do have a group by case
How to change max_bytes_before_external_group_by&max_bytes_before_external_order_by values
1) OOMKiller killed the clickhouse process (check dmesg in the shell).
If those parameters are zero (default) - clickhouse tries do to all group by & order by in memory (it will allocate up to max_memory_usage bytes, you probably have less than 10Gb of RAM).
Alternative option - is to do huge group by / order by operations on disks. It is slower, but allow to execute those operations if the data doesn't fit in memory.
Alternative option - is to do huge group by / order by operations on disks. It is slower, but allow to execute those operations if the data doesn't fit in memory.
How to do this ??
set in the session using [ set max_bytes_before_external_group_by = .... ]
or users' profiles.
cat /etc/clickhouse-server/conf.d/memory_usage.xml
<?xml version="1.0" ?>
<yandex>
<profiles>
<default>
<max_memory_usage_for_all_queries>101310968832</max_memory_usage_for_all_queries>
<max_bytes_before_external_group_by>50655484416</max_bytes_before_external_group_by>
<max_memory_usage>101310968832</max_memory_usage>
</default>
</profiles>
</yandex>
set in the session using [ set max_bytes_before_external_group_by = .... ]
or users' profiles.cat /etc/clickhouse-server/conf.d/memory_usage.xml <?xml version="1.0" ?> <yandex> <profiles> <default> <max_memory_usage_for_all_queries>101310968832</max_memory_usage_for_all_queries> <max_bytes_before_external_group_by>50655484416</max_bytes_before_external_group_by> <max_memory_usage>101310968832</max_memory_usage> </default> </profiles> </yandex>Hi den-crane i just check memory_usage.xml is not present do we need to create this xml ,if yes can you please help where i can download memory.config file with default setting
Even if i try to execute this statement
SET max_bytes_before_external_group_by = 100000000 it is throwing SET max_bytes_before_external_group_by = 100000000 error
@I-akshay , yes that file should be created.
@den-crane provided a full content of that file. I don't understand why can you create by your own and need to download it somewhere, but please take it here: memory_usage.zip
You need to:
1) unpack it to /etc/clickhouse-server/conf.d/ directory on your clickhouse server
2) edit it accordingly to avaliable RAM on your server - max_memory_usage should be lower than available amount of RAM (in bytes) for example try 80%, max_memory_usage_for_all_queries should be lower than max_memory_usage, try 70%.
set in the session using [ set max_bytes_before_external_group_by = .... ]
or users' profiles.cat /etc/clickhouse-server/conf.d/memory_usage.xml <?xml version="1.0" ?> <yandex> <profiles> <default> <max_memory_usage_for_all_queries>101310968832</max_memory_usage_for_all_queries> <max_bytes_before_external_group_by>50655484416</max_bytes_before_external_group_by> <max_memory_usage>101310968832</max_memory_usage> </default> </profiles> </yandex>
Hi ,
@den-crane So , basically i have created a distributed table with on a cluster with 3 servers now i have done what you have suggested but still the issue is not resolved, The huge group by / order by operations is not spilling on disks, can you please suggest why this is happening
@filimonov
can we add distributed_aggregation_memory_efficient property in /memory_usage.xml
@filimonov
can we add distributed_aggregation_memory_efficient property in /memory_usage.xml
yes you can
check that settings are applied to your session
select * from system.settings where name = 'max_bytes_before_external_group_by'
HI den-crane i was asking about distributed_aggregation_memory_efficient , i want to permanently update distributed_aggregation_memory_efficient to 1 ,
I do'nt want to set the property every time before the session
cat /etc/clickhouse-server/conf.d/memory_usage.xml
<?xml version="1.0" ?>
<yandex>
<profiles>
<default>
<max_memory_usage_for_all_queries>101310968832</max_memory_usage_for_all_queries>
<max_bytes_before_external_group_by>50655484416</max_bytes_before_external_group_by>
<max_memory_usage>101310968832</max_memory_usage>
<distributed_aggregation_memory_efficient>1</distributed_aggregation_memory_efficient>
</default>
</profiles>
</yandex>
ββnameββββββββββββββββββββββββββββββββββββββ¬βvalueββββββββ¬
β distributed_aggregation_memory_efficient β 1 β
β max_bytes_before_external_group_by β 8000000000 β
β max_bytes_before_external_sort β 8000000000 β
β max_memory_usage β 16000000000 β
β max_memory_usage_for_all_queries β 16000000000 β
ββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββ΄
β Progress: 2.00 billion rows, 356.47 GB (62.25 million rows/s., 11.08 GB/s.) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 98Received exception from server (version 19.4.2):
Code: 241. DB::Exception: Received from 127.0.0.1:9000. DB::Exception: Memory limit (for query) exceeded: would use 14.91 GiB (attempt to allocate chunk of 16777216 bytes), maximum: 14.90 GiB.
hi, Set the above parameters, should not use the disk? Why still reporting memory limits?
should not use the disk? Why still reporting memory limits?
Because final aggregation result should fit into memory.
So for group by SOME_HIGH_CARDINALITY_SET_OF_COLUMNS, ch will do intermediate partial aggregations on a disk, but in the end CH will gather the result into one part in memory.
should not use the disk? Why still reporting memory limits?
Because final aggregation result should fit into memory.
So for group by SOME_HIGH_CARDINALITY_SET_OF_COLUMNS, ch will do intermediate partial aggregations on a disk, but in the end CH will gather the result into one part in memory.
Yes, You are right. I can try it again.Thanks.
should not use the disk? Why still reporting memory limits?
Because final aggregation result should fit into memory.
So for group by SOME_HIGH_CARDINALITY_SET_OF_COLUMNS, ch will do intermediate partial aggregations on a disk, but in the end CH will gather the result into one part in memory.
@den-crane Could you please put that into documentation? it will be easier to refer to it than to a single comment in, not very transparent, issue report.
Is it normal that data (45GB in CSV - 3x string, 5x int, 1x float - not sure how big in mergetree engine) "aggregated" by high cardinality set of columns, resulting same number of rows, are running out of memory on 120 GB machine?
Memory limit (for query) exceeded: would use 120.03 GiB (attempt to allocate chunk of 134217728 b
ytes), maximum: 120.00 GiB.
Because final aggregation result should fit into memory.
So for group by SOME_HIGH_CARDINALITY_SET_OF_COLUMNS, ch will do intermediate partial aggregations on a disk, but in the end CH will gather the result into one part in memory.
But that's just untrue.
ClickHouse can send aggregation result to client without allocating memory for the whole result.
It will do it by buckets of 1 / 256 of total result size.
@den-crane
should not use the disk? Why still reporting memory limits?
Because final aggregation result should fit into memory.
So for group by SOME_HIGH_CARDINALITY_SET_OF_COLUMNS, ch will do intermediate partial aggregations on a disk, but in the end CH will gather the result into one part in memory.
---- how to fix it ?
my sql : insert into apps_count select appid,count(distinct userid) from table group by appid.
the appid has 160 0000 values.
I failed every time when I ran around 98%.
max_memory_usage / max_bytes_before_external_group_by , I have set . but it is useless.
@kainever7 You should set max_bytes_before_external_group_by to slightly less value that the total amount of memory. For example, if you have 16 GB of memory, it's reasonable to set max_memory_usage to 15000000000 (because some memory is needed for OS and for other programs in the system) and max_bytes_before_external_group_by to 12000000000 (because ClickHouse will need some extra memory to spill data on disk).
If you do distributed group by, also set distributed_aggregation_memory_efficient to 1.
@I-akshay, do you have any further questions?
@blinkov are those informations from this issue available somewhere in the documentation?
@jangorecki only partially, I think, there's always a lot of room for improvement
@alexey-milovidov as the issue was closed, is there any change to get a commit id which resolved it? or a link to documentation directly. I would like to implement those informations that I am waiting for in this thread to be documented.
We have to copy the relevant information from comment https://github.com/ClickHouse/ClickHouse/issues/5618#issuecomment-575831276 into docs.
@alexey-milovidov is there any issue tracker for that where I can follow the status of that?
I still see the same issue. I dont see clickhouse writing to disk :(
Most helpful comment
Because final aggregation result should fit into memory.
So for group by SOME_HIGH_CARDINALITY_SET_OF_COLUMNS, ch will do intermediate partial aggregations on a disk, but in the end CH will gather the result into one part in memory.