Clickhouse: Using INSERT statements is much more slower than using CSV

Created on 5 Aug 2017 · 14Comments · Source: ClickHouse/ClickHouse

I wanted to load some data to CH to play with it.
After spending a little time with CH, found that using CSV files over INSERT statements is order of magnitude faster.

Sure I was expecting that insert statements would be slower, but not 412X slower!

2 files, 100K rows, same data, while first is in CSV format the other one contains insert statements.

First file: csv.out
Sample data:

'2010-03-10', 28194901262, '8006-6129-3130-5580', 1169
'2015-10-17', 67128904894, '4681-5453-2740-1617', 8109
'2013-08-05', 79681799770, '6661-8986-3509-6991', 55
...

Second file: insert.out
Sample data:

INSERT INTO mytable VALUES('2010-03-10', 28194901262, '8006-6129-3130-5580', 1169);
INSERT INTO mytable VALUES('2015-10-17', 67128904894, '4681-5453-2740-1617', 8109);
INSERT INTO mytable VALUES('2013-08-05', 79681799770, '6661-8986-3509-6991', 55);
...

This is the table I wanted to load.

CREATE TABLE mytable(
    some_date Date,
    some_int    UInt64,
    some_str String,
    some_float Float32
) ENGINE = MergeTree(some_date, (some_int, some_date), 8192);

Testing CSV load:

$ cat csv.out | time clickhouse-client --query="INSERT INTO mytable FORMAT CSV"
0.04user 0.01system 0:00.38elapsed 15%CPU (0avgtext+0avgdata 35312maxresident)k
416inputs+0outputs (4major+3572minor)pagefaults 0swaps

So it took 0:00.38 seconds.

Testing with insert:

$ cat insert.out | time clickhouse-client -n
15.62user 22.88system 2:36.86elapsed 24%CPU (0avgtext+0avgdata 46324maxresident)k
776inputs+0outputs (5major+106780minor)pagefaults 0swaps

And this took 2:36.86 seconds.

That means using CSV is = 412X faster (2:36.86 seconds / 0:00.38 seconds)
Or using insert statements is 412X slower!

I know that using insert statements involves parsing sql and other stuff and this takes time but isn't 412X a little too much for this?

Is this some kind of bug or it's an expected result?

(dropping and re-creating the the table between loads)

st-need-info

Source

ramazanpolat

Most helpful comment

This is expected.

INSERT statements with single rows are slower not only because it need SQL parsing for each statement, but also, because each INSERT into MergeTree goes directly into filesystem without any buffering and forms new "data part" with files per each column. (These data parts will be merged in larger parts in background.)

If you transform

INSERT INTO mytable VALUES('2010-03-10', 28194901262, '8006-6129-3130-5580', 1169);
INSERT INTO mytable VALUES('2015-10-17', 67128904894, '4681-5453-2740-1617', 8109);
INSERT INTO mytable VALUES('2013-08-05', 79681799770, '6661-8986-3509-6991', 55);

into

INSERT INTO mytable VALUES
('2010-03-10', 28194901262, '8006-6129-3130-5580', 1169)
('2015-10-17', 67128904894, '4681-5453-2740-1617', 8109)
('2013-08-05', 79681799770, '6661-8986-3509-6991', 55);

then the query will run almost as fast as when using CSV.

Probably we could detect multiple INSERT statements with similar structure to optimize this automatically.

alexey-milovidov on 5 Aug 2017

👍16

All 14 comments

This is expected.

If you transform

INSERT INTO mytable VALUES('2010-03-10', 28194901262, '8006-6129-3130-5580', 1169);
INSERT INTO mytable VALUES('2015-10-17', 67128904894, '4681-5453-2740-1617', 8109);
INSERT INTO mytable VALUES('2013-08-05', 79681799770, '6661-8986-3509-6991', 55);

into

INSERT INTO mytable VALUES
('2010-03-10', 28194901262, '8006-6129-3130-5580', 1169)
('2015-10-17', 67128904894, '4681-5453-2740-1617', 8109)
('2013-08-05', 79681799770, '6661-8986-3509-6991', 55);

then the query will run almost as fast as when using CSV.

Probably we could detect multiple INSERT statements with similar structure to optimize this automatically.

alexey-milovidov on 5 Aug 2017

👍16

This goes for just about any database. Bulk inserts, as Alexey showed, are faster.

daledude on 5 Aug 2017

@alexey-milovidov thank you for detailed answer.
Typically clients connect to db to insert data to only a handful of tables. Most of the time, it's just one table and so one insert statement with different values. Think it like 1000+ clients sending insert statements for an individual table with only one row but they do it with high throughput. Something like 10K inserts per seconds for each client(in my pseudo benchmarks, I saw 150K inserts per second while using CSV for just one client). In this case, using insert statements will cripple the performance as you know.

Your suggestion about turning multiple inserts into one is a brilliant way to make is faster. But clients using JDBC won't benefit this since its not possible to split one insert statement into smaller chunks of SQLs.

So in this case, should I develop an automated way to collect values and generate a CSV file to load or is there any other way to make inserts faster?

ramazanpolat on 5 Aug 2017

@ramazanpolat I guess if you have 1000+ clients doing a lot of single-row inserts, you can re-architecture your app in a way like:

a front-facing API to receive messages from thousands of clients
a queue accumulating these messages (Kafka/RabbitMQ/whatever)
a batch writer which dumps batches of messages every couple of seconds.

shuttie on 8 Aug 2017

👍6

I was thinking the same, just wanted to be sure if there is another way.
@alexey-milovidov 's comment "Probably we could detect multiple INSERT statements with similar structure to optimize this automatically" should be considered for client inserts. This will save users lots of time and optimization. Should I open an enhancement issue on this?

ramazanpolat on 12 Aug 2017

Yes, it's better to open issue for enhancement.

I already thinking about it. It is also relevant for bulk INSERTs where VALUES contain expressions.

alexey-milovidov on 12 Aug 2017

I'm thinking something like this:

Clients connect to CH with a hint in JDBC URL:

jdbc:clickhouse://<host>:<port>[/<database>][?format=table]

This connection style hints CH server to indicate that client will send a text with a supported format to be directly inserted into a table.

With hinted JDBC connection, client can only send a formatted text (with the specified format in JDBC URL) instead of an SQL.

For example:

Client will connect to CH using the following JDBC URL.

jdbc:clickhouse://localhost:8123/test?CSV=mytable

The CH server will be ready to load data directly to the table named 'mytable'. No SQL parsing or any other stuff needed.

Then client sends CSV (https://clickhouse.yandex/docs/en/formats/csv.html) instead of an SQL statement.

Another idea is to extend java.sql.Statement and java.sql.Connection in a way to hint CH about the format and the table:

class BulkLoad implements Statement{
    ...
    public append(String values) {...}
    public load() {...}
    ...
}

then

BulkLoad bulkLoad = con.createBulkLoad(Format.CSV,"mytable");
bulkLoad.append("value1, value2, value3,...");
bulkLoad.append("value4, value5, value6,...");
bulkLoad.append("(value7, value8, value9,...), (value10, value11, value12,...), ....");
bulkLoad.load();

ramazanpolat on 13 Aug 2017

Regarding insert performance I observed that in case you insert data in a ReplicatedMergeTree where EventDate it's spread on 3-4 years the speed is pretty good with batches of 10k records.. But if the same records I spread them in 20 years.. it's at least 100 times slower..

silviucpp on 21 Aug 2017

Yes, this is expected, because each batch of data for INSERT is split to batches by partitions. And partitions are months.

There are two possible workarounds:

use even larger batches (example: 100k instead of 10k);
group data by months before insertion.

This is not an issue in typical use cases when you:

insert realtime data;
insert a dump, almost ordered by time.

alexey-milovidov on 28 Aug 2017

@alexey-milovidov thanks for explanation, maybe this can be added in documentation because might be a good hint for others as well

silviucpp on 29 Aug 2017

@shuttie , Are you aware of any such existing batch writer for clickhouse?

zero-master on 11 Nov 2017

@zero-master https://github.com/nikepan/clickhouse-bulk , but I've never tried this thing in action.

shuttie on 12 Nov 2017

👍1

There is also Buffer table engine, which can be used for solving that
problem (with certain limitations, check documentation:
https://clickhouse.yandex/docs/en/table_engines/buffer.html ).

Also there is an experimental Kafka table engine (can be used as a source for materialized view), and Kafka->Clickhouse bridge.