When we do a CREATE DATABASE, we currently read from a bunch of system tables and write into others. This used to be done as individual operations, but we previously did some work to speed up create database, by batching _all_ of the writes together into 1. This however means the master will use a lot of memory and send a relatively sizeable request when doing this:
I0226 01:29:14.932030 244 sys_catalog.cc:779] Copied total 43622 rows
I0226 01:29:15.018285 244 sys_catalog.cc:780] Copied total 63285498 bytes
@iSignal found that in some low memory envs, this would already end up failing just because of the memory limit. We should just add some chunking to limit how big each batched write gets.
Results of one experiment on GCP with different values of copy_table_batch_size parameter as define in https://github.com/yugabyte/yugabyte-db/blob/7f8d931129055d53a6f19e53c2e1103b8ddb2804/src/yb/master/sys_catalog.cc#L110
These numbers were by running ysqlsh timing from my macbook to a GCP us-west universe. The numbers were quite similar when run from the nodes themselves.
| Batch size for copy tables | Peak mem usage | Latency of create db | Number of batches |
| --- | --- | --- | --- |
| -1 (unlimited) | 416.86M | 9.499s | 1 |
| 15K | 341.1M | 11.068 | 3 |
| 10K | 215.75M | 10.569 | 5 |
| 7K | 213.29M | 15.829 | 7 |
However, the numbers are somewhat variable. On another universe, I got the following results (this time all numbers are reported by ysqlsh timing from the leader nodes themselves on a different GCP us-west universe.)
| Batch size for copy tables | Peak mem usage | Latency of create db | Number of batches |
| --- | --- | --- | --- |
| -1 (unlimited) | 419.30M | 14.490 / 15.78 | 1 |
| 15K | 276.53M | 17.12 / 17.20 | 3 |
| 10K | 268.64M | 15.67 / 15.76 | 5 |
| 7K | 191.71M | 16.32 / 16.45| 7 |
Peak mem numbers also were different when a second create database was performed after the first.
Here are the results for a c5.xlarge cross-region on a v2.1.3 universe across us-east-1, us-west-2, us-west-1 with leader master in us-east-1
| Batch size for copy tables | Peak mem usage | Latency of create db | Number of batches |
| --- | --- | --- | --- |
| -1 (unlimited) | 399M | 45.1 | 1 |
| 10K | 250.45M | 46.7s | 5 |
Based on these results, I would recommend setting the default batch size to 10K which has roughly the same performance as current, ~65% peak mem usage as current and requires 5 round trips instead of 1 for create database. I will also retry the test for 15K to confirm the 341M number in the first case.
On a local rf3 cluster on my macbook, the results are much sharper between single batch and 10K batch size as reported on the diff esult of this change with different batching params on rf3 yb-ctl cluster, because network latency is not a major factor.
current: 410MB mem, 14.5secs
10K: 230MB mem, 8.4secs
7.5K: 220MB mem, 8.2secs
5K: 180MB mem, 8.3secs
cc: @m-iancu @ndeodhar @bmatican
The plan is to undo this change for now until we have a fix for issues like #4353 at the raft level. The issue is that large writes on the raft log that can take O(heartbeat interval) might cause a leader change so a createdatabase that involves multiple large writes might never get through fully.
One issue I discovered while revisiting this fix is that the size computation was using protobuf's SpaceUsedLong which seems to compute the bytes used in memory for the protobuf object. If we instead use protobuf's ByteSizeLong method instead to look at the serialized form of the protobuf, the size is around 10-20% of the in memory size.
Going by the ByteSizeLong method of the serialized form of the protobuf for the write request, we can complete the entire create database write in 10MB total serialized bytes spread over 3 batches with a max of 4MB serialized writes / batch (which is the current default for consensus).
Most helpful comment
Results of one experiment on GCP with different values of copy_table_batch_size parameter as define in https://github.com/yugabyte/yugabyte-db/blob/7f8d931129055d53a6f19e53c2e1103b8ddb2804/src/yb/master/sys_catalog.cc#L110
These numbers were by running ysqlsh timing from my macbook to a GCP us-west universe. The numbers were quite similar when run from the nodes themselves.
| Batch size for copy tables | Peak mem usage | Latency of create db | Number of batches |
| --- | --- | --- | --- |
| -1 (unlimited) | 416.86M | 9.499s | 1 |
| 15K | 341.1M | 11.068 | 3 |
| 10K | 215.75M | 10.569 | 5 |
| 7K | 213.29M | 15.829 | 7 |
However, the numbers are somewhat variable. On another universe, I got the following results (this time all numbers are reported by ysqlsh timing from the leader nodes themselves on a different GCP us-west universe.)
| Batch size for copy tables | Peak mem usage | Latency of create db | Number of batches |
| --- | --- | --- | --- |
| -1 (unlimited) | 419.30M | 14.490 / 15.78 | 1 |
| 15K | 276.53M | 17.12 / 17.20 | 3 |
| 10K | 268.64M | 15.67 / 15.76 | 5 |
| 7K | 191.71M | 16.32 / 16.45| 7 |
Peak mem numbers also were different when a second create database was performed after the first.
Here are the results for a c5.xlarge cross-region on a v2.1.3 universe across us-east-1, us-west-2, us-west-1 with leader master in us-east-1
| Batch size for copy tables | Peak mem usage | Latency of create db | Number of batches |
| --- | --- | --- | --- |
| -1 (unlimited) | 399M | 45.1 | 1 |
| 10K | 250.45M | 46.7s | 5 |
Based on these results, I would recommend setting the default batch size to 10K which has roughly the same performance as current, ~65% peak mem usage as current and requires 5 round trips instead of 1 for create database. I will also retry the test for 15K to confirm the 341M number in the first case.
On a local rf3 cluster on my macbook, the results are much sharper between single batch and 10K batch size as reported on the diff esult of this change with different batching params on rf3 yb-ctl cluster, because network latency is not a major factor.
cc: @m-iancu @ndeodhar @bmatican