Yugabyte-db: Create table is slow in YSQL and YCQL

Created on 27 Feb 2019 · 10Comments · Source: yugabyte/yugabyte-db

Creating a table takes about 3-4 s via both YSQL and YCQL APIs. For comparison, creating tables in postgres is around 10ms.
The performance of "create table" needs to be improved.

Repro steps:

create local universe
$ ./bin/yb-ctl --rf 1 create --enable_postgres

start psql client, turn on timing and create some tables
$ ./bin/psql -h localhost -p 5433 -U postgres
postgres=# timing
Timing is on.
postgres=# create table foo1(a int primary key, b text);
CREATE TABLE
Time: 4270.386 ms (00:04.270)
postgres=# create table foo2(a int primary key, b text, c int, d text);
CREATE TABLE
Time: 4252.385 ms (00:04.252)
postgres=# create table foo3(a int primary key, b text, c int, d text);
CREATE TABLE
Time: 3452.776 ms (00:03.453)
postgres=# create table foo4(a int primary key, b text, c int, d text);
CREATE TABLE
Time: 4263.499 ms (00:04.263)

aredocdb kinenhancement

Source

ravimurthy

Most helpful comment

Did some digging into our current raft implementation. Raft followers have a FailureDetector that is triggered at some wakeup time to check whether it knows about a leader, and if not, try to become a leader themselves. Turns out, there is already a mechanism for triggering a faster FailureDetector wakeup for followers on the create flow. However, later in the codepath, wakeup is being overwritten with default value of --leader_failure_max_missed_heartbeat_periods=6 * --raft_heartbeat_interval_ms=500. So p0 fix is to only set wakeup to default if it is not on create path. Results of this change are as follows:

Num Threads | Num Tables/Thread | Tablets/Table | Pre-Fix | Post-Fix
-- | -- | -- | -- | --
1 | 1 | 4 | 4.211s | 0.564s
1 | 1 | 24 | 4.238s | 1.197s
1 | 1 | 100 | 6.621s | 4.243s
1 | 10 | 4 | 39.945s | 5.730s
1 | 100 | 4 | 402.054s | 59.393s
10 | 10 | 4 | 43.977s | 16.415s
10 | 100 | 4 | 468.656s | 264.974s

Of note, there was no difference in timings in case of reducing the heartbeat interval to 100ms for the 1 table case, which indicates that the master processing of create table requests, master processing of tablet server responses, and tablet server leader elections are the three biggest bottlenecks.

In p1 fix, we could incorporate @mbautin idea of designating leaders from the master during the CreateTablet RPC, which would ensure the table is balanced at create time and would prevent Raft network roundtrip of electing the first leader. Additionally, we could augment the CreateTablet RPC response from the tablet server with the tablet peer status (FOLLOWER or LEADER) so master knows that every tablet has been properly created without waiting for heartbeat.

cc: @kmuthukk @bmatican

rahuldesirazu on 24 Apr 2019

🚀3

All 10 comments

Some notes:

1) @robertpang clarified that this is common to YCQL & YSQL tables.

2) Yes, DDL isn't a flow that we have optimized for yet-- but something to look into as a medium-pri task.

3) Possible reasons could be replication factor & number of shards. Specific areas to look into might be:

is the load-balancer creating shards at a throttled rate
does the very first leader election take time based on our default settings (is it single digit ms or higher),
how do we check for completion of table creation (is it immediate when all tablets are successfully created, or is there some polling once every so often to check for completion)

regards,
Kannan

kmuthukk on 27 Feb 2019

When creating new tablets, we could also set initial timeouts deterministically instead of randomizing them, in such a way that the initial leaders are likely to be elected according to the load balancer's policy. This will reduce unnecessary leader stepdowns caused by the load balancer.

Also eventually we'll be able create new "table pieces" in existing Raft groups instead of creating new Raft groups, when we have the corresponding functionality.

mbautin on 6 Mar 2019

As @mbautin mentioned, this seems largely related to the "initial" leader election also waiting for:

--leader_failure_max_missed_heartbeat_periods=6 * --raft_heartbeat_interval_ms=500.

So 3 seconds right there. Reducing those numbers made table creation go much faster, but of course, that's just to confirm the theory rather than being the right fix.

The right improvement I suppose would be to set the initial leader election be more aggressive. It should likely still be randomized so that even the initial distribution of leaders is spread out across nodes (leaving less work for subsequent leader balancing).

kmuthukk on 16 Apr 2019

Adding some more notes here:

When creating user tables, we decide on the master side how many tablets to create and where to put each replica for each tablet. During this selection, we can easily also pick where to put the leaders and fast track one of the replicas in each Raft group to trigger an election ASAP. This is most useful for speeding up runtime CREATE TABLE operations!
This issue however also affects the master side, as the sys_catalog table there also has its own Raft group which goes through the same logical path of waiting out the FailureDetector...However, this largely affects just the first time starting up and the much more uncommon master leader changes. This would be useful for speeding up cluster startup!

Note: Currently the logic for CreateTable is in the CatalogManager whereas most of the logic for where to put tablets (and leaders) is in the LoadBalancer. This split has been a long standing TODO to integrate for a while, but is largely outside of the scope of this task. However, this generates a bit of extra complexity on duplicating some logic between the two components. For example, we already have code to respect leader affinity in the LoadBalancer, which we'll have to probably ensure we apply on the CreateTable path as well...

cc @rahuldesirazu

bmatican on 23 Apr 2019

We might be able to "designate" term 0 leaders from the master according to the load balancing policy rather than doing a leader election. If the "designated" leader fails for some reason, the normal leader election mechanism will kick in.

mbautin on 23 Apr 2019

cc: @kmuthukk @bmatican

rahuldesirazu on 24 Apr 2019

🚀3

Good stuff @rahuldesirazu! 🥇

rkarthik007 on 24 Apr 2019

Discussed with @rahuldesirazu offline, let's use this task to speedup the short-term dev usability, for local / rf1 setups. For a mid-long term task, spun up #1258

bmatican on 25 Apr 2019

Did some more extensive mini cluster testing, have very good results for RF=1 case and RF=3 case with not too many tablets:

Tables | Tablets/Table | RF | Pre-Fix (s) | Post-Fix (s)
-- | -- | -- | -- | --
1 | 48 | 3 | 5.26 | 4.64
1 | 24 | 3 | 4.48 | 1.93
1 | 8 | 1 | 4.14 | 0.324
1 | 12 | 3 | 4.25 | 1.375
25 | 24 | 3 | 110 | 52
100 | 12 | 3 | 396 | 140

rahuldesirazu on 26 Apr 2019

👍1

Fixed in https://github.com/YugaByte/yugabyte-db/commit/6831eabf1be4caf5f167711de1bd2ffc894783ef

kmuthukk on 29 Apr 2019

Was this page helpful?

0 / 5 - 0 ratings