Creating a table takes about 3-4 s via both YSQL and YCQL APIs. For comparison, creating tables in postgres is around 10ms.
The performance of "create table" needs to be improved.
Repro steps:
Some notes:
1) @robertpang clarified that this is common to YCQL & YSQL tables.
2) Yes, DDL isn't a flow that we have optimized for yet-- but something to look into as a medium-pri task.
3) Possible reasons could be replication factor & number of shards. Specific areas to look into might be:
regards,
Kannan
When creating new tablets, we could also set initial timeouts deterministically instead of randomizing them, in such a way that the initial leaders are likely to be elected according to the load balancer's policy. This will reduce unnecessary leader stepdowns caused by the load balancer.
Also eventually we'll be able create new "table pieces" in existing Raft groups instead of creating new Raft groups, when we have the corresponding functionality.
As @mbautin mentioned, this seems largely related to the "initial" leader election also waiting for:
--leader_failure_max_missed_heartbeat_periods=6 * --raft_heartbeat_interval_ms=500.
So 3 seconds right there. Reducing those numbers made table creation go much faster, but of course, that's just to confirm the theory rather than being the right fix.
The right improvement I suppose would be to set the initial leader election be more aggressive. It should likely still be randomized so that even the initial distribution of leaders is spread out across nodes (leaving less work for subsequent leader balancing).
Adding some more notes here:
sys_catalog table there also has its own Raft group which goes through the same logical path of waiting out the FailureDetector...However, this largely affects just the first time starting up and the much more uncommon master leader changes. This would be useful for speeding up cluster startup!Note: Currently the logic for CreateTable is in the CatalogManager whereas most of the logic for where to put tablets (and leaders) is in the LoadBalancer. This split has been a long standing TODO to integrate for a while, but is largely outside of the scope of this task. However, this generates a bit of extra complexity on duplicating some logic between the two components. For example, we already have code to respect leader affinity in the LoadBalancer, which we'll have to probably ensure we apply on the CreateTable path as well...
cc @rahuldesirazu
We might be able to "designate" term 0 leaders from the master according to the load balancing policy rather than doing a leader election. If the "designated" leader fails for some reason, the normal leader election mechanism will kick in.
Did some digging into our current raft implementation. Raft followers have a FailureDetector that is triggered at some wakeup time to check whether it knows about a leader, and if not, try to become a leader themselves. Turns out, there is already a mechanism for triggering a faster FailureDetector wakeup for followers on the create flow. However, later in the codepath, wakeup is being overwritten with default value of --leader_failure_max_missed_heartbeat_periods=6 * --raft_heartbeat_interval_ms=500. So p0 fix is to only set wakeup to default if it is not on create path. Results of this change are as follows:
Num Threads | Num Tables/Thread | Tablets/Table | Pre-Fix | Post-Fix
-- | -- | -- | -- | --
1 | 1 | 4 | 4.211s | 0.564s
1 | 1 | 24 | 4.238s | 1.197s
1 | 1 | 100 | 6.621s | 4.243s
1 | 10 | 4 | 39.945s | 5.730s
1 | 100 | 4 | 402.054s | 59.393s
10 | 10 | 4 | 43.977s | 16.415s
10 | 100 | 4 | 468.656s | 264.974s
Of note, there was no difference in timings in case of reducing the heartbeat interval to 100ms for the 1 table case, which indicates that the master processing of create table requests, master processing of tablet server responses, and tablet server leader elections are the three biggest bottlenecks.
In p1 fix, we could incorporate @mbautin idea of designating leaders from the master during the CreateTablet RPC, which would ensure the table is balanced at create time and would prevent Raft network roundtrip of electing the first leader. Additionally, we could augment the CreateTablet RPC response from the tablet server with the tablet peer status (FOLLOWER or LEADER) so master knows that every tablet has been properly created without waiting for heartbeat.
cc: @kmuthukk @bmatican
Good stuff @rahuldesirazu! 馃
Discussed with @rahuldesirazu offline, let's use this task to speedup the short-term dev usability, for local / rf1 setups. For a mid-long term task, spun up #1258
Did some more extensive mini cluster testing, have very good results for RF=1 case and RF=3 case with not too many tablets:
Tables | Tablets/Table | RF | Pre-Fix (s) | Post-Fix (s)
-- | -- | -- | -- | --
1 | 48 | 3 | 5.26 | 4.64
1 | 24 | 3 | 4.48 | 1.93
1 | 8 | 1 | 4.14 | 0.324
1 | 12 | 3 | 4.25 | 1.375
25 | 24 | 3 | 110 | 52
100 | 12 | 3 | 396 | 140
Most helpful comment
Did some digging into our current raft implementation. Raft followers have a FailureDetector that is triggered at some wakeup time to check whether it knows about a leader, and if not, try to become a leader themselves. Turns out, there is already a mechanism for triggering a faster FailureDetector wakeup for followers on the create flow. However, later in the codepath, wakeup is being overwritten with default value of
--leader_failure_max_missed_heartbeat_periods=6 * --raft_heartbeat_interval_ms=500. So p0 fix is to only set wakeup to default if it is not on create path. Results of this change are as follows:Num Threads | Num Tables/Thread | Tablets/Table | Pre-Fix | Post-Fix
-- | -- | -- | -- | --
1 | 1 | 4 | 4.211s | 0.564s
1 | 1 | 24 | 4.238s | 1.197s
1 | 1 | 100 | 6.621s | 4.243s
1 | 10 | 4 | 39.945s | 5.730s
1 | 100 | 4 | 402.054s | 59.393s
10 | 10 | 4 | 43.977s | 16.415s
10 | 100 | 4 | 468.656s | 264.974s
Of note, there was no difference in timings in case of reducing the heartbeat interval to 100ms for the 1 table case, which indicates that the master processing of create table requests, master processing of tablet server responses, and tablet server leader elections are the three biggest bottlenecks.
In p1 fix, we could incorporate @mbautin idea of designating leaders from the master during the CreateTablet RPC, which would ensure the table is balanced at create time and would prevent Raft network roundtrip of electing the first leader. Additionally, we could augment the CreateTablet RPC response from the tablet server with the tablet peer status (FOLLOWER or LEADER) so master knows that every tablet has been properly created without waiting for heartbeat.
cc: @kmuthukk @bmatican