Etcd: Proposal: Increase quota-backend-bytes default to 8GB

Created on 24 May 2018 · 16Comments · Source: etcd-io/etcd

Now that the bbolt freelist no longer persisted, we should consider increasing the etcd default storage limit to 8GB. We could potentially go higher, but this keeps the snapshot/restore operations to roughly 1 minute. We can always increase this further in future releases based on feedback.

Based on the data below:

etcd's size limit should never exceed the memory available to it. If running on a dedicated machine with, say 16GB of memory, etcd's storage limit needs to be less than 16GB, and with a healthy margin (4GB?)
Throughput and latency appear stable up to at least 16GB
snapshot and restore operation latency increases linearly up to at least 16GB
At 8GB, snapshot and restore take about 1 minute each (TODO: do we hit any thresholds here? Anything we should update to support 1 minute snapshots/restores?)

The benchmark constructed using the following flow:

Write 1KB values randomly to a fixed size keyspace of 100,000,000
Start with a 15s compaction interval, each time the compaction interval is exceeded, continue to write for 1 minute, then increase the compaction interval by 15s
At each GB of DB file size growth, perform an snapshot followed by a restore

write latency 99 ile vs db size gb
write throughput writes 2fs vs db size gb
save and restore latency ms vs db size gb

Debian/linux 4.9.0-amd64
6 core - Intel(R) Xeon(R) @ 3.60GHz
64 GB memory (4x 16GiB DIMM DDR4 2400 MHz)
HDD

cc @gyuho @wenjiaswe

areperformance stale

Source

jpbetz

👍5

Most helpful comment

A more interesting test is how etcd performs when the free list contains a lot of pages and the write size is small.

xiang90 on 24 May 2018

👍2

All 16 comments

Interesting. How did we measure restore latencies?

gyuho on 24 May 2018

What is your memory size? When dB size grows beyond the memory size, the perf will decrease significantly

xiang90 on 24 May 2018

A more interesting test is how etcd performs when the free list contains a lot of pages and the write size is small.

xiang90 on 24 May 2018

👍2

For restore latency, we measured how long bin/etcdctl snapshot restore snap.out took. This is for a single member cluster, I'd like to do a 3 member cluster next. For that one I'm thinking of stopping a single member, nuking it's DB, starting it again. To get the timing for the 3 member case I guess I'll need to poll either the endpoint status or the cluster-health?

jpbetz on 24 May 2018

👍1

@xiang90 Good idea. I'll try a test where we do a bunch of puts/deletes with small objects and see what happens at the bbolt layer when we're allocating against a large freelist with high fragmentation.

jpbetz on 24 May 2018

@jpbetz also for the restore test, we now are testing how IO layer perform mostly. I would expect the index rebuilding dominates the time when the number of keys grows.

xiang90 on 24 May 2018

@jpbetz

What is your memory size? When dB size grows beyond the memory size, the perf will decrease significantly

I think the big jump around 52GB is because of this issue.

xiang90 on 24 May 2018

nuking it's DB, starting it again

The main motivation to soft-limit database size was to limit mean time to recovery. So, I would also measure how long it takes to rebuild mvcc states on restart (using mvcc.New), as @xiang90 suggests.

gyuho on 24 May 2018

Sounds good. I've added the machine stats to the description. I'll try with a range of object sizes incl. very small to get more data on worst case recovery times.

jpbetz on 24 May 2018

I've updated the testing based on the feedback here. The new flow creates a larger number of small objects, many of which get deleted over time and compacted, producing free list entries in bolt as well as putting pressure on the snapshot and restore operations.

jpbetz on 6 Jun 2018

Start with a 15s compaction interval, each time the compaction interval is exceeded, continue to write for 1 minute, then increase the compaction interval by 15s

If we randomly write to a 100M keyspace at a rate around 15k/s, the possibility of key overwriting is pretty low. So I am not sure if the compaction actually does anything.

xiang90 on 6 Jun 2018

regardless, I think 16GB is a reasonable goal that we can achieve with some effort in short term.

xiang90 on 6 Jun 2018

Hm. Running time was about 1 day, so about 12 writes per key. Not a lot of compaction or history. I'll drop that down to 1M keys and see what happens the next time we run this.

jpbetz on 6 Jun 2018

@jpbetz I also want to see what happens when the majority of the 16GB db pages are all in its free pages. That is the extreme case. If boltdb can still perform well, then great :P. Or we might need to do some optimization there.

xiang90 on 6 Jun 2018

Throughput and latency appear stable up to at least 16GB

I'm seeing the similar pattern with small values (100 KB).

I also logged growing freelist size (if we keep writing data, the freelist size grows up to 2 GB for 10 GB DB). And it doesn't seem to have much effect (writing large values would slow down quicker, with much less freelist).

Tested with 10 GB db file restore, and see most of time spent on rebuilding MVCC storage here

https://github.com/coreos/etcd/blob/25f4d809800542a2fa85568f5c5cd0c881f7e010/mvcc/kvstore.go#L363-L380

Will keep experimenting.

gyuho on 7 Jun 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.