Etcd: How does reclaiming space work?

Created on 28 Sep 2020 · 14Comments · Source: etcd-io/etcd

Does compaction free up storage space immediately or does it take some time for the compaction to free up unused storage space?

Scenario:
I’m creating (let’s assume) 1000 keys and then deleting the keys.

Problem:
Once I delete the 1000 keys, I see the sizeInUse is still pretty high.

Current solution:
I’ve to run compaction + defrag in order to actually reclaim the free space immediately.

Question:
Shouldn’t the space be immediately returned? Or how long should I wait before the space is actually given back to the OS?

Using etcd 3.4.9. Thanks.

Source

vivekpatani

All 14 comments

BBolt defaults to keep 0.5 of page in use, so removal of 1000 items might not cause reclamation of any (4096 bytes) pages:
https://github.com/etcd-io/bbolt/blob/f6be82302843a215152f5a1daf652c1ee5503f85/bucket.go#L26
Allocations above some size are happening in 16MBs chunks:
https://github.com/etcd-io/bbolt/blob/f6be82302843a215152f5a1daf652c1ee5503f85/db.go#L37
I recommend using bbolt command line tool to see what sits in the etcd DB snapshot.

ptabor on 28 Sep 2020

Thanks @ptabor.

Thanks for points 1 & 2.

I'll check point 3, is there a way to check this while writing an integration test at all?

Just to get a deeper understanding, here's what I'm doing:

Create a brand new etcd cluster with a storage quota of 72 Megabytes (75497472 Bytes).
Create a new KV client.
With the help of that KV client, I PUT 200 keys, each of 102400 random characters and an empty value for each key. This will fill the database upto 21331968 Bytes.
With the help of that KV client, I RANGE_DELETE the 200 keys, created in the step above. This results in increasing the size of the DB to 42246144 Bytes. I verify this by checking response.Deleted and it does display that 200 keys were indeed deleted. My question is, why is there a size bump?
With the help of that KV client, I perform COMPACTION on the latest revision 200.
Finally, I create a Maintenance client, I perform, DEFRAG, which finally yields all the unused space and brings the storage down to 16384 Bytes (which is what you pointed out) as the smallest chunk.

My question is why is deletion causing a spike in DB usage (maybe due to revisions), why do I need to compact and perform de-fragmentation in order to get space back? Just trying to understand the inner workings.

Is there a way to get that empty space back without actually doing compaction and de-fragmentation?

Really appreciate your input. Thanks.

vivekpatani on 28 Sep 2020

bbolt is never editing pages in-place.
If you delete / add entry to a page, bbolt is allocating a new page and copying the remaining content there.
The allocated page is either taken from the list of 'not being used pages' (free list) or allocated at the end of the file.

Page becomes part of freelist when 2 conditions holds:

There is no open transaction that is referencing (even transitively) the page
The page cannot be reached from the root of current state of the btree.

Also sometimes bbolt is looking for continuous area for multiple pages (and allocates at the end of file if not found).

I recommend adding verbose logging to bbolt when its allocating pages, grow(ing) storage for your experimenting, such that you will see what triggers given action.

ptabor on 29 Sep 2020

👍1

Hi @vivekpatani, your observation is the expected behavior of compaction and defragmentation. Please refer to

https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/maintenance.md#history-compaction-v3-api-key-value-database
and
https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/maintenance.md#defragmentation

jingyih on 29 Sep 2020

👍1

@ptabor @jingyih

Here's the discrepancy that I've come across. Not sure, if it's a bug or that's how its suppose to be (or maybe I'm not understanding something here):

New Database initiated:
SizeInUse: 24576

Step 1:
Create 200 keys (of 102400 characters each with empty values)
SizeInUse: 21237760

Step 2:
Delete the 200 keys created above
SizeInUse: 42156032

Step 3:
Manually perform compaction using: etcdctl compaction 401
SizeInUse: 41918464
^ Primarily this size remains constant until I perform the next step
^ At this step, I would expect SizeInUse go down
^ At this step, I also assume all keys to have been erased from existence (so then why is the SizeInUse so high?)
^ I have made sure that the compaction is complete and still SizeInUse does not decrease until I perform next step.

Step 4:
Manually perform a PUT using etcdctl put foo bar
SizeInUse: 212992
^ At this step, I actually see the SizeInUse to go down to it's actual real value.
^I also verify none of the keys are actually left behind (except foo).

./etcd-dump-db iterate-bucket ../../default.etcd key
key="\x00\x00\x00\x00\x00\x00\x01\x92_\x00\x00\x00\x00\x00\x00\x00\x00", value="\n\x03foo\x10\x92\x03\x18\x92\x03 \x01*\x03bar"

My question is why does compaction operation not update the db to the correct size?

vivekpatani on 5 Oct 2020

@jpbetz can you comment on this?

I was reading the design document and at the end I saw the paragraph on Compaction. Was wondering if what I was seeing was related to that?

Here's the doc I'm referring to: https://docs.google.com/document/d/1V9UuL8BnpZF2xFpHhvE1lO2hWvtdiLml3Ukr-m4jcdI/edit#heading=h.m4svduimkxre

Thank you all for taking the time to help me understand this.

vivekpatani on 7 Oct 2020

There might be a bug in the size reporting of etcd database in use. It'd be great if someone could verify this independently.

[HOWTO] build the etcd (client and server):

Go to root directory of project and run ./build
To run the server, ./bin/etcd --log-level debug
To issue a client command ./bin/etcdctl xx xxxx

[HOWTO] reproduce the issue:

Start a brand new etcd cluster.
SizeInUse: 24576
Create 200 keys of 102400 characters each.
SizeInUse: 21237760
Delete 200 keys created in the step above.
SizeInUse: 42156032
Manually perform compaction using: etcdctl compaction 401
SizeInUse: 41918464
Manually perform a PUT using etcdctl put foo bar
SizeInUse: 212992

What is the problem?

The size reporting in Step 3 seems to be incorrect. The variable SizeInUse should not be reporting that high a number when compaction is complete, it should be much lower. We need not run defrag to update the size (or step 4).

[HOWTO] check `SizeInUse`?

Either use delve or just add a simple go routine to print the SizeInUse periodically.

[HOWTO] create and delete 200 keys?

https://github.com/vivekpatani/etcd-create-delete

Etcd server version

3.4.9

vivekpatani on 13 Oct 2020

@jpbetz @ptabor @jingyih any chance of looking at this?

vivekpatani on 19 Oct 2020

How the file-size behaves in your repro ?
I would expect only defrag to change the actual size.

I think refresh of some metric is delayed to the edit operation on the db.
Which exact metric do you mean by SizeInUse:

this one https://github.com/etcd-io/etcd/blob/b7f0f52a16dbf83f18ca1d803f7892d750366a94/mvcc/metrics.go#L222 ?

ptabor on 19 Oct 2020

Thank you for your response @ptabor

My understanding of etcd is that compaction actually reduces the file size (db size). I create a few keys, delete them (which marks them as a tombstone record), then perform compaction. Compaction in two phases - 1. Updates indices & 2. Deletes the unreachable revisions in background. In compaction step 2, we actually see all the keys removed from db (by using the db inspection tool). So the SizeInUse should see effect in this step, but I see the size change in an arbitrary PUT after the compaction step. Please feel free to correct me if I'm missing (or am wrong about) something.

The SizeInUse I'm referring to is:

https://github.com/etcd-io/etcd/blob/2b79442d8e9fc54b1ac27e7e230ac0e4c132a054/mvcc/backend/backend.go#L65
Which is the same as - https://github.com/etcd-io/etcd/blob/2b79442d8e9fc54b1ac27e7e230ac0e4c132a054/mvcc/backend/backend.go#L89
Which is the same as what metrics will report as posted by you.

vivekpatani on 20 Oct 2020

@vivekpatani I'm still not entirely clear on how you measured the SizeInUse() value? Did you issue an etcdctl command? If so, what was the exact command?

jpbetz on 20 Oct 2020

@jpbetz @ptabor

This is how I keep a check on SizeInUse

https://github.com/vivekpatani/etcd/commit/421f541eff4ae8be4d338c193ea737447d855bdb

The only 2 changes I've made is to:

server/etcdserver/api/v3rpc/grpc.go
server/etcdserver/server.go

The exact steps to reproduce the problem:

From the root folder, ./build
Then execute, ./bin/etcd --log-level=info
Let the server continue to run in the background, open another terminal in the root of the project.
Add 200 Keys & Delete 200 Keys (Here's a script to do that, but feel free to use your own if that's preferred: https://github.com/vivekpatani/etcd-create-delete-requests-upstream)
Then from root directory run: ./bin/etcdctl compaction 401
While doing all this, please keep an eye on SizeInUse in logs.

If you're using the script:

From the root directory: go mod vendor && go build (Code tested on go1.15)
./m
It will create 200 keys and delete them with a break of 5s.

The idea is to see that compaction doesn't actually reduce the size as expected. Please let me know if I can asnwer anymore questions. Sorry for the late response.

vivekpatani on 31 Oct 2020

@vivekpatani Your understanding sounds right to me. The strange behavior you observed is probably due to how the SizeInUse gets updated in the backend.

IIRC, SizeInUse in backend only gets updated when etcd opens a new transaction from bbolt:
https://github.com/etcd-io/etcd/blob/6e800b9b0161ef874784fc6c679325acd67e2452/server/mvcc/backend/backend.go#L528

In your case, any arbitrary PUT will trigger the etcd backend to commit the already opened transaction from bbolt and open a new one (roughly speaking), which updates the db stat along the way. The part I don't understand is that the compaction itself should also trigger the backend to update the stat.

I think SizeInUse (and other db stat) can be and probably should be updated more frequently.

jingyih on 4 Nov 2020

👍1

Okay, that makes sense, I was worried if I was doing something incorrectly. Let me research and see if I can fix this. Although it'd be great if one of y'all could please just verify this, I'll start working on this in the mean time.

Thank you. @jingyih @ptabor @jpbetz

vivekpatani on 6 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings