Realm-cocoa: Add in place compact method

Created on 18 Dec 2014 · 20Comments · Source: realm/realm-cocoa

This is probably a meta-ticket that will need to be broken up in several tickets, but I think we should look at optimizing a few scenarios regarding to file sizes on device.

Letting people compact their files in place in Cocoa
Consider compacting the file automatically after a ~large amount of deletes has taken place — although that may backfire in some scenarios
Not allocating in as big chunks as we do, to limit the odds of hitting memory limits

@kspangsege @astigsen thoughts?

Blocked Pipeline-Idea-Backlog T-Enhancement

Source

timanglade

Most helpful comment

If/when the compact method is made public, it would be great to have the reason why it didn't compact the realm. Like in a try/catch having an error like: "did not compact because process X holds an open connection on the Realm".

kevincador on 21 Nov 2016

👍2

All 20 comments

@timanglade I don't think you fully understand why people keep hitting the file size issue. It isn't an issue of compaction or due to the size of allocations. Its due to the fact that we often times hold onto sparsely populated blocks of data unnecessarily. The real solution is to not hold onto unneeded blocks which is the actual problem here, and none of the workarounds you suggest actually solve this issue and have the following issues:

Users can already compact their db using writeCopyToPath. The big problem with any of this type of solution is that users don't know they need to call this method (because if everything else worked they shouldn't have to), and there is a good possibility that calling a compaction method could induce noticeable lag in apps.
Auto-compacting after deletes doesn't make any sense. You don't need to delete any objects to trigger the space wastage - you only need to create object in many write transactions while there is another existing read transaction. What might make sense is auto-compacting after a large number of write transactions resulting into unreferenced blocks. The only way I could see this working is if compaction and writes in general happened in the background to prevent blocking a user's app.
Changing the allocation size isn't going to fix anything. It could prevent a few users from hitting the limit, but at that point they still have the issue of wasting a huge amount of disk space. One thing we could change is the branching factor in core which will slow the rate that space is wasted, but that is a separate issue entirely.

alazier on 18 Dec 2014

First of all, I am positively not talking about the invalidate/file size issue thing, at all :D I’m just talking about how much disk space we can use under typical scenarios, and trying to slim that down.

All your points are terrific though! If you think the real silver bullet is to stop holding on to sparsely populated blocks, feel free to close this.

That said, if it looks like implementing it will take a while, I would kindly push for us to add an explicit compact method, so people can recover from existing file sizes without having to do the copy dance. (Or is the idea that you prefer forcing people to copy files around, so it’s less likely they will create issues in their app?)

I do feel the fact the allocator doubles at every step is a bit inconsiderate of common limits on mobile, and that should be looked into, as a separate issue.

timanglade on 18 Dec 2014

It just isn't clear to me how users will know when to call the in-place compact method - if their db has grown to the size that they need it then they will have to use the writeCopyToPath: version anyway as their Realm will be inaccessible.

Changing the allocator could improve things at most %50. That may be better than nothing but we still need to find a way to do better than that.

alazier on 18 Dec 2014

The point of compact isn’t to save users when they hit the ceiling, it’s to let them optimize their files during regular operations. (This came up while talking to a couple of different users recently) e.g. they use Realm as a cache that they flush regularly — with the current setup, they are never able to reclaim the space. Another scenario that could use a compact is doing a remote backup of your .realm file.
Admittedly, both scenarios _could_ work with writeCopyToPath:, but it’s just not clear at all from that method name that it will achieve what they want (reduce the file size). Maybe we just rename it to “writeCompactedCopyToPath:” to make it slightly more discoverable?

I’ll write something up for the allocator in core — I do think a 50% improvement is nothing to sneeze at :wink: You should explain your thoughts re: sparsely-populated blocks (if that’s not in a ticket somewhere already!)

timanglade on 18 Dec 2014

By default Core Data sets SQLite's auto_vacuum to 1 (full), which means that compaction is done at every transaction commit. This is likely why you never hear about Core Data file sizes ballooning to unacceptable sizes, though it does come at the expense of (typically slightly) more expensive transactions.

@finnschiermer has a PR up for an in-place compaction (https://github.com/realm/realm-core/pull/591), but even once that's done, it requires that there be no other open transactions during the compaction. Which means that PR won't be sufficient to support an auto-compaction like SQLite's.

In the short-term (once https://github.com/realm/realm-core/pull/591 is merged), we could propose to users who have app's that create unusually large realm file sizes to run this on app launch. We'd of course have to run benchmarks before being able to claim it as a viable option, but in my limited tests, writeCopyToPath: is actually extremely fast.

jpsim on 18 Dec 2014

If we find that in-place compaction is fast enough, we could even have Realms auto-compact on app launch by default (though we'd have to allow users to disable that).

jpsim on 18 Dec 2014

I think auto-compaction after some/all transactions (or at launch) would probably be taking this too far at this point. I get y’all points about inducing lags in apps, and letting developers be in complete control of when they potentially induce that lag. For a while I thought running a compact after massive deletes would be palatable but now I agree it’s as problematic as doing this on launch or after transactions.

That said, I still believe we need to make it more discoverable that realm files _are_ compactable, so what are your thoughts on renaming the method to writeCompactedCopyToPath:? Is that unnecessary or undesirable?

timanglade on 18 Dec 2014

To recap the four potential actions mentioned during the course of this thread:

Do we agree that we should not auto-compact for the foreseeable future, even once realm/realm-core#591 is in?
Anybody strongly against renaming writeCopyToPath: to something like writeCompactedCopyToPath: to help discovery of the only current compaction mechanism? It seems like it may be some time before we can offer something better/safer to users.
~~Allocator~~ — I already wrote up a ticket about re-evaluating the allocation scheme in core
~~Sparsely-populated blocks~~ — a close reading of #1034 seems to indicate that that ticket tracks @alazier’s issue with sparsely-populated blocks

timanglade on 18 Dec 2014

I think auto-compaction after some/all transactions (or at launch) would probably be taking this too far at this point

Why? In my worst-case scenario that I constructed (lots of tables, with many columns/rows/links), writeCopyToPath: never took longer than 0.07s (which is actually very impressive). When you consider that this is a worst case scenario, and that a typical app's launch time is between 1-2 seconds, I think that's quite reasonable.

jpsim on 18 Dec 2014

I'm in favor of renaming writeCopyToPath: to writeCompactedCopyToPath:

jpsim on 18 Dec 2014

I'm not. Performing an in-place compaction with it correctly is fairly difficult, but it's pretty easy to do something that appears to work but actually results in data loss and crashes. We really don't want people trying to write their own in-place compaction logic with writeCopyToPath:.

tgoyne on 18 Dec 2014

I completely agree, but there are saf_er_ ways to do it (i.e. at app launch).

jpsim on 18 Dec 2014

OK, so it remains that we should give users a way to compact files (either in place or at a new location), for the use-cases listed above, at least until we deal with #1034. Fundamentally, do we feel like a better solution is right around the corner with realm/realm-core#591? What is the ideal compaction API/experience we want to give users?

timanglade on 18 Dec 2014

I do think a 50% improvement is nothing to sneeze at :wink:

Out of context this sounds good, but in the worst case we can use up to 3 orders magnitudes more space than is required to store the data. So in practice making such an improvement will have negligible effect.

You should explain your thoughts re: sparsely-populated blocks (if that’s not in a ticket somewhere already!)

This has been discussed a few times in the office and is due to how core writes data (it copies blocks and only changes the new data, and can hand on to outdated blocked that are no longer needed). This is not an issue per se, but the underlying reason/explanation for why our db sometimes grows very large unexpectedly.

I think auto-compaction after some/all transactions (or at launch) would probably be taking this too far at this point.

I'm not worried about users who know they need to compact - a simple name change or more docs as you suggest could address this. The bigger issue is the greater number of users whose files have grown large without them knowing - this can easily occur without having to delete any data (many small transactions with an old read-lock...) yet users will not anticipate that a compaction is needed to work around the issue. The only ways to truly address this is to fix the underlying problem in core, or to do auto-compact. We could definitely make auto-compact using lazy writes but this would require significant effort from core. For large files I don't anticipate that we could ensure that compaction would take less than a few ms which is pretty much the worst acceptable case if we want to do this on startup.

alazier on 18 Dec 2014

By default Core Data sets SQLite's auto_vacuum
http://www.sqlite.org/pragma.html#pragma_auto_vacuum to 1 (full), which
means that compaction is done at every transaction commit. This is likely
why you never hear about Core Data file sizes ballooning to unacceptable
sizes, though it does come at the expense of (typically slightly) more
expensive transactions.

SQLite stil has the same issue as us as it is unrelated to compaction. When
you compact it cannot remove data that is currently held by blocked write
transactions (which is held in the Write-Ahead-Log, which is not
compacted), so compact only has an effect when no transactions are reading.
But their design with manual transactions discourages holding read
transactions open for long, so it is rarely a problem.

The right solution is to either find ways to discourage blocking read
transactions or to change core to reuse intermediate states. Improving core
is likely the best solution in the long run, but it is both a big (and
format breaking) task, and may have both size and performance penalties.

I don't think that compaction helps much, as you can only do it after the
fact, when there are no open read transactions, and at that point the space
can be reused anyways.

astigsen on 18 Dec 2014

👍1

@astigsen we’re actually not talking about the read/write thing, just about the normal allocation/release schemes that do cause some space waste, even if you don’t hit the thing we added invalidate for.

timanglade on 18 Dec 2014

Just had a conversation about this with @astigsen and @bmunkholm. Takeaways:

We believe that now that we have invalidate, there are very few legitimate cases in which you would need to compact files in-place. That said, there will be a few, so we do need to have an in-place compaction method somewhere in our API. If we expose one, we have to assume people will try to use it, and as @tgoyne highlights, it’s very easy to shoot yourself in the foot with the current (& proposed) compaction methods.
@astigsen comments that realm/realm-core#591 may not be safe enough. There may be scenarios under which we could endure shadow writes or even corruptions if people try to read the file while it is compacted. We need to design a live compaction mechanism that assumes the file will be read while it is compacted.
Once it is live in core, we should expose this safe compaction as an in-place compact method on RLMRealm — we can debate hiding this method from our public API if we want.
Optionally, we can debate running compact automatically after people call deleteAllObjects, since that’s the scenarios people seem to hit most often.

timanglade on 19 Dec 2014

So it sounds like there is one proposed change which has come out of this discussion - to add a compact method. Will change the name accordingly.

alazier on 19 Dec 2014

just following up: pr 591 against core does guard against corruptions:
When calling Compact, the caller must be the only one with a open shared
group.
If this is not the case, it is detected and the call fails.
Once compaction is in progress, any other thread trying to open a
SharedGroup will
wait in open() until compaction is complete.

/Finn