Borg: Cache persistency

Created on 15 Jul 2016 · 20Comments · Source: borgbackup/borg

Doing a daily single "borg create" with all required directories is a lot faster (10-50x) than doing many different "borg create" with every directory, still in the same repo.
It looks like the cache get purged of previous files not seen in the current borg run, but that could be seen in a subsequent one.
I propose to add a system of cache persistency, so that it is possible to do many borg runs while finding the files in cache.
An approach could be to keep file caches until at least some time (maybe 1 day as default) is passed without seeing them. This way you could do many daily different borg create.
An improvement could also consider that backups may not be done some time periods (e.g. week end days, or when there is a network failure) and still not purge the cache when the first run of batch is done.

documentation easy

Source

FabioPedretti

All 20 comments

Entries not seen in the files cache are only removed after they were not seen ten times in a row.

Check the file status when create runs - only if it's A (for added) the file wasn't found in the files cache.

-- How long does each borg-create take, approximately? a few minutes or longer?

enkore on 15 Jul 2016

It make sense, I had more than 10 different borg create previously.
Each one spent from a few minutes to up to 5 hours.
After I moved from the many runs to a single borg create with all the directories the first run took about 12:30 h.
The next day it took about 1:00 h.
I am backupping about 1.1TB for a total of 1.5M files mounted as NFS shares, over a remote server repo mounted with sshfs.

FabioPedretti on 15 Jul 2016

Ok, these numbers sound about right for the size of the data set, depending on churn / rate of change and considering that both "sides" (input and repository) are network-mounted (using borgs built-in remoting capability over SSH instead of sshfs may improve performance a bit, but not drastically - at least not at this time).

If the machines involved in the SSH / sshfs connection are weak you may also improve performance by choosing different ciphers. Recent SSH versions support [email protected], which is pretty much the fastest cipher on all CPUs with no AES-NI (or similar) - in these cases it is much, much faster than AES. nothing to do, if supported this is the default preferred cipher. Yay! :)

         The default is:

         [email protected],
         aes128-ctr,aes192-ctr,aes256-ctr,
         [email protected],[email protected],
         aes128-cbc,aes192-cbc,aes256-cbc,3des-cbc

enkore on 15 Jul 2016

Do we need either max_generations be configurable or named backup-sets and separate file caches for each?

ThomasWaldmann on 15 Jul 2016

I would see this more of an implementation-detail. Maybe we can avoid invalidating files cache entries that can't possibly have been seen in the current create run?

An entirely different question is whether this is a real problem. Is it typical to have 10s of different backup sets on the same client (same cache) that go in separate archives _and_ that run always at the same time in-sequence?

enkore on 15 Jul 2016

Hmm, we can't match the pathnames as we do not have them in the cache, just their hashes.

ThomasWaldmann on 15 Jul 2016

Thanks for looking into this.
About the reason to use more than 10 runs: I had some trouble passing directories with spaces, so I ended up using different backup instances*.
Maybe just adding in the FAQ 1) the note about the 10 runs and 2) to suggest using built-in SSH rather than sshfs would just be enough for now.

*This is the code that I currently use, properly supporting path with spaces passed through an env variable:

SHARES[0]='DIR A'
SHARES[1]='DIR B'
SHARES[2]='DIR C'
...
borg create -v --stats -C zlib ::$ARCHIVE-$TODAY ${SHARES[@]}

FabioPedretti on 15 Jul 2016

Most troubles with spaces in arguments can be solved 'with quoting' or with\ escaping.

If you use shell variables, try SHARE="'with spaces'" (outer double quotes, inner single quotes).

Borg does nothing special with spaces, you just need to hold your shell back from splitting a path with spaces into multiple arguments.

ThomasWaldmann on 15 Jul 2016

👍1

When in doubt about quoting you can just use a small Python script, like the one below and use that instead of the actual program:

import sys
print(sys.argv)

$ python quot.py this should be one name but isnt 'this is one name, though' also\ this 
['quot.py', 'this', 'should', 'be', 'one', 'name', 'but', 'isnt', 'this is one name, though', 'also this']

enkore on 15 Jul 2016

👍1

TODO:

make own FAQ entry about max_age == 10 of the file cache entries.
maybe also add a FAQ entry about pathes with spaces or other special chars

ThomasWaldmann on 15 Jul 2016

❤1

What about keeping the 10 sets, but also requiring some days (with 2 you can safely manage daily backups, with 8 also weekly) before purging caches?

FabioPedretti on 16 Jul 2016

Well, the count required depends on backup sets and frequency of them.

ThomasWaldmann on 16 Jul 2016

Could maybe a command line flag be added that specifies the number (default: 10), and only purge the cache for entries that have were "not seen" less than that number?
Or even a option similar to "--keep-within=2d" for keeping any cache entries that were not seen for at least 2 days. Maybe this would require adding a "last seen date" field to the cache entries as well, so it might be overkill/too much overhead. But specifying at least the number would be super useful.

I've run into this problem as well: I run backups on my development folder and my VirtualBox folder every hour or so, but whole-machine backups maybe daily. If I do more than 10 of these small backups, the whole machine backup the next day takes much longer.

jmiserez on 17 Jul 2016

Implementation: Add tweakable BORG_FILES_CACHE_TTL environment variable?

Another option could be to change the age field into "last seen at" (adding another field imho not a good idea) and work from there. Note that age is evaluated when _storing_ the files cache, so after a backup run any file seen would always have that field = the last couple minutes.

OTOH, KISS; making it tweakable is likely sufficient.

enkore on 17 Jul 2016

Yes, guess a env var would be nice. Should we also raise the default value?

ThomasWaldmann on 17 Jul 2016

👍1

20 maybe?

enkore on 17 Jul 2016

ok.

ThomasWaldmann on 18 Jul 2016

Also: add to FAQ about this:
Q: ...
A: The files cache does intentionally not contain entries that have >= 20 as "entry age" (|project_name| has not seen this file within N backup runs).

ThomasWaldmann on 18 Jul 2016

Yes, guess a env var would be nice.

Would love to see that soon (1.0.7?). This way I could do again single archive per server (NFS share), rather than compacting several servers in single archives.

FabioPedretti on 30 Jul 2016

Fixed by #1414.

ThomasWaldmann on 31 Jul 2016

Was this page helpful?

0 / 5 - 0 ratings