Doing a daily single "borg create" with all required directories is a lot faster (10-50x) than doing many different "borg create" with every directory, still in the same repo.
It looks like the cache get purged of previous files not seen in the current borg run, but that could be seen in a subsequent one.
I propose to add a system of cache persistency, so that it is possible to do many borg runs while finding the files in cache.
An approach could be to keep file caches until at least some time (maybe 1 day as default) is passed without seeing them. This way you could do many daily different borg create.
An improvement could also consider that backups may not be done some time periods (e.g. week end days, or when there is a network failure) and still not purge the cache when the first run of batch is done.
Entries not seen in the files cache are only removed after they were not seen ten times in a row.
Check the file status when create runs - only if it's A (for added) the file wasn't found in the files cache.
-- How long does each borg-create take, approximately? a few minutes or longer?
It make sense, I had more than 10 different borg create previously.
Each one spent from a few minutes to up to 5 hours.
After I moved from the many runs to a single borg create with all the directories the first run took about 12:30 h.
The next day it took about 1:00 h.
I am backupping about 1.1TB for a total of 1.5M files mounted as NFS shares, over a remote server repo mounted with sshfs.
Ok, these numbers sound about right for the size of the data set, depending on churn / rate of change and considering that both "sides" (input and repository) are network-mounted (using borgs built-in remoting capability over SSH instead of sshfs may improve performance a bit, but not drastically - at least not at this time).
If the machines involved in the SSH / sshfs connection are weak you may also improve performance by choosing different ciphers. Recent SSH versions support nothing to do, if supported this is the default preferred cipher. Yay! :)[email protected], which is pretty much the fastest cipher on all CPUs with no AES-NI (or similar) - in these cases it is much, much faster than AES.
The default is: [email protected], aes128-ctr,aes192-ctr,aes256-ctr, [email protected],[email protected], aes128-cbc,aes192-cbc,aes256-cbc,3des-cbc
Do we need either max_generations be configurable or named backup-sets and separate file caches for each?
I would see this more of an implementation-detail. Maybe we can avoid invalidating files cache entries that can't possibly have been seen in the current create run?
An entirely different question is whether this is a real problem. Is it typical to have 10s of different backup sets on the same client (same cache) that go in separate archives _and_ that run always at the same time in-sequence?
Hmm, we can't match the pathnames as we do not have them in the cache, just their hashes.
Thanks for looking into this.
About the reason to use more than 10 runs: I had some trouble passing directories with spaces, so I ended up using different backup instances*.
Maybe just adding in the FAQ 1) the note about the 10 runs and 2) to suggest using built-in SSH rather than sshfs would just be enough for now.
*This is the code that I currently use, properly supporting path with spaces passed through an env variable:
SHARES[0]='DIR A'
SHARES[1]='DIR B'
SHARES[2]='DIR C'
...
borg create -v --stats -C zlib ::$ARCHIVE-$TODAY ${SHARES[@]}
Most troubles with spaces in arguments can be solved 'with quoting' or with\ escaping.
If you use shell variables, try SHARE="'with spaces'" (outer double quotes, inner single quotes).
Borg does nothing special with spaces, you just need to hold your shell back from splitting a path with spaces into multiple arguments.
When in doubt about quoting you can just use a small Python script, like the one below and use that instead of the actual program:
import sys
print(sys.argv)
$ python quot.py this should be one name but isnt 'this is one name, though' also\ this
['quot.py', 'this', 'should', 'be', 'one', 'name', 'but', 'isnt', 'this is one name, though', 'also this']
TODO:
What about keeping the 10 sets, but also requiring some days (with 2 you can safely manage daily backups, with 8 also weekly) before purging caches?
Well, the count required depends on backup sets and frequency of them.
Could maybe a command line flag be added that specifies the number (default: 10), and only purge the cache for entries that have were "not seen" less than that number?
Or even a option similar to "--keep-within=2d" for keeping any cache entries that were not seen for at least 2 days. Maybe this would require adding a "last seen date" field to the cache entries as well, so it might be overkill/too much overhead. But specifying at least the number would be super useful.
I've run into this problem as well: I run backups on my development folder and my VirtualBox folder every hour or so, but whole-machine backups maybe daily. If I do more than 10 of these small backups, the whole machine backup the next day takes much longer.
Implementation: Add tweakable BORG_FILES_CACHE_TTL environment variable?
Another option could be to change the age field into "last seen at" (adding another field imho not a good idea) and work from there. Note that age is evaluated when _storing_ the files cache, so after a backup run any file seen would always have that field = the last couple minutes.
OTOH, KISS; making it tweakable is likely sufficient.
Yes, guess a env var would be nice. Should we also raise the default value?
20 maybe?
ok.
Also: add to FAQ about this:
Q: ...
A: The files cache does intentionally not contain entries that have >= 20 as "entry age" (|project_name| has not seen this file within N backup runs).
Yes, guess a env var would be nice.
Would love to see that soon (1.0.7?). This way I could do again single archive per server (NFS share), rather than compacting several servers in single archives.
Fixed by #1414.