Borg: Backup to multiple remote locations

Created on 6 Feb 2016 · 11Comments · Source: borgbackup/borg

Currently I don't see how the backup can be synced to multiple remote locations without running the backup multiple times. It seems reasonable to first do a local backup, then sync the backup files to the remote locations.

Rsync can be used to sync the backup files, but due to how borg compacts the segment files this will result in a lot of duplicated data transfer if there are few or no changes to the backup set.

Each time borg creates a new backup, a new segment file is created. If there are no or few changes this file is very small. However, borg combines this segment file with the existing segment file to minimize the total amout of files. Using rsync to update the changes results in copying this combined file, and results in an average of (maximum segment file size(5MB) / 2 =) 2.5 MB extra data transfered for every single backup.

I have tried a simple hack by simply not compacting the segment files, and this works quite well. (very limited testing) There might however be good reasons for compacting the segments, such as being
able to delete old backups.

My proposal is to only compact segments when there is enough data to fill a new segment file. This will drastically decrease the amount of duplicated data transfer.

Of course, any other solution for syncing the backup files will be appreciated.

It might be interesting to look at the paper describing Duplicati 2.0 block based storage model. Specifically the assumption of a "dumb" storage backend.
https://duplicati.googlecode.com/files/Block-basedstorageformat.pdf

Source

oysols

All 11 comments

This issue has been on my mind also, as I have a similar concerns. I think it's generally good practice (particularly because of bit-rot, but not only because of it) to replicate backups in more than one location and on different media. The way borg currently aggressively (or you could say, optimally, in terms of storage space consumed) compacts the segment files perhaps doesn't not lend itself well to minimising data transfers when replicating the borg repo.

To be fair, this also addresses a limitation in rsync. Rsync is not optimised for the case were a file is both modified and renamed (as is essentially the case when compacting the borg segments). If there was an rsync-like tool that built chunk-based (rather than file-based) hashes to minimise data transfers, that would scratch our itch.

Your suggestion seems similar to one that came to my mind recently. My thought was to make the segment file compaction/optimisation process skippable (and providing a separate borg compact command to carry out that task at a time of the user's choosing).

Perhaps both your suggestion and mine could be combined for maximal effect?

In the meantime, some hacky ways around the problem may be:

If you've got plenty of disk space, create an uncompressed tarball of the borg repo at each end and then rsync the tarball. But of course you double your storage requirement on each end
Use zbackup to sync your repos to your additional remote locations. The downside here is that if you need to restore files from one of these additional remote locations you will first need to extract the borg repo from the zbackup repo. The other hassle is that you'll need to periodically purge old/deleted segment data from your zbackup repo - it's not difficult, but just CPU/IO intensive and requires extra disk space. Also, if memory serves, zbackup is hard-coded to use heavy compression, which is an extra waste of CPU cycles.
Use some kind of underlying block device and rsync that. For example, you could have your borg repos in a filesystem on a loopback device (filesystem within a file). Then you can rsync the virtual filesystem file. The obvious downside is that you need to lock yourself in to a size for your virtual filesystem. In addition, prior to syncing to your remote locations you'll want to zero the deleted blocks on the filesystem.

These are the 'hacks' that come to mind. Very warty, but a means to an end.

level323 on 6 Feb 2016

I recently added this to FAQ:

http://borgbackup.readthedocs.org/en/latest/faq.html#can-i-copy-or-synchronize-my-repo-to-another-location

Close?

ThomasWaldmann on 6 Feb 2016

It might be that the only use case is if you want a backup of the backup. Having a backup of the backup would not help against problems during backup creation, but it would help for bit-rot or deleting/corrupting your backup due to a bug in borg or by accident (or malicious intent).

As far as I can see borg does not touch segment files after creation. If there is a change a new segment file is created. Simply copying new segment files to a remote location would be enough to guarantee a complete set of segment files, and would not require much bandwidth. This would also enable us to use "dumb" storage such as Amazon Glacier without much overhead.

It seems like the best option for me is a nested borg backup. Doing a backup of the backup files. This will however require me to install borg on the remote server, instead of using "standard" rsync.

Regarding compacting segments: I still dont see the need for compacting segment files for each backup. The same data will be compacted multiple times before reaching a "static" state when the latest segment file has reached the maximum segment file size.

oysols on 6 Feb 2016

If you maintain an identical copy of repo1 at a different location (repo2), that does not help you against any issue located in repo1 (including bit rot, deleted archives or other corruption) because it will be also in the identical copy (with a little delay depending on how often you copy). So, if you do that and it breaks, you own the pieces.

Creating a borg backup of a borg repo even makes less sense IMHO. To do any restore in the case the first repo fails, you'ld first need to completely restore the first repo from then 2nd repo before you could extract even a single file.

About compacting separately: open a new ticket about that, this one is about backup to multiple repos.

I'm closing this now, this is answered in FAQ.

ThomasWaldmann on 6 Feb 2016

It is a old issue but I wonder:
Why not give the possibility to point to multiple destination instead of repeating the command?
Which require to read and to compute all the source again. Isn't it?

Would it be possible to start multiple instances of the client <=> server process supplied by one backup process?
I imagine a command like:

borg create \
    --replica ssh://[email protected]:2022/~/backup/main \
    --replica ssh://[email protected]:2022/~/backup/main \
    /path/to/repo::Monday ~/src ~/Documents

I hope what I mean is clear. :blush:

Does this could be better than run the command twice?

alexandrestein on 6 Sep 2019

The main problem is that the code is not prepared for this and adding this adds quite some complexity:

needing to deal with multiple repos
including needing to deal with multiple keys (encryption/MAC keys/chunker secrets)
error handling (currently, on major problems borg will just terminate)

OTOH, running the command twice is not as bad as it sounds, mostly because:

the usage of the files cache avoids heavy processing for all the unchanged files
some stuff has to be done multiple times anyway, e.g. for changed files: chunking, encrypting, authenticating (due to different keys in different repos)

Also, the goal of backing up to two location is likely to have 2 independent backups. If you do them both at once, the processing is not really independent any more and a problem in one of them might impact the other one.

ThomasWaldmann on 6 Sep 2019

Well I don't know the all process going on under the hood.

But from my standpoint I don't see why use a different key and do the chunking, encrypting and authenticating multiple times.

It's just a way to have clone of the same archive using the same parameters.
I was imagining that the sync process is some kind of pipe in which you send some streams.
Instead of having one pipe we have as many pipes as replications needed.
And we fill those pipes and wait for the slowest pipe to make things simpler in case of error.

Again I don't know how the process works, and you said:

The main problem is that the code is not prepared for this

I understand that it's a pretty big work and don't have the time to try it my self.
I will try with multiple calls.

Thanks you for this quick reply.

alexandrestein on 6 Sep 2019

I don't see why use a different key and do the chunking, encrypting and authenticating multiple times.

Well, that's the way it currently works. Key material is generated from random per repo. This has security and confidentiality reasons:

AES counter mode requires that for same key, the counter values are not being re-used. And we also manage the counter per repo. Counter-reuse breaks the encryption.
The chunker secret is also from random and thus chunks the same input data differently per repo. This (in comparison to always starting from a specific chunker non-secret value) is for confidentiality and to counter fingerprinting attacks (so an attacker can't see from chunk lengths what content you have). Theoretically, this could be moved to a multiple-repo shared secret, but there currently is no code to do that.

The processing pipes you mention: the code is not very separated, so there are currently no workers connected by pipes. But there is the idea to restructure like that at some time in the future when we introduce multithreading (see milestones).

ThomasWaldmann on 6 Sep 2019

👍1

I understand all about this but I thought it could be an option in this particular case (replication) to use same encrypted content for all location just to prevent to do multiple times the exact same thing.
The way I see it, we don't reuse the key but the ciphered content.

Great news about the multi-threading.
May-be in the future this will more feasible.

alexandrestein on 6 Sep 2019

In theory, if one would assume that such "related repos" are always updated in parallel, yes, that would be an option. The problems might then start if users would use one of these repos in a non-parallel way (or maybe when it somehow otherwise it gets "out of sync" due to some isse with one repo, but not the other), not sure.

Anyway, it is quite some added complexity and because we have the speedup due to the files cache, there is no pressing reason to do it that way (except if you really have huge changes between each backup).

ThomasWaldmann on 6 Sep 2019

👍1

Ok :smile:

Let's try to run multiple borg command in parallel with various destination.

Thank you :v:

alexandrestein on 6 Sep 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings