Borg: Resume backups

Created on 14 Feb 2016  路  13Comments  路  Source: borgbackup/borg

When doing an initial backup, it can take quite a bit of time. Would it be possible to allow the user to Ctrl-C to kill it and then resume sometime later?

Most helpful comment

That already works, the default checkpoint time is 5 minutes, so you will not lose more than 5 minutes if you hit ctrl-c. After you finished the backup successfully, you can remove all backupname.checkpoint archives.

All 13 comments

That already works, the default checkpoint time is 5 minutes, so you will not lose more than 5 minutes if you hit ctrl-c. After you finished the backup successfully, you can remove all backupname.checkpoint archives.

Hi @ThomasWaldmann,

Yes, I see that here:
http://borgbackup.readthedocs.org/en/stable/faq.html?highlight=checkpoint#if-a-backup-stops-mid-way-does-the-already-backed-up-data-stay-there

Is there a configurable option to make it less than 5 minutes or is 5 minutes hard coded?

see borg create --help (checkpoint-interval or so).

Thanks.


Discussion below this point and subsequent responses are unrelated. See #664 for additional details.


Separately, it might be good to increase the logging levels. I tried to create an archive just now. After processing two 700 MiB files, borg simply stopped. cpu was essentially zero. mem usage was at 50%. idling was sustained at 98%. Even though I ran with --debug -p -s --show-rc, no useful information was shown to stdout. I'm using lz4 compression.

I'll try to figure out what is going on, but this is going to devolve into me hacking the code, which should not be necessary (in the ideal situation). After Ctrl-C'ing, I waited > 10 minutes and the program did not quit. Ctrl-Z generated no response. After 30 minutes, I just kill -9'd it.

Ok, worse than I thought...I am simply unable to kill it. I even tried to kill -9 the parent process. Nothing seems to kill it. State is D+ (uninterruptible sleep). So I'm restarting now. We gotta fix this...

While the process is running, can you ctrl + t to check its usage? I know you reported some stats above.

What's the stats of your machine?

I've got a Retina MBP. Linux is a Parallels VM with 2 CPUs and 8 GiB RAM. The hard drive that contains the files I am copying is an external USB drive that is known to Mac OS and made available to Linux through Parallels.

(ctrl + t did not show any information, btw)

Whenever borg processes a decently sized file (unsure how large the file has to be), something happens that blocks borg from proceeding. Linux is 50% idle, 50% wa and borg has state D+. To give a concrete example, I had a 300 MiB file. Borg went through that file (121 chunks), uploading those chunks to the server in about 8 minutes. After that, in archive.py, the following lines are executed, which includes 2 print statements that I added:

         print("{}: adding chunks to item".format(datetime.now()))    # added
         item[b'chunks'] = chunks
         item.update(self.stat_attrs(st, path))
         self.stats.nfiles += 1
         self.add_item(item)
         print("{}: done adding item".format(datetime.now()))

I haven't tracked it down yet, but these lines do not finish in a reasonable amount of time. After the 8 minutes to process the chunks, the first print statement above appears. Even 25 minutes later, the second print statement does not appear. Memory usage is <50%. Top still shows 50% idle and 50% wa.

In one instance, I went to Mac OS and did an ls on the external USB drive. This started borg right back up again. So maybe the drive went to "sleep". But in another instance, doing the same thing did nothing to unblock, and the machine was generally "unresponsive" (doing ls in another terminal stalls until I Ctrl-C it). How long should I expect those lines to take, in general? Does this give a hint as to what might be going on?

Have you tried borg break-lock on the stalled repository?

I let it run over night....eventually it becomes unblocked, it sent a few more up this morning. Stalled again now (for the last 2 hours). I don't have time to dig into this today, but I'll keep looking later.

I guess this should be approached a bit different to make debugging and managing the issue easier:

First, the topic of THIS issue is "resume backups" which usually works and was therefore closed as non-issue. So, as you have some real issues described here later, the question is whether they are related to resuming or also happen without resuming.

Second, you need to simplify the setup to the minimum required to reproduce your issue:

  • you have 2 different OSes and a virtualization software potentially causing issues that do not happen without that mix
  • you likely deal with multiple different filesystems and the VM software showing some host files in the guest somehow
  • you do remote backups that might suffer from network issues (inside borg or unrelated to borg) that do not happen with a local backup
  • you have at least 3 disks involved (internal client disk, usb disk at client, server disk) that might have problems (or go to sleep or whatever), try to reproduce on 1 local fs / internal disk
  • you use compression (try if it happens without)
  • do NOT ctrl-c, but just let the backup work normally

So, please:

Either edit this issue and remove everything that is not relevant any more, edit your first post so it describes the minimum necessary to reproduce the issue and is clean and focussed.

Or, open a new / separate issue for everything you find - do not mix multiple ideas / issues / TODOs into a single issue.

I'll open another issue. Apologies for piggybacking on this issue. However, I'm not going to edit this issue to remove my previous comments. The conversation happened and it involved multiple people. I think there is very limit benefit to cleaning up. Readers are surely capable enough to understand that the conversation has deviated from the original topic. I'll be sure to link back to this issue. Hope that is ok.

And yes, I realize that I need to find a minimum set of conditions to reproduce the issue. I've already tried most of what you suggested. Making progress...

The good thing about a low --checkpoint-interval (currently defaulting to 1800 seconds, I understand) would be that you never lose much time, but I suppose that there is a downside as well. What is it? I've seen excessive writes mentioned, but I don't know how they happen.

@JonasOlson it used to be 300s in borg < 1.1, but the shorter one makes this interval, the more overhead it creates (and the more inefficient it gets).

it has to finish a transaction (commit, store caches/indexes to disk) and start a new transaction - the larger your indexes and caches are, the slower that process is. it is especially slow if you access the repo via sshfs over a slow connection, because it will have to copy the repo index over that connection.

so, choose an interval that is somehow appropriate for the stability of the hardware and network.
not too short, but also not too long.

Was this page helpful?
0 / 5 - 0 ratings