Borg: Multithreading

Created on 26 May 2015  Â·  52Comments  Â·  Source: borgbackup/borg

I started some experimental multithreading code there:

https://github.com/thomaswaldmann/borg/tree/multithreading

especially:

https://github.com/ThomasWaldmann/borg/commit/240a27a22774eb055c7bff5097b4d6557bbb81c5

From Python c-api docs:
"the standard zlib and hashlib modules release the GIL when compressing or hashing data."
So current compression and hashing should be ok for good multithreading, as well as python's I/O (it releases GIL before I/O ops).

Additionally, later changesets implement code to release the GIL there:

  • chunker
  • crypto
  • lz4 compression

:moneybag: there is a bounty for this

Bountysource enhancement help wanted

Most helpful comment

No.

All 52 comments

About the question in the commit comment about longer "user time" when running multithreaded compare to single-threaded:

The answer might be that this was a dual core cpu with hyperthreading. While this looks like 4 cores to the OS (and Python), when really using it like 4 cores, each of these cores might appear as a bit slower CPU than when just using a single core, which might explain the longer "user time" needs.

It still was quite faster on the wall clock, so we don't worry.

Some things are TODO there:

  • remote repo tests are broken (and thus currently disabled)
  • that delayer thread is strange, better solution?
  • likely AES counter uniqueness is broken
  • needs way more analysis / optimization
  • fix the tests, they are quite broken / hang right now

Also, the bounty should be higher, this is quite a lot of work.

Just an idea

Traditional multi-threading (shared data, locking, queuing etc.) is often neither simple to develop nor test. I have been using ZeroMQ to effectively multi-thread things in a few situations now, and feel that it makes things much easier with essentially no relevant overhead (for inproc://), especially when using zerocopy.

I understand that in borg files are essentially processed by a _pipeline_ of algorithms:

reading => chunking => hashing / encrypting => storage

This might lend itself well to the pipeline pattern in zmq, e.g. every one of these stages is handled by one or more nodes (=threads). Work distribution, queuing etc. are handled by zmq transparently.

This zmq approach to multithreading is not unlike microservices. It would probably mean extensive refactoring for Borg. OTOH it would lead to Borg scaling very well with available CPU resources.

Interesting idea. Can you say what the pros of this approach are compared to Python's Queue module (which is in stdlib and threadsafe)? The Queue module is what the multithreading code uses currently.

Ah I see how you approached it, it is conceptually already very close. I also see the "delayer problem" now.

On a technical level the zmq approach is language agnostic, so it's [it can be] straightforward to implement entire stages in C with no Python/GIL involvement _at all_ (and the ability to still do this zerocopy: separate, pure C-thread runs a node in the same process, so inproc:// can be used between Python and C parts, without ever copying the data).

Regarding compression:
We are using https://github.com/madler/pigz for parallel compressing of petabyte of archives. Maybe it could help here. We never had an issue - and the performance on multicore systems is awsome!

@schuft69 please see the ticket about compression about why this is not that simple in case of borg.

Each chunk (0 byte ... 8 MB) is compressed independently, so parallel compressors tend to not work well there; instead we'll run separate stages (hashing, encryption, compression) in parallel and maybe multiple operations in parallel as well (e.g. compressing multiple chunks independently using independent compressors in parallel).

We're working on drafts regarding this topic; we will put more work forward here once 1.1 reaches freeze / RC status.

The multithreaded branch hasn't been updated since March 2016, about 1.25 years, and is missing about half of the commits in the current HEAD. How much of that approach remains viable?

Lack of some MT support makes borgbackup basically unusable for me. Even basic support would make things much more bearable (for example, compressing multiple chunks simultaneously).

I'd like to make contributions to help, but I'm not sure where to start right now. What's the current status on MT draft/direction?

@sjuxax that was just an experiment and should not be used.

real, better working MT is planned for borg 1.2.

coding work on 1.2 has not started yet, we are still finishing 1.1. there is some planning, see milestones and tickets here.

The multithreading branch was never intended to be merged; it was one of now several tests. The Borg 1.2 entry in the (project management) wiki might be of interest. Some further tests and planning has been conducted, but partly not published yet (some of that stuff is not in English and has to be translated / rewritten first).

The current plan (this is from March) looks basically like this:

borg-mt1 2

Grey boxes are individual actors / threads. Arrows indicate channels: orange are queue-like, green is RPC, violet is queue-like for metadata and blue is the same for errors.

@sjuxax btw, why is it unusable without multithreading for you?

I'm mainly interested in using it to back up large, frequently-changing targets. Even if the job would normally complete without wrapping around on itself (i.e., the prior job wouldn't finish before the next interval), it's not worth the extra cost and risk to keep the backup going in the background for much longer than it would take other backup targets to complete the job.

If I was backing up around 10G of stuff, it probably wouldn't be a big deal. I'm interested in backing up things that are much larger, almost always >= 100G with many small files and some very large files (5-10G+). I have not tested lately, but back in April when I last tested, I definitely felt it was too slow for regular use.

As an example, I'm backing up a 5T filesystem right now (from disks that like to choke every few hours). While I recognize other limitations in borgbackup may make a 5T backup implausible, it would be convenient to be able to use it for more common tasks in the 100G-500G range, but without multithreading, it's just too slow for me (especially the compression stage; I'm not encrypting within borgbackup).

I've tried several things over the last little bit, including SquashFS, fsarchiver, and others. Right now I am using ZFS on loopback with compression and deduplication enabled and rsyncing the tree over. That seems to be working the best, and it makes it easy to resume the process when the disks hang.

5T works quite well; the largest backup set I know of that was "publicly admitted" to use Borg is >40T. What's clear of course is that Borg 1.0 and 1.1 have fundamental limitations on their processing speed... whether this is a nasty problem for a big backup set has more to do with it's workload and less with the sheer size of it. Fast-changing sets require the most processing, while other workloads are usually less problematic. Files that don't change are faster to process than files that do; bigger files are slower than smaller files (<=one chunk) etc.

But yeah, I'm aware of these limitations and I'll / we'll try to address them with 1.2.

Well, you need to consider the amount of change between backups. If it is too slow to backup the changes, yeah, then you have a problem.

The total amount (as long as it is within the limitations) doesn't matter much, borg skips unchanged files rather quickly (be careful whether inode numbers are stable for your fs and always mount at same mountpoint).

Assuming you have enough backup space (after considering how much you safe due to historical dedup), you can use lz4 compression (you won't need multiple threads then to get compression fast).

Encryption is also fast, IF you have AES-NI.

It's not perfect yet (thus our plans for 1.2), but quite ok - not only for small backups.

Here's my prototype from March which roughly corresponds to the current plan.

Easy refactorings like FSOProcessors and MetadataCollector seem prudent to me for a start.

hmm, guess it makes more sense if you do a PR with that and I review it. And you could finish reviewing the crypto-aead stuff, so we can merge it first, so I don't need to rebase it again / fix conflicts again.

btw, the branch has only few changesets with huge changes. i imagined a bit smaller steps / less risk approach.

Regarding large (in size and/or in number of files) backups. Here is my use case.

In the past, I only used BorgBackup for not-very-busy servers. But recently I decided to try to back up a CI server. With lots of node_modules/ (Node.js) and vendor/ (Go) directories, the server has tons of small files.

The initial backup (~42 GiB) took only 42 minutes. That's about 1 GiB per minute. Considering the large number of files and network transfer time (I'm backing up from a data center near San Jose, CA to AWS us-east-1 N. Virginia region over SSH), it's quite impressive.

Here is the most recent backup:

Time (start): Tue, 2018-03-20 01:46:24
Time (end):   Tue, 2018-03-20 02:06:34
Duration: 20 minutes 9.91 seconds
Number of files: 6134998
Utilization of max. archive size: 2%

Even with single-core, the speed is acceptable. However, the fact that only one of the many cores of the CI server reaches 100% usage during the backup makes think "what a waste of CPU cores" :-)

What is the current state of this issue?

With borg create --compression auto,zstd,3 on modern hardware (ssd) backup process speed is limited by speed of one cpu core. And it can be much faster on multiple cores.

See the milestones. Next release (1.2) will not yet have it.

It's quite a lot of work to restructure borg for this, better funding would be desirable also.

Any news on this?

No.

Do you need help on this issue?

work on this (besides planning, experiments) has not started yet and likely some crypto stuff needs to be solved first (e.g. avoiding AES counter mode issues).

OK sure.

Anyways here is my idea:
The chunker should fill multiple chunk queues, one for each thread we want to spawn.
Then we can process each queue with a separate thread.
This way we can parallelize the whole pipeline (hashing, cache lookup, compressing and maybe even encrypting).

This would require the cache to be thread save or read only and will probably result in RAM demands scaling with the number of threads.

There is quite a nice bounty on this! I'd like to cash in on it. anyone thats been active for awhile can summarize what needs to be done? I have some quarantine time on my hands i would like to monetize on :P

Added some more cash to the bounty. Hope we can get this feature soon!

I'm ready to start work on it. I've sent Borg an email and waiting for a confirmation, but have yet to receive that.

I'm TW (obviously) but my impression was that this issue ("multithreading") is mostly an epic "umbrella" issue. There are a lot of nuances involved (and probably some larger changes to borg's internal architecture).

I'm pretty sure the current bounty (USD 1500) is very very low compared to the regular rates of a professional software developer who needs to spend time on this (otherwise @ThomasWaldmann and others would have solved this a long time ago).

From what I can see #929 is a better place to start. The first step would be to define realistic intermediate goals (there are several operations which could benefit from using multiple CPU cores) and agree on ways to measure the impact.

I say if Jean wants to give it a go and has nothing better to do then give it their best. :)

Yeah, but i am an actual engineer. Would deff solve for the bounty, even if it takes me a whole week. So i mailed asking for detailed info on completion conditions etc to solve it properly. I will start once i have a reply and a go ahead on this platform :)

I'm TW (obviously) but my impression was that this issue ("multithreading") is mostly an epic "umbrella" issue. There are a lot of nuances involved (and probably some larger changes to borg's internal architecture).

I'm pretty sure the current bounty (USD 1500) is very very low compared to the regular rates of a professional software developer who needs to spend time on this (otherwise @ThomasWaldmann and others would have solved this a long time ago).

From what I can see #929 is a better place to start. The first step would be to define realistic intermediate goals (there are several operations which could benefit from using multiple CPU cores) and agree on ways to measure the impact.

Yeah that was my first impression as well. Would be easier to have clear and concise goals to claim to the bounty. While i am applying for jobs and interviews i could work on it. Seems right up my alley to be honest.

Would deff solve for the bounty, even if it takes me a whole week.

I won't claim much knowledge about borg's internals (I did contribute only tiny bits to borg) so maybe I'm wrong but I am pretty confident that a week is not enough. I guess with all the required architectural discussions, getting agreement on the preferred approach, coding, testing and benchmarking this is easily a month of work (at least). I'd happy to be proved wrong though :-)

That's why working on smaller steps towards "multithreading" would be more beneficial. IMHO the first approach is unlikely to fully implement the desired features but with multiple developers each contributing some pieces we might get a version of borgbackup which utilizes the power of multicore machines.

Anyway: Outsiders discussing this in this ticket is unlikely to get real progress. If @jean-phillipe88 wants to start working on this I propose you read the relevant tickets in this tracker and then come up with a high-level step-by-step plan of the changes (and probably some more in-depth description of the necessary changes of the first step).

Would deff solve for the bounty, even if it takes me a whole week.

I won't claim much knowledge about borg's internals (I did contribute only tiny bits to borg) so maybe I'm wrong but I am pretty confident that a week is not enough. I guess with all the required architectural discussions, getting agreement on the preferred approach, coding, testing and benchmarking this is easily a month of work (at least). I'd happy to be proved wrong though :-)

That's why working on smaller steps towards "multithreading" would be more beneficial. IMHO the first approach is unlikely to fully implement the desired features but with multiple developers each contributing some pieces we might get a version of borgbackup which utilizes the power of multicore machines.

Anyway: Outsiders discussing this in this ticket is unlikely to get real progress. If @jean-phillipe88 wants to start working on this I propose you read the relevant tickets in this tracker and then come up with a high-level step-by-step plan of the changes (and probably some more in-depth description of the necessary changes of the first step).

Yeah i can see why you think a month would be the minimum time duration using this approach. And also why the bounty is way to low to attract any serious interest. I would not really discuss any architectures or getting agreements, but just push to a fork for review to see if the results are worth it.

I'm used to being hired to upgrade someone else's code going in with zero knowledge beforehand. Never worked with large groups trying to contribute, and i see allot of problems getting even small changes would take forever using this approach.

But don't misunderstand me, i came here because of the bounty. If it is unlikely that i could earn it, i probably won't delve into it.

I mean, this issue has been open for 5 years now. Is it ever meant to be marked as SOLVED?

Implementing multithreading isn't actually that much work, my prototype which mostly worked fine from a few years ago should still be in my repo – just a few hundred lines of diff or so. But actually making it go faster, especially when processing small chunks, is way more involved (pure Python isn't going to cut it, especially these days). IIRC it was about 20-30 % faster while using about 100 % more CPU when processing large files, and 20 % or so slower when processing small files. That would have been on a quad-core Xenon.

The design discussed 2016-2017 should be solid and scalable to 8+ cores, but many threads will need to run their consume/produce-loops without the GIL, i.e. only in native code. It's basically just a lot of plumbing and testing work.

that is what i thought indeed.

the plan to work on multithreading is for helium milestone, after doing crypto work.

current master branch is still hydrogen and not released yet, but in alpha.

Does that mean the bounty is currently inactive?

There is no "inactive" state for bounties. Once they are set, they are active.

But that does not necessarily mean they can be done at any time (without investing a lot of time on also solving the prerequisites).

Also, there are better and worse defined / scoped bounties and for new contributors I would rather suggest working on some smaller scope bounty / project.

There are lots of open tickets and this one is definitely one of the harder to solve ones.

That is why I sent you an email requesting a clear and concise definition of the prerequisites for fulfilling the bounty requirements. I am a pro developer, and i came across this because of the bounty, and i am fairly confident in my ability to solve for it.

Extra work on the way to it is possible, and if the work has bounties on the way, great; but honestly, i am a bounty hunter :)

Did you get my email btw?

There is no "inactive" state for bounties. Once they are set, they are active.

But that does not necessarily mean they can be done at any time (without investing a lot of time on also solving the prerequisites).

Also, there are better and worse defined / scoped bounties and for new contributors I would rather suggest working on some smaller scope bounty / project.

There are lots of open tickets and this one is definitely one of the harder to solve ones.

If you are here just because of the bounty, I guess you better search for more attractive bounties elsewhere.

I reacted to this one because from all the ones i checked out, this one seemed the fastest to accomplish. Just curious though, this bounty has been active for 5 years, is it never meant to be solved?

If you are here just because of the bounty, I guess you better search for more attractive bounties elsewhere.

To all backers of the bounty on bountysource for this issue, you need to urgently get active or your backing may get lost to bountysource.

See #5230 for details (I posted a copy of my email there, which you can slightly modify and reuse for the email you need to write to them).

About my previous comment: no need to get active any more, this was resolved, see #5230.

As a general comment:

This is no easy issue and thus not suited for new contributors who are unfamiliar with the codebase and the required changes that need to implemented before even starting to work on this issue.

Ouch.
Now it seems like rewriting borg in haskell would be easier than implementing the multithreading. At least, it seems like a manageable task for $1500.

@l29ah guess you're way too optimistic, but of course it depends on how quick you are in haskell and what you expect per hour.

if it is 0 per hour, you'll have infinite time. :-)

The documentation of internals of borgbackup is just too good, i have to give writing a parallel borg-compatible backup tool a try :)

Do one better - learn from the issues documented in those docs and design something that avoids these issues ;)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

phdoerfler picture phdoerfler  Â·  6Comments

rugk picture rugk  Â·  5Comments

pierreozoux picture pierreozoux  Â·  4Comments

ThomasWaldmann picture ThomasWaldmann  Â·  6Comments

verygreen picture verygreen  Â·  4Comments