borg uses too much memory (with misc. operations)

Created on 30 May 2020 · 11Comments · Source: borgbackup/borg

Is this a BUG / ISSUE report or a QUESTION?

ISSUE

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

client: 1.1.11 (openSUSE TW system package)
server: 1.1.11 (built locally from debian package)

Operating system (distribution) and version.

client: openSUSE Tumbleweed (20200516)
server: osmc (2020.03-1). Debian-based distribution for mediacenters.

Hardware / network configuration, and filesystems used.

client: x86_64
server: aarch64 (a vero 4k mediacenter with external usb hard disk)

How much data is handled by borg?

Original size of all archives: 2.11 TB
Each archive original size is between 31 and 60 GB
There's a total of 43 archives.

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg recreate -e 'home/user/large-directory' -e 'home/user/large-file' ssh://borgbackup@server/home/borgbackup/repositories/hostname::hostname-2019-09-30T00:01:57

Describe the problem you're observing.

Recreating the archive consumes a lot of memory. I checked it kept growing during the execution of the recreate command peaking just at the end at a VmSize of 34155864 k (~34GB).

After some debugging, I could observe the problem lies in the self.responses dictionary in remote.py, which keeps all the messages received during the processing.

In the middle of a debugging session I got this:

(Pdb) p total_size(self.responses)
25272256788
(Pdb) p len(self.responses.keys())
89057

(total_size is the function from https://code.activestate.com/recipes/577504/ which returns the total size in memory of an object and its contents)

It seems the only way for an item to be removed from the responses dictionary is by first getting its msgid into the waiting_for list. I checked (in a different debugging session than the commands above) and waiting_for doesn't seem to change at all during the execution. It stayed with the same single content every time I checked:

(Pdb) p waiting_for               
[45112]
(Pdb) p max(self.responses.keys())
24628

Of course, the max msgid value of self.responses.keys() kept growing every time I checked, just as its length.

I don't know where waiting_for got that value from, but I'd say maybe waiting_for is waiting for the last msgid coming from the server, so every response will just be stored in self.responses and never discarded.

Note that all this happens inside the self.repository.commit() call in Archive.save . This is the backtrace that can be seen by stopping the execution at mostly any time during the recreate (after a few seconds after starting):

Traceback (most recent call last):
  File "/usr/lib64/python3.8/site-packages/borg/archiver.py", line 164, in wrapper
    return method(self, args, repository=repository, cache=cache_, **kwargs)
  File "/usr/lib64/python3.8/site-packages/borg/archiver.py", line 1727, in do_recreate
    if not recreater.recreate(name, args.comment, args.target):
  File "/usr/lib64/python3.8/site-packages/borg/archive.py", line 1718, in recreate
    self.save(archive, target, comment, replace_original=replace_original)
  File "/usr/lib64/python3.8/site-packages/borg/archive.py", line 1841, in save
    target.save(comment=comment, timestamp=self.timestamp,
  File "/usr/lib64/python3.8/site-packages/borg/archive.py", line 519, in save
    self.repository.commit()
  File "/usr/lib64/python3.8/site-packages/borg/remote.py", line 527, in do_rpc
    return self.call(f.__name__, named, **extra)
  File "/usr/lib64/python3.8/site-packages/borg/remote.py", line 768, in call
    for resp in self.call_many(cmd, [args], **kw):
  File "/usr/lib64/python3.8/site-packages/borg/remote.py", line 897, in call_many
    self.unpacker.feed(data)

Please note that the line numbers in archive.py and remote.py might be different than yours since I added some debug code to those files.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

I can reproduce it every time.

Include any warning/errors/backtraces from the system logs

The system logs don't show any problem in the client or the server. I also did a borg check which didin't find any problem.

bug

Source

antlarr

All 11 comments

Can you reproduce the issue again:

first using the command you gave in top post
then without any excludes
and again using --dry-run?

Would be interesting whether these also use a lot of memory.

ThomasWaldmann on 1 Jun 2020

Even better, let me copy what I wrote on irc after textshell and elho mentioned that this behaved like preload was being set to True and I checked it was being set to False:

<antlarr> ouch I think I found a problem
<antlarr>     def unpack_many(self, ids, filter=None, partial_extract=False, preload=False, hardlink_masters=None): 
<antlarr> that's a preload variable
<antlarr>             def preload(chunks):
<antlarr> that function is shadowing the variable
<antlarr> after defining that function, the code does:
<antlarr>             if preload:

so the problem is that that if is always True since the function is always defined instead of using the preload variable passed as a parameter of DownloadPipeline.unpack_many.

I tested that renaming the preload function fixed the problem (consuming less than 150 MB instead of ~34GB and taking less than half the time to finish) but elho was faster to create a commit, so I guess he will submit it soon: https://github.com/elho/borg/commit/be646106464f4226c8c6025bb96d386d433169a1

antlarr on 1 Jun 2020

👀1 👍1

Yes, I built a borg 1.2 from that branch plus a 1.1.11 with that patch on top, testsuite passed, I will submit the actual PR after some real world tests with them have completed successfully.

elho on 1 Jun 2020

👍1

Oh, oops. Thanks for finding this!

This affects pretty much everything using Archive.iter_items() or Archive.extract_item().

ThomasWaldmann on 1 Jun 2020

👍1

now fixed in master and 1.1-maint (will be in borg 1.1.12).

thanks to @elho and @antlarr for finding and fixing!

ThomasWaldmann on 2 Jun 2020

@LocutusOfBorg @FelixSchwarz as this affects many operations, likely including borg extract, it would be good to have 1.1.12 packaged soon.

Guess noone wants to have a "out of memory" surprise at full restore time...

ThomasWaldmann on 6 Jun 2020

crap, there seems to be a cython code generation issue, so at least the pip pkg is broken on py38, see #5214 .

ThomasWaldmann on 6 Jun 2020

Shouldn't the linter have caught this? I know PyCharm does, but I thought pylint would have as well...

enkore on 6 Jun 2020

@enkore the problem is that pycharm warns about quite a lot in borg source.

some is invalid. some is valid, but harmless. and some is valid and not harmless.

would be quite some effort to clean up the code to remove the warnings about the harmless stuff.

ThomasWaldmann on 6 Jun 2020

@antlarr Thank you for spotting the problem :-)

@ThomasWaldmann also thank you for making me aware of the problem. I did built 1.1.13 for Fedora 31/32/rawhide and EPEL 7+8.

FelixSchwarz on 7 Jun 2020

1.1.12 available in Debian and Ubuntu

LocutusOfBorg on 12 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add bandwidth limit support

chebee7i · 5Comments

`borg list` or `borg prune` hangs

pierreozoux · 4Comments

size placeholders for borg list <repo>

rugk · 5Comments

borg info requires repo_or_archive param

ThomasWaldmann · 6Comments

Why did you choose mkdir for locking and not touch?

rugk · 3Comments