ISSUE
client: 1.1.11 (openSUSE TW system package)
server: 1.1.11 (built locally from debian package)
client: openSUSE Tumbleweed (20200516)
server: osmc (2020.03-1). Debian-based distribution for mediacenters.
client: x86_64
server: aarch64 (a vero 4k mediacenter with external usb hard disk)
Original size of all archives: 2.11 TB
Each archive original size is between 31 and 60 GB
There's a total of 43 archives.
borg recreate -e 'home/user/large-directory' -e 'home/user/large-file' ssh://borgbackup@server/home/borgbackup/repositories/hostname::hostname-2019-09-30T00:01:57
Recreating the archive consumes a lot of memory. I checked it kept growing during the execution of the recreate command peaking just at the end at a VmSize of 34155864 k (~34GB).
After some debugging, I could observe the problem lies in the self.responses dictionary in remote.py, which keeps all the messages received during the processing.
In the middle of a debugging session I got this:
(Pdb) p total_size(self.responses)
25272256788
(Pdb) p len(self.responses.keys())
89057
(total_size is the function from https://code.activestate.com/recipes/577504/ which returns the total size in memory of an object and its contents)
It seems the only way for an item to be removed from the responses dictionary is by first getting its msgid into the waiting_for list. I checked (in a different debugging session than the commands above) and waiting_for doesn't seem to change at all during the execution. It stayed with the same single content every time I checked:
(Pdb) p waiting_for
[45112]
(Pdb) p max(self.responses.keys())
24628
Of course, the max msgid value of self.responses.keys() kept growing every time I checked, just as its length.
I don't know where waiting_for got that value from, but I'd say maybe waiting_for is waiting for the last msgid coming from the server, so every response will just be stored in self.responses and never discarded.
Note that all this happens inside the self.repository.commit() call in Archive.save . This is the backtrace that can be seen by stopping the execution at mostly any time during the recreate (after a few seconds after starting):
Traceback (most recent call last):
File "/usr/lib64/python3.8/site-packages/borg/archiver.py", line 164, in wrapper
return method(self, args, repository=repository, cache=cache_, **kwargs)
File "/usr/lib64/python3.8/site-packages/borg/archiver.py", line 1727, in do_recreate
if not recreater.recreate(name, args.comment, args.target):
File "/usr/lib64/python3.8/site-packages/borg/archive.py", line 1718, in recreate
self.save(archive, target, comment, replace_original=replace_original)
File "/usr/lib64/python3.8/site-packages/borg/archive.py", line 1841, in save
target.save(comment=comment, timestamp=self.timestamp,
File "/usr/lib64/python3.8/site-packages/borg/archive.py", line 519, in save
self.repository.commit()
File "/usr/lib64/python3.8/site-packages/borg/remote.py", line 527, in do_rpc
return self.call(f.__name__, named, **extra)
File "/usr/lib64/python3.8/site-packages/borg/remote.py", line 768, in call
for resp in self.call_many(cmd, [args], **kw):
File "/usr/lib64/python3.8/site-packages/borg/remote.py", line 897, in call_many
self.unpacker.feed(data)
Please note that the line numbers in archive.py and remote.py might be different than yours since I added some debug code to those files.
I can reproduce it every time.
The system logs don't show any problem in the client or the server. I also did a borg check which didin't find any problem.
Can you reproduce the issue again:
--dry-run?Would be interesting whether these also use a lot of memory.
Even better, let me copy what I wrote on irc after textshell and elho mentioned that this behaved like preload was being set to True and I checked it was being set to False:
<antlarr> ouch I think I found a problem
<antlarr> def unpack_many(self, ids, filter=None, partial_extract=False, preload=False, hardlink_masters=None):
<antlarr> that's a preload variable
<antlarr> def preload(chunks):
<antlarr> that function is shadowing the variable
<antlarr> after defining that function, the code does:
<antlarr> if preload:
so the problem is that that if is always True since the function is always defined instead of using the preload variable passed as a parameter of DownloadPipeline.unpack_many.
I tested that renaming the preload function fixed the problem (consuming less than 150 MB instead of ~34GB and taking less than half the time to finish) but elho was faster to create a commit, so I guess he will submit it soon: https://github.com/elho/borg/commit/be646106464f4226c8c6025bb96d386d433169a1
Yes, I built a borg 1.2 from that branch plus a 1.1.11 with that patch on top, testsuite passed, I will submit the actual PR after some real world tests with them have completed successfully.
Oh, oops. Thanks for finding this!
This affects pretty much everything using Archive.iter_items() or Archive.extract_item().
now fixed in master and 1.1-maint (will be in borg 1.1.12).
thanks to @elho and @antlarr for finding and fixing!
@LocutusOfBorg @FelixSchwarz as this affects many operations, likely including borg extract, it would be good to have 1.1.12 packaged soon.
Guess noone wants to have a "out of memory" surprise at full restore time...
crap, there seems to be a cython code generation issue, so at least the pip pkg is broken on py38, see #5214 .
Shouldn't the linter have caught this? I know PyCharm does, but I thought pylint would have as well...
@enkore the problem is that pycharm warns about quite a lot in borg source.
some is invalid. some is valid, but harmless. and some is valid and not harmless.
would be quite some effort to clean up the code to remove the warnings about the harmless stuff.
@antlarr Thank you for spotting the problem :-)
@ThomasWaldmann also thank you for making me aware of the problem. I did built 1.1.13 for Fedora 31/32/rawhide and EPEL 7+8.
1.1.12 available in Debian and Ubuntu