Pyinstaller: Error for binaries larger than 2Gb

Created on 20 Dec 2018  路  38Comments  路  Source: pyinstaller/pyinstaller

For final binaries larger than 2Gb, the following exception message is printed out:

struct.error: 'i' format requires -2147483648 <= number <= 2147483647

https://github.com/pyinstaller/pyinstaller/blob/develop/PyInstaller/archive/writers.py#L264

class CTOC(object):
    ENTRYSTRUCT = '!iiiiBB'  # (structlen, dpos, dlen, ulen, flag, typcd) followed by name

bootloader help wanted pull-request wanted

Most helpful comment

Ah right. Let me take a lok at this right now - my apologies, I completely forgot about this.

  1. No. There will be within a week though hopefully.
  2. Moving the limit from 2gb to 4gb should be easy enough I think, however I don't know beyond that.
  3. Very soon.

All 38 comments

A possible solution would be to change this from signed to unigned values (need to be done in the bootloader, too).

Can you please provide a test-case we can include into the test-suite. Thanks.

A possible solution would be to change this from signed to unigned values (need to be done in the bootloader, too).

Can you please provide a test-case we can include into the test-suite. Thanks.

The switch from signed to unsigned moves the limit from 2gb to 4gb, right ? Is there a way to go over the 4gb limit ? Basically, could we go with an 8byte specifier ?

As for a test case, do you need a specific format or just a set of steps ?

The switch from signed to unsigned moves the limit from 2gb to 4gb, right ?

Right.

Is there a way to go over the 4gb limit ? Basically, could we go with an 8byte specifier ?

Basically we could be, but this required more changes. Also we should keep in mind 32-bit platforms, which might have trouble using 8-byte specifiers.

As for a test case, do you need a specific format or just a set of steps ?

Well, this should fit into our test-quite, which uses py-test. The most problematic point is to generate same data which surly ends up to be > 4gb when creating the CTOC.

I suggest several test-cases, which (as I assume) also ease developing tests:

  • an unit-test for CArchiveWriter and CArchiveReader (testing CTOC and CTOCReader is not of much use IMHO). Shall go into a new file in tests/unit/.
  • an end-to-end test to ensure the created executable actually works.
    This test could create a huge file containing some known data at some relevant positions (e.g. start, just before 2GB, just after 2GB, just before 4GB, just after 4GB), pass it using pyi_args=["--add-data", "xxx:yyy"] and the code should verify the expected data can be read.
    Shall go into tests/function/test_basic.py.

The switch from signed to unsigned moves the limit from 2gb to 4gb, right ?

Right.

Sorry this might be a dumb question, but how to do it? Change '!iiiiBB' to '!IIIIBB' ?
how to to this for bootloader for they are all binary files?

Pinging

Any news?
I tried to change datastructure but have no luck

@Red-Eyed I've added this to my todo-list, and I'll get this in the 4.1 release, maybe even the 4.0 release depending on when that actually gets released.

@Legorooj is this fixed now?

@Legorooj sorry for the annoying comments, but this issue is important to me: Currently I have to do my installer with additional tarballs rather than having just a single large executable.

So, If you don't mind, I just want to ask a few questions in order to have a better understanding the status of this issue:

  1. Is any kind of work in progress?
  2. Maybe there are some difficulties that doesn't allow to put > 2GB into the CArchive?
  3. How do you think when (if ever) this will be resolved?

Thanks!

Ah right. Let me take a lok at this right now - my apologies, I completely forgot about this.

  1. No. There will be within a week though hopefully.
  2. Moving the limit from 2gb to 4gb should be easy enough I think, however I don't know beyond that.
  3. Very soon.

I question the usefulness of this. It'd take a good 30 minutes to pack a 4GB application then a good minute or so to unpack again followed by however long it takes for your antivirus to give it a good sniff which would have to happen every time the user opens the application. I already find ~200MB applications to be annoyingly slow to startup. I think you'd be better off turning your one-dir applications into an installable bundle using NSIS or something equivalent. That way your user only has to unpack it once.

@bwoodsend you described only ONE use case, and yes it's useless. Also, you do not take into account other OS than windows.

My use case is actually cross platform installer. And I do not want or need any third party installer, as my self written install in python code does the job.

Linux distributions do not have problems with filesystem and antiviruses so it's fast to unpack zstd archives in multithreaded mode.

Also, u said that u've done it, could you please share that branch with me?

@bwoodsend

It'd take a good 30 minutes to pack a 4GB

Idk, it takes me about 2 min to pack ~2GB.But note, I just pack "data"

I don't have a branch for it. I was just messing about with zlib which is what PyInstaller uses to pack and unpack.

If your large size is coming from data files is it possible to just put those into a zip, have your code read directly from said zip, then include the zip in your onefile app?

I just include tar.xz into my pyinstaller one file and then unpack it

my tar.xz is about 1.6 GB (the source size is 6GB)
I want to use zstd instead, because unpacking is done in multi threaded mode (which is much faster), but compression ratio is sightly lower compared to xz, and it's size is about 2.5 GB that doesn't fit into current pyinstaller implementation.

I don't have a branch for it. I was just messing about with zlib which is what PyInstaller uses to pack and unpack.

I played around with PyInstaller and CArchive, but I haven't make it to work.

So, If you're not going to create PR for this issue, I would like to see any kind of investigation work, even if it doesn't meet PR requirements, just to see what have u done.

Or u didn't change CArchive logic?

Thanks

@bwoodsend if you pack the cuda cudnn libraries for deep learning, the package will up to 2GB archive. Please move the limit from 2GB to 4GB or larger. We really need this feature. Thanks bro.

@bwoodsend if you pack the cuda cudnn libraries for deep learning, the package will up to 2GB archive. Please move the limit from 2GB to 4GB or larger. We really need this feature. Thanks bro.

Yes, same usage. I ended up with putting all model data into password protected zip. It will be great if we could go beyond 2GB limitation.

Uf, this is going to be all sorts of fun...

Here's an experimental branch that raises limit from 2 GB to 4 GB by switching from signed integers to unsigned ones: https://github.com/rokm/pyinstaller/tree/large-file-support

I think before even considering the move to 64-bit integers for raising the limit further, we'll need to rework the archive extraction in the bootloader. Because currently, it extracts the whole toc entry into an allocated buffer and, if compression is enabled, decompresses it in one go. This is done both for internal use and for extraction onto filesystem during unpacking... The decompression should definitely be done in a streaming manner (i.e., using smaller chunks; so that we can avoid having whole compressed data in memory at once). And when extraction is performed as a part of pyi_arch_extract2fs(), the input file reading should be done in a streaming manner as well, even if there's no compression (so we can avoid reading the whole file into memory).

@rokm You mean it'll currently have all 2-4GB in RAM during decompression? That'd be horrific on a 4GB machine.

I'm not sure if the big data entries are actually compressed...

But even for uncompressed files, the current implementation of pyi_arch_extract2fs() uses pyi_arch_extract() to obtain the entry's data blob, and then writes it to the file in _MEIPASS dir... So unless I'm mistaken, if we add a 2 GB data file to the program, it will end up whole in the RAM during the unpacking...

https://github.com/pyinstaller/pyinstaller/blob/e67a589c4240fe0d422efeb09d5a91f26f8e624a/bootloader/src/pyi_archive.c#L145-L223

(And it's even worse if it is compressed, because then we keep both whole compressed and uncompressed data blobs in memory during decompression).

@rokm
Thanks for the effort!

I just tried your branch large-file-support (I didn't build anything, should I?)
Unfortunately, it doesn't work on Ubuntu 20.04
So it packs into one file, but it throws an error on the unpacking

I faced with the same error, I guess, in the bootloader when I tried to work on this issue

-> ~/my_cool_installer
[568688] Cannot open self /home/redeyed/my_cool_installer or archive /home/redeyed/my_cool_installer.pkg

(I didn't build anything, should I?)

Yes, you need to rebuild the bootloader yourself.

Okay, will do that tomorrow.
Until then, I would like to ask you:
could you please confirm that your branch actually unpacks package (that was built in one file mode) larger than 2 GB) ?

That would be awesome. Thanks!

Okay, will do that tomorrow.
Until then, I would like to ask you:
could you please confirm that your branch actually unpacks package (that was built in one file mode) larger than 2 GB) ?

That would be awesome. Thanks!

One of the commits adds a test that creates a 3 GB data file with random contents, computes its md5 hash, and then adds this file to a onefile build of a program, which in turn reads the unpacked file from its _MEIPASS dir, computes the md5 hash, and compares it to the one that was computed previously.

This test is now passing on my Fedora 33 box, and in a Windows 10 VM. (But you need to rebuild the bootloader, because that's where the unpacking actually takes place).

Okay, will do that tomorrow.
Until then, I would like to ask you:
could you please confirm that your branch actually unpacks package (that was built in one file mode) larger than 2 GB) ?

That would be awesome. Thanks!

Yes! In my case, the libtorch_cuda.so is 900+MB. The package will larger than 2 GB if you add the cuda, cudnn and TensorRT library.

Okay, will do that tomorrow.
Until then, I would like to ask you:
could you please confirm that your branch actually unpacks package (that was built in one file mode) larger than 2 GB) ?
That would be awesome. Thanks!

Yes! In my case, the libtorch_cuda.so is 900+MB. The package will larger than 2 GB if you add the cuda, cudnn and TensorRT library.

Currently I have 6 GB of python environment (pytorch, tensorflow, scipy etc) and pack it into 1.6GB tar.xz archive to get under 2GB limit

but I want to use zstandard compression (to speed up decompression) but zstandart compresses to 2.6GB which is above of 2GB limit

Note: I just built bootloader and it works, thank you @rokm !

Here's a further branch, in which 64-bit integers are used: https://github.com/rokm/pyinstaller/tree/large-file-support-v2

So now in theory, sky's the limit - you can chuck in all your deep learning frameworks, CUDA libraries, pretrained models, ...

In practice, however, the 5 GB onefile test passes only on linux (tested only 64-bit for now). Windows (even 64-bit) do not seem to support executables larger than 4 GB. On (64-bit) macOS, the macholib used in a signing preparation step in the assembly pipeline seems to assume that the file size can be represented with an unsigned integer, so 4 GB max as well (and this is consistent with macOS signature searching code in the bootloader, which uses 32-bit unsigned integers).

So really huge onefile executables (> 4 GB) work only on linux. But on all three OSes, this limitation can be worked around by using .spec file and adding append_pkg=False as an extra argument to EXE(). This will give you a small executable (e.g., program) and a single large archive file to go with it (e.g., program.pkg).

@Legorooj any progress?

I stopped work on this because by the time I woke up the next morning @rokm had already written everything that needed to be馃槄. Maybe he could submit a PR?

I stopped work on this because by the time I woke up the next morning @rokm had already written everything that needed to be馃槄. Maybe he could submit a PR?

Nice job!

@sshuair the current plan is to have the changes from those experimental branches submitted and merged gradually - first the endian-handling cleanup, then the file extraction cleanup, then switch to unsigned 32-bit ints, and finally to 64-bit ones.

In the meantime, if you require this functionality, you can use either of the experimental branches linked above.

@sshuair
If u need >2GB just use branch that @rokm published. I just use it and it works on Ubuntu 20.04 and Windows (but before using it, u need to rebuild the bootloader, it's easy, just read the documentation)

@Red-Eyed the branch that @rokm published seems not work for me. When the file PKG-00.pkg up to 2.7BG, the error appeared again. My OS system is Ubuntu 16.04.

@Red-Eyed Sorry~~It's my fault, I install the package from master branch but not the large-file-support-v2. Now it works for me.

Was this page helpful?
0 / 5 - 0 ratings