etcher 🚀 - .gz doesn't return correct file size for content above 2^32 B = 4 GB

So as far as I understand, the user will eventually hit ENOSPC, without any further weird behaviour, right?

I would love to be able to reliably get the drive size (or at least closely estimate), however decompressing the whole thing twice sounds like a very bad solution.

I'll research on this and see if there is a way we can do it.

jviotti on 22 Aug 2016

@jviotti I think your description is still mixing up two uncovered issues, broken out to two parts to here and to #629. In that other issue (as described the reproducible ENOSPC) that happens regardless of compression.

In this issue:

ENOSPC would happen if there's a gz image with SIZE > 2^32 bytes, and the user is trying to burn onto a card which has CAPACITY < SIZE but CAPACITY > SIZE mod 2^32. Then it would cause an issue, because the initial capacity check in etcher couldn't figure out the correct size
If using a card that CAPACITY > SIZE, then the only effect is that the "speed" bar is wrong, but everything else works properly (including the progress bar), and the user won't run into ENOSPC.

Decompressing things twice might be a bad solution, but judging by the comments, it's just gz not designed for these big files (nor to return correct size estimate either), so curious to see if there's any other solution than run through the file twice. Should not be too bad, especially if that has some UI display such as "checking archive contents", so people know it not all just hung. I think doing things correctly for gz is more important than taking a bit longer time. Decompression itself with no data storage just byte counting seems to be pretty fast.

imrehg on 22 Aug 2016

👍1

If using a card that CAPACITY > SIZE, then the only effect is that the "speed" bar is wrong, but everything else works properly (including the progress bar), and the user won't run into ENOSPC.

I see. I wonder why the speed is wrong. I can't think of a way this
issue could affect the speed (unless GZ files >4GB are very slow to
decompress?).

Decompressing things twice might be a bad solution, but judging by the
comments, it's just gz not designed for these big files (nor to
return correct size estimate either), so curious to see if there's any
other solution than run through the file twice. Should not be too bad,
especially if that has some UI display such as "checking archive
contents", so people know it not all just hung. I think doing things
correctly for gz is more important than taking a bit longer time.
Decompression itself with no data storage just byte counting seems to
be pretty fast.

Yeah, could be, but needs more thought. I'd love to push a bit more to
see if we can find an alternative solution. We're putting an enourmous
amount of effort to reduce the time from getting an image to the drive,
and it sounds counter-intuitive to be willing to spend time
decompressing things twice.

On Sun, Aug 21, 2016 at 07:57:09PM -0700, Gergely Imreh wrote:

@jviotti I think your description is still mixing up two uncovered issues, broken out to two parts to here and to #629. In that other issue (as described the reproducible ENOSPC) that happens regardless of compression.

In this issue:

ENOSPC would happen if there's a gz image with SIZE > 2^32 bytes, and the user is trying to burn onto a card which has CAPACITY < SIZE but CAPACITY > SIZE mod 2^32. Then it would cause an issue, because the initial capacity check in etcher couldn't figure out the correct size

If using a card that CAPACITY > SIZE, then the only effect is that the "speed" bar is wrong, but everything else works properly (including the progress bar), and the user won't run into ENOSPC.

Decompressing things twice might be a bad solution, but judging by the comments, it's just gz not designed for these big files (nor to return correct size estimate either), so curious to see if there's any other solution than run through the file twice. Should not be too bad, especially if that has some UI display such as "checking archive contents", so people know it not all just hung. I think doing things correctly for gz is more important than taking a bit longer time. Decompression itself with no data storage just byte counting seems to be pretty fast.

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/resin-io/etcher/issues/638#issuecomment-241304873

Juan Cruz Viotti
Software Engineer

jviotti on 22 Aug 2016

As has already been mentioned, if the uncompressed file is over 4GB, gzip returns its size modulo 4GB (because it only has a 32bit size field) - gzip would report a 4.5GB file as 0.5GB, a 9.3GB file as 1.3GB, and a 15.9GB file as 3.9GB. So therefore the _only_ way to correctly get the size is to uncompress the whole file first.

However I wonder if you could use some kind of heuristic to 'guess' when gzip has wrapped the size of the compressed file, based on the size of the _uncompressed_ file? (I'm guessing the disk images used with Etcher probably have vaguely similar compression ratios).
E.g. if the compressed .gz file is bigger than N GB, and gzip reports the uncompressed file as being M GB, perhaps you could 'deduce' that the size of the uncompressed file is _actually_ M+4 GB?
Obviously you'd need to do some experimentation with different disk-images to work out a reliable value for N.
(and similarly if the compressed .gz file is bigger than N*2 GB, you could deduce that the uncompressed size is actually M+8 GB)
Note that this heuristic is still likely to fail though if you have a compressed disk-image with a lot of blank unpartitioned space (because blank space compresses _really_ well.).

I see. I wonder why the speed is wrong.

Well, I guess if Etcher thinks the uncompressed file is only 0.5GB, but it's actually 4.5GB, perhaps the speed is calculated based on the 0.5GB figure? (4.5GB obviously takes a lot longer to write than 0.5GB would!)

P.S. Obviously where I've said '4GB' everywhere above, it's just a shorthand for '2^32 bytes'.

lurch on 29 Sep 2016

Revisiting this issue, I am with @lurch on this. He is probably right about the heuristic approach and about the speed explanation. @jviotti?

alexandrosm on 2 Dec 2016

Sounds good, lets experiment with it.

jviotti on 2 Dec 2016

Obviously where I've said '4GB' everywhere above, it's just a shorthand for '2^32 bytes'.

I'd suggest that you guys watch your units more closely, especially when building UI or write logic. 2^32 bytes is actually 4GiB (gibibytes), not GB (gigabytes).

WasabiFan on 2 Dec 2016

I can never remember which way around they are :-/

And as Etcher is aimed at novices, I expect they might get confused if we started reporting everything in GiB instead of just GB ?

lurch on 2 Dec 2016

I can never remember which way around they are :-/

The simple way to remember it is to compare it to metric measurements. If it has an SI prefix like metric units do, you know it's a power of ten (as with metric).

And as Etcher is aimed at novices, I expect they might get confused if we started reporting everything in GiB instead of just GB?

As I see it, if a novice doesn't know what gibibytes are, they'll probably just assume that GiB is "Gigabytes". That's a close enough approximation for said novice's uses, I'd imagine.

WasabiFan on 2 Dec 2016

@jviotti did anyone end up doing anything with this? It really should be a fairly simple fix for 99% of cases

alexandrosm on 2 Mar 2017

(this is our oldest bug still open)

alexandrosm on 2 Mar 2017

Is there _actually_ anyone using gzip to compress images over 4GB in size? I.e. does anyone have some example images that can be used for testing, or is this just an edge-case that we'll (probably) never hit in practice?

lurch on 2 Mar 2017

I've seen some out there, although its definitely rare. It shouldn't be hard to fix the speed + percentage issues though. I'm happy if we treat gzip sizes with a grain of salt and eventually throw ENOSPC if there is no remaining space in the drive, like we do with bzip2.

jviotti on 3 Mar 2017

@jhermsmeier Can you help me out with this one? Try flashing an gz image directly with the Etcher CLI. The flash progress quickly reaches 99%, and remains there for quite some time while there's more data coming through the stream chain. Maybe we're not calculating the size correctly somewhere?

jviotti on 3 Mar 2017

Oh, the sizes seem to be fine. The issue is that for compressed images, the stream chain looks like this:

Input compressed file
Calculate progress based on compressed size
Apply decompression transform
Write to drive

It looks like the decompression is very fast compared to the drive writing, and therefore the progress reaches 99% much sooner.

I'm not sure what would be the solution here. In some cases we only have the uncompressed size, so we should make use of it to show the progress. Maybe there is a way to make the initial readable stream wait for the drive writes before sending more data?

jviotti on 3 Mar 2017

Looks like pausing/resuming the readable stream should do it.

jviotti on 3 Mar 2017

I investigated the slow speed issue in more detail, and it looks to be resolved in master. There is a speed penalty (~2.0 MB/s) when decompressing large files though, but its not even closer to @imrehg initial report (~0.01MB/s).

The slow speed seem to happen on larger compressed images, and also depends on the compression level. I ran various experiments with images of several sizes (from ~1 GB to ~4 GB), using compression levels 1 to 9, inclusive, and the decompression time drastically increases on larger images, which can be reproduced with the gzip tool as well.

In summary:

Images that don't fit into the drive will eventually cause ENOSPC, leading a friendly message being presented to the user
The absurd speed times issue was fixed
We're hitting a UX issue where the decompression time is much faster than the flashing time (for certain images)

I think we can close this issue once the last one is fixed.

jviotti on 3 Mar 2017

@jhermsmeier Check http://linorg.usp.br/OpenELEC/OpenELEC-Generic.x86_64-6.0.3.img.gz for an image that showcases the 99% UX issue.

jviotti on 3 Mar 2017

the decompression time drastically increases on larger images

Isn't that expected?!? I'd expect a streaming gzip decompressor to have a constant(ish) decompression speed, so of course a larger image will take longer to decompress.
Or are you saying that larger images decompress at a slower speed (bitrate) than smaller images, in which case that sounds like it might be a resource-leak somewhere?
I'm sure @jhermsmeier , our streaming expert, will have some better ideas ;-)

In some cases we only have the uncompressed size, so we should make use of it to show the progress.

I suggested elsewhere, that since we'll only be supporting streaming images from our online catalog, the online catalog could store the size of the uncompressed image, which will solve this problem (and of course that uncompressed-image-size figure should be automatically updated, so that it never gets out-of-sync with the actual compressed image).

lurch on 3 Mar 2017

I mean that decompression time and image size are not directly proportional. As the image gets larger (mainly if you compress with the highest compression level), decompression starts getting ridiculously slow.

I ran all my experiments with the gzip CLI tool, I didn't test on Etcher at all.

jviotti on 3 Mar 2017

Interesting - I'd expect the streaming block-based nature of gzip (based on top of zlib) would mean there'd be a direct proportion between decompression time and image size? Perhaps it depends on the dictionary size, or perhaps I'm completely misunderstanding this ;-)

lurch on 4 Mar 2017

Yeah, I don't know. Maybe it also depends on the image itself? I'm a compression noob, so I have no clue apart from what I saw on my experiments.

jviotti on 4 Mar 2017

Beyond all this though, we should have a heuristic that basically says this:

gzip compresses images within a certain range (e.g. 1.5x to 3x)
if an image claims to be much out of that range (e.g. says it's 300mb but
the archive is 2.5gb) we should assume it's actually 4.3gb instead.
Essentially, we should add 2^32 bytes to the estimated size again and
again, until the compression ratio gets within a realistic range.

I think an algorithm like this, used only for gzip files (maybe bzip too?),
should fix the vast majority of the cases. We should still fail well when
we're wrong, but we should try hard to be right :)

--

Alexandros Marinos

Founder & CEO, Resin.io

+1 206-637-5498

@alexandrosm

On Fri, Mar 3, 2017 at 3:51 PM, Juan Cruz Viotti notifications@github.com
wrote:

Yeah, I don't know. Maybe it also depends on the image itself? I'm a
compression noob, so I have no clue apart from what I saw on my experiments.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/resin-io/etcher/issues/638#issuecomment-284102753,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABLUCKLN91vbMP2jKNkGcuVVVW_-u52jks5riKeIgaJpZM4Joo50
.

alexandrosm on 4 Mar 2017

👍1

I agree. Let's add that estimation algorithm to https://github.com/resin-io/etcher/blob/master/lib/image-stream/handlers.js#L84

jviotti on 6 Mar 2017

Looking at #1171 I've just had another idea about an alternative approach to this...

The only "proper" way to get the full size of a .gz or .bz2 compressed image, is to extract the whole thing, and count how many bytes get spat out (i.e. something like gunzip -c someimage.img.gz | wc -c on the command-line). However this can be a slow process for larger images, so we don't want to wait to extract the whole image to count how big it is, before we start flashing it to the user's SD card.
However the time taken to extract the image in the background is obviously going to be less than the time taken to actually write the image to the SD card, so I guess one approach could be to start decompressing and writing the image to the SD card as normal, but then in _another_ background thread, decompress the image in its entirety to see how big it is. Once that size-counting is finished, the 'foreground' thread that's actually writing the image to the SD card could be updated with how big the image _actually_ is, and the writing process could then update the percentage-bar to display the _true_ progress, based on the image's actual size (or it could early-abort the writing process if it discovers the image is actually bigger than the card).
It'd look a bit visually-jarring with the percentage bar suddenly taking a jump backwards (if the true size is discovered to be bigger than the estimated size), but IMHO that'd still be much better than the "write speed slows to a crawl" behaviour that we currently have.

Thoughts? (have I explained that clearly enough?)

lurch on 9 Mar 2017

I guess it depends on how much more reliable it will be than the heuristic
method. I think it won't be much better, so I'd rather do the heuristic and
see how far we get before we resort to extreme measures...

--

Alexandros Marinos

Founder & CEO, Resin.io

+1 206-637-5498

@alexandrosm

On Thu, Mar 9, 2017 at 8:55 AM, Andrew Scheller notifications@github.com
wrote:

Looking at #1171 https://github.com/resin-io/etcher/issues/1171 I've
just had another idea about an alternative approach to this...

The only "proper" way to get the full size of a .gz or .bz2 compressed
image, is to extract the whole thing, and count how many bytes get spat out
(i.e. something like gunzip -c someimage.img.gz | wc -c on the
command-line). However this can be a slow process for larger images, so we
don't want to wait to extract the whole image to count how big it is,
before we start flashing it to the user's SD card.
However the time taken to extract the image in the background is
obviously going to be less than the time taken to actually write the image
to the SD card, so I guess one approach could be to start decompressing and
writing the image to the SD card as normal, but then in another
background thread, decompress the image in its entirety to see how big it
is. Once that size-counting is finished, the 'foreground' thread that's
actually writing the image to the SD card could be updated with how big the
image actually is, and the writing process could then update the
percentage-bar to display the true progress, based on the image's
actual size (or it could early-abort the writing process if it discovers
the image is actually bigger than the card).
It'd look a bit visually-jarring with the percentage bar suddenly taking a
jump backwards (if the true size is discovered to be bigger than the
estimated size), but IMHO that'd still be much better than the "write speed
slows to a crawl" behaviour that we currently have.

Thoughts? (have I explained that clearly enough?)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/resin-io/etcher/issues/638#issuecomment-285410734,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABLUCOceT5J6bQqsDM8kXfXDoHPhnOYCks5rkC7ngaJpZM4Joo50
.

alexandrosm on 9 Mar 2017

As I explain in https://github.com/resin-io/etcher/issues/1171#issuecomment-285402214 (and see also the table in a comment further down that page) we already have cases where even the best-tuned heuristic will still give incorrect results. I can't see any scenario (unless I'm missing something obvious?) where what I describe above would be "unreliable".
Is spawning a background thread to do an extra decompress of the image really an "extreme measure"? ;-) (although I agree that the cross-thread communication may be tricky - that's something I've never done in NodeJS)

And what I describe above could also work in addition to the heuristic, it's not an either-or scenario. i.e.

user selects gzip-compresssed image
get uncompressed size from gzip
apply heuristic to try to get a 'better' uncompressed size
start writing image to disk
start background decompression thread
(...some time later...) background-decompression-thread gives us the _true_ uncompressed size (which may or may not match the heuristic size)
Adjust the progress-bar if necessary, and continue writing the image

lurch on 9 Mar 2017

I think you're mixing up two different problems. The heuristic IS NOT used at all for the writing, but only to prevent a possibly too small drive from being selected. When the writer kicks off, it uses the compressed size only. The reason for the "write speed slows to a crawl" is not related to the uncompressed heuristic.

jviotti on 9 Mar 2017

Sorry for getting confused. But surely it _would_ make sense for the writer to report write-progress as the percentage of data written to disk out of the total size of the (uncompressed) image? Isn't that what most people would expect? :confused:

lurch on 9 Mar 2017

No such thing as threads in Node userland ;)
The equivalent of gunzip file > /dev/null while counting bytes and keeping time in Node:

function gunzipSize( filename, callback ) {

  var size = 0

  fs.createReadStream( filename )
    .on( 'error', callback )
    .pipe( zlib.createUnzip() )
    .on( 'readable', function() {
      var chunk = null
      while( chunk = this.read() ) {
        size = size + chunk.length
      }
      chunk = null
    })
    .on( 'error', callback )
    .on( 'end', function() {
      callback( null, size )
    })

}

var filename = path.join( process.env['HOME'], 'Downloads', 'android-ver6.0-20170112-pine64-32GB.img.gz' )
var stats = fs.statSync( filename )
var time = process.hrtime()

gunzipSize( filename, ( error, size ) => {

  time = process.hrtime( time )

  var ms = ( time[0] * 1e3 ) + ( time[1] / 1e6 )
  var minutes = ( ms / 1000 / 60 ) | 0
  var seconds = ( ms / 1000 ) - ( minutes * 60 )

  console.log( 'Finished in %s min %s s', minutes, seconds.toFixed(1) )

  if( error ) {
    return console.log( error )
  }

  console.log( 'Uncompressed size: %s GB (%s bytes)', (size / 1024 / 1024 / 1024).toFixed(1), size )
  console.log( 'Compressed size: %s GB (%s bytes)', (stats.size / 1024 / 1024 / 1024).toFixed(1), stats.size )

})

Ends up chewing through at 100% CPU (1 core, obv.), and about 115 MB RAM during a run of it on my Macbook Air from 2012:

Finished in 3 min 21.7 s
Uncompressed size: 28.8 GB (30908350464 bytes)
Compressed size: 0.8 GB (815284283 bytes)

Which means, if this were to run in parallel, we'd be bottlenecking the CPU (or I/O if the source is slow). I think heuristic + counting bytes while writing is probably the less resource-intensive path.

jhermsmeier on 9 Mar 2017

The current implementation calculates the percentage based on how much was decompressed from the file. For most decompression algorithms we're using, writing to the drive is faster than decompressing, so the progress bar displays fine, however gzip seems to be a special case, because its quite fast.

I avoided heuristics and relied on that approach given that for some compression methods (e.g: bzip2), its impossible to get even an estimate, and we'd be really guessing in the dark.

I believe that we can pause decompression if we're getting slow on writes, and that should work fine.

jviotti on 9 Mar 2017

No such thing as threads in Node userland ;)

From what I remember reading about Electron when I first started working on Etcher, I thought it was designed around having separate co-operating processes?

if this were to run in parallel, we'd be bottlenecking the CPU

I wonder if we can assign threads/processes to have a lower priority than the main thread? Given that we're going to have lots of I/O waits waiting for the SD card to write, there should be plenty of 'spare' CPU time-slots.

or I/O if the source is slow

Well, I'd still expect even reading from a 'slow' disk to be faster than writing to a SD card?

For most decompression algorithms we're using, writing to the drive is faster than decompressing

Yikes! :-( Is that because they're implemented in pure-javascript, and JS isn't designed for manipulating binary data? How much is it slowing things down by - would it be worth trying to find native-binding equivalents? (if I'm using the right terminology)
A benchmark from 12 years ago (wow!) shows even the most highly-compressed bzip2 file decompressing at over 5MB/s. (A more recent benchmark shows bzip2 decompressing at over 20MB/s)

however gzip seems to be a special case, because its quite fast.

Maybe that points to gzip being implemented in C rather than JS?

given that for some compression methods (e.g: bzip2), its impossible to get even an estimate

Yeah, the background-decompression-thread approach I suggested above would also work for .bz2 files (although I guess it means the progress-indicator would have to switch part-way through from being a "rough estimate, based on how much of the input file has been decompressed", to "an exact figure, based on how much of the output file has been written to disk so far").

I think heuristic + counting bytes while writing is probably the less resource-intensive path.

I believe that we can pause decompression if we're getting slow on writes, and that should work fine.

Fair enough, looks like I've been out-voted ;-)

lurch on 9 Mar 2017

Andrew -- just to clarify, I said that the heuristic approach will be less
reliable but much much simpler to write and execute, not the other way
around.

--

Alexandros Marinos

Founder & CEO, Resin.io

+1 206-637-5498

@alexandrosm

On Thu, Mar 9, 2017 at 12:01 PM, Juan Cruz Viotti notifications@github.com
wrote:

The current implementation calculates the percentage based on how much was
decompressed from the file. For most decompression algorithms we're using,
writing to the drive is faster than decompressing, so the progress bar
displays fine, however gzip seems to be a special case, because its quite
fast.

I avoided heuristics and relied on that approach given that for some
compression methods (e.g: bzip2), its impossible to get even an estimate,
and we'd be really guessing in the dark.

I believe that we can pause decompression if we're getting slow on writes,
and that should work fine.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/resin-io/etcher/issues/638#issuecomment-285464517,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABLUCFdvsHaHIJu6yYJ4P5BbLNQ3P_iuks5rkFqqgaJpZM4Joo50
.

alexandrosm on 9 Mar 2017

The threads of electron are exactly two: one for ui/browser stuff, and one
for "server"/node stuff.

--

Alexandros Marinos

Founder & CEO, Resin.io

+1 206-637-5498

@alexandrosm

On Thu, Mar 9, 2017 at 12:33 PM, Andrew Scheller notifications@github.com
wrote:

No such thing as threads in Node userland ;)

From what I remember reading about Electron when I first started working
on Etcher, I thought it was designed around having separate co-operating
processes?

if this were to run in parallel, we'd be bottlenecking the CPU

I wonder if we can assign threads/processes to have a lower priority than
the main thread? Given that we're going to have lots of I/O waits waiting
for the SD card to write, there should be plenty of 'spare' CPU time-slots.

or I/O if the source is slow

Well, I'd still expect even reading from a 'slow' disk to be faster than
writing to a SD card?

For most decompression algorithms we're using, writing to the drive is
faster than decompressing

Yikes! :-( Is that because they're implemented in pure-javascript, and JS
isn't designed for manipulating binary data? How much is it slowing things
down by - would it be worth trying to find native-binding equivalents? (if
I'm using the right terminology)
A benchmark http://tukaani.org/lzma/benchmarks.html from 12 years ago
(wow!) shows even the most highly-compressed bzip2 file decompressing at
over 5MB/s. (A more recent benchmark
https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison/
shows bzip2 decompressing at over 20MB/s)

however gzip seems to be a special case, because its quite fast.

Maybe that points to gzip being implemented in C rather than JS?

given that for some compression methods (e.g: bzip2), its impossible to
get even an estimate

Yeah, the background-decompression-thread approach I suggested above
would also work for .bz2 files (although I guess it means the
progress-indicator would have to switch part-way through from being a
"rough estimate, based on how much of the input file has been
decompressed", to "an exact figure, based on how much of the output file
has been written to disk so far").

I think heuristic + counting bytes while writing is probably the less
resource-intensive path.

I believe that we can pause decompression if we're getting slow on writes,
and that should work fine.

Fair enough, looks like I've been out-voted ;-)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/resin-io/etcher/issues/638#issuecomment-285474526,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABLUCDVlo-Ih407SAOM4MQbXReSlT0I4ks5rkGIFgaJpZM4Joo50
.

alexandrosm on 9 Mar 2017

Ahhhh-hhaaaa!!! I've just realised something significant :-D

As seen in https://github.com/resin-io/etcher/issues/1171#issuecomment-285445931 the 29476.5MiB of the android-ver6.0-20170112-pine64-32GB.img gets compressed to just 778MiB in android-ver6.0-20170112-pine64-32GB.img.gz. However (here's the important part) the data isn't distributed evenly throughout the .gz file - there's maybe 2 or 3 GB of _actual_ disk data, and then 20+ GB of binary zeroes (which is why the image compresses so well). However it only takes a few blocks to store those 20+ GB of binary zeroes in the .gz file. So by the time all the 'real data' has been decompressed and written to the SD card, it's totally believable that we _are_ now 95% of the way through reading the gzip file (as @jviotti explains above, the progress-indicator is based on how much of the gzip file has been read so far).
So even if the stream-backpressure stuff is working correctly, the progress indicator will display 95% progress, even though we've actually only written maybe 3GB out of the actual 25GB that we need to write to disk. And of course if the progress bar thinks we're already 95% done, but we've still got another 20+ GB of zeroes to write, that'll lead to the "apparent write speed" dropping so dramatically.

So it's not necessarily just the "gzip being quite fast" that is causing this problem, but the fact that disk images tend to have lots of empty-space at the end, which compresses super-efficiently, and takes up only a very small fraction of the compressed file, but still takes a very long time (much longer than the "real data" in the example in #1171 ) to actually write out to disk.

Does that make sense?

lurch on 9 Mar 2017

...and on the subject of background processes in Electron, I just found this - @jviotti that's a similar approach to your child-writer stuff isn't it?

lurch on 9 Mar 2017

Yeah, we can always create child processes and communicate through IPC, which can provide us with some sort of multi-threading.

jviotti on 9 Mar 2017

How about displaying something like a wild guess, and saying so, if the compressed size is bigger than gzip claims the output file will be. Although it is possible that a compressed file is truly bigger than the output it would be much better than reporting so really bogus number like with a 5GB input reporting 866MB output!

Here's an off the wall idea: Ask the user if they know what the output size is supposed to be! I would expect in most cases they do.

Interesting side note: 2 years ago lurch commented "Is there actually anyone using gzip to compress images over 4GB in size". Today why would anyone buy a SD card smaller than 16GB!

DG12 on 2 Apr 2019

Dear all,
if anyone is still following this bug ... in bioinformatics we have thousands of ASCII files that are (when compressed) well over 4GiB. This is a very common occurrence, and therefore, the inability of GZIP of listing the correct size of the uncompressed file is a routine nuisance.

As one of many potential examples: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR277/ERR277077/

Thank you for supporting this amazing utility!

lucventurini on 31 Jul 2019

Recently also faced this issue with gzip 1.9
$ gzip -l filename.gz
compressed uncompressed ratio uncompressed_name
5627740354 2198646035 -156.0% filename

tipuraneo on 5 Aug 2019

@tipuraneo It's a limitation of the .gz file-format itself, rather than a 'bug' in any particular implementation of gzip.

lurch on 5 Aug 2019

I know. So you could say gzip is not the best choice for files > 4 GB? I prefer xz over gzip.

tipuraneo on 5 Aug 2019

👍1

Etcher: .gz doesn't return correct file size for content above 2^32 B = 4 GB

All 42 comments

Related issues