Node: ZLib support for decoding Zip files

Created on 27 Feb 2019  Â·  7Comments  Â·  Source: nodejs/node

I'd like to write a zip file decompressor in node. To do that, I'd need to be able to inflate a deflated stream without knowing it's length in advance (some zip files do not include the deflated stream's length until after the stream).

I'm not entirely sure, but I don't think node currently allows this. The popular npm unzip modules all search for a magic signature to find the end of the deflate stream. This is incorrect (per Mark Adler) as the magic signature might appear as compressed data in the deflate stream itself. I've linked to some of those examples in the SO discussion with Mark Adler on SO here.

I think node strives to expose the deflate logic as a stream transform. A stream transform expects to to consume each of the data chunks passed to it. So there's no way, via that abstraction, to have the stream self report it's own termination mid chunk.

I believe this is the reference implementation: https://github.com/madler/zlib/blob/master/examples/gzappend.c

question zlib

All 7 comments

I don't see why that wouldn't work. Node.js checks for Z_STREAM_END and when it sees that status code, it resets the stream and starts inflating the remaining bytes in the current chunk.

If you're saying that's not working for you, can you post a minimal standalone test case? Or do you mean you want Node.js to stop after the Z_STREAM_END without consuming the remaining bytes?

Does checking stream.bytesWritten after a chunk has been written help? That value provides you with information about how much data the zlib library actually consumed, and will stop increasing once the end of the stream has been reached.

It doesn’t quite work this way for gunzip, because those streams try to continue reading after EOF has been reached, in case another member of the gzip file follows. But I don’t think that’s what you’re looking for anyway?

I was looking for a way to know when the stream terminated without being otherwise told by one of the zip headers and I couldn't figure it out. :( So maybe stream.bytesWritten is the answer. I'll look more deeply at that. It's documentation seems to indicate it's a measure of bytes pushed and not back pressure from zlib, no?

The zlib.bytesWritten property specifies the number of bytes written to the engine, before the bytes are processed (compressed or decompressed, as appropriate for the derived class).

Is there an example that determines when the stream terminates using Z_STREAM_END or stream.bytesWritten? I couldn't find one in the docs. And the unzip NPM libs I scanned either searched for the magic sig (here, here), claimed a zip couldn't be scanned (here) which I'm pretty sure is not true (Mark Adler verified this back on SO), or claimed to stream but actually read the file backwards to get the directory (here). I think the directory is for random access whereas I want to truly stream; I want to inflate the file while its coming down over the wire.

Also, could you point me to where Node.js checks for Z_STREAM_END? I see it defined here but not used anywhere else in any JS code. Is the implementation done in C?

Thanks in advance for any pointers!

Ah! Here is where it node.js checks for Z_STREAM_END!

      while (strm_.avail_in > 0 &&
             mode_ == GUNZIP &&
             err_ == Z_STREAM_END &&
             strm_.next_in[0] != 0x00) {
        // Bytes remain in input buffer. Perhaps this is another compressed
        // member in the same archive, or just trailing garbage.
        // Trailing zero bytes are okay, though, since they are frequently
        // used for padding.

        ResetStream();
        err_ = inflate(&strm_, flush_);
      }

And it looks like it's only applied in GUNZIP mode. gzip is a format for single compressed files (per Mark Adler here).

Eh. But here's UNZIP

Decompress either a Gzip- or Deflate-compressed stream by auto-detecting the header.

That's looking good... except, like the docs say, that's for GZIP format not ZIP format. And the code scans for GZIP magic headers not ZIP magic headers.

#define GZIP_HEADER_ID1 0x1f
#define GZIP_HEADER_ID2 0x8b

But let's see how this stream.bytesWritten is computed...


Mr. Adler points us to this as the reference implementation of inflate and here he supplies a heavily annotated version of this file.

var zlib = require('zlib');
var assert = require('assert');

const input = Buffer.from('01234567890123456789');

zlib.deflate(input, (err, deflatedBuffer) => {
  assert(!err);

  deflatedBuffer = Buffer.concat([deflatedBuffer, Buffer.from([1,2,3,4,5,6])]);

  var numberRead = 0;
  var buffers = [];

  var stream = zlib.createInflate()
    .on('data', function(chunk) {
      buffers.push(chunk);
      numberRead += chunk.length;
    })
    .on('end', function() {
      var result = Buffer.concat(buffers, numberRead);
      this.close();
      console.log(result.toString());
      console.log(stream.bytesWritten);
      })
    .end(deflatedBuffer);
});

Returns

01234567890123456789
20

So bytesWritten is 20 and not 26 so I know those extra 6 bytes I added were not part of the deflate stream.

@kingces95 Glad you got something that works for you!

That being said, I think it might be a bug that the zlib stream doesn’t emit the 'end' event if you change .end() to .write(). It essentially forces you to see whether you’ve reached the end of the data stream while still writing data to it, I think – I’ll try to look into that.

(That’s something you’d have to work around when supporting all existing Node.js versions, though.)

(See #26363 for a patch that should help here.)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

willnwhite picture willnwhite  Â·  3Comments

akdor1154 picture akdor1154  Â·  3Comments

Brekmister picture Brekmister  Â·  3Comments

danielstaleiny picture danielstaleiny  Â·  3Comments

cong88 picture cong88  Â·  3Comments