Go: compress/flate: decompression performance

Created on 16 Dec 2017 · 12Comments · Source: golang/go

What version of Go are you using (`go version`)?

go version go1.9.2 linux/amd64

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/fabecassis/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build767750319=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

I noticed that the performance of "compress/gzip" is close to 2x slower than the gunzip(1) command-line tool. I've tested multiple methods of performing the decompression in this simple repository:
https://github.com/flx42/go-gunzip-bench
In summary, on a 333 MB gzipped file, for the read-decompress-write process:

The idiomatic go implementation takes 7.8s.
Piping to gunzip(1) takes 4.4s.
Using cgzip, a cgo wrapper on top on zlib, it takes 3.4s

Is there any room for improvement here for the standard implementation? The use case is decompression of downloaded Docker layers, in golang.

Performance help wanted

Source

flx42

👍6

Most helpful comment

The biggest blocker is time on myself and a reviewer who's also familiar with RFC1951. It's possible to get a new implementation in for Go1.11, but who knows. Compression needs to be reviewed carefully.

\cc @mdempsky @nigeltao @klauspost. Any takers as reviewers?

dsnet on 16 Dec 2017

👍3

All 12 comments

@klauspost FYI

flx42 on 16 Dec 2017

/cc @dsnet

agnivade on 16 Dec 2017

I see some complaints about it elsewhere.

dongweigogo on 16 Dec 2017

There's a large amount of area for improvement. I re-implemented the decompressor in github.com/dsnet/compress/flate. I have yet to merge my changes into the standard library, but it's about 1.6x to 2.3x faster. You can use that package if you want.

dsnet on 16 Dec 2017

The biggest bang for buck is to get around the io.ByteReader interface that the flate.Reader promises to uphold. This is problematic since it limits the input bandwidth to 200MB/s in the best case.

The approach I took was to special-case the fact that most io.Reader don't satisfy io.ByteReader and so will be automatically wrapped as a bufio.Reader. Thus, I type-assert the input reader to check whether it has the Peek and Discard methods and use those to process as much of the known input as possible. This allows you to take advantage of the new math/bits package to parallelize certain operations across multiple bytes. My implementation was written before math/bits, so I think it's possible to get yet more speed benefits than its current state.

Unfortunately, bytes.Buffer, bytes.Reader, and strings.Reader also implement io.ByteReader, so will also need some wrapping to be performant.

Some of the relevant performance changes:

flate: first draft of DEFLATE implementation. This takes advantage of the Peek trick.
flate: add TryWriteCopy to inline dictionary operations. I think I already upstreamed this optimization to the standard library.
internal/prefix: use wide integer loads when reading

dsnet on 16 Dec 2017

Thanks @dsnet, do you have any ETA yet on when you will start to integrate your optimizations into the standard library?

I know that more optimized packages exist, but projects might be reluctant to add an external dependency for this task.

flx42 on 16 Dec 2017

\cc @mdempsky @nigeltao @klauspost. Any takers as reviewers?

dsnet on 16 Dec 2017

👍3

I have added a few minor comments to the commits listed above.

You could consider supporting the io.WriterTo interface. It avoids a copy in the Read function.

I assume you have already fuzzed it :)

klauspost on 17 Dec 2017

👍2

It's not production-ready yet, or any time soon, but I have been working on https://github.com/google/wuffs, which is a new, memory-safe programming language that generates C code now, and could generate Go code (using package unsafe) in the future. The kicker is that despite being memory safe (e.g. all array accesses are bounds-checked), its DEFLATE implementation benchmarks more or less as fast as the C zlib library (see https://github.com/google/wuffs/blob/master/doc/benchmarks.md).

A rough estimate (different measurement frameworks, different test data) is that it is therefore about 4x faster at decoding than Go's compress/flate. On my desktop: Wuffs is 300 MB/s compared to "go test -test.bench=. compress/flate"'s 75 MB/s. There are a couple further optimization techniques (https://github.com/madler/zlib/pull/292 that I proposed for C zlib but had to abort because of API backwards compatibility constraints) that I'd expect to give an extra 1.5x improvement to Wuffs' throughput.

It's still a work in progress, and won't help Go programs any time soon. But that's where I'm spending my limited time these days.

nigeltao on 21 Dec 2017

👍1

As for APIs (io.ByteReader, io.WriterTo, etc), Wuffs' API is similar to Go's golang.org/x/text/transform.Transformer. You wouldn't be able to just drop that interface (and its specific errors) into the stdlib per se, as the stdlib can't depend on x, but there may be some ideas in there worth considering.

nigeltao on 21 Dec 2017

@nigeltao This project looks amazing. Even if it probably won't happen soon, having a new implementation under golang.org/x/ would already be a big step up to convince projects to adopt it, compared to non-official projects on GitHub.

flx42 on 21 Dec 2017

FWIW, I'm seeing a nice performance improvement with Go 1.11 beta2. Down to 6.8s, compared to 8s originally with 1.9.2.

Not sure if this is an optimization in the deflate code, or general compiler optimizations.

flx42 on 2 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings