Aws-sdk-java: [S3] Clarification on (de)compression (i.e. "Content-Encoding: gzip")

Created on 4 May 2017 · 5Comments · Source: aws/aws-sdk-java

I've tried to find information on how to handle the upload / download of compressed S3 objects via the SDK API, but I couldn't find any clear answer.

First, I've seen that ClientConfiguration contains on that matter:

    /**
     * The default on whether to use gzip compression.
     */
    public static final boolean DEFAULT_USE_GZIP = false;

    /**
     * Checks if gzip compression is used
     *
     * @return if gzip compression is used
     */
    public boolean useGzip() {
        return useGzip;
    }

    /**
     * Sets whether gzip compression should be used
     *
     * @param use
     *            whether gzip compression should be used
     */
    public void setUseGzip(boolean use) {
        this.useGzip = use;
    }

    /**
     * Sets whether gzip compression should be used
     *
     * @param use
     *            whether gzip compression should be used
     * @return The updated ClientConfiguration object.
     */
    public ClientConfiguration withGzip(boolean use) {
        setUseGzip(use);
        return this;
    }

And that this attribute has an effect on ApacheHttpClientFactory:

    public ConnectionManagerAwareHttpClient create(HttpClientSettings settings) {
        ...

        // By default http client enables Gzip compression. So we disable it
        // here.
        // Apache HTTP client removes Content-Length, Content-Encoding and
        // Content-MD5 headers when Gzip compression is enabled. Currently
        // this doesn't affect S3 or Glacier which exposes these headers.
        //
        if (!(settings.useGzip())) {
            builder.disableContentCompression();
        }

Which ultimately affects HttpClientBuilder#build:

    public CloseableHttpClient build() {
        ...

            if (!contentCompressionDisabled) {
                if (contentDecoderMap != null) {
                    final List<String> encodings = new ArrayList<String>(contentDecoderMap.keySet());
                    Collections.sort(encodings);
                    b.add(new RequestAcceptEncoding(encodings));
                } else {
                    b.add(new RequestAcceptEncoding());
                }
            }

So, if I understand correctly the comment in ApacheHttpClientFactory, "Gzip compression" actually means the _decoding_ (decompression) of HTTP resources returned with "Content-Encoding: gzip", induced by setting Accept-Encoding: gzip on the HTTP request?

Given that ClientConfiguration#DEFAULT_USE_GZIP = false by default, does that means that, out of the box, the SDK always retrieves non-compressed HTTP resources, i.e. travelling decompressed on the wire?
And it's only a matter of calling ClientConfiguration.setUseGzip(true) to change that behavior? Won't that affect any functionality related to S3, as the comment above implies?

The next part of the puzzle is how to _upload_ compressed, gzip-encoded data to S3 via the SDK.
What is the most effective way to do that?
Should on-the-fly encoding be avoided, in order to know the Content-Length value to assign to the request?
Is there some Apache HTTP client-related issue I should be aware of / pay attention to?

Thanks for your attention after this long message!

guidance

Source

dalbani

Most helpful comment

Also, is this going to be fixed in v2 of the SDK? I found this issue on the v2 repo https://github.com/aws/aws-sdk-java-v2/issues/131 but I'm not sure if its related. Thanks!

varxy20k on 13 May 2019

👍2

All 5 comments

Hum, after having done some testing, my understanding is that _retrieving_ gzip-encoded objects is more or less broken in the Java SDK.

Case 1: `ClientConfiguration.setUseGzip(false)`

Having ClientConfiguration.setUseGzip(false) (i.e. default behavior) doesn't generate any error — the data is simply passed as (originally) compressed to the API caller.
That was actually a surprise to me to see that the S3 server doesn't support on-the-fly transcoding of gzip-encoded resources for HTTP clients which don't support it.
Compare to Google Cloud Storage which reportedly supports it: https://cloud.google.com/storage/docs/transcoding

Case 2: `ClientConfiguration.setUseGzip(true)`

Having ClientConfiguration.setUseGzip(true) sets up transparent gzip decompression in the Apache HTTP client (i.e. Accept-Encoding: deflate,gzip header).
Unfortunately, this breaks the MD5 checksum verification step in AmazonS3Client, due to the mismatch between the checksum originally calculated on the compressed data (as stored in ETag with PUT) and the checksum calculated on the decompressed data fetched with GET.
Turning off this validation with the com.amazonaws.services.s3.disableGetObjectMD5Validation property (see SkipMd5CheckStrategy class) leads to another failure however.
This time, it's the verification made on the content length which fails, because the length of the compressed and decompressed data mismatch of course.

Those 2 validation steps are implemented at https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/AmazonS3Client.java#L1389.

Conclusion: from what I can see, there's no way to _transparently_ retrieve and use data from gzip-encoded objects in S3.
Or have I missed something obvious?

dalbani on 9 May 2017

👍1

Hi @dalbani, sorry for the late reply. Yes, your findings are correct, because of the checksum and length checks we do at the S3 client level, we don't currently support stream decompression of compressed object streams. I'm going go ahead and close this. Please feel free to reopen it if you'd like to submit it as a feature request.

dagnir on 2 Jun 2017

Hi - is this still the case?

As far as I can tell there's still the issue with length checks on a decompressed stream, and hence any user has to manually decompress.

Are there any plans to fix this and allow transparent decompression, or at least make it more evident that automatic decompression is not supported by the SDK?

wjnicholson on 25 Feb 2019

Also, is this going to be fixed in v2 of the SDK? I found this issue on the v2 repo https://github.com/aws/aws-sdk-java-v2/issues/131 but I'm not sure if its related. Thanks!

varxy20k on 13 May 2019

👍2

this still seems to be an issue at least with 1.11.556. we had the following behaviour:
content-type: text/xml and content-encoding: gzip

download via aws-console using chrome works like a charme: a 25mb gzip is transparently uncompressed to an 400mb xml

download via aws-sdk we got an transparently gzip-uncompressed input-stream, that is truncated to the gzip's filesize of 25mb, missing the rest of the ~400mb without any exception/telling.

if changing the content-type from text/xml to the binary/octet-stream (mentioned here: https://github.com/aws/aws-sdk-java/issues/472) the aws-dsk does not do any transparent decompression but instead gives you the gzip binary which you can then stuff into an GzipInputStream and which flawlessly works.

TL;DR: to avoid bogus transparent decompression, set content-type to binary and Gunzip yourself.