Gsutil: how to `gsutil cp` gzip files without decompressing them?

Created on 11 Apr 2018  路  5Comments  路  Source: GoogleCloudPlatform/gsutil

Reading GCS decompressive transcoding documentation, I understand that the only way to retrieve a gziped file stored on GCS with Content-Encoding: gzip in its compressed state is to pass "Accept-Encoding: gzip" header when requesting it.

When trying to do so using gsutil, I have an error:

$ gsutil ls -L gs://xxx/0.json.gz | grep 'Content-'
    Content-Encoding:       gzip
    Content-Length:         129793
    Content-Type:           text/plain
$ gsutil -h "Accept-Encoding: gzip" cp gs://xxx/0.json.gz .
ArgumentException: Invalid header specified: accept-encoding: gzip

(I know this example is probably bad; the extension shouldn't be explicitely set to .gz but I have to work with this right now)

My guess is that gsutil performs client-side decompression (as suggested by gsutil cp documentation) and so it prevents from passing the Accept-Encoding header.

My question is then how can I use gsutil to download a gzip file that has his metadata set to Content-Encoding: gzip without decompressing it (and without having to set other metadata like Cache-control: no-transform if that would be a workaround)?

Feature Request

Most helpful comment

It doesn't look like there's a way to disable the auto-decompression behavior for gsutil cp. For one-off use cases, gsutil cat will skip the decompression:

$ gsutil cat gs://bucket/obj.gz > /destination/path/obj.gz

But I realize it's very slow and painful to run a separate invocation of gsutil for every object like this. We should provide some sort of behavior to prevent auto-decompression when downloading objects via cp/mv/rsync.

All 5 comments

It doesn't look like there's a way to disable the auto-decompression behavior for gsutil cp. For one-off use cases, gsutil cat will skip the decompression:

$ gsutil cat gs://bucket/obj.gz > /destination/path/obj.gz

But I realize it's very slow and painful to run a separate invocation of gsutil for every object like this. We should provide some sort of behavior to prevent auto-decompression when downloading objects via cp/mv/rsync.

Thanks for your answer. I think the documentation could also mention the fact that gsutil will enforce the Accept-Encoding header (if I'm correct) because the error just looks a bit strange at first sight.

Using Cache-Control: no-transform doesn't help, since downloading (cp or rsync from cloud to local) auto-decompresses based on Content-Encoding: gzip only (metadata_util.py, ObjectIsGzipEncoded and usages).

I guess auto-decompression is active to keep cp -z working "as expected" for up- and download (e.g. #42), even though it could be argued that cp -z is essentially a gzip operation, and a bare cp should _always_ copy, not gunzip automatically (cp. the tar -cz and tar -xz symmetry).

For rsync, automatic decompression is especially problematic, consider:

1) create a directory with a single, precompressed file
mkdir local
echo "compress me" | gzip -c > local/precompressed
2) sync local to remote
rsync -r local/ gs://somewhere/remote/
3) set appropriate encoding on precompressed
setmeta -h 'content-encoding:gzip' gs://somewhere/remote/precompressed
4) sync back remote
rsync -r gs://somewhere/remote/ local/
5) observe that nothing is copied
Building synchronization state...
Starting synchronization...
6) now change the local precompressed file
echo "compress me differently" | gzip -c > local/precompressed
7) and sync back remote again
rsync -r gs://somewhere/remote/ local/
8) observe that precompressed is synced, and automatically decompressed
Building synchronization state...
Starting synchronization...
gs://somewhere/remote/precompressed has a...
Copying gs://somewhere/remote/precompressed...
Downloading to temp gzip filename...
...
9) and sync back remote once more
rsync -r gs://somewhere/remote/ local/
10) observe the same messages as above in 8), since local/precompressed is _uncompressed_ and remote/precompressed isn't - they differ
11) repeat 9) and 10) ad infinitum

Wrapping it up: an option to switch off auto decompression would be very helpful.

Hey :)

Any update on this one ?

It is quite inconvenient when working with lots of gzip files in a data science environment (that represents several TBs in gzip format)
I did not find a good workaround for the moment (I will try the cat version but the lack of md5sum checks might lead to other problems :/)

Maybe you know a script with the python API (or any other language) that does that ?

Thanks.

PS (edit) : After looking up code I found https://github.com/hackersandslackers/googlecloud-storage-tutorial/blob/master/main.py . I used this one with raw_download=True on download_to_filename and it works but slower than gsutil a priori. (which is a problem with the large amount of data I need to transfer)

If you have *.gz files with Content-Encoding: gzip metadata and you want to download all the files?

Horrible "Solution":

Use http://rclone.org/ instead, using command: rclone sync or rclone copy and pass --header-download "Accept-Encoding: gzip".
It will download the original .gz file and will not decompress the file.

Was this page helpful?
0 / 5 - 0 ratings