Reading GCS decompressive transcoding documentation, I understand that the only way to retrieve a gziped file stored on GCS with Content-Encoding: gzip
in its compressed state is to pass "Accept-Encoding: gzip"
header when requesting it.
When trying to do so using gsutil
, I have an error:
$ gsutil ls -L gs://xxx/0.json.gz | grep 'Content-'
Content-Encoding: gzip
Content-Length: 129793
Content-Type: text/plain
$ gsutil -h "Accept-Encoding: gzip" cp gs://xxx/0.json.gz .
ArgumentException: Invalid header specified: accept-encoding: gzip
(I know this example is probably bad; the extension shouldn't be explicitely set to .gz
but I have to work with this right now)
My guess is that gsutil
performs client-side decompression (as suggested by gsutil cp
documentation) and so it prevents from passing the Accept-Encoding
header.
My question is then how can I use gsutil
to download a gzip file that has his metadata set to Content-Encoding: gzip
without decompressing it (and without having to set other metadata like Cache-control: no-transform
if that would be a workaround)?
It doesn't look like there's a way to disable the auto-decompression behavior for gsutil cp
. For one-off use cases, gsutil cat
will skip the decompression:
$ gsutil cat gs://bucket/obj.gz > /destination/path/obj.gz
But I realize it's very slow and painful to run a separate invocation of gsutil for every object like this. We should provide some sort of behavior to prevent auto-decompression when downloading objects via cp/mv/rsync.
Thanks for your answer. I think the documentation could also mention the fact that gsutil
will enforce the Accept-Encoding
header (if I'm correct) because the error just looks a bit strange at first sight.
Using Cache-Control: no-transform
doesn't help, since downloading (cp
or rsync
from cloud to local) auto-decompresses based on Content-Encoding: gzip
only (metadata_util.py, ObjectIsGzipEncoded
and usages).
I guess auto-decompression is active to keep cp -z
working "as expected" for up- and download (e.g. #42), even though it could be argued that cp -z
is essentially a gzip
operation, and a bare cp
should _always_ copy, not gunzip
automatically (cp. the tar -cz
and tar -xz
symmetry).
For rsync
, automatic decompression is especially problematic, consider:
1) create a directory with a single, precompressed file
mkdir local
echo "compress me" | gzip -c > local/precompressed
2) sync local to remote
rsync -r local/ gs://somewhere/remote/
3) set appropriate encoding on precompressed
setmeta -h 'content-encoding:gzip' gs://somewhere/remote/precompressed
4) sync back remote
rsync -r gs://somewhere/remote/ local/
5) observe that nothing is copied
Building synchronization state...
Starting synchronization...
6) now change the local precompressed file
echo "compress me differently" | gzip -c > local/precompressed
7) and sync back remote again
rsync -r gs://somewhere/remote/ local/
8) observe that precompressed is synced, and automatically decompressed
Building synchronization state...
Starting synchronization...
gs://somewhere/remote/precompressed has a...
Copying gs://somewhere/remote/precompressed...
Downloading to temp gzip filename...
...
9) and sync back remote once more
rsync -r gs://somewhere/remote/ local/
10) observe the same messages as above in 8), since local/precompressed is _uncompressed_ and remote/precompressed isn't - they differ
11) repeat 9) and 10) ad infinitum
Wrapping it up: an option to switch off auto decompression would be very helpful.
Hey :)
Any update on this one ?
It is quite inconvenient when working with lots of gzip files in a data science environment (that represents several TBs in gzip format)
I did not find a good workaround for the moment (I will try the cat version but the lack of md5sum checks might lead to other problems :/)
Maybe you know a script with the python API (or any other language) that does that ?
Thanks.
PS (edit) : After looking up code I found https://github.com/hackersandslackers/googlecloud-storage-tutorial/blob/master/main.py . I used this one with raw_download=True on download_to_filename and it works but slower than gsutil a priori. (which is a problem with the large amount of data I need to transfer)
If you have *.gz
files with Content-Encoding: gzip
metadata and you want to download all the files?
Horrible "Solution":
Use http://rclone.org/ instead, using command: rclone sync
or rclone copy
and pass --header-download "Accept-Encoding: gzip"
.
It will download the original .gz file and will not decompress the file.
Most helpful comment
It doesn't look like there's a way to disable the auto-decompression behavior for
gsutil cp
. For one-off use cases,gsutil cat
will skip the decompression:But I realize it's very slow and painful to run a separate invocation of gsutil for every object like this. We should provide some sort of behavior to prevent auto-decompression when downloading objects via cp/mv/rsync.