Google-cloud-ruby: Issues with downloading GCS objects previously transferred with GCS transfer

Created on 24 Mar 2017  路  21Comments  路  Source: googleapis/google-cloud-ruby

As part of a bigger application I was able to whittle this problem down with MD5 verification previous to downloading a file using the "google/cloud/storage" gem. Using my service account I can download files that either I or the service account have uploaded to the bucket with the following test script

require "google/cloud/storage"
storage = Google::Cloud::Storage.new(
  project: ENV['GCLOUD_PROJECT'],
  keyfile: ENV['GCLOUD_KEYFILE']
)

bucket = storage.bucket bucketname
file = bucket.file "filename"
file.download './foo', verify: :all

but I can't download any files that were transferred via GCS transfer service (I used this to transfer files from AWS into GCS). I get the following error:

/usr/local/bundle/gems/google-cloud-storage-0.23.2/lib/google/cloud/storage/file/verifier.rb:34:in `verify_md5!': The downloaded file failed MD5 verification. (Google::Cloud::Storage::FileVerificationError)
    from /usr/local/bundle/gems/google-cloud-storage-0.23.2/lib/google/cloud/storage/file.rb:809:in `verify_file!'
    from /usr/local/bundle/gems/google-cloud-storage-0.23.2/lib/google/cloud/storage/file.rb:407:in `download'
    from sample.rb:9:in `
'
storage p2 acknowledged question

Most helpful comment

Thanks @ioverzero. I examined the first object you cited. GCS and the client correctly report the MD5 of the .gz file.

The second MD5, the one that you are calculating by hand, is the MD5 of the unzipped object.

The file's metadata notes a "contentEncoding" of "gzip" and a "contentType" of "application/json". GCS interprets this as meaning that the object is a JSON object that happens to be pre-encoded gzip'd. When a client asks to download the resource, GCS may unzip the object if the client does not specify that a gzip encoding is okay.

If you literally want to store a ".json.gz" object and download it as such, you'll want to set the contentType to "application/x-gzip" and clear the contentEncoding.

I'm guessing that the google-cloud ruby library perhaps does not send an "Accept-Encoding: gzip" header, or alternately it does send that header and the decompresses the resource before it gets to your code.

All 21 comments

What types of files are you unable to download? Are they all of the same type? Images? Videos? Text? CSV?

Can you download the file using file.download './foo', verify: :none, compute the MD5 locally, and compare it to the value in file.md5?

I am working with gzipped csv and json files.

I took a look at the filename.md5 and filename.crc32c using

gsutil ls -L gs://bucket.filename.json.gz

and after downloading a local copy of the file, I used the same method that the download verifier was using to calculate the hashes

require "digest/md5"
require "digest/crc32c"
require "pathname"

f = File.open(Pathname("filename.json.gz").to_path, "rb")
puts Digest::CRC32c.file(f).base64digest
puts Digest::MD5.file(f).base64digest

and I got different values. Yea I tried file.download './foo', verify: :none and it succeeded but that is not the way I want to handle these imported files.

I downloaded the file from AWS and used the above script to get the md5 and crc32c hashes and I got the same value that the file attributes on GCS has filename.md5 and filename.crc32c, however when I download directly from GCS and run the same script on the file, I get different md5 and crc32c hashes.

@swcloud I'm unsure how to address this, so I've assigned it to you. It looks to me like a possible bug in the GCS transfer service, but I don't know for sure. What I can say is that I'm pretty confident that google-cloud-storage is downloading the file properly, since downloading the file using gsutil produces the same result (according to this Stack Overflow comment).

@ioverzero @blowmage I will file an internal bug.

@blowmage @swcloud Thank you, how can I get notified of the progress?

Hi @ioverzero, I'd like to try and reproduce this problem. Could you either give me reproduction instructions for some test data or else send me the name of the bucket and object? If you'd rather not share it in public, you could email it to [email protected].

@BrandonY sent out an email with the specific configuration that gave me the error. Sent with Case #12425537 in the subject line. Thanks for all the help.

Thanks @ioverzero. I examined the first object you cited. GCS and the client correctly report the MD5 of the .gz file.

The second MD5, the one that you are calculating by hand, is the MD5 of the unzipped object.

The file's metadata notes a "contentEncoding" of "gzip" and a "contentType" of "application/json". GCS interprets this as meaning that the object is a JSON object that happens to be pre-encoded gzip'd. When a client asks to download the resource, GCS may unzip the object if the client does not specify that a gzip encoding is okay.

If you literally want to store a ".json.gz" object and download it as such, you'll want to set the contentType to "application/x-gzip" and clear the contentEncoding.

I'm guessing that the google-cloud ruby library perhaps does not send an "Accept-Encoding: gzip" header, or alternately it does send that header and the decompresses the resource before it gets to your code.

@ioverzero @BrandonY I was able to reproduce the issue.

  1. upload a file gzipped (using the client library)
  2. In the GCS GUI, set the contentType to 'application/json', contentEncoding to 'gzip'
  3. download the file (using the client library)
  4. get the same error message and verify the downloaded file is indeed decompressed.

@blowmage Do we decompress in the client library?

@swcloud The google-cloud-storage library does not explicitly decompress, but by setting that Content-Encoding HTTP header you are indicating to the HTTP client that the contents are gzipped and most HTTP clients will decompress. A common use-case for this is to store compressed website assets like stylesheets and javascript in a bucket.

@sqrrrl Do you have any input on whether files should be automatically decompressed when setting the ContentEncoding header? The decompressing is happening in Hurley HTTP client used by Google API Client.

FWIW, I agree with @BrandonY on this statement:

If you literally want to store a ".json.gz" object and download it as such, you'll want to set the contentType to "application/x-gzip" and clear the contentEncoding.

Hello @blowmage @swcloud @BrandonY the last comment by @BrandonY made this work, I was able to download using the gcs gem.

If you literally want to store a ".json.gz" object and download it as such, you'll want to set the contentType to "application/x-gzip" and clear the contentEncoding.

the next question is how to set this as a default for the GCS transfer service as all my files are coming in as content-type = application/json and content-encoding = gzip

@blowmage - yes, client is decompressing. Given the client mostly works with JSON responses that it needs to parse, it has to check the content encoding and decompress if necessary. Basically we're always interested in the content itself, and content-encoding tells us how to interpret it. If the content-encoding isn't set or is set to identity we'll just treat the content as-is.

@blowmage @sqrrrl @swcloud @BrandonY I guess a better question would be how to use the mentioned components (GCS transfer service, and the GCS gem) to get objects from AWS S3 into GCS and be able to download them using the GCS gem, then be able to operate on the unzipped object's contents.

The scenario will have a large amount of files syncing hourly from AWS S3 to GCS (the S3 bucket grows at around 17 GB per day) using GCS transfer service, and a Google Container Engine worker downloads the files using the GCS gem and operates on the contents.

@blowmage is there a way to specify the MD5 hash check against the gzipped version of the file being downloaded?

@ioverzero Maybe I don't understand what you're asking for, but did you try the encryption_key option?

@quartzmo I need to download gzipped files that have been transferred over with GCS transfer service. I have considered some of my options but none of these will work for me:

  1. Have a script that changes the metadata of the incoming files to be content-type = application/x-gzip
    Cons: Resource intensive, not guaranteed to work since ruby script will try to download files as soon as they hit GCS
  2. Change the file attributes before download with the gem using something like:
bucket = storage.bucket 'some-bucket'
file = bucket.file "file.json.gz"

file.update do |f| # update file attributes for successful download
  f.content_type = "application/x-gzip" 
  f.content_encoding = ""
end

file.download './foo', verify: :all # will download successfully

file.update do |f| # revert metadata changes
  f.content_type = "application/json" 
  f.content_encoding = "gzip"
end

Cons: Since my script is running based on GCS event notifications, this will cause multiple download attempts, some of which fail.

  1. Do not verify MD5
file.download './foo', verify: :none

Cons: No measure of file integrity.

@ioverzero Thanks for the clarification. (I thought I misunderstood your question to @blowmage, and indeed I did.)

is there a way to specify the MD5 hash check against the gzipped version of the file being downloaded?

No, the decompressing happens during transport. The only way to control this behavior is to correct the content-type and content-encoding values.

Thanks @blowmage, I can work around this.

Was this page helpful?
0 / 5 - 0 ratings