Dvc: azure/gs/s3: re-design checksum computation for external dependencies/outputs

Created on 4 Dec 2018  路  7Comments  路  Source: iterative/dvc

Unlike S3, azure does provide content-md5(need to investigate if it still works, since s3 had support for it some time ago and doesn't have it right now) This is related to external outs/deps feature, that is not implemented yet. Usual push/pull are not going to be affected.

c13-half-a-week enhancement p2-medium

All 7 comments

Ok, turns out user should set content-md5 by himself in order for this to work. Also ETag does indeed change when you simply copy blob from one location to another. So we need to require Content-Md5 to be present or to compute and set it ourselves. We can use ETag for dependencies though.

Also, md5 for google cloud storage only supported for non-composite objects, so looks like we need to do something with it as well.

Ok, turns out when you move multipart objects on s3 they don't retain their etag. See https://discordapp.com/channels/485586884165107732/485596304961962003/563864123587297280 . So we might need to come up with a general approach to handle all of these clouds :(

Another approach to handle all of these in a unified way is https://discordapp.com/channels/485586884165107732/485596304961962003/563899005151608843

I'm giving a shot at the s3 workaround. As far as I can see we need to check if the origin object of the copy in here is a multipart uploaded object by checking if it has a suffix in the etag. If that's the case, we need to calculate from the size it has and the number of parts it was uploaded in from the suffix of the etag, the chunk size of we need to set in the TransferConfig passed to the copy in order to achieve the same number of parts in the copy.

Actually, the head_object has the part count and the size, using that is less hacky.

Actually, the head_object has the part count and the size, using that is less hacky.

https://github.com/aws/aws-sdk-java/issues/1303 that has been generally considered undocumented and shouldn't be used as per aws-sdk-java

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shcheklein picture shcheklein  路  3Comments

gregfriedland picture gregfriedland  路  3Comments

shcheklein picture shcheklein  路  3Comments

GildedHonour picture GildedHonour  路  3Comments

tc-ying picture tc-ying  路  3Comments