Dvc: azure: support external dependencies and outputs

Created on 26 Mar 2020  路  1Comment  路  Source: iterative/dvc

As the title says, we currently do not support external dependencies and outputs from azure. So,
dvc import-url and friends don't work.

Implementation

It should be trivial to implement. dvc uses etag to ensure that it's same file. So, just need to implement following methods in dvc/remote/azure.py similar to one in dvc/remote/s3.py:

https://github.com/iterative/dvc/blob/a1fe6c6f44777463876ad24ee0d162173999f9d3/dvc/remote/s3.py#L78-L79

From Ruslan's reply on Discord: https://discordapp.com/channels/485586884165107732/485596304961962003/691327678271193160

Then:

  • Implement a DependencyAzure and OutputAzure classes. Refer DependencyS3 and OutputS3 for the example.

P.S. I could be missing a few things though. :(

Discord Context https://discordapp.com/channels/485586884165107732/485596304961962003/691283402413834290

feature request good first issue help wanted p3-nice-to-have

Most helpful comment

For the purposes of completeness and easy reference (this was also discussed on Discord and in the relevant PR):

From the official API docs:

In version 2012-02-12 and newer, Put Blob sets a block blob鈥檚 MD5 hash value even when the Put Blob request doesn鈥檛 include an MD5 header.

If I am reading that correctly, it basically means that every blob will always have the content-MD5 property set on upload. Additionally, the PUT request fails if the MD5 specified in the request doesn't match that of the computed one, so that means no blob can ever have an incorrect MD5. If the header is omitted, the MD5 is calculated anyway.

Nevertheless, testing with uploading a 4GB file through the web UI of Azure resulted in it not being assigned a Content-MD5 property. Hence, apparently it does not actually work as expected. I have submitted a request to to Azure support to ask for clarification, explaining that ideally it would always have a correct Content-MD5 property set. I'll provide an update of some kind when I get a response.

>All comments

For the purposes of completeness and easy reference (this was also discussed on Discord and in the relevant PR):

From the official API docs:

In version 2012-02-12 and newer, Put Blob sets a block blob鈥檚 MD5 hash value even when the Put Blob request doesn鈥檛 include an MD5 header.

If I am reading that correctly, it basically means that every blob will always have the content-MD5 property set on upload. Additionally, the PUT request fails if the MD5 specified in the request doesn't match that of the computed one, so that means no blob can ever have an incorrect MD5. If the header is omitted, the MD5 is calculated anyway.

Nevertheless, testing with uploading a 4GB file through the web UI of Azure resulted in it not being assigned a Content-MD5 property. Hence, apparently it does not actually work as expected. I have submitted a request to to Azure support to ask for clarification, explaining that ideally it would always have a correct Content-MD5 property set. I'll provide an update of some kind when I get a response.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dnabanita7 picture dnabanita7  路  3Comments

anotherbugmaster picture anotherbugmaster  路  3Comments

jorgeorpinel picture jorgeorpinel  路  3Comments

shcheklein picture shcheklein  路  3Comments

ghost picture ghost  路  3Comments