Dvc.org: docs: differentiate types of file hashes

Created on 21 Jun 2020  Â·  18Comments  Â·  Source: iterative/dvc.org

Extracted from https://github.com/iterative/dvc.org/pull/1448#discussion_r443124777 and https://github.com/iterative/dvc.org/pull/1494#pullrequestreview-441361208:

For HTTP(s)/S3/azure, we have etag, for ssh/local/GS we have md5 and for hdfs, we use checksum key.

("key" referring to the actual name with which the values are saved in dvc.yaml)

Related: #68


I believe this affects the following docs:

  • [x] dvc.lock file description
  • [ ] run -d (external dep)
  • [ ] add? (external outs),
  • [ ] import-url or import?
  • [ ] External Dependencies guide
  • [ ] External Data (Outputs) guide

some of which may need more examples that feature etag and checksum fields in dvc.lock.

doc-content enhancement

Most helpful comment

I've created an issue for the CRLF problem https://github.com/iterative/dvc/issues/4658.

All 18 comments

I would suggest to generalize it somehow instead of trying to mention all possible fields everywhere. E.g. outs can have hash (e.g. md5 or etag).

Agree but we still need to update all those docs with the more general concept of "file hash" or "hash value" — which fortunately we've already been doing in the past 🙂

I think this issue is linked to #1448 more than it looks like.
I didn't find any page that gave a list of all file hashes together, and I think we might need it every time we try to generalize.
I'm not sure where exactly would we need to update the contents in import, import-url. Everything looks quite specific to me there.
I believe once #1448 is merged, we can add the link to dvc-file-and-directories in #1494 too.
Thoughts on this?

I'm not sure where exactly would we need to update the contents in import, import-url

Which kind of file hash is used on each type of external dependency. Same as in add and run. Or maybe in the External Data guide and link from all those refs into there.

once #1448 is merged, we can add the link to

Yes, maybe.

For HTTP(s)/S3/azure, we have etag, for ssh/local/GS we have md5 and for hdfs, we use checksum key.

What about OSS and Google Drive @skshetry ? Please lmk and I'll update in #1527

Also @skshetry this all applies to .dvc files as well, right? Not just dvc.lock

Oh and what about external outputs? Or do outs entries always use md5 no matter what? CC @efiop you probably also know 🙂

@jorgeorpinel They might be md5(local, ssh, gs)/etag(s3) or checksum(hdfs)

what about external outputs?

@jorgeorpinel, this is about external dependencies and outputs. So, it applies to dvc.lock and .dvc files too.

What about OSS and Google Drive?

OSS uses etag. Google Drive does not support external deps/outs, so it does not have any.

oss also doesn't support external deps/outs.

They might be md5(local, ssh, gs)/etag(s3) or checksum(hdfs)

@efiop, Saugat indicated recently (https://github.com/iterative/dvc.org/pull/1494#pullrequestreview-441361208) that eTag is also used for GS. Can you guys confirm please? Thanks

On OSS, please clarify:

OSS uses etag

vs.

oss also doesn't support external deps/outs.

Thanks

Wait no, sorry. Saugat said MD5 for GS, never mind. eTag is for HTTP(s), S3, and Azure.

p.s. you can just review #1527 instead.

hi all! I'm working on a tool to upload data to a dvc remote without actually using dvc and I noticed that the md5 of a file is not calculated correctly if the file is in windows format. Looks like the CRLF are replace by LF before calculation. https://github.com/iterative/dvc/pull/775 This is a minor issue because will treat as equal files with different ending.

The point is, I first though that for files added with dvc add it will be always md5 checksum of the content, but looks like it is not that simple.
is there any clear documentation in how the file hash is calculated to help me to implement a compatible uploader?

I noticed that the md5 of a file is not calculated correctly if the file is in windows format. Looks like the CRLF are replace by LF before calculation.

Could it be Git doing that though (depending on the repo's config)? Cc @efiop on this Q anyway.

is there any clear documentation in how the file hash is calculated

Not much @MetalBlueberry, which is why this issue and #68 exist.

The one place where we've already put some of this info is in our DVC Metafiles guide: https://dvc.org/doc/user-guide/dvc-files-and-directories (please find the md5, etag, checksum, and hash terms).

Feel free to ask any questions about this or anything else directly in our http://dvc.org/chat !

@MetalBlueberry if we talk about regular dvc add of some local artifact (not dvc add --external) you got it right - it applies dos2unix, and that's pretty much it. Curious about your use case though? Why don't you want using DVC for this- you can reuse some API at least, e.g. to calculate hash.

I've created an issue for the CRLF problem https://github.com/iterative/dvc/issues/4658.

Was this page helpful?
0 / 5 - 0 ratings