Dvc.org: decide best term for "md5" "checksum" or "hash" strings used in DVC-files, and Git commit (SHA) "hash"

Created on 12 Aug 2019  路  8Comments  路  Source: iterative/dvc.org

and implement throughout the docs.

And consider changing the md5 field name (also in DVC-files)?

Notes:

  • Avoid MD5. We don't actually always use pure MD5. See #68
    Oh and let's try to use upper case "MD5", not "md5" when possible.
  • hash: May be too general a term? (Also applies to Git commit SHA-1 hashes)
  • checksum: according to Ivan and Ruslan this term isn't correct though...
    UPDATE: See https://github.com/iterative/dvc.org/issues/552#issuecomment-585041488 + below.
doc-content enhancement

Most helpful comment

Agree. Right now we are reducing the usage of MD5 in docs because AFAIK our file hashes are not always direct MD5 (and this is still not properly explained, see #68)
This issue is to decide what to do with all these terms so if you guys tell me we should remove MD5 as much as possible then we'll do that.

All 8 comments

If it is just dos2unix <datafile> && md5sum <datafile> (as described here: https://github.com/iterative/dvc.org/issues/68#issuecomment-520301930),
then it actually is pure MD5, but the datafile is normalized first, to avoid any mismatching in different operating systems.

I think that when the explanation is meant for the users (who don't need to know the internal details) the terms hash/checksum are ok.
When explaining internal details, than MD5 is better, and also accurate. However a note on end-of-line normalization may also be useful.

I agree with @dashohoxha .

checksum vs hash. Hash is a better term because of the primary goal we are using these md5s for - store and find binary files in our storage. (Though we check the consistency as well).

I'm fine to use checksum everywhere, especially considering that it won't be hard to switch to a new version.

MD5, etag, other checksums make sense when we go into details.

I also vote for "checksum". In fact I don't see how it's the wrong concept at all, other than the purpose we use it for. I think "hash" is too general, TBH.

image

  • [x] I also just found the term "pointer" being used in some of our docs to refer to these fields.

UPDATE: I made sure every instance of MD5 is upper case unless it's a file or command code block as part of #669.

Hi! Notice that per some relatively recent conversation with @shcheklein, we stopped using "checksum" in docs, in favor of terms like (MD5/md5) "file hash" and "hash values" (in DVC-files). See #962

Should we apply the same in docs?

Note that there is somewhat of a collision between that term and (Git) "commit hash" for which we're using "commit SHA hash" in the docs PR right now (may review that one). Cc @efiop

@jorgeorpinel I would reconsider using MD5. As https://github.com/iterative/dvc/issues/1676 and https://github.com/iterative/dvc/issues/3069 are still open, and we do not intend to drop them, there exists possibility that at some point in time md5 will be obsolete term. This will trigger a lot of changes on the docs side.

Agree. Right now we are reducing the usage of MD5 in docs because AFAIK our file hashes are not always direct MD5 (and this is still not properly explained, see #68)
This issue is to decide what to do with all these terms so if you guys tell me we should remove MD5 as much as possible then we'll do that.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

efiop picture efiop  路  4Comments

dashohoxha picture dashohoxha  路  4Comments

piojanu picture piojanu  路  4Comments

efiop picture efiop  路  4Comments

utkarshsingh99 picture utkarshsingh99  路  3Comments