Dvc: Support hashings other than MD5

Created on 6 Jan 2020  路  8Comments  路  Source: iterative/dvc

The docs seem to imply that md5 is the only hash users can use. The vulnerabilities of md5 have made it not usable in many organizations which require a certain level of security.

Is the a way to use SHA1 or SHA256 or other hashes instead?

I saw some PRs (https://github.com/iterative/dvc/pull/2022 for instance), but they're closed w/o being merged. What's the progress on that front?

feature request p3-nice-to-have question

Most helpful comment

I agree with @adrinjalali, using MD5 to check for integrity is far from ideal.

As SHA-1 is starting to show collisions as well, DVC could join git and replace the hashing function to SHA-256.

All 8 comments

Hi @adrinjalali !

In our case md5 is not used for cryptography, it is simply used for file hashing, where it is still suitable (at least I'm not aware of any issues with it in such scenarios), so it is safe to use for htat purpose. The contents of the cache files are not encrypted, you could see it for yourself by opening any file under .dvc/cache. If you need to encrypt your data, you could do that pretty easily on the remote itself (e.g. sse option for s3 remote) or encrypt in your pipeline along the way and only track those encrypted files. We also have https://github.com/iterative/dvc/issues/1317 for in-house encryption, please feel free to leave a comment there, so it is easier for us to estimate the priority. Would the cloud-side encryption be suitable for your scenario?

I understand these hashes shouldn't be the only defense layer against a malicious player, but it is still a very useful thing. Imagine a malicious player who has write access to the storage (s3 bucket in this case). They can take a model file, modify it in a malicious way in a way the md5 is kept unchanged, and upload the new one. The rest of the pipeline now won't detect a change in the file, and will serve the new wrong model.

Again, I understand this should not be seen as the only way to secure the models, but in places where software goes through security audit before being used, md5 is one of the first things they're gonna use to reject it.

Another point is that having a SHA is not expensive or complicated at all, so why not include it?

I agree with @adrinjalali, using MD5 to check for integrity is far from ideal.

As SHA-1 is starting to show collisions as well, DVC could join git and replace the hashing function to SHA-256.

@adrinjalali @JavierLuna Thanks for the info, guys! We will support other hash functions eventually, it just wasn't the focus for us, but we will reconsider it. We've had an attempt from an outside contributor https://github.com/iterative/dvc/pull/1920 , but he wasn't able to finalise it :slightly_frowning_face: If anyone is interested in contributing this feature(similar to #1920, where sha is a opt-in config option for now), we will be happy to help with everything we can.

I'll leave this issue opened in addition to https://github.com/iterative/dvc/issues/1676 , as this current one is about simply replacing md5 with something else (possibly as an option), but #1676 is originally about some custom hash functions.

I'd like to help with both issues!
Maybe I'll try working on #1676 and set the default hashing function to sha-256?

@JavierLuna That would be great! :pray:

Maybe I'll try working on #1676

Let's start with the current one, similar to how #1920 does it. Custom hashes could be implemented later, esp since we don't really know what type of hash creator of #1676 wants.

set the default hashing function to sha-256

Please don't set it as a default one, it should be configurable. Md5 should stay default for now. If the feature is implemented correctly, we will be able to switch to sha-256 later using 1-line PR :slightly_smiling_face:

Please let us know if you have any questions :) FYI: we also have a #dev-general channel in our discord, feel free to join :slightly_smiling_face:

Also, we need to remember that changing default checksum will result in whole cache recalculation for users upgrading their DVC version, and this could bring a lot of issues. If having md5 is severe, the change of default checksum should probably be left for major version release.

It might be interesting to use xxhash: https://cyan4973.github.io/xxHash/ https://pypi.org/project/xxhash/

which is quite a bit faster than md5 (although like md5 it is not cryptographic, but that's fine for the dvc use case).

As a reference, I have a file util_hash in my ubelt library that supports generalized hashing. I may take a stab at integrating xxhash in a future PR.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dmpetrov picture dmpetrov  路  35Comments

mdekstrand picture mdekstrand  路  43Comments

kskyten picture kskyten  路  44Comments

Suor picture Suor  路  39Comments

luchoPipe87 picture luchoPipe87  路  69Comments