It is a great idea. I have not done any perf tests but based on my understanding the bottleneck will be I\O, not CPU\algorithm.
It might help actually. ~3x difference in my laptop SSD and 1Gb file comparing to simple algorithms such as crc32 and line count.
$ du -sh stackoverflow_small_xml/Votes.xml
1.0G stackoverflow_small_xml/Votes.xml
$ time wc -l stackoverflow_small_xml/Votes.xml
11672020 stackoverflow_small_xml/Votes.xml
real 0m0.865s
user 0m0.690s
sys 0m0.173s
$ time crc32 stackoverflow_small_xml/Votes.xml
829c3e66
real 0m0.708s
user 0m0.375s
sys 0m0.323s
$ time md5 stackoverflow_small_xml/Votes.xml
MD5 (stackoverflow_small_xml/Votes.xml) = 3c62472bd006701211c4453c26e87432
real 0m2.564s
user 0m2.536s
sys 0m0.273s
$ time shasum stackoverflow_small_xml/Votes.xml
shasum: stackoverflow_small_xml/Votes.xml:
real 0m2.897s
user 0m2.557s
sys 0m0.335s
PS: I listed median numbers after 3 runs for each of the commands.
UPDATED: sha
added
Specs:
CPU: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz (8 processors - 4 cores)
Memory: 8G
Disk: HD
MD5:
❯ for x in {1..5}; do time python -c 'import hashlib; hashlib.md5(open("sample.txt", "rb").read()).hexdigest()'; done
python -c 1.59s user 0.48s system 99% cpu 2.079 total
python -c 1.62s user 0.49s system 99% cpu 2.113 total
python -c 1.59s user 0.50s system 99% cpu 2.100 total
python -c 1.59s user 0.48s system 99% cpu 2.075 total
python -c 1.64s user 0.48s system 99% cpu 2.127 total
SHA-1:
❯ for x in {1..5}; do time python -c 'import hashlib; hashlib.sha1(open("sample.txt", "rb").read()).hexdigest()'; done
python -c 1.15s user 0.48s system 99% cpu 1.636 total
python -c 1.16s user 0.49s system 99% cpu 1.655 total
python -c 1.13s user 0.49s system 99% cpu 1.622 total
python -c 1.15s user 0.49s system 99% cpu 1.646 total
python -c 1.21s user 0.55s system 99% cpu 1.762 total
xxHash:
❯ for x in {1..5}; do time python -c 'import xxhash; xxhash.xxh64(open("sample.txt", "rb").read()).hexdigest()'; done
python -c 0.15s user 0.48s system 99% cpu 0.631 total
python -c 0.14s user 0.51s system 99% cpu 0.648 total
python -c 0.14s user 0.48s system 99% cpu 0.621 total
python -c 0.13s user 0.48s system 99% cpu 0.620 total
python -c 0.14s user 0.48s system 99% cpu 0.623 total
:zap: It is indeed fast, I couldn't find any information about the risk of collision, tho
Unintended collisions would be really unfortunate, but I don't know if we can raise the probabilities in favor of speed
_Btw. MD5 is 128 bits, xxHash is 64 bits_
For reference: https://ticki.github.io/blog/designing-a-good-non-cryptographic-hash-function/
@dmpetrov, @ivan, @efiop, @pared, what are your thoughts on this one? I'm interested in how this could benefit dvc
:thinking:
FYI: I added sha
perf test to my previous comment.
I don't see strong reasons for a cryptographic algorithm. A general hashing algo with a reasonable collision rate should be good. I would seriously consider xxHash or sha (since Git uses that) to make DVC faster.
PS: And it is not high priority task to my mind.
@mroutis what was sample.txt
file size in your tests?
@dmpetrov , I forgot to add it:
dd if=/dev/urandom of=sample.txt count=1 bs=1G iflag=fullblock
It may not be the best scenario, tho... as a lot of data is structured.
I've done another perf comparison md5 vs sha1 with 10Gb file:
hashlib
lib sha1
implementation is almost 10% faster than md5
.shasum
and md5
) the result is opposite - md5 is ~13% faster.The python implementation (1) is more relevant for DVC. We can say that DVC will get 10% improvement by moving to sha1
.
Another thing that @mroutis brought - cryptographic algorithm vs not crypto. It seems like cryptographic algorithm is important for tools like Git and DVC because a single collision can break the entire repository. And you might heart that today md5 algo is not considered as as cryptographic algo due to some issues - https://security.stackexchange.com/questions/11839/what-is-the-difference-between-a-hash-function-and-a-cryptographic-hash-function
My takeaways: if we decide to change the algo I'd consider only cryptographic algos. sha1 might be a perfect option since the python implementation is even faster that the md5 implementation that we curently use.
Interesting fact: SHA-1 is also broken
@mhham , it looks like sha256
is quite slow :disappointed:
Tested with the same file as below:
dd if=/dev/urandom of=sample.txt count=1 bs=1G iflag=fullblock
❯ for x in {1..5}; do time python -c 'import hashlib; hashlib.sha256(open("sample.txt", "rb").read()).hexdigest()'; done
python -c 4.09s user 0.60s system 99% cpu 4.705 total
python -c 4.06s user 0.77s system 99% cpu 4.843 total
python -c 3.52s user 0.63s system 99% cpu 4.159 total
python -c 4.64s user 0.69s system 99% cpu 5.347 total
python -c 4.37s user 0.53s system 99% cpu 4.923 total
Also, another alternative: https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html
Hi all,
I just discovered dvc as I have started hacking on something similar myself and realized I could not be the first person to think of this. dvc is a very cool project - congratulations!
How difficult would it be to add a config option to switch the hash algorithm? For use cases with moderate data files sizes and/or concerns for attacks, it can definitely be worth some extra CPU time to have a cryptographic hash function. The tool I started hacking on uses sha256 by default.
Thank you for your time!
Hi @rasmuse !
Should not be too hard, to be honest. You'll need to mostly go through dvc/remote/local.py
to support another hash in addition to md5 and, of course, you'll need to add a config option for it, see dvc/config.py
. Let us know if you'd like to tackle this, we will be happy to help! :slightly_smiling_face:
Thanks,
Ruslan
Thanks for the prompt reply @efiop !
I will look into it and get back if/when I encounter any problems.
@rasmuse Sure! Btw, we also have our chat at https://dvc.org/chat
. Please feel free to join and use it too, if you feel like it :slightly_smiling_face: There are both developers and users there.
Closing in favor of https://github.com/iterative/dvc/issues/1676
Most helpful comment
FYI: I added
sha
perf test to my previous comment.I don't see strong reasons for a cryptographic algorithm. A general hashing algo with a reasonable collision rate should be good. I would seriously consider xxHash or sha (since Git uses that) to make DVC faster.
PS: And it is not high priority task to my mind.