Dvc: Faster checksum

Created on 2 Feb 2019  ·  20Comments  ·  Source: iterative/dvc

Checksum computation for very big files could probably be made faster:

  1. By using a faster hashing algorithm (xxHash for instance)
  2. By compiling the function using cython
enhancement

Most helpful comment

FYI: I added sha perf test to my previous comment.

I don't see strong reasons for a cryptographic algorithm. A general hashing algo with a reasonable collision rate should be good. I would seriously consider xxHash or sha (since Git uses that) to make DVC faster.

PS: And it is not high priority task to my mind.

All 20 comments

It is a great idea. I have not done any perf tests but based on my understanding the bottleneck will be I\O, not CPU\algorithm.

It might help actually. ~3x difference in my laptop SSD and 1Gb file comparing to simple algorithms such as crc32 and line count.

$ du -sh stackoverflow_small_xml/Votes.xml
1.0G    stackoverflow_small_xml/Votes.xml

$ time wc -l stackoverflow_small_xml/Votes.xml
 11672020 stackoverflow_small_xml/Votes.xml

real    0m0.865s
user    0m0.690s
sys 0m0.173s

$ time crc32 stackoverflow_small_xml/Votes.xml
829c3e66

real    0m0.708s
user    0m0.375s
sys 0m0.323s

$ time md5 stackoverflow_small_xml/Votes.xml
MD5 (stackoverflow_small_xml/Votes.xml) = 3c62472bd006701211c4453c26e87432

real    0m2.564s
user    0m2.536s
sys 0m0.273s

$ time shasum stackoverflow_small_xml/Votes.xml
shasum: stackoverflow_small_xml/Votes.xml:

real    0m2.897s
user    0m2.557s
sys 0m0.335s

PS: I listed median numbers after 3 runs for each of the commands.

UPDATED: sha added

Specs:

CPU: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz (8 processors - 4 cores)
Memory: 8G
Disk: HD

MD5:

❯ for x in {1..5}; do time python -c 'import hashlib; hashlib.md5(open("sample.txt", "rb").read()).hexdigest()'; done
python -c   1.59s user 0.48s system 99% cpu 2.079 total
python -c   1.62s user 0.49s system 99% cpu 2.113 total
python -c   1.59s user 0.50s system 99% cpu 2.100 total
python -c   1.59s user 0.48s system 99% cpu 2.075 total
python -c   1.64s user 0.48s system 99% cpu 2.127 total

SHA-1:

❯ for x in {1..5}; do time python -c 'import hashlib; hashlib.sha1(open("sample.txt", "rb").read()).hexdigest()'; done
python -c   1.15s user 0.48s system 99% cpu 1.636 total
python -c   1.16s user 0.49s system 99% cpu 1.655 total
python -c   1.13s user 0.49s system 99% cpu 1.622 total
python -c   1.15s user 0.49s system 99% cpu 1.646 total
python -c   1.21s user 0.55s system 99% cpu 1.762 total

xxHash:

❯ for x in {1..5}; do time python -c 'import xxhash; xxhash.xxh64(open("sample.txt", "rb").read()).hexdigest()'; done
python -c   0.15s user 0.48s system 99% cpu 0.631 total
python -c   0.14s user 0.51s system 99% cpu 0.648 total
python -c   0.14s user 0.48s system 99% cpu 0.621 total
python -c   0.13s user 0.48s system 99% cpu 0.620 total
python -c   0.14s user 0.48s system 99% cpu 0.623 total

:zap: It is indeed fast, I couldn't find any information about the risk of collision, tho

  • What would be the downsides of using a non-cryptographic algorithm?
  • Is there a strong reason we need a cryptographic sound algorithm?

Unintended collisions would be really unfortunate, but I don't know if we can raise the probabilities in favor of speed

_Btw. MD5 is 128 bits, xxHash is 64 bits_

For reference: https://ticki.github.io/blog/designing-a-good-non-cryptographic-hash-function/

@dmpetrov, @ivan, @efiop, @pared, what are your thoughts on this one? I'm interested in how this could benefit dvc :thinking:

FYI: I added sha perf test to my previous comment.

I don't see strong reasons for a cryptographic algorithm. A general hashing algo with a reasonable collision rate should be good. I would seriously consider xxHash or sha (since Git uses that) to make DVC faster.

PS: And it is not high priority task to my mind.

@mroutis what was sample.txt file size in your tests?

@dmpetrov , I forgot to add it:

dd if=/dev/urandom of=sample.txt count=1 bs=1G iflag=fullblock

It may not be the best scenario, tho... as a lot of data is structured.

I've done another perf comparison md5 vs sha1 with 10Gb file:

  1. In python hashlib lib sha1 implementation is almost 10% faster than md5.
  2. In default command line tool (shasum and md5) the result is opposite - md5 is ~13% faster.

The python implementation (1) is more relevant for DVC. We can say that DVC will get 10% improvement by moving to sha1.

Another thing that @mroutis brought - cryptographic algorithm vs not crypto. It seems like cryptographic algorithm is important for tools like Git and DVC because a single collision can break the entire repository. And you might heart that today md5 algo is not considered as as cryptographic algo due to some issues - https://security.stackexchange.com/questions/11839/what-is-the-difference-between-a-hash-function-and-a-cryptographic-hash-function

My takeaways: if we decide to change the algo I'd consider only cryptographic algos. sha1 might be a perfect option since the python implementation is even faster that the md5 implementation that we curently use.

Interesting fact: SHA-1 is also broken

@mhham , it looks like sha256 is quite slow :disappointed:

Tested with the same file as below:

dd if=/dev/urandom of=sample.txt count=1 bs=1G iflag=fullblock
❯ for x in {1..5}; do time python -c 'import hashlib; hashlib.sha256(open("sample.txt", "rb").read()).hexdigest()'; done
python -c   4.09s user 0.60s system 99% cpu 4.705 total
python -c   4.06s user 0.77s system 99% cpu 4.843 total
python -c   3.52s user 0.63s system 99% cpu 4.159 total
python -c   4.64s user 0.69s system 99% cpu 5.347 total
python -c   4.37s user 0.53s system 99% cpu 4.923 total

Hi all,

I just discovered dvc as I have started hacking on something similar myself and realized I could not be the first person to think of this. dvc is a very cool project - congratulations!

How difficult would it be to add a config option to switch the hash algorithm? For use cases with moderate data files sizes and/or concerns for attacks, it can definitely be worth some extra CPU time to have a cryptographic hash function. The tool I started hacking on uses sha256 by default.

Thank you for your time!

Hi @rasmuse !

Should not be too hard, to be honest. You'll need to mostly go through dvc/remote/local.py to support another hash in addition to md5 and, of course, you'll need to add a config option for it, see dvc/config.py. Let us know if you'd like to tackle this, we will be happy to help! :slightly_smiling_face:

Thanks,
Ruslan

Thanks for the prompt reply @efiop !

I will look into it and get back if/when I encounter any problems.

@rasmuse Sure! Btw, we also have our chat at https://dvc.org/chat. Please feel free to join and use it too, if you feel like it :slightly_smiling_face: There are both developers and users there.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

analystanand picture analystanand  ·  3Comments

anotherbugmaster picture anotherbugmaster  ·  3Comments

dnabanita7 picture dnabanita7  ·  3Comments

shcheklein picture shcheklein  ·  3Comments

TezRomacH picture TezRomacH  ·  3Comments