Dvc: Add parallelism to checksum calculation for pulling multiple large files

Created on 27 Feb 2020  路  22Comments  路  Source: iterative/dvc

when checking out, we do it sequentially, stage after stage, that is why checksums calculations are sequential

It would be better if multiple stages (which might correspond to large files) can perform checksum calculation at the same time.

bug p1-important research

All 22 comments

@pared

@pared Do you remember the context? Aren't we trusting pulled checksums these days? Or is it about unpacked dir case?

@efiop @Ykid Sorry for the delay,
Here is the start of our conversation:
https://discordapp.com/channels/485586884165107732/563406153334128681/682124179885260804

@Ykid has a repository where there are few big files (A, B, C), each having its own stage (A.dvc, B.dvc, C.dvc). @Ykid noticed that checksum calculations on dvc pull take a lot of time.

So, I see 2 potential problems:

  1. Files were on NFS, so reading files and feeding it to hash function might be taking a lot of time.
  2. We do parallelize checksum computations, but on out level. What I mean by that, is that when checking out here we do it stage by stage, so we will have:
  3. checkout A.dvc
  4. checkout B.dvc

and so on.
And as checksum calculation happens down the road of stage.checkout our checksum calculations will be sequential (because stage checkouts are sequential).

So I would vote for changing the name of the issue to Parallelize stage checkouts

@Ykid
What version of dvc are you using?

@pared

DVC version: 0.86.5
Python version: 3.7.6
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: pip
Cache: reflink - not supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/nvme0n1p2')
Filesystem type (workspace): ('ext4', '/dev/nvme0n1p2')

@pared Not sure I follow, what hashes are calculated on checkout? For the files we've downloaded? But we should've trusted those and then when checking out we are trusting the cache hashes, right?

@efiop Thats right, when pulling we should trust the checksums by default.
@Ykid did you configuration changed in any way since we talked on discord?
It used to be like that:

[cache]
    type = symlink
    protected = true
[core]
    remote = default
    checksum_jobs = 24
['remote "default"']
    url = /path/to/nfs/mount

@pared I didn't change my config. What's the discrepancy you spotted ?

@Ykid with the version that you are using, dvc should not even calculate checksum when pulling. I need to try to reproduce your problem.

@Ykid @efiop seems like a bug,
I can confirm that when pulling, md5 is calculated. Reproduction script:

#!/bin/bash

rm -rf repo storage git_repo clone
mkdir repo storage git_repo

main=$(pwd)

pushd git_repo
git init --quiet --bare
popd

pushd repo

git init --quiet
git remote add origin $main/git_repo
dvc init -q

dvc remote add -q -d str $main/storage 

git add -A
git commit -am "init"

dd if=/dev/urandom of=data1 bs=1M count=1 &> /dev/null
dvc add -q data1

dd if=/dev/urandom of=data2 bs=1M count=1 &> /dev/null
dvc add -q data2

git add -A
git commit -am "add data"
dvc push -q
git push origin master
popd

git clone git_repo clone
pushd clone
dvc pull

Interestingly I am unable to write test reproducing this behaviour:

def test_trust_remote_checksums(tmp_dir, tmp_path_factory, erepo_dir, mocker):
    with erepo_dir.chdir():

        erepo_dir.dvc_gen({"file": "file content"}, commit="add dir")

    from dvc.scm import Git
    Git.clone(fspath(erepo_dir), fspath(tmp_dir))

    from dvc.repo import Repo

    dvc = Repo(fspath(tmp_dir))

    md5_spy = mocker.spy(dvc.cache.local, "get_file_checksum")

    dvc.pull()

    assert md5_spy.call_count == 0

This one passes, I think we should include this task in next sprint.

Well, my test has not been working, due to #2888, will fix it during this issue.
The reason why my bash script results in md5 calculation and test does not is that erepo_dir has been using @efiop optimization from #3472. The cache file was protected. Thanks to that, when we were gathering dvc/remote/local._status, cache_exists did not have to calculate checksum (because file in cache was read only) and in my bash script I was using remote storage (local one), which does not protect its content by default. Then, when I am cloning the git repo, and pull the data, DVC checks cache_exists -> changed_cache_file which recalculates md5, because files on local remote are not protected.

So I would say that the issue we have here is how we treat LocalRemote:
Why do we calculate md5 of obtained files?
When pulling we check files status which in results calls cache_exists. For LocalRemote, we are
iterating over desired checskums and call changed_cache_file in sequential manner.

So to help slow checksum calculation, we could introduce parallelization here

Another thing is that in case of other remotes (besides ssh) we do not trigger checksum calculation, it is assumed that checksum on remote is the right one. I guess we are not doing it locally to verify no cache corruption has occurred. But this is the reason why we are calculating the checksum.

Problem: changed_cache_file for local uses state, therefore, SQLite is preventing us from parallelizing cache_exist for LocalRemote.

@pared SQLLite should be thread-safe and can be used from multiple threads?

@pared Does local remote preserve protected mode when uploading? If it doesn't, maybe we should do that and then trust on pull if cache files are still protected?

@shcheklein Maybe we are not initializing it properly? I get errors specifically pointing to the fact that we are accessing DB from multiple threads.

@efiop Local remote does not preserve the mode. I think that could be a proper way to solve it, utilize cache optimization and trust protected local remote. I will prepare PR for that.

Maybe we are not initializing it properly? I get errors specifically pointing to the fact that we are accessing DB from multiple threads.

To be honest, I don't know. We need to check the docs and do some research. Of course, if it's a bottleneck for solving this particular issue.

@Ykid sorry it took so long, you should not experience file checksum recalculation for local remote.
As to parallelization, this issue needs to be researched further.

State not working in parallel mode is a well-known thing, it is just because of the way it is currently set up. We will be looking at hash optimations separately, closing this issue for now, since it is mitigated by #3858

For the record: this was probably caused by https://github.com/iterative/dvc/issues/4485 , preparing a fix. CC @pared

@efiop I am not sure if they are related. This one was related to cloning and pulling the project, not import.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jorgeorpinel picture jorgeorpinel  路  3Comments

shcheklein picture shcheklein  路  3Comments

ghost picture ghost  路  3Comments

siddygups picture siddygups  路  3Comments

TezRomacH picture TezRomacH  路  3Comments