Dvc: Add parallelism to checksum calculation for pulling multiple large files

Created on 27 Feb 2020 · 22Comments · Source: iterative/dvc

when checking out, we do it sequentially, stage after stage, that is why checksums calculations are sequential

It would be better if multiple stages (which might correspond to large files) can perform checksum calculation at the same time.

bug p1-important research

Source

Ykid

All 22 comments

@pared

Ykid on 27 Feb 2020

@pared Do you remember the context? Aren't we trusting pulled checksums these days? Or is it about unpacked dir case?

efiop on 28 Feb 2020

@efiop @Ykid Sorry for the delay,
Here is the start of our conversation:
https://discordapp.com/channels/485586884165107732/563406153334128681/682124179885260804

@Ykid has a repository where there are few big files (A, B, C), each having its own stage (A.dvc, B.dvc, C.dvc). @Ykid noticed that checksum calculations on dvc pull take a lot of time.

So, I see 2 potential problems:

Files were on NFS, so reading files and feeding it to hash function might be taking a lot of time.
We do parallelize checksum computations, but on out level. What I mean by that, is that when checking out here we do it stage by stage, so we will have:
checkout A.dvc
checkout B.dvc

and so on.
And as checksum calculation happens down the road of stage.checkout our checksum calculations will be sequential (because stage checkouts are sequential).

So I would vote for changing the name of the issue to Parallelize stage checkouts

pared on 3 Mar 2020

👍1

@Ykid
What version of dvc are you using?

pared on 3 Mar 2020

@pared

DVC version: 0.86.5
Python version: 3.7.6
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: pip
Cache: reflink - not supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/nvme0n1p2')
Filesystem type (workspace): ('ext4', '/dev/nvme0n1p2')

Ykid on 7 Mar 2020

@pared Not sure I follow, what hashes are calculated on checkout? For the files we've downloaded? But we should've trusted those and then when checking out we are trusting the cache hashes, right?

efiop on 9 Mar 2020

@efiop Thats right, when pulling we should trust the checksums by default.
@Ykid did you configuration changed in any way since we talked on discord?
It used to be like that:

[cache]
    type = symlink
    protected = true
[core]
    remote = default
    checksum_jobs = 24
['remote "default"']
    url = /path/to/nfs/mount

pared on 10 Mar 2020

@pared I didn't change my config. What's the discrepancy you spotted ?

Ykid on 11 Mar 2020

@Ykid with the version that you are using, dvc should not even calculate checksum when pulling. I need to try to reproduce your problem.

pared on 11 Mar 2020

@Ykid @efiop seems like a bug,
I can confirm that when pulling, md5 is calculated. Reproduction script:

#!/bin/bash

rm -rf repo storage git_repo clone
mkdir repo storage git_repo

main=$(pwd)

pushd git_repo
git init --quiet --bare
popd

pushd repo

git init --quiet
git remote add origin $main/git_repo
dvc init -q

dvc remote add -q -d str $main/storage 

git add -A
git commit -am "init"

dd if=/dev/urandom of=data1 bs=1M count=1 &> /dev/null
dvc add -q data1

dd if=/dev/urandom of=data2 bs=1M count=1 &> /dev/null
dvc add -q data2

git add -A
git commit -am "add data"
dvc push -q
git push origin master
popd

git clone git_repo clone
pushd clone
dvc pull

pared on 23 Mar 2020

👀1

Interestingly I am unable to write test reproducing this behaviour:

def test_trust_remote_checksums(tmp_dir, tmp_path_factory, erepo_dir, mocker):
    with erepo_dir.chdir():

        erepo_dir.dvc_gen({"file": "file content"}, commit="add dir")

    from dvc.scm import Git
    Git.clone(fspath(erepo_dir), fspath(tmp_dir))

    from dvc.repo import Repo

    dvc = Repo(fspath(tmp_dir))

    md5_spy = mocker.spy(dvc.cache.local, "get_file_checksum")

    dvc.pull()

    assert md5_spy.call_count == 0

This one passes, I think we should include this task in next sprint.

pared on 23 Mar 2020

👍1

Well, my test has not been working, due to #2888, will fix it during this issue.
The reason why my bash script results in md5 calculation and test does not is that erepo_dir has been using @efiop optimization from #3472. The cache file was protected. Thanks to that, when we were gathering dvc/remote/local._status, cache_exists did not have to calculate checksum (because file in cache was read only) and in my bash script I was using remote storage (local one), which does not protect its content by default. Then, when I am cloning the git repo, and pull the data, DVC checks cache_exists -> changed_cache_file which recalculates md5, because files on local remote are not protected.

pared on 6 May 2020

So I would say that the issue we have here is how we treat LocalRemote:
Why do we calculate md5 of obtained files?
When pulling we check files status which in results calls cache_exists. For LocalRemote, we are
iterating over desired checskums and call changed_cache_file in sequential manner.

So to help slow checksum calculation, we could introduce parallelization here

Another thing is that in case of other remotes (besides ssh) we do not trigger checksum calculation, it is assumed that checksum on remote is the right one. I guess we are not doing it locally to verify no cache corruption has occurred. But this is the reason why we are calculating the checksum.

pared on 20 May 2020

Problem: changed_cache_file for local uses state, therefore, SQLite is preventing us from parallelizing cache_exist for LocalRemote.

pared on 21 May 2020

@pared SQLLite should be thread-safe and can be used from multiple threads?

shcheklein on 21 May 2020

@pared Does local remote preserve protected mode when uploading? If it doesn't, maybe we should do that and then trust on pull if cache files are still protected?

efiop on 21 May 2020

@shcheklein Maybe we are not initializing it properly? I get errors specifically pointing to the fact that we are accessing DB from multiple threads.

@efiop Local remote does not preserve the mode. I think that could be a proper way to solve it, utilize cache optimization and trust protected local remote. I will prepare PR for that.

pared on 22 May 2020

👍1

Maybe we are not initializing it properly? I get errors specifically pointing to the fact that we are accessing DB from multiple threads.

To be honest, I don't know. We need to check the docs and do some research. Of course, if it's a bottleneck for solving this particular issue.

shcheklein on 22 May 2020

@Ykid sorry it took so long, you should not experience file checksum recalculation for local remote.
As to parallelization, this issue needs to be researched further.

pared on 22 May 2020

👀1

State not working in parallel mode is a well-known thing, it is just because of the way it is currently set up. We will be looking at hash optimations separately, closing this issue for now, since it is mitigated by #3858

efiop on 23 May 2020

👍1

For the record: this was probably caused by https://github.com/iterative/dvc/issues/4485 , preparing a fix. CC @pared

efiop on 28 Aug 2020

👀1 👍1

@efiop I am not sure if they are related. This one was related to cloning and pulling the project, not import.

pared on 31 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings