when checking out, we do it sequentially, stage after stage, that is why checksums calculations are sequential
It would be better if multiple stages (which might correspond to large files) can perform checksum calculation at the same time.
@pared
@pared Do you remember the context? Aren't we trusting pulled checksums these days? Or is it about unpacked dir case?
@efiop @Ykid Sorry for the delay,
Here is the start of our conversation:
https://discordapp.com/channels/485586884165107732/563406153334128681/682124179885260804
@Ykid has a repository where there are few big files (A, B, C), each having its own stage (A.dvc, B.dvc, C.dvc). @Ykid noticed that checksum calculations on dvc pull
take a lot of time.
So, I see 2 potential problems:
out
level. What I mean by that, is that when checking out here we do it stage by stage, so we will have: A.dvc
B.dvc
and so on.
And as checksum calculation happens down the road of stage.checkout
our checksum calculations will be sequential (because stage checkouts are sequential).
So I would vote for changing the name of the issue to Parallelize stage checkouts
@Ykid
What version of dvc
are you using?
@pared
DVC version: 0.86.5
Python version: 3.7.6
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: pip
Cache: reflink - not supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/nvme0n1p2')
Filesystem type (workspace): ('ext4', '/dev/nvme0n1p2')
@pared Not sure I follow, what hashes are calculated on checkout? For the files we've downloaded? But we should've trusted those and then when checking out we are trusting the cache hashes, right?
@efiop Thats right, when pulling we should trust the checksums by default.
@Ykid did you configuration changed in any way since we talked on discord?
It used to be like that:
[cache]
type = symlink
protected = true
[core]
remote = default
checksum_jobs = 24
['remote "default"']
url = /path/to/nfs/mount
@pared I didn't change my config. What's the discrepancy you spotted ?
@Ykid with the version that you are using, dvc should not even calculate checksum when pulling. I need to try to reproduce your problem.
@Ykid @efiop seems like a bug,
I can confirm that when pulling, md5 is calculated. Reproduction script:
#!/bin/bash
rm -rf repo storage git_repo clone
mkdir repo storage git_repo
main=$(pwd)
pushd git_repo
git init --quiet --bare
popd
pushd repo
git init --quiet
git remote add origin $main/git_repo
dvc init -q
dvc remote add -q -d str $main/storage
git add -A
git commit -am "init"
dd if=/dev/urandom of=data1 bs=1M count=1 &> /dev/null
dvc add -q data1
dd if=/dev/urandom of=data2 bs=1M count=1 &> /dev/null
dvc add -q data2
git add -A
git commit -am "add data"
dvc push -q
git push origin master
popd
git clone git_repo clone
pushd clone
dvc pull
Interestingly I am unable to write test reproducing this behaviour:
def test_trust_remote_checksums(tmp_dir, tmp_path_factory, erepo_dir, mocker):
with erepo_dir.chdir():
erepo_dir.dvc_gen({"file": "file content"}, commit="add dir")
from dvc.scm import Git
Git.clone(fspath(erepo_dir), fspath(tmp_dir))
from dvc.repo import Repo
dvc = Repo(fspath(tmp_dir))
md5_spy = mocker.spy(dvc.cache.local, "get_file_checksum")
dvc.pull()
assert md5_spy.call_count == 0
This one passes, I think we should include this task in next sprint.
Well, my test has not been working, due to #2888, will fix it during this issue.
The reason why my bash script results in md5 calculation and test does not is that erepo_dir has been using @efiop optimization from #3472. The cache file was protected. Thanks to that, when we were gathering dvc/remote/local._status
, cache_exists
did not have to calculate checksum (because file in cache was read only) and in my bash script I was using remote storage (local one), which does not protect its content by default. Then, when I am cloning the git repo, and pull
the data, DVC checks cache_exists
-> changed_cache_file
which recalculates md5, because files on local remote are not protected.
So I would say that the issue we have here is how we treat LocalRemote:
Why do we calculate md5 of obtained files?
When pull
ing we check files status which in results calls cache_exists
. For LocalRemote
, we are
iterating over desired checskums and call changed_cache_file
in sequential manner.
So to help slow checksum calculation, we could introduce parallelization here
Another thing is that in case of other remotes (besides ssh
) we do not trigger checksum calculation, it is assumed that checksum on remote is the right one. I guess we are not doing it locally to verify no cache corruption has occurred. But this is the reason why we are calculating the checksum.
Problem: changed_cache_file
for local uses state, therefore, SQLite is preventing us from parallelizing cache_exist
for LocalRemote.
@pared SQLLite should be thread-safe and can be used from multiple threads?
@pared Does local remote preserve protected mode when uploading? If it doesn't, maybe we should do that and then trust on pull if cache files are still protected?
@shcheklein Maybe we are not initializing it properly? I get errors specifically pointing to the fact that we are accessing DB from multiple threads.
@efiop Local remote does not preserve the mode. I think that could be a proper way to solve it, utilize cache optimization and trust protected local remote. I will prepare PR for that.
Maybe we are not initializing it properly? I get errors specifically pointing to the fact that we are accessing DB from multiple threads.
To be honest, I don't know. We need to check the docs and do some research. Of course, if it's a bottleneck for solving this particular issue.
@Ykid sorry it took so long, you should not experience file checksum recalculation for local remote.
As to parallelization, this issue needs to be researched further.
State not working in parallel mode is a well-known thing, it is just because of the way it is currently set up. We will be looking at hash optimations separately, closing this issue for now, since it is mitigated by #3858
For the record: this was probably caused by https://github.com/iterative/dvc/issues/4485 , preparing a fix. CC @pared
@efiop I am not sure if they are related. This one was related to cloning and pulling the project, not import.