tdeboissiere on discord reports that dvc pull(which didn't download anything, so dvc checkout is the culprit) on 0.20.0 takes 111s, but on 0.20.3 160s. Need to investigate if we have a regression in checkout performance.
@tdeboissiere I can't confirm this so far, my tests seem to have about the same performance in 0.20.0 and 0.20.3 . Could you please try out dvc pull once again with 0.20.0 and the current version, but using a cProfiler? I.e. could you please run:
python -m cProfile -s cumulative $(which dvc) checkout &> checkout1.log
python -m cProfile -s cumulative $(which dvc) checkout &> checkout2.log
python -m cProfile -s cumulative $(which dvc) checkout &> checkout3.log
for both versions and provide all the logs(6 of them total)? The reason for 3 logs instead of 1, is because there are some lazy calculations that we use.
Thanks,
Ruslan
I just ran the commands on one of my datasets and this time the timings are very similar.
The only difference is that I've used dvc checkout dvcfile today. When I reported the possible bug, I had a more complex pipeline where I wrapped several dvc commands with a python script.
I will investigate some more and keep you posted
@efiop Here are the logs that reproduce the issue.
dvc 0.20.0 is consistently faster than dvc 0.20.3 (note that the command I launched was dvc pull --with-deps, not dvc checkout --with-deps since the latter is not available in dvc 0.20.0). I did make sure that the data was already in my local cache before launching the commands though.
This is a stage of my pipeline where I use a dummy master dvc file to tie in several dvc commands
@tdeboissiere Thanks! I'll take a look shortly.
@efiop I have a short deadline to deliver a model and need to free space on my SSD. Since dvc gc is taking too long ― after more than 12 hours still hasn't completed yet ― would it be okay to manually remove the .dvc/cache/ folder? I know this means I would have to re-execute the entire pipeline, but that would probably take only a few hours.
I've already tried to downgrade to 0.20.0 but it still takes hours for the garbage collector to complete its work.
@IamGianluca Whoa, that is a lot of time :slightly_frowning_face: Note that if you remove .dvc/cache, you will loose cache for your pipeline inputs as well, so you won't be able to reproduce the pipeline as is, you would have to manually place input data back into your workspace. If you know where to find that data, you could indeed rm -rf .dvc/cache for now. Otherwise you need to at least back up that input data somewhere, either by manually coping it, or creating a directory, making it a dvc remote and pushing data there(either dvc push to backup all currently used cache, or dvc push data.dvc for all input data).
I'm investigating right now. Thank you for your patience guys.
Thank you guys for your patience. With add/checkout it turned out, that we are not shortcircuiting log messages early enough in non-verbose modes, causing such delays(up to 50% speedup on a test with a directory with 100K files). This was the cause of regression after 0.20.0, since we've added more debug msgs after it. Will release new version ASAP. #1331 is still relevant since we could improve the performance much more. Also, gc issue described by @IamGianluca is also a separate one, so created https://github.com/iterative/dvc/issues/1429 to track that. Actively working on optimizations right now.
@IamGianluca can you say more about your situation, the one which involved dvc gc running for hours? What's your files are they added as a dir or somehow separately? Do you use local data and local cache or some remotes are involved? What is the exact command you are running that is slow?
Most helpful comment
Thank you guys for your patience. With add/checkout it turned out, that we are not shortcircuiting log messages early enough in non-verbose modes, causing such delays(up to 50% speedup on a test with a directory with 100K files). This was the cause of regression after 0.20.0, since we've added more debug msgs after it. Will release new version ASAP. #1331 is still relevant since we could improve the performance much more. Also,
gcissue described by @IamGianluca is also a separate one, so created https://github.com/iterative/dvc/issues/1429 to track that. Actively working on optimizations right now.