Dvc: Pull extrememly slow on ~400GB of data with hot DVC cache

Created on 30 Jan 2020  路  37Comments  路  Source: iterative/dvc

OS: Docker image based off of tensorflow/tensorflow:latest-gpu-py3, with Python 3.7.5 as the system Python.

Initial setup:

# python --version
Python 3.7.5
# python -m venv .profiling_venv
# source .profiling_venv/bin/activate
# pip install dvc yappi
...
# dvc --version
0.82.6

DVC cache configuration:

# dvc config cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --local cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --global cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --system cache.dir
/ssd/.../dvc/cache
# cat /etc/xdg/dvc/config
[cache]
dir = /ssd/.../dvc/cache
protected = true
type = "hardlink,symlink"

Please note that the DVC cache is hot. In other words, most if not all files for dvc fetch are present and up-to-date at /ssd/....

Make a fresh clone and profile dvc pull:

# git clone ssh://... repo
...
# cd repo
# yappi -f callgrind -o dvc_pull.prof -s $(which dvc) pull
...

This dvc pull, uninstrumented, usually takes 40+ minutes with a hot DVC cache.

Count the number of DVC-tracked files (symlinks, see the above config) and the total size of the repo:

# find . -type l | wc -l
29003
# du -shL .
403G    .

Looking at the dvc_pull.prof (in KCachegrind) suggests that the bottleneck is the checksum process. The file_md5 and dos2unix functions in utils/__init__.py appear particularly costly.

Is this a known issue? Would the primary authors of DVC entertain a more performant version of file_md5 (perhaps written in C/C++ and without TQDM integration)?

dvc_pull_prof.zip

Screenshot_2020-01-30_14-02-20

bug p1-important research

Most helpful comment

@efiop
This test replikates @kevlar1818 situation:

def test_checkout_calculation_on_no_state(tmp_dir, dvc, mocker):
    tmp_dir.dvc_gen("file", "file content")
    (tmp_dir / ".dvc" / "state").unlink()
    (tmp_dir / "file").unlink()

    file_md5_counter = mocker.spy(dvc_module.remote.local, "file_md5")

    dvc.checkout()
    assert file_md5_counter.call_count == 1

So to me it seems that with each new instance of docker image, md5's are unnecessary calculated.
The "trust" flag seems like a way to go.

All 37 comments

Looking into this further, it appears DVC is wrongly treating the binary files as text files (via istextfile.py). This leads to the huge amount of dos2unix calls mentioned above. In my case, the files have large text headers followed by binary data.

Is this a known issue?

It was a known issue, but I didn't follow up on it :disappointed:

Take a look at https://github.com/iterative/dvc/issues/1970#issuecomment-491154993

In my case, the files have large text headers followed by binary data.

Ouch, our heuristic is similar to git's and only checks first N bytes to determine if this is a binary or a text file. Related https://github.com/iterative/dvc/issues/992 .

That being said, current dvc pull should trust hashes downloaded from remote, so I wonder if there is a bug here somewhere. Need to check, might be related to unpacked dir or something. @pared maybe you have any thoughts about it?

@efiop seems like a bug to me, dos2unix should not be called at all as it's invoked only by file_md5, which supposedly should not happen after https://github.com/iterative/dvc/pull/3200. Looking into that.

Ok, got it, I made a mistake in the implementation:

  • trusting remote cache occurs only when we are actually pulling from the cache
  • in case when we already have a cache (@kevlar1818 has up-to-date cache) we do not trigger
    files download, which was triggering inserting downloaded files checksums into database, I need to fix #3200 to make sure that proper checksums are also inserted in case when we do have cache.

[EDIT]
actually that is not that easy, as the calculating checksum of cache files is part of verifying that cache is not corrupted, we need to consider that.

@kevlar1818 do you set up fresh docker container for each time you use this repo? Or is it kind of your development environment and it will persist?
If the latter, long checkout should not occur again, if your use case is the first one, it will be a recurrent issue.

I think we should consider trusting the cache or provide an option allowing the user to do that.

@kevlar1818 please rerun pull, and tell us if its also taking so much time

@pared Rerunning pull on the same repo (not a new clone) seems to produce the same runtime as the first pull.

The hope is that the repo can be cloned in a new docker container whenever a developer wants to work on the project. So I'd bin our use case in your first description above. That being said, all developers share the /ssd cache.

@kevlar1818 And could you show prof results for that one too, please?

@efiop I'll have to re-run with profiling enabled on Monday.

@efiop @pared @mroutis Please see above for the requested profiling of a second dvc pull

@kevlar1818 One more thing, could you please show dvc version output?

@efiop

# dvc version
DVC version: 0.82.6
Python version: 3.7.5
Platform: Linux-3.10.0-957.10.1.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic
Binary: False
Package: pip

@kevlar1818 Oh, sorry, forgot to mention that it would be great to pip install psutil before that and then run dvc version inside the repository, that way it will provide FS info that might be useful here. Could you please do that and show the output once again? Sorry for the inconvenience.

@efiop

# pip install psutil
...
Installing collected packages: psutil
Successfully installed psutil-5.6.7
# dvc version
DVC version: 0.82.6
Python version: 3.7.5
Platform: Linux-3.10.0-957.10.1.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic
Binary: False
Package: pip
Filesystem type (workspace): ('nfs', '...')

Thanks @kevlar1818 !

Btw, looks like one more line is missing from that output. The one for cache.

But now that we see that it is nfs, I suspect that it is probably the cause here. Need to check that maybe inode is broken or something like that. Could you please show full log for dvc pull -v?

@kevlar1818 Also, what is the remote that you are dvc pull-ing from? I don't see it in the config.

@kevlar1818 Btw, what we could also do here is have a video call together and debug on the spot. It should be much faster for everyone. :slightly_smiling_face: Would that work for you? If it would, please ping me through the email (see it in my github profile) and we'll arrange something ASAP.

Btw, looks like one more line is missing from that output. The one for cache.

@efiop That was the entire output of dvc version. As I noted in my initial bug report, my cache is located at /ssd/... which is a locally mounted ext4 FS.

Also, what is the remote that you are dvc pull-ing from? I don't see it in the config.

The remote is a MinIO (S3) instance on the same LAN.

@efiop As for a video call, let me get back to you on that. I'd have to carve out some time at work.

I have a similar issue, even on recent DVC (after https://github.com/iterative/dvc/pull/3200 was merged).

  • fresh git pull
  • dvc cache is a shared location
  • dvc pull takes ages doing md5 checksums, even though all the data _is_ in the cache and as far as I can tell, it doesnt need to get it from the actual remote.
  • subsequent dvc pull calls dont take long at all (seconds rather than several minutes)

in my use case, I'd happily make the cache read only if that offers additional guarantees against corruption, as this is a CI environment and nothing is expected to alter data

btw from the OP

perhaps written in C/C++ and without TQDM integration

from what I can see in the profiler results, tqdm is relatively negligible but let me know if it's still considered unacceptably high.

(related: #3060)

from what I can see in the profiler results, tqdm is relatively negligible but let me know if it's still considered unacceptably high.

@casperdcl You're right. At first I was worried about Python's overhead given the sheer number of function calls apparent in the profiling. I didn't do a great job summarizing my subsequent findings, so I'll share them here for the benefit of others (e.g. @efiop, @mroutis, @pared).

I did some basic benchmarking of md5sum and DVC's file_md5, and the two implementations were actually quite comparable in runtime. On a 5GB binary file file_md5 was only perhaps 10% slower than md5sum. The story changes when comparing md5sum and file_md5 on a text file; file_md5 unfailingly takes twice as long to complete. The reason is fairly obvious: The dos2unix function in dvc/utils/__init__.py is calling .replace on every chunk of the text file, so file_md5 is literally reading every byte in the file twice. This is why I opened #3264, which I'm still hoping will be merged (either with a 1MB or 8KB chunk size) regardless of this apparent bug being fixed or not.

@efiop and @pared any updates on this? I'm a bit confused as to the status of this alleged bug and its fix. Is this still considered a bug?

Hi @kevlar1818 !

Sorry for the silence, I was waiting for your response on

@efiop As for a video call, let me get back to you on that. I'd have to carve out some time at work.

Yes, I think it is a bug, but we were not able to reproduce, hence why I was asking for a meeting to figure it out on the spot. If you can come up with a reproduction script that would be an alternative to a meeting. Or we will need to take a closer look into it ourselves. Adding to the sprint to keep this in focus.

@kevlar1818 is the remote MinIO instance running in distributed mode? If so, how many nodes.

@dvaldivia The MinIO instance is not running in distributed mode.

@efiop I'll try to reproduce the issue with a minimal example first, and then we can revisit having a debugging session. Should I open a separate Github issue concerning the misclassification of binary files as text files? I still think that's another facet to this that could/should be addressed regardless of this bug.

@kevlar1818 Sounds good. Please feel free to open an issue for that. :+1:

@pared mentined that the thing that is happening here could be reproduced with simple dvc checkout, because the cache is already there, there is no need to pull it. And the issue with that is that .dvc/state is empty, so it needs to verify all the new cache files coming from an external cache dir. We do trust things we download during pull, but we don't trust local caches, as they might be corrupted much more easilly. Need to verify that it is the thing that is going on here and if it is so- consider adding a config option to trust cache dir (but there might be a better solution too, need to think about it).

@efiop
This test replikates @kevlar1818 situation:

def test_checkout_calculation_on_no_state(tmp_dir, dvc, mocker):
    tmp_dir.dvc_gen("file", "file content")
    (tmp_dir / ".dvc" / "state").unlink()
    (tmp_dir / "file").unlink()

    file_md5_counter = mocker.spy(dvc_module.remote.local, "file_md5")

    dvc.checkout()
    assert file_md5_counter.call_count == 1

So to me it seems that with each new instance of docker image, md5's are unnecessary calculated.
The "trust" flag seems like a way to go.

@pared What would that look like, could you describe it? Not trying to push him or anything, but since @kevlar1818 has submitted a PR already, maybe he would also consider submitting a PR for this 馃槈, so we better describe the solution better and hopefully agree on it.

@efiop
I believe we should have --no-verify option for checkout command. Intentionally, it should be similar to --verify for fetch. I think in case of checkout --no-verify as option is better (and therefore verify by default) because this use case (shared cache) is prone to user corrupting cache if, for example 2 repos use the same cache with symlink or hardlink cache type, we forget to protect one and modify it.

Check for this flag should probably happen somewhere around here.

Really, really dislike new config options. Even if we introduce an option - would it be possible for us to set certain defaults depending on the configuration? E.g. cache is not shared - do not verify, cache is shared verify, etc?

There is no way for us to tell if the cache is shared or not.

I've also mentioned protected based approach, where we set read-only for all cache that we add and if it is still read-only then we assume the cache file is not corrupted and so trust its hash. That way we won't need special options.

@kevlar1818 Really sorry for this taking so long, we have a lot on our plate these days. Thanks to amazing research by @pared , I've created https://github.com/iterative/dvc/pull/3472 that does some tricks with read-only permissions, so that it could trust any cache. Please consider trying it out with

pip install git+https://github.com/efiop/dvc@3261

and let us know if that helps with the issue or not. Thanks for the feedback!

@kevlar1818 The fix was released in 0.90 , please give it a try and let us know if that fixed the issue. Thanks for the feedback and your initial PR! :pray:

Was this page helpful?
0 / 5 - 0 ratings