OS: Docker image based off of tensorflow/tensorflow:latest-gpu-py3
, with Python 3.7.5 as the system Python.
Initial setup:
# python --version
Python 3.7.5
# python -m venv .profiling_venv
# source .profiling_venv/bin/activate
# pip install dvc yappi
...
# dvc --version
0.82.6
DVC cache configuration:
# dvc config cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --local cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --global cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --system cache.dir
/ssd/.../dvc/cache
# cat /etc/xdg/dvc/config
[cache]
dir = /ssd/.../dvc/cache
protected = true
type = "hardlink,symlink"
Please note that the DVC cache is hot. In other words, most if not all files for dvc fetch
are present and up-to-date at /ssd/...
.
Make a fresh clone and profile dvc pull
:
# git clone ssh://... repo
...
# cd repo
# yappi -f callgrind -o dvc_pull.prof -s $(which dvc) pull
...
This dvc pull
, uninstrumented, usually takes 40+ minutes with a hot DVC cache.
Count the number of DVC-tracked files (symlinks, see the above config) and the total size of the repo:
# find . -type l | wc -l
29003
# du -shL .
403G .
Looking at the dvc_pull.prof
(in KCachegrind) suggests that the bottleneck is the checksum process. The file_md5
and dos2unix
functions in utils/__init__.py
appear particularly costly.
Is this a known issue? Would the primary authors of DVC entertain a more performant version of file_md5
(perhaps written in C/C++ and without TQDM integration)?
Looking into this further, it appears DVC is wrongly treating the binary files as text files (via istextfile.py). This leads to the huge amount of dos2unix
calls mentioned above. In my case, the files have large text headers followed by binary data.
Is this a known issue?
It was a known issue, but I didn't follow up on it :disappointed:
Take a look at https://github.com/iterative/dvc/issues/1970#issuecomment-491154993
In my case, the files have large text headers followed by binary data.
Ouch, our heuristic is similar to git's and only checks first N bytes to determine if this is a binary or a text file. Related https://github.com/iterative/dvc/issues/992 .
That being said, current dvc pull
should trust hashes downloaded from remote, so I wonder if there is a bug here somewhere. Need to check, might be related to unpacked dir or something. @pared maybe you have any thoughts about it?
@efiop seems like a bug to me, dos2unix should not be called at all as it's invoked only by file_md5, which supposedly should not happen after https://github.com/iterative/dvc/pull/3200. Looking into that.
Ok, got it, I made a mistake in the implementation:
[EDIT]
actually that is not that easy, as the calculating checksum of cache files is part of verifying that cache is not corrupted, we need to consider that.
@kevlar1818 do you set up fresh docker container for each time you use this repo? Or is it kind of your development environment and it will persist?
If the latter, long checkout should not occur again, if your use case is the first one, it will be a recurrent issue.
I think we should consider trusting the cache or provide an option allowing the user to do that.
@kevlar1818 please rerun pull
, and tell us if its also taking so much time
@pared Rerunning pull on the same repo (not a new clone) seems to produce the same runtime as the first pull.
The hope is that the repo can be cloned in a new docker container whenever a developer wants to work on the project. So I'd bin our use case in your first description above. That being said, all developers share the /ssd
cache.
@kevlar1818 And could you show prof results for that one too, please?
@efiop I'll have to re-run with profiling enabled on Monday.
@efiop @pared @mroutis Please see above for the requested profiling of a second dvc pull
@kevlar1818 One more thing, could you please show dvc version
output?
@efiop
# dvc version
DVC version: 0.82.6
Python version: 3.7.5
Platform: Linux-3.10.0-957.10.1.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic
Binary: False
Package: pip
@kevlar1818 Oh, sorry, forgot to mention that it would be great to pip install psutil
before that and then run dvc version
inside the repository, that way it will provide FS info that might be useful here. Could you please do that and show the output once again? Sorry for the inconvenience.
@efiop
# pip install psutil
...
Installing collected packages: psutil
Successfully installed psutil-5.6.7
# dvc version
DVC version: 0.82.6
Python version: 3.7.5
Platform: Linux-3.10.0-957.10.1.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic
Binary: False
Package: pip
Filesystem type (workspace): ('nfs', '...')
Thanks @kevlar1818 !
Btw, looks like one more line is missing from that output. The one for cache
.
But now that we see that it is nfs, I suspect that it is probably the cause here. Need to check that maybe inode
is broken or something like that. Could you please show full log for dvc pull -v
?
@kevlar1818 Also, what is the remote that you are dvc pull
-ing from? I don't see it in the config.
@kevlar1818 Btw, what we could also do here is have a video call together and debug on the spot. It should be much faster for everyone. :slightly_smiling_face: Would that work for you? If it would, please ping me through the email (see it in my github profile) and we'll arrange something ASAP.
Btw, looks like one more line is missing from that output. The one for
cache
.
@efiop That was the entire output of dvc version
. As I noted in my initial bug report, my cache is located at /ssd/...
which is a locally mounted ext4 FS.
Also, what is the remote that you are
dvc pull
-ing from? I don't see it in the config.
The remote is a MinIO (S3) instance on the same LAN.
@efiop As for a video call, let me get back to you on that. I'd have to carve out some time at work.
I have a similar issue, even on recent DVC (after https://github.com/iterative/dvc/pull/3200 was merged).
in my use case, I'd happily make the cache read only if that offers additional guarantees against corruption, as this is a CI environment and nothing is expected to alter data
btw from the OP
perhaps written in C/C++ and without TQDM integration
from what I can see in the profiler results, tqdm
is relatively negligible but let me know if it's still considered unacceptably high.
(related: #3060)
from what I can see in the profiler results,
tqdm
is relatively negligible but let me know if it's still considered unacceptably high.
@casperdcl You're right. At first I was worried about Python's overhead given the sheer number of function calls apparent in the profiling. I didn't do a great job summarizing my subsequent findings, so I'll share them here for the benefit of others (e.g. @efiop, @mroutis, @pared).
I did some basic benchmarking of md5sum
and DVC's file_md5
, and the two implementations were actually quite comparable in runtime. On a 5GB binary file file_md5
was only perhaps 10% slower than md5sum
. The story changes when comparing md5sum
and file_md5
on a text file; file_md5
unfailingly takes twice as long to complete. The reason is fairly obvious: The dos2unix
function in dvc/utils/__init__.py
is calling .replace
on every chunk of the text file, so file_md5
is literally reading every byte in the file twice. This is why I opened #3264, which I'm still hoping will be merged (either with a 1MB or 8KB chunk size) regardless of this apparent bug being fixed or not.
@efiop and @pared any updates on this? I'm a bit confused as to the status of this alleged bug and its fix. Is this still considered a bug?
Hi @kevlar1818 !
Sorry for the silence, I was waiting for your response on
@efiop As for a video call, let me get back to you on that. I'd have to carve out some time at work.
Yes, I think it is a bug, but we were not able to reproduce, hence why I was asking for a meeting to figure it out on the spot. If you can come up with a reproduction script that would be an alternative to a meeting. Or we will need to take a closer look into it ourselves. Adding to the sprint to keep this in focus.
@kevlar1818 is the remote MinIO
instance running in distributed mode? If so, how many nodes.
@dvaldivia The MinIO instance is not running in distributed mode.
@efiop I'll try to reproduce the issue with a minimal example first, and then we can revisit having a debugging session. Should I open a separate Github issue concerning the misclassification of binary files as text files? I still think that's another facet to this that could/should be addressed regardless of this bug.
@kevlar1818 Sounds good. Please feel free to open an issue for that. :+1:
@pared mentined that the thing that is happening here could be reproduced with simple dvc checkout
, because the cache is already there, there is no need to pull it. And the issue with that is that .dvc/state
is empty, so it needs to verify all the new cache files coming from an external cache dir. We do trust things we download during pull, but we don't trust local caches, as they might be corrupted much more easilly. Need to verify that it is the thing that is going on here and if it is so- consider adding a config option to trust cache dir (but there might be a better solution too, need to think about it).
@efiop
This test replikates @kevlar1818 situation:
def test_checkout_calculation_on_no_state(tmp_dir, dvc, mocker):
tmp_dir.dvc_gen("file", "file content")
(tmp_dir / ".dvc" / "state").unlink()
(tmp_dir / "file").unlink()
file_md5_counter = mocker.spy(dvc_module.remote.local, "file_md5")
dvc.checkout()
assert file_md5_counter.call_count == 1
So to me it seems that with each new instance of docker image, md5's are unnecessary calculated.
The "trust" flag seems like a way to go.
@pared What would that look like, could you describe it? Not trying to push him or anything, but since @kevlar1818 has submitted a PR already, maybe he would also consider submitting a PR for this 馃槈, so we better describe the solution better and hopefully agree on it.
@efiop
I believe we should have --no-verify
option for checkout
command. Intentionally, it should be similar to --verify
for fetch. I think in case of checkout
--no-verify
as option is better (and therefore verify by default) because this use case (shared cache) is prone to user corrupting cache if, for example 2 repos use the same cache with symlink
or hardlink
cache type, we forget to protect one and modify it.
Check for this flag should probably happen somewhere around here.
Really, really dislike new config options. Even if we introduce an option - would it be possible for us to set certain defaults depending on the configuration? E.g. cache is not shared - do not verify, cache is shared verify, etc?
There is no way for us to tell if the cache is shared or not.
I've also mentioned protected
based approach, where we set read-only for all cache that we add and if it is still read-only then we assume the cache file is not corrupted and so trust its hash. That way we won't need special options.
@kevlar1818 Really sorry for this taking so long, we have a lot on our plate these days. Thanks to amazing research by @pared , I've created https://github.com/iterative/dvc/pull/3472 that does some tricks with read-only permissions, so that it could trust any cache. Please consider trying it out with
pip install git+https://github.com/efiop/dvc@3261
and let us know if that helps with the issue or not. Thanks for the feedback!
@kevlar1818 The fix was released in 0.90 , please give it a try and let us know if that fixed the issue. Thanks for the feedback and your initial PR! :pray:
Most helpful comment
@efiop
This test replikates @kevlar1818 situation:
So to me it seems that with each new instance of docker image, md5's are unnecessary calculated.
The "trust" flag seems like a way to go.