This https://github.com/iterative/dvc/blob/0.93.0/dvc/remote/base.py#L290 takes ~50 seconds compared to ~0.4 for simple replace() that we used to have on our imagenet dataset.
Actually, same with dumping. PathInfo is extremely slow when we need lots of it.
Another confirmation https://github.com/iterative/dvc/pull/3634#issuecomment-616335337
The issue is that in OutputBase._collect_used_dir_cache(), we add str(path_info) to our used_cache dict, rather than just adding path_info.
PathInfo.__str__() is overridden to always compute the relative path from our current working dir, rather than than returning a string fspath.
https://github.com/iterative/dvc/blob/52369bdf47d2fcafed8ca3d1f9b0deaecc99a236/dvc/path_info.py#L45-L47
For *nix systems, relpath() just wraps os.path.relpath, but in the event that we have a directory with millions of files, we end up making millions of os.path.relpath/os.path.abspath/os.getcwd calls here, which is very slow (see cpython posixpath.py).
Since we don't actually need the relative path until we are printing/logging status for modified/pushed/pulled files, one easy optimization is that we can just put path_info directly into used_cache, and delay calling str(path_info) until we actually use the relative path string. For the case where we only have a few modified files, we will only compute relpath for those modified files rather than for the entire directory.
But this does not cover the case where we eventually have to print/log (and compute relpath for) millions of modified files. To get better behavior in that case, we could implement our own versions of posixpath.abspath/posixpath.relpath that would take advantage of the fact that things like os.getcwd() should not change between abspath and relpath calls for our use case.
for S3 remote with 2M files, deferring the relpath call drops the Repo._cloud_status runtime from 938s to 374s with cProfile
@efiop thoughts?
@pmrowla Very interesting idea about relpath/abspath re-implementation! But I'm a bit unsure if it is really worth creating such a general hack for it, when we could do a local hack something like so:
path = str(self.path_info)
filter_path = str(filter_info) if filter_info else None
for entry in self.dir_cache:
# NOTE: not using PathInfo's here, as they become really slow when you need a lot of them
checksum = entry[self.remote.PARAM_CHECKSUM]
entry_path = os.path.join(path, entry[self.remote.PARAM_RELPATH])
if not filter_path or entry_path == filter_path or entry_path.startswith(filter_path + os.sep):
cache.add(self.scheme, checksum, entry_path)
At least it won't lose the hack context this way. Plus, I feel like path_info might bite us not only in relpath department but in others too, where we won't be able to provide a good general hack in the future, so it is better to keep it local for now.