Dvc: remote: loading dirs is exteremely slow because of PathInfo

Created on 15 Apr 2020 · 5Comments · Source: iterative/dvc

This https://github.com/iterative/dvc/blob/0.93.0/dvc/remote/base.py#L290 takes ~50 seconds compared to ~0.4 for simple replace() that we used to have on our imagenet dataset.

bug p1-important performance

Source

efiop

👀1

All 5 comments

Actually, same with dumping. PathInfo is extremely slow when we need lots of it.

efiop on 15 Apr 2020

👍1

Another confirmation https://github.com/iterative/dvc/pull/3634#issuecomment-616335337

efiop on 21 Apr 2020

The issue is that in OutputBase._collect_used_dir_cache(), we add str(path_info) to our used_cache dict, rather than just adding path_info.

https://github.com/iterative/dvc/blob/52369bdf47d2fcafed8ca3d1f9b0deaecc99a236/dvc/output/base.py#L393-L397

PathInfo.__str__() is overridden to always compute the relative path from our current working dir, rather than than returning a string fspath.

https://github.com/iterative/dvc/blob/52369bdf47d2fcafed8ca3d1f9b0deaecc99a236/dvc/path_info.py#L45-L47

For *nix systems, relpath() just wraps os.path.relpath, but in the event that we have a directory with millions of files, we end up making millions of os.path.relpath/os.path.abspath/os.getcwd calls here, which is very slow (see cpython posixpath.py).

pmrowla on 22 Apr 2020

Since we don't actually need the relative path until we are printing/logging status for modified/pushed/pulled files, one easy optimization is that we can just put path_info directly into used_cache, and delay calling str(path_info) until we actually use the relative path string. For the case where we only have a few modified files, we will only compute relpath for those modified files rather than for the entire directory.

But this does not cover the case where we eventually have to print/log (and compute relpath for) millions of modified files. To get better behavior in that case, we could implement our own versions of posixpath.abspath/posixpath.relpath that would take advantage of the fact that things like os.getcwd() should not change between abspath and relpath calls for our use case.

for S3 remote with 2M files, deferring the relpath call drops the Repo._cloud_status runtime from 938s to 374s with cProfile

@efiop thoughts?

pmrowla on 22 Apr 2020

@pmrowla Very interesting idea about relpath/abspath re-implementation! But I'm a bit unsure if it is really worth creating such a general hack for it, when we could do a local hack something like so:

        path = str(self.path_info)
        filter_path = str(filter_info) if filter_info else None
        for entry in self.dir_cache:
            # NOTE: not using PathInfo's here, as they become really slow when you need a lot of them
            checksum = entry[self.remote.PARAM_CHECKSUM]
            entry_path = os.path.join(path, entry[self.remote.PARAM_RELPATH])
            if not filter_path or entry_path == filter_path or entry_path.startswith(filter_path + os.sep):
                cache.add(self.scheme, checksum, entry_path)

At least it won't lose the hack context this way. Plus, I feel like path_info might bite us not only in relpath department but in others too, where we won't be able to provide a good general hack in the future, so it is better to keep it local for now.

efiop on 22 Apr 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings