Dvc: output: cache metadata

Created on 29 Oct 2019  路  9Comments  路  Source: iterative/dvc

DVC checks several times if a file is a directory or a file, its checksum, etc.
When working with remotes, this often requires to make a request each time that we need that information.

It would be cool to memoize that information in the output/dependency or PathInfo object.

enhancement performance

All 9 comments

@mroutis could you elaborate a bit, pelase? what kind of requests? point to the code?

@shcheklein I'm pretty sure @mroutis is talking about things like exists(), that might take quite a long time, as you have to do a request over the network to check if something exists or not. We do abuse those kinds of checks. Ideally we would either cache them or collect stuff first and then process it without the need to do sequential calls, similar to what @Suor has been talking about for a long time now.

@mroutis @efiop would be still good to have code example in place and somewhat more structured list of things you have im mind to cache.

So here is a partial call tree for file checkout:

    remote._checkout_file(path_info)
        remote.changed(path_info)
E1          remote.exists(path_info)
            remote.save_info(path_info)
                remote.get_checksum(path_info)
E2                  remote.exists(path_info)
D1                  remote.isdir(path_info)
        remote.safe_remove(path_info)
E3          remote.exists(path_info)
            remote.already_cached(path_info)
                remote.get_checksum(path_info)
E4                  remote.exists(path_info)
D2                  remote.isdir(path_info)
        remote.link(cache_info, path_info)
            remote._link(...)
                remote.makedirs(path_info.parent)
                remote._do_link(..., path_info)
E5                  remote.exists(path_info)

So we call remote.exists() 5 times and remote.isdir() 2 times for same path_info, we also do that for cache_info, which is remote.checksum_to_path_info(checksum).

@Suor @mroutis thanks, I understand it better now. Just to confirm - this mostly affect external data management cases, right? Like when we have SSH with a cache setup on it to version stuff that is located on it?

It also affects local case. If it's many files it might take time I guess. Also local sometimes means nfs, which has network lags.

The push/pull to cache is not affected, since it uses status to actually collect data in bulk.

@Suor , just curious, how did you come up with that call tree?
https://github.com/iterative/dvc/issues/2689#issuecomment-549819969

@mroutis built it manually

BTW, from some discussions with @efiop on this the proper solution will be flattening the checkout logic. I.e. make a plan of things to do, then execute.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GildedHonour picture GildedHonour  路  3Comments

nik123 picture nik123  路  3Comments

gregfriedland picture gregfriedland  路  3Comments

siddygups picture siddygups  路  3Comments

anotherbugmaster picture anotherbugmaster  路  3Comments