Dvc: Do not check for directory when looking for cache files

Created on 1 Dec 2019  路  2Comments  路  Source: iterative/dvc

Recent refactoring in #2873 and #2853 implemented isdir(), isfile() and exists(). exists() checks if the given path exists using isdir as well as isfile. isfile simply checks for existence of the blob (in case of s3 and gs), whereas isdir checks if any files with the prefix exists.

.cache_exists() uses .exists(), but does not care about directory paths. So, to reduce number of ._list_paths() remote calls, it's better to refactor .cache_exists() to use isfile.

As @Suor notes here https://github.com/iterative/dvc/pull/2873#discussion_r352328162, isfile might not have been implemented in all remotes, or might be slower in some remotes.

Further infomation: https://github.com/iterative/dvc/pull/2873#discussion_r352332637

enhancement p1-important performance refactoring

Most helpful comment

BTW, this is a bit more complicated.

.exists() being more complex and slower is true for s3 and gs, but might be not true for other remotes likes local and ssh. So we should bench things and might even require using different calls for different remotes to get top performance everywhere.

All 2 comments

BTW, this is a bit more complicated.

.exists() being more complex and slower is true for s3 and gs, but might be not true for other remotes likes local and ssh. So we should bench things and might even require using different calls for different remotes to get top performance everywhere.

@skshetry, I was thinking about leveraging S3's CommonPrefixes to check if a file exist. That should handle it in one call, instead of the isfile or isdir check.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dnabanita7 picture dnabanita7  路  3Comments

mfrata picture mfrata  路  3Comments

TezRomacH picture TezRomacH  路  3Comments

shcheklein picture shcheklein  路  3Comments

anotherbugmaster picture anotherbugmaster  路  3Comments