Dvc: `dvc push --run-cache` is incredibly slow on "ssh" remote

Created on 16 Oct 2020  路  1Comment  路  Source: iterative/dvc

Bug Report

A command dvc push --run-cache takes very long time (more than hour in my case) if there are a lot files in dvc_cache/runs. The reason is making a new ssh connection for each file to check if it exist on the remote.

Please provide information about your setup

Output of dvc version:

$ dvc version
1.8.4+21a6be

Additional Information (if any):

$ dvc push --run-cache --verbose
2020-10-16 14:36:14,695 DEBUG: Check for update is enabled.
2020-10-16 14:36:14,697 DEBUG: fetched: [(3,)]                        
2020-10-16 14:36:14,818 DEBUG: Establishing ssh connection with 'xxx' through port '22' as user 'dvc'
2020-10-16 14:36:15,457 DEBUG: Establishing ssh connection with 'xxx' through port '22' as user 'dvc'
2020-10-16 14:36:16,070 DEBUG: Establishing ssh connection with 'xxx' through port '22' as user 'dvc'

and so on, hundreds of time.

Most helpful comment

It is caused by dropping ssh connection out of the pool in get_connection in dvc/tree/pool.py since the connection throws a GeneratorExit exception in generator SSHTree.walk_files if the generator is closed prematurely, which is the case in dvc/stage/cache.py:210 in the statement first(to_remote.walk_files(key)).

See my PR.

>All comments

It is caused by dropping ssh connection out of the pool in get_connection in dvc/tree/pool.py since the connection throws a GeneratorExit exception in generator SSHTree.walk_files if the generator is closed prematurely, which is the case in dvc/stage/cache.py:210 in the statement first(to_remote.walk_files(key)).

See my PR.

Was this page helpful?
0 / 5 - 0 ratings