Dvc: output/dependency/remote: slow performance for remote directories

Created on 15 May 2019 · 9Comments · Source: iterative/dvc

Hi @efiop and @mroutis,

The problem:
As talked about at Discord (see from this message and the next 14 messages) I have tried out the new feature (#1654) of using remote folders as DVC stage dependencies and outputs, however, I found the runtime extremely slow.

My test example:
I have created the following code example to test out the reason for the slow runtime. The code contains:

rsync_dataset.sh: a local bash script which performs rsync of a remote folder over ssh
rsync_dataset.dvc: the DVC stage for executing rsync_dataset.sh which as the remote folder as a stage output. Note that the remote output folder address "remote://ahsoka_project_data/PS_141_test_output_data_folder" is expanded by my DVC configuration to the SSH remote "ssh://[email protected]:22/scratch/dvc_project_cache/PS/", this URI is also seen in the log file.
rsync_dataset.log: the verbose log from running dvc repro rsync_dataset.dvc -vf > rsync_dataset.log.

(EDIT - I totally forgot to attach the .py, .dvc and .log file: rsync_dataset.log)

rsync_dataset.sh

#!/bin/bash
#
# A script for coping the BDICG data in the
# "raster_data_cloudless_2018_12_06_10_15_29" folder containing the netCDF
# dataset per field to the PS data directory.
#
# Run the script locally. It connects to Ahsoka and initiates the copy.

echo '"rsync_data_set.sh" began...'

AHSOKA=ahsoka.vfltest.dk
BDICG_DATADIR=/scratch/dvc_users/fogh/PS/PS_141_test_dependency_data_folder/
PS_DIR=/scratch/dvc_users/fogh/PS/PS_141_test_output_data_folder/

ssh fogh@${AHSOKA} rsync -avh --stats --info=progress2 ${BDICG_DATADIR} ${PS_DIR}

echo '"rsync_data_set.sh" finished.'

rsync_dataset.dvc

cmd: bash rsync_dataset.sh
deps:
- md5: 6d1ed0801f6d6d065b99195771b1cb92
  path: rsync_dataset.sh
outs:
- cache: true
  md5: c0049518603c8f0154509c54b4630238.dir
  metric: false
  path: remote://ahsoka_project_data/PS_141_test_output_data_folder
  persist: false
wdir: .
md5: 1cfcba0bc0eed0e8e5b6fd1122cb5cc8

To answer @mroutis (as you asked for in this message): my DVC cache is remote and configured like described here https://github.com/PeterFogh/dvc_dask_use_case/blob/master/README.md. Wrt. file size - on the remote server the data folder "/scratch/dvc_users/fogh/PS/PS_141_test_dependency_data_folder/" contains 10 netCDF files:

$ ls -lh
total 71M
-rw-r--r-- 1 fogh hpcusers 2.8M May 15 13:03 0.nc
-rw-r--r-- 1 fogh hpcusers 6.5M May 15 13:03 1.nc
-rw-r--r-- 1 fogh hpcusers 4.7M May 15 13:03 2.nc
-rw-r--r-- 1 fogh hpcusers 1.4M May 15 13:03 3.nc
-rw-r--r-- 1 fogh hpcusers 3.0M May 15 13:03 4.nc
-rw-r--r-- 1 fogh hpcusers 3.1M May 15 13:03 5.nc
-rw-r--r-- 1 fogh hpcusers 3.5M May 15 13:03 6.nc
-rw-r--r-- 1 fogh hpcusers 2.8M May 15 13:03 7.nc
-rw-r--r-- 1 fogh hpcusers 2.8M May 15 13:03 8.nc
-rw-r--r-- 1 fogh hpcusers 3.0M May 15 13:03 9.nc

However, this is only a toy example, as our actual pipeline has a folder containing approx. 300 netCDF files, but each file is still of a similar size (approx. 5 MB).

My DVC is installed using pip and my DVC version is:

> conda deactivate && conda activate py37_v3 && dvc version
DVC version: 0.40.2
Python version: 3.7.3
Platform: Linux-4.4.0-43-Microsoft-x86_64-with-debian-stretch-sid

Example runtime:
The runtime of dvc repro rsync_dataset.dvc -vf is:

$ time dvc repro rsync_dataset.dvc -v > rsync_dataset.log
real    2m30.600s
user    0m5.750s
sys     0m4.891s

and the as seen in rsync_dataset.log the runtime of rsync_dataset.sh is less than 1 second. But as I told on Discord the runtime of our actual pipeline is 39 minutes of a folder of 1.8 GB with 287 files.

I have also computed the md5 checksum of all the files in the remote - it has a runtime less than 1 second:

$ time find PS_141_test_dependency_data_folder/ -type f -exec md5sum {} \;
74ff8e3c0bd44f6487840df0965ed5c3  PS_141_test_dependency_data_folder/7.nc
608572e058c9753026392bbfead38a95  PS_141_test_dependency_data_folder/0.nc
1f412bd9a4be8b5886aa2ec24b53ef48  PS_141_test_dependency_data_folder/9.nc
dd336099ddbd0fab05301719008a210b  PS_141_test_dependency_data_folder/3.nc
5fa4a3f6665a918a4af5d8699d151956  PS_141_test_dependency_data_folder/8.nc
f9131d06bb09240e4dd2437735de506c  PS_141_test_dependency_data_folder/5.nc
2deaef6a7c55f2ce10d94f3593373001  PS_141_test_dependency_data_folder/6.nc
6be099278fde604c716eae23e7f3b70a  PS_141_test_dependency_data_folder/4.nc
51eac88b1630568765bd6e33b6e72ab7  PS_141_test_dependency_data_folder/2.nc

real    0m0.180s
user    0m0.163s
sys     0m0.017s

Suspensions for slow runtime:
I suspect the slow runtime is because DVC performs md5 checksums checks 3 times for each file in the remote folder.

check if the file exists in the cache before executing the stage script - an example from the log is DEBUG: cache 'ssh://[email protected]:22/scratch/dvc_project_cache/PS/60/8572e058c9753026392bbfead38a95' expected '608572e058c9753026392bbfead38a95' actual '608572e058c9753026392bbfead38a95'
check if the file exists in the cache after executing the stage script - an example from the log is
DEBUG: cache 'ssh://[email protected]:22/scratch/dvc_project_cache/PS/60/8572e058c9753026392bbfead38a95' expected '608572e058c9753026392bbfead38a95' actual '608572e058c9753026392bbfead38a95' - note that the logged line is the same both before and after executing the stage script.
check if the file has changed - an example from the log is DEBUG: checking if 'remote://ahsoka_project_data/PS_141_test_output_data_folder/0.nc'('{'md5': '608572e058c9753026392bbfead38a95'}') has changed.

I also suspect the slow runtime is due to the many SSH connections created for each file checksum.

Question:
Is it possible to optimize the checking the checksums of remote folders, to improve the runtime?

c13-half-a-week p2-medium

Source

PeterFogh

Most helpful comment

From private discussion:
Worth noting that state database has uniqueness constraint on inode, so there is possible case where we have state database with mixed local/ssh entries, and we start overriding one with another. It would be desirable to handle this potential problem in this task.

pared on 17 May 2019

👍2

All 9 comments

Discord context and some explanation about how we can help with this by implementing 'state' db for remote ssh files: https://discordapp.com/channels/485586884165107732/563406153334128681/577814546568314891

efiop on 15 May 2019

👍1

pared on 17 May 2019

👍2

I would look at how we can make it faster before using state db. I suspect there are some inefficiencies like calling .exists() multiple times for each file or alike.

Suor on 25 Jun 2019

👍1

This is in a bad state now. A number of things needs to be fixed:

[x] add a test for directory outputs on ssh
[x] make dvc repro -f not check outs beforehand
[x] do not open/close ssh connection each time. Connection pool?
[ ] optimize directory md5 for ssh
[x] fix dir checksum not having .dir ending for a ssh dir

Suor on 2 Jul 2019

1, 5 for the first iteration

efiop on 2 Jul 2019

3) for the next step

efiop on 9 Jul 2019

👍1

Explored point 4. Looks like current approach works better for bigger files, using single command works better for smaller ones ~ 1mb, especially lots of them. So the universal approach could be batching after collecting the way we do now.

Suor on 15 Jul 2019

👍1

Ran a few more ssh collect dir scenarios, looks like simply raising a number of connections - #2278 - will help a lot. Batching might help even more, however, we don't want to batch bigger files only small ones, so this requires some more thinking.

Suor on 16 Jul 2019

@efiop After trying the newer versions of DVC, I'm happy with the speed. It can cache 5 GB in seconds.
Now, I just need the cache to work in my pipeline, see https://github.com/iterative/dvc/issues/2542.
Thus, I propose we close this issue :)

PeterFogh on 27 Sep 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings