Hi @efiop and @mroutis,
The problem:
As talked about at Discord (see from this message and the next 14 messages) I have tried out the new feature (#1654) of using remote folders as DVC stage dependencies and outputs, however, I found the runtime extremely slow.
My test example:
I have created the following code example to test out the reason for the slow runtime. The code contains:
rsync_dataset.sh
: a local bash script which performs rsync of a remote folder over sshrsync_dataset.dvc
: the DVC stage for executing rsync_dataset.sh
which as the remote folder as a stage output. Note that the remote output folder address "remote://ahsoka_project_data/PS_141_test_output_data_folder" is expanded by my DVC configuration to the SSH remote "ssh://[email protected]:22/scratch/dvc_project_cache/PS/", this URI is also seen in the log file.rsync_dataset.log
: the verbose log from running dvc repro rsync_dataset.dvc -vf > rsync_dataset.log
.(EDIT - I totally forgot to attach the .py, .dvc and .log file: rsync_dataset.log)
rsync_dataset.sh
#!/bin/bash # # A script for coping the BDICG data in the # "raster_data_cloudless_2018_12_06_10_15_29" folder containing the netCDF # dataset per field to the PS data directory. # # Run the script locally. It connects to Ahsoka and initiates the copy. echo '"rsync_data_set.sh" began...' AHSOKA=ahsoka.vfltest.dk BDICG_DATADIR=/scratch/dvc_users/fogh/PS/PS_141_test_dependency_data_folder/ PS_DIR=/scratch/dvc_users/fogh/PS/PS_141_test_output_data_folder/ ssh fogh@${AHSOKA} rsync -avh --stats --info=progress2 ${BDICG_DATADIR} ${PS_DIR} echo '"rsync_data_set.sh" finished.'
rsync_dataset.dvc
cmd: bash rsync_dataset.sh deps: - md5: 6d1ed0801f6d6d065b99195771b1cb92 path: rsync_dataset.sh outs: - cache: true md5: c0049518603c8f0154509c54b4630238.dir metric: false path: remote://ahsoka_project_data/PS_141_test_output_data_folder persist: false wdir: . md5: 1cfcba0bc0eed0e8e5b6fd1122cb5cc8
To answer @mroutis (as you asked for in this message): my DVC cache is remote and configured like described here https://github.com/PeterFogh/dvc_dask_use_case/blob/master/README.md. Wrt. file size - on the remote server the data folder "/scratch/dvc_users/fogh/PS/PS_141_test_dependency_data_folder/" contains 10 netCDF files:
$ ls -lh
total 71M
-rw-r--r-- 1 fogh hpcusers 2.8M May 15 13:03 0.nc
-rw-r--r-- 1 fogh hpcusers 6.5M May 15 13:03 1.nc
-rw-r--r-- 1 fogh hpcusers 4.7M May 15 13:03 2.nc
-rw-r--r-- 1 fogh hpcusers 1.4M May 15 13:03 3.nc
-rw-r--r-- 1 fogh hpcusers 3.0M May 15 13:03 4.nc
-rw-r--r-- 1 fogh hpcusers 3.1M May 15 13:03 5.nc
-rw-r--r-- 1 fogh hpcusers 3.5M May 15 13:03 6.nc
-rw-r--r-- 1 fogh hpcusers 2.8M May 15 13:03 7.nc
-rw-r--r-- 1 fogh hpcusers 2.8M May 15 13:03 8.nc
-rw-r--r-- 1 fogh hpcusers 3.0M May 15 13:03 9.nc
However, this is only a toy example, as our actual pipeline has a folder containing approx. 300 netCDF files, but each file is still of a similar size (approx. 5 MB).
My DVC is installed using pip and my DVC version is:
> conda deactivate && conda activate py37_v3 && dvc version
DVC version: 0.40.2
Python version: 3.7.3
Platform: Linux-4.4.0-43-Microsoft-x86_64-with-debian-stretch-sid
Example runtime:
The runtime of dvc repro rsync_dataset.dvc -vf
is:
$ time dvc repro rsync_dataset.dvc -v > rsync_dataset.log
real 2m30.600s
user 0m5.750s
sys 0m4.891s
and the as seen in rsync_dataset.log
the runtime of rsync_dataset.sh
is less than 1 second. But as I told on Discord the runtime of our actual pipeline is 39 minutes of a folder of 1.8 GB with 287 files.
I have also computed the md5 checksum of all the files in the remote - it has a runtime less than 1 second:
$ time find PS_141_test_dependency_data_folder/ -type f -exec md5sum {} \;
74ff8e3c0bd44f6487840df0965ed5c3 PS_141_test_dependency_data_folder/7.nc
608572e058c9753026392bbfead38a95 PS_141_test_dependency_data_folder/0.nc
1f412bd9a4be8b5886aa2ec24b53ef48 PS_141_test_dependency_data_folder/9.nc
dd336099ddbd0fab05301719008a210b PS_141_test_dependency_data_folder/3.nc
5fa4a3f6665a918a4af5d8699d151956 PS_141_test_dependency_data_folder/8.nc
f9131d06bb09240e4dd2437735de506c PS_141_test_dependency_data_folder/5.nc
2deaef6a7c55f2ce10d94f3593373001 PS_141_test_dependency_data_folder/6.nc
6be099278fde604c716eae23e7f3b70a PS_141_test_dependency_data_folder/4.nc
51eac88b1630568765bd6e33b6e72ab7 PS_141_test_dependency_data_folder/2.nc
real 0m0.180s
user 0m0.163s
sys 0m0.017s
Suspensions for slow runtime:
I suspect the slow runtime is because DVC performs md5 checksums checks 3 times for each file in the remote folder.
DEBUG: cache 'ssh://[email protected]:22/scratch/dvc_project_cache/PS/60/8572e058c9753026392bbfead38a95' expected '608572e058c9753026392bbfead38a95' actual '608572e058c9753026392bbfead38a95'
DEBUG: cache 'ssh://[email protected]:22/scratch/dvc_project_cache/PS/60/8572e058c9753026392bbfead38a95' expected '608572e058c9753026392bbfead38a95' actual '608572e058c9753026392bbfead38a95'
- note that the logged line is the same both before and after executing the stage script.DEBUG: checking if 'remote://ahsoka_project_data/PS_141_test_output_data_folder/0.nc'('{'md5': '608572e058c9753026392bbfead38a95'}') has changed.
I also suspect the slow runtime is due to the many SSH connections created for each file checksum.
Question:
Is it possible to optimize the checking the checksums of remote folders, to improve the runtime?
Discord context and some explanation about how we can help with this by implementing 'state' db for remote ssh files: https://discordapp.com/channels/485586884165107732/563406153334128681/577814546568314891
From private discussion:
Worth noting that state database has uniqueness constraint on inode, so there is possible case where we have state database with mixed local/ssh entries, and we start overriding one with another. It would be desirable to handle this potential problem in this task.
I would look at how we can make it faster before using state db. I suspect there are some inefficiencies like calling .exists()
multiple times for each file or alike.
This is in a bad state now. A number of things needs to be fixed:
dvc repro -f
not check outs beforehand.dir
ending for a ssh dir1, 5 for the first iteration
3) for the next step
Explored point 4. Looks like current approach works better for bigger files, using single command works better for smaller ones ~ 1mb, especially lots of them. So the universal approach could be batching after collecting the way we do now.
Ran a few more ssh collect dir scenarios, looks like simply raising a number of connections - #2278 - will help a lot. Batching might help even more, however, we don't want to batch bigger files only small ones, so this requires some more thinking.
@efiop After trying the newer versions of DVC, I'm happy with the speed. It can cache 5 GB in seconds.
Now, I just need the cache to work in my pipeline, see https://github.com/iterative/dvc/issues/2542.
Thus, I propose we close this issue :)
Most helpful comment
From private discussion:
Worth noting that state database has uniqueness constraint on inode, so there is possible case where we have state database with mixed local/ssh entries, and we start overriding one with another. It would be desirable to handle this potential problem in this task.