Dvc: Dependend directory not in dvc

Created on 30 Nov 2018  路  6Comments  路  Source: iterative/dvc

dvc-version: 0.21.2
linux/ubuntu
python: 3.6.5

I have the following folder-structure (for image-classification):

data
|___ raw
|      |___ apple
|      |___ orange
|___ processed
|      |___ train
|      |     |___ apple
|      |     |___ orange
|      |___ test
|      |     |___ apple
|      |     |___ orange

Inside directories raw/apple and orange are image-files.
With split_dataset.py these files are copied to the processed-directory and split into training- and test-images (randomly)
So my datapreparation-step is:

dvc run -d data/raw -d split_dataset.py -o data/processed -f data.dvc python split_dataset.py
The output is:

Adding 'data/processed' to 'data/.gitignore'.
Saving 'data/processed' to cache '.dvc/cache'.
Linking directory 'data/processed'.
Saving information to 'data.dvc'.

The content of data.dvc is

cmd: python3 split_dataset.py
deps:
- md5: 749f2c46188b40a30ac18106e466e543.dir
  path: data/raw
- md5: c46da892a15ff5c9425bd3fddfee1a14
  path: split_dataset.py
md5: f5144729533bc4ec9dd36ae3b5218fbd
outs:
- cache: true
  md5: a8c5831d07f8c882460d9567dfcf582b.dir
  path: data/processed

After doing dvc remote and dvc push ... output:

Preparing to push data to s3://...
[##############################] 100% Collecting information
(1/28): [##############################] 100% data/processed
(2/28): [##############################] 100% data/processed/train/apple/d0.jpg
(3/28): [##############################] 100% data/processed/train/apple/d9.jpg
(4/28): [##############################] 100% data/processed/train/orange/i05.jpg
(5/28): [##############################] 100% data/processed/test/apple/d4.jpg
(6/28): [##############################] 100% data/processed/train/apple/d1.jpg
(7/28): [##############################] 100% data/processed/train/apple/db.jpg
(8/28): [##############################] 100% data/processed/test/apple/d5.jpg
(9/28): [##############################] 100% data/processed/train/apple/d2.jpg
(10/28): [##############################] 100% data/processed/train/orange/i07.jpg
(11/28): [##############################] 100% data/processed/train/apple/dc.jpg
(12/28): [##############################] 100% data/processed/test/apple/da.jpg
(13/28): [##############################] 100% data/processed/train/apple/d3.jpg
(14/28): [##############################] 100% data/processed/train/orange/i08.jpg
(15/28): [##############################] 100% data/processed/train/orange/i00.jpg
(16/28): [##############################] 100% data/processed/train/orange/i09.jpg
(17/28): [##############################] 100% data/processed/train/apple/d6.jpg
(18/28): [##############################] 100% data/processed/train/orange/i01.jpg
(19/28): [##############################] 100% data/processed/test/orange/i03.jpg
(20/28): [##############################] 100% data/processed/train/orange/i10.jpg
(21/28): [##############################] 100% data/processed/train/apple/d7.jpg
(22/28): [##############################] 100% data/processed/train/orange/i02.jpg
(23/28): [##############################] 100% data/processed/test/orange/i06.jpg
(24/28): [##############################] 100% data/processed/train/orange/i04.jpg
(25/28): [##############################] 100% data/processed/train/apple/d8.jpg
(26/28): [##############################] 100% data/processed/train/orange/s2.jpg
(27/28): [##############################] 100% data/processed/train/orange/s3.jpg
(28/28): [##############################] 100% data/processed/test/orange/s1.jpg

I checkout the project on a different machine with: git clone and dvc pull
The directory data/raw is not created.

The directory is given in the dependencies (-d) but is somehow ignored. But it's listed in the .dvc file.

question

Most helpful comment

Hi Ruslan,

my focus is on deep learning and image recognition. Along with that comes a large amount of training images and models that are really large (80MB up to 500MB).
My large models and all the image-files doesn't fit into the corporate git and so I'm now using DVC to version my model and the corresponding image-files and store the large data in S3 and the code in GIT.
The combination of code/data resp. GIT and DVC was something I was searching for a long time.
And DVC fits this in such an easy and flexible way, that I couldn't imagine the hussle I had before.

All 6 comments

Hi, @stvogel !

I wrote this little script to reproduce your issue:

#!/usr/bin/env bash

rm -rf repo
mkdir repo
cd repo

git init
dvc init

mkdir -p data/raw/{apple,orange}
mkdir -p data/processed/train/{apple,orange}

echo "apple" > data/raw/apple/image.jpg
echo "orange" > data/raw/orange/image.jpg

dvc run -f process.dvc \
        -d data/raw \
        -o data/processed \
        'cp data/raw/apple/image.jpg data/processed/train/apple/'


rm -rf /tmp/dvc-cache
dvc remote add -d myremote /tmp/dvc-cache
dvc push

rm -rf data/*

dvc pull

ls -lah data

# Permissions Size Name
# .rw-r--r--   10  .gitignore
# drwxr-xr-x    -  processed

Indeed, it only checkouts the processed directory, besides specifying raw as a dependency.
I'm not sure if this is a bug or was an intended behavior, maybe @efiop can confirm.

The code only takes into account the "outputs" during the checkout:
https://github.com/iterative/dvc/blob/ff577d7a8751a91d2a0fee06f6976ff781d871b7/dvc/stage.py#L500-L502

Hi @stvogel !

@mroutis is right, it is an intended behavior. Dependencies are tracked(e.g. changes to their checksums) but not automatically cached by dvc, you need to first add them to dvc(i.e. specify as an output to some stage, e.g. dvc add data creates a stage that has no dependencies, no command, but has data specified as an output, making it tracked and cached). So if you would to run

dvc add data/raw

right now(no need to re-run anything), it would make everything work 馃檪

Thanks,
Ruslan

Oh indeed, I thought I've tried that and got "data/raw" already under dvc-control.
But you're right. It works.
So the "first"-data in your pipeline (that has no output whatsoever) should in general be added manuall.
Ok, should keep that in mind.

Thanks a lot for the quick answer.
Did I say that DVC really helps a lot and solved one of my real PITA?

@stvogel Glad it works for you :slightly_smiling_face:

Did I say that DVC really helps a lot and solved one of my real PITA?

We would love to hear more about the way dvc helps in your scenarios. Would you mind elaborating? :slightly_smiling_face:

Thanks,
Ruslan

Hi Ruslan,

my focus is on deep learning and image recognition. Along with that comes a large amount of training images and models that are really large (80MB up to 500MB).
My large models and all the image-files doesn't fit into the corporate git and so I'm now using DVC to version my model and the corresponding image-files and store the large data in S3 and the code in GIT.
The combination of code/data resp. GIT and DVC was something I was searching for a long time.
And DVC fits this in such an easy and flexible way, that I couldn't imagine the hussle I had before.

@stvogel Glad to hear that dvc is so useful in your scenario :slightly_smiling_face: Thanks for all the feedback!

Was this page helpful?
0 / 5 - 0 ratings