Dvc: gs: support directories as external dependencies/outputs

Created on 19 Nov 2019  ยท  6Comments  ยท  Source: iterative/dvc

I have a similar issue to #2678 but for GS.

I have a bucket with the following structure

my_bucket
       โ”œโ”€โ”€ data
       โ”‚     โ”œโ”€โ”€ img1.png
       โ”‚     โ”œโ”€โ”€ img2.png
       โ”‚     โ”œโ”€โ”€ ...
       โ””โ”€โ”€ cache

I have then created a clean project

$ git init
$ dvc init
$ dvc remote add gscache gs://my_bucket/cache
$ dvc config cache.gs gscache
$ dvc add gs://my_bucket/data

The output is as follows:

100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ|Add                                                                                                                            1/1 [00:00<00:00,  1.21file/s]
ERROR: output 'gs://my_bucket/data' does not exist

Adding a single file works (dvc add gs://my_bucket/data/img1.png).

A more verbose version:

$ dvc add gs://my_bucket/data -v 
DEBUG: PRAGMA user_version;
DEBUG: fetched: [(3,)]
DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
DEBUG: PRAGMA user_version = 3;
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ|Add                                                                                                                            1/1 [00:01<00:00,  1.63s/file]
DEBUG: SELECT count from state_info WHERE rowid=?
DEBUG: fetched: [(0,)]
DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?
ERROR: output 'gs://my_bucket/data' does not exist
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/command/add.py", line 25, in run
    fname=self.args.file,
  File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/__init__.py", line 35, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/add.py", line 53, in add
    stage.save()
  File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/stage.py", line 716, in save
    out.save()
  File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/output/base.py", line 219, in save
    raise self.DoesNotExistError(self)
dvc.output.base.OutputDoesNotExistError: output 'gs://my_bucket/data' does not exist
------------------------------------------------------------

dvc --version = 0.68.1. I am using ubuntu, I installed using conda, python 3.7.5.

awaiting response feature request good first issue help wanted p2-medium

All 6 comments

Hi @willypicard !

2678 is about a specific bug we have in directory support for s3. The issue you are reporting is related to https://github.com/iterative/dvc/issues/1654 , as we don't currently support gs directories as external outputs or dependencies. Maybe you could elaborate on what your scenario is, so we could better understand if support for gs dirs is what you really need? ๐Ÿ™‚

I am using kubeflow to preprocess datasets and I would like to use dvc as a tool to manage my datasets and models. I have large datasets stored on GCS and i would like them to be versioned and on GCS. So it would be convenient to provide the directory containing the dataset instead of each file in it (some of my datasets contains hundreds of thousands of files).

How big are those datasets? Just checking if you are also aware of a possibility to mount that bucket through s3fuse and work with it as with any local files. ๐Ÿ™‚

I have a dataset that is 250GB large. So rather large...
S3fuse might be an option. However, it would seem "natural" to be able to use dvc add gs://my_bucket/mydataset as we can locally. And it would be a great feature for cloud-based tools such as kubeflow.

And obviously it is also possible to use gsutil ls -r gs://my_bucket/data to retrieve the list of files and run dvc add on each of them. But it is utterly not elegant.

@willypicard Got it. Makes sense, let's implement it. ๐Ÿค Unfortunately, we don't have enough space in the current sprint, so if you would be willing to give it a shot, we'll try to help with everything we can. It is really not complex, as we already have all the generalized logic in place, so the only things that one would need to implement are:

Make RemoteGS.exists() support directories
Implement RemoteGS.walk_files()
Implement RemoteGS.isdir()

One could look at s3.py from https://github.com/iterative/dvc/pull/2619/files as an example. ๐Ÿ™‚ Let us know what you think. Thanks for the feedback!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  ยท  3Comments

jorgeorpinel picture jorgeorpinel  ยท  3Comments

analystanand picture analystanand  ยท  3Comments

siddygups picture siddygups  ยท  3Comments

GildedHonour picture GildedHonour  ยท  3Comments