I have a similar issue to #2678 but for GS.
I have a bucket with the following structure
my_bucket
โโโ data
โ โโโ img1.png
โ โโโ img2.png
โ โโโ ...
โโโ cache
I have then created a clean project
$ git init
$ dvc init
$ dvc remote add gscache gs://my_bucket/cache
$ dvc config cache.gs gscache
$ dvc add gs://my_bucket/data
The output is as follows:
100%|โโโโโโโโโโ|Add 1/1 [00:00<00:00, 1.21file/s]
ERROR: output 'gs://my_bucket/data' does not exist
Adding a single file works (dvc add gs://my_bucket/data/img1.png).
A more verbose version:
$ dvc add gs://my_bucket/data -v
DEBUG: PRAGMA user_version;
DEBUG: fetched: [(3,)]
DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
DEBUG: PRAGMA user_version = 3;
100%|โโโโโโโโโโ|Add 1/1 [00:01<00:00, 1.63s/file]
DEBUG: SELECT count from state_info WHERE rowid=?
DEBUG: fetched: [(0,)]
DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?
ERROR: output 'gs://my_bucket/data' does not exist
------------------------------------------------------------
Traceback (most recent call last):
File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/command/add.py", line 25, in run
fname=self.args.file,
File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/__init__.py", line 35, in wrapper
ret = f(repo, *args, **kwargs)
File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/scm_context.py", line 4, in run
result = method(repo, *args, **kw)
File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/add.py", line 53, in add
stage.save()
File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/stage.py", line 716, in save
out.save()
File "/home/egnyte/anaconda3/envs/dvc/lib/python3.7/site-packages/dvc/output/base.py", line 219, in save
raise self.DoesNotExistError(self)
dvc.output.base.OutputDoesNotExistError: output 'gs://my_bucket/data' does not exist
------------------------------------------------------------
dvc --version = 0.68.1. I am using ubuntu, I installed using conda, python 3.7.5.
Hi @willypicard !
I am using kubeflow to preprocess datasets and I would like to use dvc as a tool to manage my datasets and models. I have large datasets stored on GCS and i would like them to be versioned and on GCS. So it would be convenient to provide the directory containing the dataset instead of each file in it (some of my datasets contains hundreds of thousands of files).
How big are those datasets? Just checking if you are also aware of a possibility to mount that bucket through s3fuse and work with it as with any local files. ๐
I have a dataset that is 250GB large. So rather large...
S3fuse might be an option. However, it would seem "natural" to be able to use dvc add gs://my_bucket/mydataset as we can locally. And it would be a great feature for cloud-based tools such as kubeflow.
And obviously it is also possible to use gsutil ls -r gs://my_bucket/data to retrieve the list of files and run dvc add on each of them. But it is utterly not elegant.
@willypicard Got it. Makes sense, let's implement it. ๐ค Unfortunately, we don't have enough space in the current sprint, so if you would be willing to give it a shot, we'll try to help with everything we can. It is really not complex, as we already have all the generalized logic in place, so the only things that one would need to implement are:
Make RemoteGS.exists() support directories
Implement RemoteGS.walk_files()
Implement RemoteGS.isdir()
One could look at s3.py from https://github.com/iterative/dvc/pull/2619/files as an example. ๐ Let us know what you think. Thanks for the feedback!