folder/data/data.csv andfolder/datasets.md.dvc remote add -f s3 s3://dvc-temp/folder
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cache
dvc run -d remote://s3/data 'echo hello world'Running command:
echo hello world
hello world
ERROR: unexpected error - '/folder/datasets.md' does not start with '/folder/data'
$ dvc version
Python version: 3.6.6
Platform: Linux-5.3.12-arch1-1-x86_64-with-arch
Binary: False
Package: None
Filesystem type (cache directory): ('ext4', '/dev/sda9')
Filesystem type (workspace): ('ext4', '/dev/sda9')
#! /usr/bin/env bash
export AWS_ACCESS_KEY_ID='testing'
export AWS_SECRET_ACCESS_KEY='testing'
export AWS_SECURITY_TOKEN='testing'
export AWS_SESSION_TOKEN='testing'
moto_server s3 &> /dev/null &
python -c '
import boto3
session = boto3.session.Session()
s3 = session.client("s3", endpoint_url="http://localhost:5000")
s3.create_bucket(Bucket="dvc-temp")
s3.put_object(Bucket="dvc-temp", Key="folder/data/data.csv")
s3.put_object(Bucket="dvc-temp", Key="folder/datasets.md", Body="### Datasets")
'
temp=$(mktemp -d)
cd $temp
dvc init --no-scm
dvc remote add -f s3 s3://dvc-temp/folder
dvc remote modify s3 endpointurl http://localhost:5000
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cache
dvc run -d remote://s3/data 'echo hello world'
walk_files implementation in RemoteS3 looking via prefix instead of /<prefix> to walk files. Either, walk_files should get directory path or should just append it itself.https://github.com/iterative/dvc/blob/0404a2324e497667a8b7d0ab0bd2b37db8c97e4c/dvc/remote/s3.py#L282
Or, I'd prefer it to be handled when collecting the directory.
https://github.com/iterative/dvc/blob/caa67c725e1e351ed122bdad17db0f29a8e73c39/dvc/remote/base.py#L196
exists looks flawed. Say, you have data/subdir-file.txt and data/subdir/1 files. When adding data/subdir, the first result could be subdir-file.txt which matches startswith, therefore, the exists() will return True, but in reality, subdir does not exist._list_paths() till it finds the exact match (not sure, how expensive this will be).https://github.com/iterative/dvc/blob/caa67c725e1e351ed122bdad17db0f29a8e73c39/dvc/remote/s3.py#L208-L211
The WIP implementation for Google Cloud Storage (#2853, more specifically https://github.com/iterative/dvc/pull/2853#discussion_r351796322) also suffers from the same issue.
~For now, I have reverted any changes related to this in the PR, as it has already ballooned in size. I'd opt for fixing this in a next PR~.
The following patch is working for me (might need additional improvement):
diff --git a/dvc/remote/s3.py b/dvc/remote/s3.py
index 8013f9bb..7494c96c 100644
--- a/dvc/remote/s3.py
+++ b/dvc/remote/s3.py
@@ -209,8 +209,13 @@ class RemoteS3(RemoteBASE):
def exists(self, path_info):
dir_path = path_info / ""
- fname = next(self._list_paths(path_info, max_items=1), "")
- return path_info.path == fname or fname.startswith(dir_path.path)
+
+ if self.isdir(dir_path):
+ return True
+
+ for fname in self._list_paths(path_info):
+ if path_info.path == fname:
+ return True
+
+ return False
+
def makedirs(self, path_info):
# We need to support creating empty directories, which means
@@ -279,7 +284,7 @@ class RemoteS3(RemoteBASE):
)
def walk_files(self, path_info, max_items=None):
- for fname in self._list_paths(path_info, max_items):
+ for fname in self._list_paths(path_info / "", max_items):
if fname.endswith("/"):
continue
@skshetry Great catch! Mind creating a PR with that? 馃檪 We could then do the same in gs. CC @mroutis
Script to reproduce for 2nd issue:
#! /usr/bin/env bash
export AWS_ACCESS_KEY_ID='testing'
export AWS_SECRET_ACCESS_KEY='testing'
export AWS_SECURITY_TOKEN='testing'
export AWS_SESSION_TOKEN='testing'
moto_server s3 &> /dev/null &
python -c '
import boto3
session = boto3.session.Session()
s3 = session.client("s3", endpoint_url="http://localhost:5000")
s3.create_bucket(Bucket="dvc-temp")
s3.put_object(Bucket="dvc-temp", Key="folder/data/subdir-file.txt", Body="### Subdir")
s3.put_object(Bucket="dvc-temp", Key="folder/data/subdir/1", Body="")
'
temp=$(mktemp -d)
cd $temp
dvc init --no-scm
dvc remote add -f s3 s3://dvc-temp/folder
dvc remote modify s3 endpointurl http://localhost:5000
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cache
dvc add remote://s3/data/subdir