Dvc: Cannot add file having name with substring of a folder as prefix in s3

Created on 30 Nov 2019  路  4Comments  路  Source: iterative/dvc

Steps to reproduce

  1. Upload two files in s3: folder/data/data.csv andfolder/datasets.md.
  2. Setup remotes and caches.
dvc remote add -f s3 s3://dvc-temp/folder
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cache
  1. dvc run -d remote://s3/data 'echo hello world'

Outcome

Running command:
        echo hello world
hello world
ERROR: unexpected error - '/folder/datasets.md' does not start with '/folder/data'

Version

$ dvc version
Python version: 3.6.6
Platform: Linux-5.3.12-arch1-1-x86_64-with-arch
Binary: False
Package: None
Filesystem type (cache directory): ('ext4', '/dev/sda9')
Filesystem type (workspace): ('ext4', '/dev/sda9')

Script to reproduce

#! /usr/bin/env bash

export AWS_ACCESS_KEY_ID='testing'
export AWS_SECRET_ACCESS_KEY='testing'
export AWS_SECURITY_TOKEN='testing'
export AWS_SESSION_TOKEN='testing'

moto_server s3 &> /dev/null &

python -c '
import boto3

session = boto3.session.Session()
s3 = session.client("s3", endpoint_url="http://localhost:5000")
s3.create_bucket(Bucket="dvc-temp")

s3.put_object(Bucket="dvc-temp", Key="folder/data/data.csv")
s3.put_object(Bucket="dvc-temp", Key="folder/datasets.md", Body="### Datasets")
'

temp=$(mktemp -d)
cd $temp

dvc init --no-scm
dvc remote add -f s3 s3://dvc-temp/folder
dvc remote modify s3 endpointurl http://localhost:5000
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cache

dvc run -d remote://s3/data 'echo hello world'

Analysis:

  1. This is due to walk_files implementation in RemoteS3 looking via prefix instead of /<prefix> to walk files. Either, walk_files should get directory path or should just append it itself.

https://github.com/iterative/dvc/blob/0404a2324e497667a8b7d0ab0bd2b37db8c97e4c/dvc/remote/s3.py#L282

Or, I'd prefer it to be handled when collecting the directory.
https://github.com/iterative/dvc/blob/caa67c725e1e351ed122bdad17db0f29a8e73c39/dvc/remote/base.py#L196

  1. Again, the logic of exists looks flawed. Say, you have data/subdir-file.txt and data/subdir/1 files. When adding data/subdir, the first result could be subdir-file.txt which matches startswith, therefore, the exists() will return True, but in reality, subdir does not exist.
    So, the function should check if it's a directory, and should loop through all results of _list_paths() till it finds the exact match (not sure, how expensive this will be).

https://github.com/iterative/dvc/blob/caa67c725e1e351ed122bdad17db0f29a8e73c39/dvc/remote/s3.py#L208-L211

bug p0-critical

All 4 comments

The WIP implementation for Google Cloud Storage (#2853, more specifically https://github.com/iterative/dvc/pull/2853#discussion_r351796322) also suffers from the same issue.
~For now, I have reverted any changes related to this in the PR, as it has already ballooned in size. I'd opt for fixing this in a next PR~.

The following patch is working for me (might need additional improvement):

diff --git a/dvc/remote/s3.py b/dvc/remote/s3.py
index 8013f9bb..7494c96c 100644
--- a/dvc/remote/s3.py
+++ b/dvc/remote/s3.py
@@ -209,8 +209,13 @@ class RemoteS3(RemoteBASE):

     def exists(self, path_info):
         dir_path = path_info / ""
-        fname = next(self._list_paths(path_info, max_items=1), "")
-        return path_info.path == fname or fname.startswith(dir_path.path)
+
+        if self.isdir(dir_path):
+            return True
+
+        for fname in self._list_paths(path_info):
+            if path_info.path == fname:
+                return True
+
+        return False
+
     def makedirs(self, path_info):
         # We need to support creating empty directories, which means
@@ -279,7 +284,7 @@ class RemoteS3(RemoteBASE):
         )

     def walk_files(self, path_info, max_items=None):
-        for fname in self._list_paths(path_info, max_items):
+        for fname in self._list_paths(path_info / "", max_items):
             if fname.endswith("/"):
                 continue

@skshetry Great catch! Mind creating a PR with that? 馃檪 We could then do the same in gs. CC @mroutis

Script to reproduce for 2nd issue:

#! /usr/bin/env bash

export AWS_ACCESS_KEY_ID='testing'
export AWS_SECRET_ACCESS_KEY='testing'
export AWS_SECURITY_TOKEN='testing'
export AWS_SESSION_TOKEN='testing'

moto_server s3 &> /dev/null &

python -c '
import boto3

session = boto3.session.Session()
s3 = session.client("s3", endpoint_url="http://localhost:5000")
s3.create_bucket(Bucket="dvc-temp")

s3.put_object(Bucket="dvc-temp", Key="folder/data/subdir-file.txt", Body="### Subdir")
s3.put_object(Bucket="dvc-temp", Key="folder/data/subdir/1", Body="")
'

temp=$(mktemp -d)
cd $temp

dvc init --no-scm
dvc remote add -f s3 s3://dvc-temp/folder
dvc remote modify s3 endpointurl http://localhost:5000
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cache

dvc add remote://s3/data/subdir
Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  路  3Comments

shcheklein picture shcheklein  路  3Comments

dnabanita7 picture dnabanita7  路  3Comments

robguinness picture robguinness  路  3Comments

dmpetrov picture dmpetrov  路  3Comments