Dvc: Failure when using S3 remote cache

Created on 23 Mar 2020  路  10Comments  路  Source: iterative/dvc

Trying to run the command

dvc run -d s3://mybucket/data.txt -o s3://mybucket/data_copy.txt aws s3 cp s3://mybucket/data.txt s3://mybucket/data_copy.txt

Using AWS S3 remote cache.

It fails with the error message ERROR: unexpected error - Parameter validation failed: Invalid length for parameter Key, value: 0, valid range: 1-inf

Attached is the log with -v turned on.

log.txt

and out of pip freeze

pip_freeze.txt

bug p1-important

Most helpful comment

This happens because we try to create parent when trying to link from cache to workspace (and, here, parent of s3://<bucket>/<name> is "", i.e. empty).
https://github.com/iterative/dvc/blob/ee26afe65e8f9ebf11d7236fd3ee16a47f9b4fc6/dvc/remote/base.py#L380-L385

If the file is in s3://<bucket>/<namespaced>/<name>, this won't fail.

I don't think, we need to create parent directories for object-based storages except when checking out directory.

Google Cloud Storage should equally be affected here as it's the same logic.

All 10 comments

@helger Could you also show us your $ dvc version output, please?

@helger Could you also show us your $ dvc version output, please?

 dvc version
DVC version: 0.90.2
Python version: 3.7.4
Platform: Linux-4.4.0-1104-aws-x86_64-with-debian-stretch-sid
Binary: False
Package: pip
Supported remotes: azure, gs, hdfs, http, https, s3, ssh, oss
Filesystem type (workspace): ('ext4', '/dev/xvda1')

@helger Could you also show us your $ dvc version output, please?

dvc version
DVC version: 0.90.2
Python version: 3.7.4
Platform: Linux-4.4.0-1104-aws-x86_64-with-debian-stretch-sid
Binary: False
Package: pip
Supported remotes: azure, gs, hdfs, http, https, s3, ssh, oss
Filesystem type (workspace): ('ext4', '/dev/xvda1')

Output of pip freeze

pip_freeze.txt

This happens because we try to create parent when trying to link from cache to workspace (and, here, parent of s3://<bucket>/<name> is "", i.e. empty).
https://github.com/iterative/dvc/blob/ee26afe65e8f9ebf11d7236fd3ee16a47f9b4fc6/dvc/remote/base.py#L380-L385

If the file is in s3://<bucket>/<namespaced>/<name>, this won't fail.

I don't think, we need to create parent directories for object-based storages except when checking out directory.

Google Cloud Storage should equally be affected here as it's the same logic.

@skshetry indeed adding another level fixes the issue

@skshetry Great catch! :pray: In general we do need to create those, because s3 and s3 tools sometimes create dirs like that, it was discussed in https://github.com/iterative/dvc/pull/2683 . We don't need-need to do that, but it is a compromise to handle both approaches. But we definitely need to handle this corner case better.

@efiop, it depends on how you see it. We can handle this corner case, but, there's no need for us to create a directory in most of the cases (except of #2678 cases?).

i.e. dvc add s3://dvc-bucket/data/file should not even try to create s3://dvc-bucket/data/ as a directory.

i.e. dvc add s3://dvc-bucket/data/file should not even try to create s3://dvc-bucket/data/ as a directory.

@skshetry It does that as a part of linking, so it is fine. What is actually wrong here is that corner case handling, indeed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dnabanita7 picture dnabanita7  路  3Comments

ghost picture ghost  路  3Comments

shcheklein picture shcheklein  路  3Comments

prihoda picture prihoda  路  3Comments

gregfriedland picture gregfriedland  路  3Comments