This is somehow an obscure thingy. MinIO is a file storage with an S3 compatible API, so we support it.
If you set up a bucket on a directory with information that was not created through the API, let's say:
mkdir /tmp/bucket
minio server /tmp/bucket
echo "foo" > /tmp/bucket/foo
It will use a default ETag:
https://github.com/minio/minio/blob/d7060c4c32b0fd0672d8a58abcfe54473b49fa39/cmd/fs-v1.go#L45-L46
This will not work with DVC, since two objects with different content could have the same ETag.
Interesting find @mroutis
Not sure it's worth noting in the docs. What DVC commands does it affect?
Maybe just leaving a workaround here (can you think of one?) and resolving the issue will be good enough for any user who comes across this problem to find this conversation.
@jorgeorpinel it affects almost every command that deals with cache :sweat_smile:
A workaround is to download everything and use mc
(minio cli) or aws
(cli) to sync the directory again. At least that's what I did.
Let's close it, then :)
Also, another point for iterative/dvc.org#499 , we could add notes like this one (about the idiosyncrasies of each remote interacting with DVC)
The problem with iterative/dvc.org#499 is that the command reference lists commands, not other kinds of things (e.g. remote types), so both refs for remote add
and for remote modify
have expandable examples for all the remote types. Would have to have a 3rd level in each of those with repetitive docs per remote type.
But if you can come up with a good way to reorganize all this that makes sense and addresses your concerns, please reopen either issue or open a new one. Thanks!
Idea: a new use case or user guide to explain these remote type-specific caveats (linked to from cmd ref examples).
two objects with different content could have the same ETag.
@mroutis @efiop it might be a bigger problem than just docs.
it affects almost every command that deals with cache
I think it affects only external dependencies/outputs, regular stuff like push/pull should work anyway.
@shcheklein Yes, it only affects external deps/outs with that particular minio misuse. Two different objects having an ETag is most likely a violation of s3 spec, so it is minio's problem. We could add a warning, but I wouldn't bother until someone actually runs into it, as adding files by-hand into the minio is a pretty obscure hack by itself. Having this issue as a record of us discovering this issue is good enough, doesn't seem worth noting it in our docs, especially since it is something that should be documented by minio guys in the first place.
@efiop kk, feel free to close this then. I'm a worried about one things, though:
so it is minio's problem
I'm not 100% if S3 public spec guarantees ETags to be md5s and/or unique across multiple files. It might be it has to guarantee only the semantics you usually need from Etags - if you change a file ETag must change. It's important to understand for us to actually understand if we can guarantee full support for the S3-compatible storages or we should mention that external stuff might not be supported depending on the implementation.
as adding files by-hand into the minio is a pretty obscure hack by itself
what about the case when you are just initially creating the bucket? It might be a pretty legit case when you put files "manually".
Had a discussion with @shcheklein . I'm thinking we could simply add
if etag == "00000000000000000000000000000000-1":
raise ...
to RemoteS3.save()
method to only catch this for external output
scenario, as for external deps
this works fine as an indicator of content changes.
Looks like Minio has several problems with ETags, another bug discovered on Friday, copying a file to a different path changes the ETag.
It is better to work with a real S3 instance and test it with moto (as we currently do).
To test things quickly you can setup an stand-alone server with moto_server s3
.
@mroutis what is the case for running the MinIO
instance locally and putting files into it's underlying storage directly rather than through the standard s3
api? I think if all the files are placed via s3
operation, you shouldn't run into the default ETag
issue you are describing. (Sorry I'm not familiar with the DVC cache)
Alternative, rather than using mc
you could use minio-go
to put files into it which works with any s3 compatible storage engine.
Most helpful comment
Looks like Minio has several problems with ETags, another bug discovered on Friday, copying a file to a different path changes the ETag.
It is better to work with a real S3 instance and test it with moto (as we currently do).
To test things quickly you can setup an stand-alone server with
moto_server s3
.