Assuming we have MinIO instance set up with two buckets (dvc-cache
, data
) on localhost:9000, and we try to add data from data
bucket as external dependency we will get ETag mismatch
error.
Example:
#!/bin/bash
rm -rf repo
mkdir repo
pushd repo
git init --quiet
dvc init -q
export AWS_ACCESS_KEY_ID="minioadmin"
export AWS_SECRET_ACCESS_KEY="minioadmin"
dvc remote add s3cache s3://dvc-cache/cache
dvc config cache.s3 s3cache
dvc remote modify s3cache endpointurl http://localhost:9000
dvc remote modify s3cache use_ssl False
dvc remote add miniodata s3://data
dvc remote modify miniodata endpointurl http://localhost:9000
dvc remote modify miniodata use_ssl False
dvc add remote://miniodata/file
Will result with:
ERROR: ETag mismatch detected when copying file to cache! (expected:
'4e102ec8d6ab714aae04d9e3c7b4c190-1', actual: 'ca9e5ed43f3fbee6edec
bb5ac6fba77e-1')
Related: #2629 , #3441
Seems to run fine for now me locally
~ /tmp/test.sh
~/repo ~
100% Add|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻坾1/1 [00:00, 1.11file/s]
To track the changes with git, run:
git add xfs_metadump.dvc
Do you have an example of the dataset which we can reproduce this issue?
@pared please try with minio server --compat
https://github.com/minio/minio#caveats to understand more about why --compat
might be needed here.
Here's what I tried:
docker run -v $PWD:/data -p 9000:9000 minio/minio:RELEASE.2020-03-06T22-23-56Z server --compat /data
I still got a similar error:
dvc add remote://miniodata/shapes_128_128_20180914T1515
Adding...
ERROR: ETag mismatch detected when copying file to cache! (expected: '7461e0bd8ff44f448d2659cc04f0f86a-1', actual: '098a1335754cabb3ec1ab403e4597be1-1')
Notice you need to purge the existing objects, dvc caches data.
@harshavardhana @pared
Ok so here's what I did:
.minio.sys
and upgraded to the latest one via docker pull
.dvc-cache
folder and created a new onegit clone
d a fresh repositoryNow I get a slightly different error, but again related to ETags:
(base) benjamintan@argus:~/workspace/dev/darkrai_2$ dvc add remote://miniodata/shapes_128_128_20180914T1515 -v
2020-03-10 10:53:10,292 DEBUG: PRAGMA user_version;
2020-03-10 10:53:10,293 DEBUG: fetched: [(3,)]
2020-03-10 10:53:10,293 DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
2020-03-10 10:53:10,293 DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
2020-03-10 10:53:10,293 DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
2020-03-10 10:53:10,293 DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
2020-03-10 10:53:10,293 DEBUG: PRAGMA user_version = 3;
2020-03-10 10:53:10,660 DEBUG: Uploading '../../../../../tmp/tmph3zuly_y' to 's3://dvc-cache/cache/.dSv6zZRrDrF8rrZG7EG7vQ.tmp'
2020-03-10 10:53:10,801 DEBUG: cache 's3://dvc-cache/cache/33/d27895814e19a630b0dcfff3ca1a32.dir' expected '33d27895814e19a630b0dcfff3ca1a32.dir' actual 'None'
2020-03-10 10:53:10,867 DEBUG: Removing s3://dvc-cache/cache/.dSv6zZRrDrF8rrZG7EG7vQ.tmp
2020-03-10 10:53:10,868 DEBUG: {'remote://miniodata/shapes_128_128_20180914T1515': 'modified'}
2020-03-10 10:53:10,921 DEBUG: Uploading '../../../../../tmp/tmpbqom4nbg' to 's3://dvc-cache/cache/.gFLeM9XfsEAdr9fAUq5CpF.tmp'
2020-03-10 10:53:10,972 DEBUG: cache 's3://dvc-cache/cache/33/d27895814e19a630b0dcfff3ca1a32.dir' expected '33d27895814e19a630b0dcfff3ca1a32.dir' actual '33d27895814e19a630b0dcfff3ca1a32'
2020-03-10 10:53:10,973 DEBUG: Computed stage 'shapes_128_128_20180914T1515.dvc' md5: 'e808412c5ebc930850a133d48bdde83f'
2020-03-10 10:53:11,022 DEBUG: Saving 's3://data/shapes_128_128_20180914T1515' to 's3://dvc-cache/cache/33/d27895814e19a630b0dcfff3ca1a32.dir'.
2020-03-10 10:53:11,035 DEBUG: cache 's3://dvc-cache/cache/00/000000000000000000000000000000-1' expected '00000000000000000000000000000000-1' actual 'None'
Adding...
2020-03-10 10:53:11,158 DEBUG: SELECT count from state_info WHERE rowid=?
2020-03-10 10:53:11,159 DEBUG: fetched: [(0,)]
2020-03-10 10:53:11,159 DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?
2020-03-10 10:53:11,223 ERROR: ETag mismatch detected when copying file to cache! (expected: '00000000000000000000000000000000-1', actual: '098a1335754cabb3ec1ab403e4597be1-1')
------------------------------------------------------------
Traceback (most recent call last):
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/command/add.py", line 20, in run
fname=self.args.file,
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/repo/__init__.py", line 28, in wrapper
ret = f(repo, *args, **kwargs)
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/repo/scm_context.py", line 4, in run
result = method(repo, *args, **kw)
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/repo/add.py", line 86, in add
stage.commit()
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 39, in wrapper
return deco(call, *dargs, **dkwargs)
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/stage.py", line 161, in rwlocked
return call()
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 60, in __call__
return self._func(*self._args, **self._kwargs)
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/stage.py", line 823, in commit
out.commit()
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/output/base.py", line 241, in commit
self.cache.save(self.path_info, self.info)
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/remote/base.py", line 520, in save
self._save(path_info, checksum)
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/remote/base.py", line 526, in _save
self._save_dir(path_info, checksum)
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/remote/base.py", line 479, in _save_dir
self._save_file(entry_info, entry_checksum, save_link=False)
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/remote/base.py", line 427, in _save_file
self.move(path_info, cache_info)
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/remote/base.py", line 659, in move
self.copy(from_info, to_info)
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/remote/s3.py", line 188, in copy
self._copy(self.s3, from_info, to_info, self.extra_args)
File "/home/benjamintan/miniconda3/lib/python3.7/site-packages/dvc/remote/s3.py", line 184, in _copy
raise ETagMismatchError(etag, cached_etag)
dvc.exceptions.ETagMismatchError: ETag mismatch detected when copying file to cache! (expected: '00000000000000000000000000000000-1', actual: '098a1335754cabb3ec1ab403e4597be1-1')
What's interesting is this particular line:
2020-03-10 10:53:11,035 DEBUG: cache 's3://dvc-cache/cache/00/000000000000000000000000000000-1' expected '00000000000000000000000000000000-1' actual 'None'
While on the other hand it seems like for other files the MD5/ETag was computed correctly (albeit with the --compat
flag turned on).
Yes because now that you purged .minio.sys you lost all metadata including the ETag for all your objects. Since dvc remembers ETag it is not possible to start server without --compat and start it again with --compat.
Server should be started with --compat on the get go, so when I meant fresh means fresh everything including the content on MinIO not just metadata.
Unless of course a refresh of sort can be asked from dvc.
The original issue here is a bug in MinIO CopyObject which is not preserving the previous ETag which is not intentional.
This happens only without --compat flag so this needs to be fixed on MinIO. Once that is fixed i expect that dvc
will work without --compat flag too.
Q: Does dvc
require ETag to be checksum of the content or does it do this outside? What is the purpose of ETag is it like some opaque value for a blob for reference?
@harshavardhana To be clear, I actually purged both MinIO and DVC metadata at the same time, then restarting the MinIO server. But you seem to be saying that I need to ensure that the content is new? For example copy it to another folder?
Yes because you already have content, we don't compute new etags for existing data. Just purging .minio.sys simply looses all object metadata. It won't regenerate it magically.
Also let's avoid lengthy discussion here, as a courtesy towards dvc maintainers.
You are not doing what I demonstrated anyways. Let's discuss on our slack instead @benjamintanweihao
@harshavardhana Thanks a lot for the explanation! Looks like there is not much we can do on dvc side, so closing this issue for now.
@efiop does it make sense to report it back to minio?
@shcheklein, there's an issue which was closed as it was supposed to work as this on minio. See: https://github.com/minio/minio/issues/8012#issuecomment-519757286
We could however suggest user to use --compat
or find a better way thanEtag
.
The solution provided by @harshavardhana seems to be working, though I encounter another problem.
It seems that running MinIO server with --compat
option helps:
docker run -p 9000:9000 minio/minio --compat server /data
.
Next error I encounter is:
Invalid length for parameter Key, value: 0, valid range: 1-inf
.
This problem does not happen when using s3
as remote.
@pared that exception looks like client side perhaps a new issue to be reported, feel free to CC me if you think it's MinIO
I think I found the behavior here @skshetry @efiop - I see the dvc
is looking for a pattern in ETag such as -1
and trying to perform multipart - are you trying to ensure the preservation of ETag in some form? what is the ETag used for?
@harshavardhana Yes, we try to ensure that etag is the same after we copy
an object from one place to another. It is needed to be able to identify that an object is unchanged no matter where it is located.
Variadic parts doesn't seem to be handled, even the number of parts calculation based on ETag cannot know for certain what was the part size used. This is tricky, has there been a thought on fixing this by using something more appropriate for all usecases?
https://github.com/s3git/s3git might give some ideas
Yes, the assumption is that the part size stays the same. An important note is that this is only used for some very advanced functionality, so maybe we didn't have enough runs in the wild, but so far it worked fine.
Thanks for sharing s3git link! Looks like they don't rely on etags, but instead manually compute hash for objects on s3, right?
@harshavardhana thanks for sharing your insight!
I found an issue about this problem and it seems to be pointing to aws client.
https://github.com/aws/aws-cli/issues/2929
I did not have time to dive deeper into that issue yet.
Yes, the assumption is that the part size stays the same. An important note is that this is only used for some very advanced functionality, so maybe we didn't have enough runs in the wild, but so far it worked fine.
Thanks for sharing s3git link! Looks like they don't rely on etags, but instead manually compute hash for objects on s3, right?
Yes that is correct it, I shall cc @fwessels for more clarity.
@harshavardhana As far as I can tell, BLAKE2 is used, but even though it is ~x2 faster than md5, it still takes time to compute. That is something that we avoid by leveraging ETags as a free way to ensure that objects are unchanged. We have thought about computing some hash instead, but for now there wasn't enough need for that. But we might come to that in the future at which point we will probably take a look at BLAKE3 or something like that. Thanks for the question! :pray:
For consistent hashing you should look into the verified streaming behaviour that BLAKE3 offers. This fixes the size of the individual parts for which hashes are computed at the lowest level which then ultimately results in the final hash at the top.
"Only" caveat here is that, when working with multipart objects, you need to know where the boundaries are for the parts to hash (and some hashes will need to be computed between two consecutive parts).
So it is definitely doable, but not completely trivial to implement (and dropping parts upon finalization of the multipart upload would not be allowed (at least not without triggering recomputing the hashes from the point onwards where a part is left out)).
@harshavardhana
Variadic parts doesn't seem to be handled
it should be handled as far as I remember:
obj = cls.get_head_object(
s3, from_info.bucket, from_info.path, PartNumber=i
)
part_size = obj["ContentLength"]
so, we try to preserve the same number of parts _and_ the same length of them.
It is still a gray area, since I'm not sure it's not officially documented how ETAG is calculated from multi-parts, but I think it is reasonable for our users to rely on that optimization for now. As @efiop mentioned we can introduce BLAKE2 os something similar if Amazon at some decides to change the logic behind it.
Btw, I wonder if s3.upload_part_copy
moves bytes first to the local machine before sending then back to S3 or S3 has a way to copy remotely? If it pulls the data locally, we can potentially calculate the hash while we copy data w/o affecting performance. In this case we would need S3 to support rename at least.
it should be handled as far as I remember:
No, it is not Parts can be uploaded in this manner
This will result in ETag as md5hex(md5(5MiB) + md5(6MiB) + md5(1byte)
-3
Now if you assume 3 parts content-length is 11MiB you have no idea what is the length used
for 1st part, 2nd part - if you happen to choose 5MiB for both then you will result with an incorrect
ETag which will mismatch. I can reproduce this right now with dvc
using AWS S3. Of course
I assume that this is not handled because it is a corner case and rare. Just so that you are aware I am clarifying this a bit.
It is still a gray area, since I'm not sure it's not officially documented how ETAG is calculated from multi-parts, but I think it is reasonable for our users to rely on that optimization for now. As @efiop mentioned we can introduce BLAKE2 os something similar if Amazon at some decides to change the logic behind it.
multipart ETAG is nothing but the hexmd5(md5(part1) + md5(part2)...)-N
this is documented not in AWS S3 docs but found while talking to AWS support.
Btw, I wonder if
s3.upload_part_copy
moves bytes first to the local machine before sending then back to S3 or S3 has a way to copy remotely? If it pulls the data locally, we can potentially calculate the hash while we copy data w/o affecting performance. In this case we would need S3 to support rename at least.
The server-side copy of parts is called CopyObjectPart() - which I see that you are using when you see ETag as a -
at the end.
NOTE: This assumption will also fail for SSE-C encrypted objects as well because AWS S3 doesn't return a proper ETag when you have SSE-C encrypted objects - meaning an SSE-C object will change its ETag automatically upon an overwrite.
https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
The entity tag is a hash of the object. The ETag reflects changes only to the contents
of an object, not its metadata. The ETag may or may not be an MD5 digest of the
object data. Whether or not it depends on how the object was created and how it
is encrypted as described below:
- Objects created by the PUT Object, POST Object, or Copy operation, or through the
AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags
that are an MD5 digest of their object data.
- Objects created by the PUT Object, POST Object, or Copy operation, or through the
AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have
ETags that are not an MD5 digest of their object data.
- If an object is created by either the Multipart Upload or Part Copy operation,
the ETag is not an MD5 digest, regardless of the method of encryption.
@harshavardhana
if you happen to choose 5MiB for both then you will result with an incorrect
ETag which will mismatch
I think that's not how DVC does this. Please, check this code - https://github.com/iterative/dvc/blob/master/dvc/remote/s3.py#L103-L107
Unless I'm missing something it takes exact sizes of all parts.
I can reproduce this right now with dvc using AWS S3
could you share the script, please? It might be a bug.
multipart ETAG is nothing but the hexmd5(md5(part1) + md5(part2)...)-N this is documented not in AWS S3 docs but found while talking to AWS support.
Yep, I know. It doesn't make it official - thus "grey" area. Though, like I mentioned it's very unlikely they change it, so we made a decision to utilize this behavior.
NOTE: This assumption will also fail for SSE-C encrypted objects as well because AWS S3 doesn't return a proper ETag when you have SSE-C encrypted objects - meaning an SSE-C object will change its ETag automatically upon an overwrite.
Yes. See https://github.com/iterative/dvc/issues/2701#issuecomment-548228199 and https://github.com/iterative/dvc.org/issues/774
I think that's not how DVC does this. Please, check this code - https://github.com/iterative/dvc/blob/master/dvc/remote/s3.py#L103-L107
Unless I'm missing something it takes exact sizes of all parts.
Ah, you are using the PartNumber API @shcheklein which was undocumented for a few years and then has silently appeared.
for i in range(1, n_parts + 1):
obj = cls.get_head_object(
s3, from_info.bucket, from_info.path, PartNumber=i
)
part_size = obj["ContentLength"]
lastbyte = byte_position + part_size - 1
if lastbyte > size:
lastbyte = size - 1
We have had debates about should we ever implement this on MinIO, but looks like since you guys use it - it makes sense to do it.
@harshavardhana thanks! 馃檹 Will there be a ticket for us to follow?
Most helpful comment
Here's what I tried:
docker run -v $PWD:/data -p 9000:9000 minio/minio:RELEASE.2020-03-06T22-23-56Z server --compat /data
I still got a similar error: