dvc push: when not all files are in local cache, still pushes the hash.dir file, breaking remote repository.

Created on 5 Aug 2020  路  21Comments  路  Source: iterative/dvc

EDIT: Solved. The issue was pushing incomplete datasets from my gpu server to the new storage remotes. It pushed all the files present in the local cache (which is what I wanted) but then it pushed the hash.dir listing, blocking my local machine from uploading the rest of the files that were not present on the cloud server. This is a pretty serious bug IMO (even though it was accidental and poor practice!)

Bug Report

Please provide information about your setup

Output of dvc version:

$ dvc version
DVC version: 1.3.1 (pip)
---------------------------------
Platform: Python 3.7.7 on Linux-5.3.0-1032-aws-x86_64-with-debian-buster-sid
Supports: http, https, s3
Cache types: hardlink, symlink
Repo: dvc, git

Additional Information (if any):

On this VPS I get the following errors repeatedly when trying to pull from my s3 storage:

2020-08-05 18:32:52,114 ERROR: failed to download 's3://[xxx]/repo.dvc/69/763f0cecd801483a1490a0b2a0b84d' to '.dvc/cache/69/763f0cecd801483a1490a0b2a0b84d' - An error occurred (404) when calling the HeadObject operation: Not Found
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/dvc/cache/local.py", line 30, in wrapper
    func(from_info, to_info, *args, **kwargs)
  File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/dvc/tree/base.py", line 420, in download
    from_info, to_info, name, no_progress_bar, file_mode, dir_mode
  File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/dvc/tree/base.py", line 478, in _download_file
    from_info, tmp_file, name=name, no_progress_bar=no_progress_bar
  File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/dvc/tree/s3.py", line 341, in _download
    Bucket=from_info.bucket, Key=from_info.path
  File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/botocore/client.py", line 635, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

This had happened on a ec2-linux VPS. I tried again on an Ubuntu Deep Learning AMI, then again with a fresh python3 virtualenv with only dvc installed. I have not been able to replicated this on any of my local workstations. They are able to clone the dvc directories just fine. Even pushing from one and pulling on another.

Also, on any machine, aws s3 ls ... does not return any thing for the hashes it is searching for on s3. But, I am able to clone the .dvc on my other machines... I am stumped....

For the record, one local dvc version:

DVC version: 1.3.1 (pip)
---------------------------------
Platform: Python 3.6.9 on Linux-5.4.0-42-generic-x86_64-with-Ubuntu-18.04-bionic
Supports: http, https, s3, ssh
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda2
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git

If applicable, please also provide a --verbose output of the command, eg: dvc add --verbose.

bug p0-critical

Most helpful comment

Can reproduce with this test

def test_push_incomplete_dir(tmp_dir, dvc, mocker, local_remote):
    (stage,) = tmp_dir.dvc_gen({"dir": {"foo": "foo", "bar": "bar"}})
    remote = dvc.cloud.get_remote("upstream")

    cache = dvc.cache.local
    dir_hash = stage.outs[0].checksum
    used = stage.get_used_cache(remote=remote)

    # remove one of the local cache files for directory
    file_hash = first(used.child_keys(cache.tree.scheme, dir_hash))
    remove(cache.tree.hash_to_path_info(file_hash))

    dvc.push()
    assert not remote.tree.exists(remote.tree.hash_to_path_info(dir_hash))

All 21 comments

I should also note that other folders I have been able to push and pull no problem and I can still push and pull these datasets over ssh. I switched to s3 when I outgrew my VPS and encountered this issue.

EDIT: Also, these hashes that can't be found on s3 are present in the local caches on both of the other machines. I wonder how they could be missing in s3??

More info: I seeded this S3 bucket by pushing from the tip of both my devel and master branches. I'm guessing somehow when pushing updates from remote machines some of the files didn't get correctly added to S3. I have updated those since and pushed again and the missing files still do not appear in the S3 bucket.

@bobertlo can you try tunning dvc status -c ?
Is there any output?

Locally:

[sj@control-01 automator (devel)]$ dvc status -r s3 -c
Data and pipelines are up to date.

Remotely, a giant of files the form:

ubuntu@ip-XXX-XX-XX-XXX:~/automator$ dvc status -r s3 -c
    deleted:            data/.../ff1ef9891bf2386ec617c11cfc1d3b299256ccc65c68491931958a6d2cab3851.png

And I just did a dvc push -r s3 and dvc pull -r s3 on two local machines successfully again.

Just to clarify this @bobertlo

And I just did a dvc push -r s3 and dvc pull -r s3 on two local machines successfully again.

when you say you are able to push successfully, if you do aws s3 ls from the local machine, are the hashes reported as missing on the remote machine actually present in S3? Also, can you confirm that you are running the same DVC version on both the remote and local machines?

It looks to me like there might be some indexing bug happening here, where local machine is not actually pushing everything.

When I say successfully I mean the transaction completed. I think they were sharing cache from before moving to S3. I had just maxed out my (ssh remote) VPS and pushed everything to S3.

I first pushed to S3 from work and then later tried to access it from the EC2 instance with no problems. The only broken files I have encountered have been files I pushed from home on my (very bad) DSL. My home laptop and work computer can push files back and forth just fine. And when I push new files from my work computer to the S3 and pull on the VPS it works fine.

It is possible I had a (daily) version mix-up but I just don't see this as a possible outcome of only that?

I'm just really confused that it is checking the remote cache when pushing and not actually sending the missing hashes.

@pmrowla to clarify they are present in $REPO/.dvc/cache and absent in s3://bucket/repo/...

It is possible I had a (daily) version mix-up but I just don't see this as a possible outcome of only that?

I'm just really confused that it is checking the remote cache when pushing and not actually sending the missing hashes.

There was a change made in DVC 1.0 where we started indexing remotes for performance reasons. It's possible that if something went wrong in the middle of a dvc push in < 1.0, our index could get into a state where DVC on your local machine mistakenly assumes files are already in your S3 remote (and as a result says everything is up-to-date when you run push).

Could you make sure that DVC on your local machine is up to date, and then run the following on that machine:

rm -rf .dvc/tmp/index
dvc push -r s3

I ran it with dvc push -r s3 -v and only got a lot of:

2020-08-06 01:27:05,999 DEBUG: Assuming '/home/robert/automator/.dvc/cache/e9/a4e8393581fc63d72c9788d5fed48f' is unchanged since it is read-only

And then some:

020-08-06 01:27:08,145 DEBUG: Indexing new .dir 'e9970fd0c951933e4841283cf8f66fc6.dir' with '484' nested files

But, this did not fix things on the other end.

After seeing the bill for two days on S3 (which is not very suitable for my use case, just an escape vector) I will probably try a min.io solution and/or just a larger server with ssh access.

I will, however, try to keep this bucket around for troubleshooting at least.

EDIT: and i pushed from my laptop which seemed to be the original offender.

The confusing part is that I have been able to rectify these differences by pushing to an ssh remote. It cannot handle the large main dataset anymore, but the other datasets are derived from it and I have been able to push to the third party remote and pull from. I think the bug is isolated somewhere in the stack on the VPS? since I have been always able to push/pull normally between local machines?

It is possible i pushed and pulled from the old remote out of habit on these smaller datasets the night of migrating but I just don't see how it could break the S3 remote. I try to push and it say it is all up to date.

Also, I remember seeing the 1.0 release quite a while ago and generally upgrade all of my installations at least daily.

Anyways, I'm just syncing everything back onto an ssh remote for now. I won't delete this bucket for a while in case there is any troubleshooting to be done. S3 is really the wrong price model for my workflow. :)

Another update: I am getting the same issue on ssh transport. I am pushing from my workstation to a (fresh brand new) server. I then try to pull on the other end and it cannot find hashes from the commit. I am checking the remote cache and the .dvc file and they are valid hashes and they do not exist on the server!

I got it! I was doing a push from my gpu server to the remote storage server to conserve local bandwidth (they are on the same cloud) and even though it didn't have all the hashes, it still pushed the .dir hash. Seeing that locally, my dvc refuses to upload the files from my local cache. Deleting the .dir hash on the remote storage server.

@bobertlo can you post the output from dvc version on your gpu server

edit: if by push you mean just copying files from your gpu server to your remote storage without using DVC, then this is expected behavior, and manually pushing things into a DVC remote in this way is an unsafe operation.

In DVC (1.0 and later) we use the .dir hashes as part of our remote storage indexing, and only push .dir hashes once all of the directory content has been uploaded into remote storage. Likewise, when we gc -c a remote, we remove .dir hashes before removing any directory content. So if DVC sees that a .dir hash exists in a remote, we trust that it means the rest of the directory contents also exists in the remote.

The "push .dir hashes last" behavior was not present in DVC < 1.0, which is why mixing <1.0 and >=1.0 DVC versions across all of your machines can also cause issues.

@pmrowla no this was definitely through the dvc client. It happened first on S3 transport then again on ssh. I definitely have the latest release on all machines and was pushing to a new remote because the S3 remote was broken and reproduced the same bug.

As soon as I deleted the .dir files everything worked fine. Your description of the process fires my understanding. I want to check that it does not send the .dir file if hashes fail to push.

If you were pushing it through DVC, it sounds like it's a bug, in which case it will still be helpful for you to post the dvc version output from the gpu machine, as it contains platform specific config information

Sure, just had to get back to a terminal and boot it up.

$ dvc version
DVC version: 1.3.1 (pip)
---------------------------------
Platform: Python 3.7.7 on Linux-4.14.181-142.260.amzn2.x86_64-x86_64-with-glibc2.10
Supports: http, https, s3, ssh

https://github.com/iterative/dvc/blob/master/dvc/cache/local.py#L333

I'm not super familiar with the code base but it looks like it pushes files and then directories into an executor and only collects failures off of the whole thing. So I think the sending of dirs with files missing is the expected behavior currently?

Edit: I'm not denying I must have done something dumb to get into this state! I don't know how the .dir files ended up in my local cache??

Each file and dir hash to upload has it's own executor, and for the dir hash we run:

https://github.com/iterative/dvc/blob/e2a574137430a6beacb86d4eb3ff8d7e4fca6734/dvc/cache/local.py#L382

which should wait on the list of executors for each file contained in the directory to finish, and then only pushes the final dir hash if all files were uploaded successfully.

Can you run dvc push -v ... from your gpu machine and post the output?

edit: actually, thinking about it now we might not be accounting for the case where the list of files to push is incomplete from the start. So we upload the "incomplete" list without any errors, and then treat that as successfully uploading the full directory

Unfortunately I was at work and had to fix this, so that local cache and ssh remote are fixed. I should be able to reproduce a broken pull by checking out an old git commit, and may be able to reproduce a bad push from the other vm I setup try to to troubleshoot and the S3 bucket, but I can't be sure. I'll report back if I can reproduce anything.

Can reproduce with this test

def test_push_incomplete_dir(tmp_dir, dvc, mocker, local_remote):
    (stage,) = tmp_dir.dvc_gen({"dir": {"foo": "foo", "bar": "bar"}})
    remote = dvc.cloud.get_remote("upstream")

    cache = dvc.cache.local
    dir_hash = stage.outs[0].checksum
    used = stage.get_used_cache(remote=remote)

    # remove one of the local cache files for directory
    file_hash = first(used.child_keys(cache.tree.scheme, dir_hash))
    remove(cache.tree.hash_to_path_info(file_hash))

    dvc.push()
    assert not remote.tree.exists(remote.tree.hash_to_path_info(dir_hash))
Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  路  3Comments

dmpetrov picture dmpetrov  路  3Comments

prihoda picture prihoda  路  3Comments

shcheklein picture shcheklein  路  3Comments

GildedHonour picture GildedHonour  路  3Comments