Dvc: How is the sync direction controlled when both local and remote files exist?

Created on 3 Jul 2017  路  8Comments  路  Source: iterative/dvc

I noticed that the data files are being downloaded even though they should have been a local copy (can't verify it now)

What happens if both local and remote files exists (with the same hash)?

bug

All 8 comments

tested again (running master version commit: 671d87be523e757fbb346213f74e3243304de556) - running sync the 2nd time downloads all files again
also, it appears that it has the chance of corrupting files if downloading fails is canceled in the middle (as the original file is overwritten during the download)

@ophiry I was unable to reproduce or I don't understand the issue correctly. Could you please give more details?

I'd like to clarify - there are two hashes:
1) filename hash/git-hash like .cache/file1_2b419f6
2) md5 sum of the file content like $ md5 ~/src/myrepo/data/Posts.xml gives cfdaa4bba57fa07d81ff96685a9aab2c.

DVC downloads a file if md5 sum doesn't match for a local file and S3 file (it gets md5 by AWS API). Yes, if downloading fails then DVC should download the file again.

So, if DVC downloads a file again when both of the md5 sum matches - it is an issue. Or DVC doesn't download a file when the md5 sums do not match.

What I think that happened is that a file was dowloaded, and the download didn't finish correctly
Afterwards, it appears that the md5 comparison assumes the local file is correct - so in case of mismatch it will reupload it (and override the original file)

Two thinks that may help:

  1. allow one way sync (command line flag) - in some cases I know that files needs to be only uploaded or only downloaded (like a client that only uses the data)
    in this case, if the file was corrupted locally, it wouldn't be sent to the cloud
  1. save the original md5 on git (in the state file?) - this way, it's simple to verify which file is ok and which one is corrupt in case of a mismatch, and no need to assume the local copy is correct.

@ophiry Did #114 fixed this issue? Can we close it now?

114 fixes the most likely cause - cases where a file download crashes, and then future calls to sync will assume the corrupted file is the correct one, and will sync it back to the server.

still, there may be other causes for corruption, and limiting the sync direction should give a more robust solution

Got it. Thank you.
Let's keep it and investigate ...

still, there may be other causes for corruption, and limiting the sync direction should give a more robust solution

I actually agree with @ophiry, I also noticed such issues. I think we should split 'sync' into two commands: 'push' and 'pull', both describe nicely what is going on and will be very intuitive to use. Plus, we could add a new command called 'status', that will show new data files and changed old data files and how they relate to the ones in the cloud. All three commands are inspired by git, of course =)

I will prepare a patch set for these shortly, so we could try them out and discuss before merging.

I've merged pull/push commands, but didn't remove 'sync' just yet. We can declare it obsolete and remove in the near future. As of now, pull + push explicitly declare sync direction for both local and remote files, thus I'm closing this issue for now. Feel free to reopen it if something is wrong.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mdscruggs picture mdscruggs  路  3Comments

ghost picture ghost  路  3Comments

dmpetrov picture dmpetrov  路  3Comments

mfrata picture mfrata  路  3Comments

dnabanita7 picture dnabanita7  路  3Comments