I noticed that the data files are being downloaded even though they should have been a local copy (can't verify it now)
What happens if both local and remote files exists (with the same hash)?
tested again (running master version commit: 671d87be523e757fbb346213f74e3243304de556) - running sync the 2nd time downloads all files again
also, it appears that it has the chance of corrupting files if downloading fails is canceled in the middle (as the original file is overwritten during the download)
@ophiry I was unable to reproduce or I don't understand the issue correctly. Could you please give more details?
I'd like to clarify - there are two hashes:
1) filename hash/git-hash like .cache/file1_2b419f6
2) md5 sum of the file content like $ md5 ~/src/myrepo/data/Posts.xml gives cfdaa4bba57fa07d81ff96685a9aab2c.
DVC downloads a file if md5 sum doesn't match for a local file and S3 file (it gets md5 by AWS API). Yes, if downloading fails then DVC should download the file again.
So, if DVC downloads a file again when both of the md5 sum matches - it is an issue. Or DVC doesn't download a file when the md5 sums do not match.
What I think that happened is that a file was dowloaded, and the download didn't finish correctly
Afterwards, it appears that the md5 comparison assumes the local file is correct - so in case of mismatch it will reupload it (and override the original file)
Two thinks that may help:
@ophiry Did #114 fixed this issue? Can we close it now?
still, there may be other causes for corruption, and limiting the sync direction should give a more robust solution
Got it. Thank you.
Let's keep it and investigate ...
still, there may be other causes for corruption, and limiting the sync direction should give a more robust solution
I actually agree with @ophiry, I also noticed such issues. I think we should split 'sync' into two commands: 'push' and 'pull', both describe nicely what is going on and will be very intuitive to use. Plus, we could add a new command called 'status', that will show new data files and changed old data files and how they relate to the ones in the cloud. All three commands are inspired by git, of course =)
I will prepare a patch set for these shortly, so we could try them out and discuss before merging.
I've merged pull/push commands, but didn't remove 'sync' just yet. We can declare it obsolete and remove in the near future. As of now, pull + push explicitly declare sync direction for both local and remote files, thus I'm closing this issue for now. Feel free to reopen it if something is wrong.