Still an issue on V10.0.8
As per discussion on #220 with @zezha-msft
It should be possible to perform a "one-way sync", where the destination mirrors the source. And will wipe any newer files in the destination container. (i.e not perform the date modified check that is default in sync right now)
Whats needed
a --check-lmt flag or something similar to indicate whether sync should compare last modified times. If the flag value is false, we'd overwrite all destination files without looking at their lmts. If the flag value is true, it's the current sync behavior.
@Kapanther thanks for submitting this feature request!
@artemuwka could you please take a look?
Hi @Kapanther! Thanks for the suggestion. Can you walk me through the scenario you have in mind? It'll help us generalize and formalize the requirement(s).
Hey @artemuwka there are two types of destination containers. but only one way to handle them
the static version is easy to handle as you can be sure it is the same from when you last synced it, but a dynamic container has the possibility of files been updated or created. or even deleted
Currently its possible to removed the files that have been created in the destination, that don't exist in the source with --delete-destination=true. Its also easy to replace deleted files with azcopy sync.
however if files get updated at the destination, but not the source they will continue to remain changed on the destination. which might not be your desired outcome.
The Scenario - Student Classes
A university runs a class that requires content to be delivered to the local drive of the end-user for a particular class. The content is stored on an azure blob and is constantly been revised and updated by various lecturers to improve it.
When users first login the content is synced from the blob to local. During the course of a class this content is tampered with to demonstrate various concepts in the class. At the end of the course these students no longer use these computers, but the content needs to be reset to the original source content.
When user logs out the content is azcopy sync <blobsource> <localdest> --recursive=true --delete-destination=true --check-lmt=false. this makes it ready for the next student to login. This check could also be done on login of as well. If the lecturers had made any edits to the source blob this would also be reflected in the destination content
however, without the --check-lmt=false all the updated files would still remain updated bythe previous students.
A second azcopy run would have to be passed on the files
azcopy copy <blobsource> <localdest> --recursive=true --overwrite=true. which is slower, creates two separate logs and jobs, which i think overwrites the updated files from testing. But is not clear in the documentation on the flag, see below. It doesn't state that the DMT is ignored for newer and older files.
--overwrite overwrite the conflicting files/blobs at the destination if this flag is set to true. (default true)
hope that helps.
Thanks @Kapanther! Appreciate the details and the walkthrough. Adding @seguler for further review.
@artemuwka and @seguler and news on this one guys?
this is still the final hurdle for us. Is the only thing azcopy cant do file comparison wise.
"however if files get updated at the destination, but not the source they will continue to remain changed on the destination. which might not be your desired outcome."
What do you think @zezha-msft ?
@Kapanther @JohnRusk Sorry for the delayed reply. I think this is a very reasonable feature request and relatively straightforward.
@Kapanther what's your timeline like?
@zezha-msft I guess if you want a put a date on it. lets aim for End of november! :) I'm in no rush to be honest, be good to retire some old scripts and speed up some of you resyncs. (this one change would probably save's the user 2-3 mins per session)
@Kapanther that's great to hear! Actually this lines up perfectly with our October release (which will happen at the end of October or early November). I'll move this item to be scheduled for that release then. Thank you!
@zezha-msft and @JohnRusk any news on been able to remove newer files at destination during a sync... ?
Things got kind of busy on the team, and we didn't get to this one sorry.
@JohnRusk That makes me a sad Panda... :p all good. One day maybe..
Can anyone verify if this issue/feature has formally been resolved please?
azcopy sync 10.7.0 is not mirroring the source to destination when changes have made to a file on the destination.
@phrak I imagine @JohnRusk can probably confirm. But I just checked 10.8.0 and i don't see anything in the flaglist that would delete files that are newer in the destination using sync... it will still skip them. :(... One day I learn to code sufficiently to help implement this... but like... probably not .... because i'm a hack..
`Flags:
--block-size-mb float Use this block size (specified in MiB) when uploading to Azure Storage or downloading from Azure Storage. Default is automatically calculated based on file size. Decimal fractions are allowed (For example: 0.25).
--check-md5 string Specifies how strictly MD5 hashes should be validated when downloading. This option is only available when downloading. Available values include: NoCheck, LogOnly, FailIfDifferent, FailIfDifferentOrMissing. (default 'FailIfDifferent'). (default "FailIfDifferent")
--delete-destination string Defines whether to delete extra files from the destination that are not present at the source. Could be set to true, false, or prompt. If set to prompt, the user will be asked a question before scheduling files and blobs for deletion. (default 'false'). (default "false")
--exclude-attributes string (Windows only) Exclude files whose attributes match the attribute list. For example: A;S;R
--exclude-path string Exclude these paths when comparing the source against the destination. This option does not support wildcard characters (). Checks relative path prefix(For example: myFolder;myFolder/subDirName/file.pdf).
--exclude-pattern string Exclude files where the name matches the pattern list. For example: *.jpg;.pdf;exactName
-h, --help help for sync
--include-attributes string (Windows only) Include only files whose attributes match the attribute list. For example: A;S;R
--include-pattern string Include only files where the name matches the pattern list. For example: .jpg;.pdf;exactName
--log-level string Define the log verbosity for the log file, available levels: INFO(all requests and responses), WARNING(slow responses), ERROR(only failed requests), and NONE(no output logs). (default INFO). (default "INFO")
--preserve-smb-info False by default. Preserves SMB property info (last write time, creation time, attribute bits) between SMB-aware resources (Azure Files). This flag applies to both files and folders, unless a file-only filter is specified (e.g. include-pattern). The info transferred for folders is the same as that for files, except for Last Write Time which is not preserved for folders.
--preserve-smb-permissions False by default. Preserves SMB ACLs between aware resources (Azure Files). This flag applies to both files and folders, unless a file-only filter is specified (e.g. include-pattern).
--put-md5 Create an MD5 hash of each file, and save the hash as the Content-MD5 property of the destination blob or file. (By default the hash is NOT created.) Only available when uploading.
--recursive True by default, look into sub-directories recursively when syncing between directories. (default true). (default true)
--s2s-preserve-access-tier Preserve access tier during service to service copy. Please refer to Azure Blob storage: hot, cool, and archive access tiers to ensure destination storage account supports setting access tier. In the cases that setting access tier is not supported, please use s2sPreserveAccessTier=false to bypass copying access tier. (default true). (default true)
Flags Applying to All Commands:
--cap-mbps float Caps the transfer rate, in megabits per second. Moment-by-moment throughput might vary slightly from the cap. If this option is set to zero, or it is omitted, the throughput isn't capped.
--output-type string Format of the command's output. The choices include: text, json. The default value is 'text'. (default "text")
--trusted-microsoft-suffixes string Specifies additional domain suffixes where Azure Active Directory login tokens may be sent. The default is '.core.windows.net;.core.chinacloudapi.cn;.core.cloudapi.de;.core.usgovcloudapi.net'. Any listed here are added to the default. For security, you should only put Microsoft Azure domains here. Separate multiple entries with semi-colons.`
@Kapanther I've moved onto a different project, so my memory is getting a bit rusty on AzCopy now. I think you are correct.
@JohnRusk or @zezha-msft who are the new custodians of this one now?
This issue? There's no change in ownership there. It's only me who's moved. The issue is still assigned to Ze.
@Kapanther sorry for the slow reply! We are looking at improving the command line interface to make it more ergonomic (and more familiar for people who came from rsync), we can include this feature ask as part of that. I'll make sure to have it brought up and discussed. We really appreciate your patience on this.
Just touching base on this issue - Has there been any recent progress on being able to properly sync a source:target pair please?
@phrak ehhh... alas i have given up hope...
@Kapanther That's a real shame - The tool has so much potential, but this is a pretty critical gap IMHO.
@zezha-msft - Rather than a full CLI interface redesign, would it be easier to add a way to properly sync SOURCE:TARGET folders?
@phrak @Kapanther sorry for the delay, we are including this feature in the upcoming 10.11.
Please let us know if you have any feedback on the flag name or the description:
"mirror-mode": "Disable last-modified-time based comparison and overwrites the conflicting files and blobs at the destination if this flag is set to true. Default is false"
@zezha-msft and @phrak nice!.. one question for you.. Does "mirror-mode" overwrite files that are the same DM (i see the descirption says "Conflicting" so I would assume not)? If so though, would it be possible to not overwrite files that are the same, so we still get the benefits of a sync? Otherwise its just the same as doing "copy --overwrite=true".
In a way the description could be seen as misleading, as there is still a last-modified comparison happening it just changes it to overwrite _any_ new files. But... semantics really... Ill check in the upcoming release
Hey @Kapanther ,
As it is stated in the explanation given by Ze, mirror-mode will only affect files which are present at both source and destination. Earlier to resolve the conflict of whether to keep the file or not, we were using last-modified-time but if mirror-mode is enabled, we'll definitely transfer from source to destination even if the destination is older than the source.
Does it answer your question?
@mohsha-msft a sorry but doesn't really answer my question.. let me try and rephrase..
If the files are the same date modified stamp will mirror mode still do the copy?
@Kapanther ,
Yes, for every file which exists on both source and destination, it will perform overwrite operation and circumvent the logic of checking/comparing last modified time.
For all the other files, transfer will go on as usual.
@Kapanther I thought the ask was to do a "brute sync", where everything is copied from source to destination, and the extra files on the destination are removed. Is this not the right understanding?
Even if lmts are the same on both src and dst side, there's no way of knowing whether the file is actually the same since I think your scenario is that the destination will have active writes too, correct?
@zezha-msft essentially yes, brute sync is a great way of describing it. True, destination could have writes as well as new files, and I would expect all of the new files to be removed and any updated files to be replaced with file from source.
But there would still be a bunch of files that are unaltered from the original source copy. particularly if it goes from blob -> filesystem and --preserve-last-modified-time is utilized. Not having to copy all these files that are the same still is a huge saving.
As a long-time robocopy user, "Mirror" mode should be consistent with the robocopy behaviour.
That is: Mirror = Copy + Delete from destination + Overwrite if different.
AZCopy still has a major functionality gap, not having an "Overwrite if different" capability.
Will the new "Mirror" mode will be able to handle this scenario?
@Kapanther ah I see what you mean. But we cannot generalize this desired behavior to the other scenarios, in other words, only x -> local has the possibility of asserting files are equal based on lmts, because this is the only case where we can set lmts ourselves.
For example, we cannot set lmts on the blobs, so even if source and destination lmts are the same, it does not mean anything. E.g. file1 has lmt of 2021-06-01-09:37:27 UTC both on the local machine and the blob endpoint; it first appears that they must be the same (assuming they have the same size), but in reality it may point to the exact opposite in the upload scenario: the blob cannot be the same as the local file, since its lmt was created by the service automatically when the data was committed, so the blob data was already uploaded at 2021-06-01-09:37:27 UTC, and thus the local file1 might very possibly have newer content.
Even if we look at x-> local scenario only, there is still uncertainty about whether the local lmts were really preserved from the last run (maybe it was AzCopy, maybe it wasn't, we couldn't know).
"Overwrite if different" cannot be achieved by any client side tool right now because of service limitations:
If we could do either of the above, then we have a reliable system where we can generalize the rules to say whether files are equal. Until that happens, the best we can do is 1-way syncs (comparing lmts in a relative way), or brute sync (copy everything, which solves the problem that @Kapanther you described earlier).
@phrak Please let us know if this makes sense, we are all ears if you have any suggestion or different perspective on this topic.
@zezha-msft i had a feeling there might be some difficulties in working out whether two files are exactly the same. I guess my use case is relatively niche and i should probably just write some sort of custom tool to do it.
new mirror copy behaviour is still well useful though! thanks for following up...
@zezha-msft , thanks for the detailed explanation.
Could you use either of these options to determine if the blob is different from the source or target?
Option 1:
Set the file's LMTS as a custom metadata attribute of the blob. Compare the source file's LMTS to the blob's metadata attribute, and overwrite if different/newer/older.
Option 2:
Use the MD5 hash of a file to determine if the blob is different.
You mentioned that _we cannot reliably get any hash over the entire blob_, but I don't quite understand why it's not reliable.
What does the --put-md5 option do in this case?
Great questions @phrak!
Option1: there is no well-known format for such metadata attribute, and different customers may have used different tools to bring their data into Azure in the first place, so it's not possible to generalize and rely on such metadata attribute.
Option2: there is indeed a stored MD5 that the tool can set when uploading to the blob service. However, there is no guarantee that this value is kept up to date or was correct in the first place.
A reliable way to get hash over the whole blob would be a new service API where we can query a composable hash over a range of the blob, calculated from the actual data in a timely manner. Then we can combine the hashes over all the ranges, and come up with the real hash over the entire blob. Unfortunately, such API does not exist yet, but I do know that there's already some conversations going on about this exact scenario. I'll keep an eye on relevant topics and advocate for such APIs when possible.
Hope this helps.
Most helpful comment
@phrak @Kapanther sorry for the delay, we are including this feature in the upcoming 10.11.
Please let us know if you have any feedback on the flag name or the description:
"mirror-mode": "Disable last-modified-time based comparison and overwrites the conflicting files and blobs at the destination if this flag is set to true. Default is false"