10.3.1
Linux
./azcopy copy "https://testbullclip.blob.core.windows.net/grain-blobs?[SAS]" "https://dbtaudevdev.blob.core.windows.net/grain-blobs?[SAS]" --recursive --include-path "DBCloud.ActorCollection.AccessControl.UserActionRatesGrain/GrainReference="
Neither include-path or include-pattern offer the same performant query that is getting blobs by prefix.
This command az storage blob list --container-name $container --account-name testbullclip --prefix $_.name appears to do the same snappy prefix query that azure storage explorer does in that the results are nearly instant. It definitely didn't scan 1.34M blobs.
--include-path is fast but appears to only allow complete path segments. I can do /foo/foobar but /foo/foo complains on no matches because i only provided path of the direct name.
--include-pattern appears to be doing a full scan and is... well, useless in a real world scenario.
The problem is the implementation. Give us prefix searching (from the source) please.
No. There is no mitigation/solution. Migrating huge storage accounts remains a hellish nightmare. I was trying to write some script that would batch up these migrations by starting multiple copy jobs on a range of prefixes but ran into this.
If --include-path foo/foo found foo/foobar, would that solve the problem?
That is already what it does, AFAIK
Them maybe there's a bug. Can you run some tests tomorrow please @adreed-msft , and then discuss with me and @zezha-msft ?
Yes @JohnRusk that is expected/wanted behaviour. If theres some reason it behaves that way then maybe a new arg (--include-prefix). The cli and azure storage explorer refer to this concept consistently as a _prefix_.
Off-topic: az storage blob copy start-batch ... has a --pattern argument which is about as useful as azcopy's --include-pattern arg.
Here is my test run through.
I have this blob:
DBCloud.ActorCollection.AccessControl.UserActionRatesGrain/GrainReference=000000000000000000000000000000000600000079a1789a_00ce1ddbd2de5293f65c895c608a73b61a4973a1fdce8810084f8771d180372a/data.json
kubectl.exe run azcopy6 -it --cluster db-t-cluster-au-dev --namespace au-dev-dev --restart=Never --overrides='{\"apiVersion\": \"v1\", \"spec\" : {\"imagePullSecrets\": [{\"name\": \"docker-registry\"}] } }' --rm --image=drawboardau.azurecr.io/azcopy:latest -- ./azcopy copy "https://testbullclip.blob.core.windows.net/grain-blobs[SAS]" "https://dbtaudevdev.blob.core.windows.net/grain-blobs[SAS]" --recursive --include-path DBCloud.ActorCollection.AccessControl.UserActionRatesGrain/GrainReference=000000000000000000000000000000000600000079a1789a_00ce1ddbd2de5293f65c895c608a73b61a4973a1fdce8810084f8771d180372a
If you don't see a command prompt, try pressing enter.
Job ecc53b4b-3362-9b45-5cfd-8c221b170c6c summary
Elapsed Time (Minutes): 0.0333
Total Number Of Transfers: 1
Number of Transfers Completed: 1
Number of Transfers Failed: 0
Number of Transfers Skipped: 0
TotalBytesTransferred: 626
Final Job Status: Completed
pod "azcopy6" deleted
The above copied 1 file (as expected). I will modify the path by dropping the last character.
kubectl.exe run azcopy6 -it --cluster db-t-cluster-au-dev --namespace au-dev-dev --restart=Never --overrides='{\"apiVersion\": \"v1\", \"spec\" : {\"imagePullSecrets\": [{\"name\": \"docker-registry\"}] } }' --rm --image=drawboardau.azurecr.io/azcopy:latest -- ./azcopy copy "https://testbullclip.blob.core.windows.net/grain-blobs[SAS]" "https://dbtaudevdev.blob.core.windows.net/grain-blobs[SAS]" --recursive --include-path DBCloud.ActorCollection.AccessControl.UserActionRatesGrain/GrainReference=000000000000000000000000000000000600000079a1789a_00ce1ddbd2de5293f65c895c608a73b61a4973a1fdce8810084f8771d180372
INFO: Scanning...
Job 6cf79a98-f44d-f045-7fcc-9f10eec92e80 has started
Log file is located at: /azcopy/logs/6cf79a98-f44d-f045-7fcc-9f10eec92e80.log
failed to perform copy command due to error: no transfers were scheduled because no files matched the specified criteria
pod "azcopy6" deleted
pod au-dev-dev/azcopy6 terminated (Error)
The only difference is I deleted the a from the end of the include path.
We're going to discuss this more, within the project team, then get back to you.
@worldspawn We're still working through this issue in the project team. Need to tread carefully because Blob Storage is not the only thing that AzCopy can read from, and also we don't want to break any existing functionality, and we don't want to make things any more complicated than necessary. But we see your point about this being useful.
BTW, you wrote:
I was trying to write some script that would batch up these migrations by starting multiple copy jobs on a range of prefixes but ran into this.
That hints at some other possible issues, which I'm very interested in. In particular, I'm curious to know why you need to batch it up. Is it to:
Reasons 1 and 2 make a lot of sense to me, and I often recommend them to AzCopy users. For performance though, if you are being driven to batch stuff up for performance reasons, we'd like to know, because that may point out performance bottlenecks that we should fix.
I think the goal I had was to split the work into batches but then run those batches in parallel, just by fanning out more containers, spread over several vms. All in azure.
I have 1.3M blobs encompassing 115GB and was really looking to max out my storage accounts ingress/egress limit and copy the data as fast as possible. I'm trying to migrate this data and will need to have down time while the migration is in progress.
Thanks for describing what you're looking at doing. FYI we are planning to do some tweaks to AzCopy to make it a bit more efficient in these situations.
Performance tips
In my own testing, I have not observed any benefit from parallelizing the work in cases like yours. My assumption is that, just by setting one instance of AzCopy to use high concurrency (as below) that accomplishes the same thing, but in a more convenient way. Also, you don't need multiple VMs, because for account-to-account copies that data flows directly from source account to destination account. AzCopy just orchestrates the process, but doesn't actually handle the data. Things that _do_ help with performance for small blobs are:
--check-length=false--s2s-preserve-access-tier=false and just let your destination blobs pick up the default tier of their destination. That may also save on IO operations.--log-level WARNING. This has a noticeable effect.Hopefully that will give you performance just as good as what could be obtained by running parallel jobs, but without the administrative overhead of actually running parallel jobs.
Batching tips
BTW, if you have multiple source containers, you may be able to leverage that fact to break the work up into batches, for reasons 1 and 2, above. This is easy if your containers happen to have names that end in numbers: you just use container names like "0", "1", "*2" etc. That moves all containers ending in zero, then all ending in 1, and so on.
BTW, if you run all 1.3 M blobs in one AzCopy job, AzCopy can handle that. There's a known performance issue if overwrite=false and the blob count is in the 10's or 100's of millions, but there are no known issues for counts in the 1 million range.
Future
We are also kicking off work to decide how best to support prefixing. As noted above, I'm not sure that prefixing is actually something that would help with "only" 1.3 million small blobs. But for customers with 10's or 100's of times more blobs, it will definitely help for reasons 1 and 2 noted above. We've commenced design work on a prefix search feature, and I'll mark this issue as "feature request".
Summary
Hopefully the above info will let you migrate successfully with the current version of AzCopy. In future, we'll look at prefixing for manageability and performance tuning for speed.
@JohnRusk I just want to share our experience we had regarding this issue.
A current implementation for data persistence relies on blob containers to store entities. They are following a naming convention such as {subscriptionId}_{entityType}_{entityId}. Per container, we have a few blobs, but nothing bigger than a few KBs.
I ran a similar command and it took almost a full day (a little bit more than 24h, including the initial scanning process) to complete in order to copy 4MB of data:
azcopy copy <src> <dest> --recursive --overwrite=ifSourceNewer
Job <jobId> summary
Elapsed Time (Minutes): 635.854
Total Number Of Transfers: 1025116
Number of Transfers Completed: 5529
Number of Transfers Failed: 0
Number of Transfers Skipped: 1019587
TotalBytesTransferred: 4581238
Final Job Status: CompletedWithSkipped
It would have been really convenient to pass something as {subscriptionId}_{entityType} as a prefix to scope down the number of lookups and copies to minimal numbers. Even without the overwrite switch, it still took 10+ hours to copy almost a GB of data on the initial test drive.
In brief, for our case, it might have taken less time to script something with the Azure CLI by using the command detailed by @worldspawn, but we found out too late the existence of this command.
Might get a partial implementation in 10.4, maybe. Tagging this with 10.4 to remind us to post a further update here.
include-pattern uses a service-side prefix when you're only doing non-recursive copies from blob storage as of 10.4. Thus, it should be nice and snappy for that.
However, include-path is a bit of a different story. @JohnRusk this would be a good opportunity for your prefix finding work to handle sub-virtual-directories.
As to _optimizing_ include-path and list of files to support partial directory names on other platforms, this is a metatraverser change that could be made in 10.5, I believe. It'd involve likely passing an additional parameter to the traverser in some nature. In the case of the files traverser, due to its per-directory nature, this would be much easier. In the case of the ADLSG2 traverser, this would be difficult-er, in that we'd need to make a non-recursive search before we went recursive if we found the folder (or started recursive with the folder if it existed)
I will keep open this issue and tag it with 10.5, since it's an ongoing performance and usability concern.
BTW, as of 10.4, include-pattern implements limited prefix-based searching. Specifically, patterns containing a single *, with that * at the end, are interpreted as prefixes... but only when recursive is false. (That's the limitation).
Thanks for the updates
Most helpful comment
@JohnRusk I just want to share our experience we had regarding this issue.
A current implementation for data persistence relies on blob containers to store entities. They are following a naming convention such as
{subscriptionId}_{entityType}_{entityId}. Per container, we have a few blobs, but nothing bigger than a few KBs.I ran a similar command and it took almost a full day (a little bit more than 24h, including the initial scanning process) to complete in order to copy 4MB of data:
It would have been really convenient to pass something as
{subscriptionId}_{entityType}as a prefix to scope down the number of lookups and copies to minimal numbers. Even without theoverwriteswitch, it still took 10+ hours to copy almost a GB of data on the initial test drive.In brief, for our case, it might have taken less time to script something with the Azure CLI by using the command detailed by @worldspawn, but we found out too late the existence of this command.