Azure-storage-azcopy: Why is there no copy by prefix?

Created on 21 Oct 2019 · 14Comments · Source: Azure/azure-storage-azcopy

Which version of the AzCopy was used?

10.3.1

Which platform are you using? (ex: Windows, Mac, Linux)

Linux

What command did you run?

./azcopy copy "https://testbullclip.blob.core.windows.net/grain-blobs?[SAS]" "https://dbtaudevdev.blob.core.windows.net/grain-blobs?[SAS]" --recursive --include-path "DBCloud.ActorCollection.AccessControl.UserActionRatesGrain/GrainReference="

What problem was encountered?

Neither include-path or include-pattern offer the same performant query that is getting blobs by prefix.

This command az storage blob list --container-name $container --account-name testbullclip --prefix $_.name appears to do the same snappy prefix query that azure storage explorer does in that the results are nearly instant. It definitely didn't scan 1.34M blobs.

--include-path is fast but appears to only allow complete path segments. I can do /foo/foobar but /foo/foo complains on no matches because i only provided path of the direct name.

--include-pattern appears to be doing a full scan and is... well, useless in a real world scenario.

How can we reproduce the problem in the simplest way?

The problem is the implementation. Give us prefix searching (from the source) please.

Have you found a mitigation/solution?

No. There is no mitigation/solution. Migrating huge storage accounts remains a hellish nightmare. I was trying to write some script that would batch up these migrations by starting multiple copy jobs on a range of prefixes but ran into this.

feature request

Source

worldspawn

Most helpful comment

@JohnRusk I just want to share our experience we had regarding this issue.

A current implementation for data persistence relies on blob containers to store entities. They are following a naming convention such as {subscriptionId}_{entityType}_{entityId}. Per container, we have a few blobs, but nothing bigger than a few KBs.

I ran a similar command and it took almost a full day (a little bit more than 24h, including the initial scanning process) to complete in order to copy 4MB of data:

azcopy copy <src> <dest> --recursive --overwrite=ifSourceNewer

Job <jobId> summary
Elapsed Time (Minutes): 635.854
Total Number Of Transfers: 1025116
Number of Transfers Completed: 5529
Number of Transfers Failed: 0
Number of Transfers Skipped: 1019587
TotalBytesTransferred: 4581238
Final Job Status: CompletedWithSkipped

It would have been really convenient to pass something as {subscriptionId}_{entityType} as a prefix to scope down the number of lookups and copies to minimal numbers. Even without the overwrite switch, it still took 10+ hours to copy almost a GB of data on the initial test drive.

In brief, for our case, it might have taken less time to script something with the Azure CLI by using the command detailed by @worldspawn, but we found out too late the existence of this command.

felpel on 2 Feb 2020

🚀1 👍1

All 14 comments

If --include-path foo/foo found foo/foobar, would that solve the problem?

JohnRusk on 21 Oct 2019

That is already what it does, AFAIK

adreed-msft on 21 Oct 2019

Them maybe there's a bug. Can you run some tests tomorrow please @adreed-msft , and then discuss with me and @zezha-msft ?

JohnRusk on 21 Oct 2019

Yes @JohnRusk that is expected/wanted behaviour. If theres some reason it behaves that way then maybe a new arg (--include-prefix). The cli and azure storage explorer refer to this concept consistently as a _prefix_.

Off-topic: az storage blob copy start-batch ... has a --pattern argument which is about as useful as azcopy's --include-pattern arg.

Here is my test run through.

I have this blob:
DBCloud.ActorCollection.AccessControl.UserActionRatesGrain/GrainReference=000000000000000000000000000000000600000079a1789a_00ce1ddbd2de5293f65c895c608a73b61a4973a1fdce8810084f8771d180372a/data.json

kubectl.exe run azcopy6 -it --cluster db-t-cluster-au-dev --namespace au-dev-dev --restart=Never --overrides='{\"apiVersion\": \"v1\", \"spec\" : {\"imagePullSecrets\": [{\"name\": \"docker-registry\"}] } }' --rm --image=drawboardau.azurecr.io/azcopy:latest -- ./azcopy copy "https://testbullclip.blob.core.windows.net/grain-blobs[SAS]" "https://dbtaudevdev.blob.core.windows.net/grain-blobs[SAS]" --recursive --include-path DBCloud.ActorCollection.AccessControl.UserActionRatesGrain/GrainReference=000000000000000000000000000000000600000079a1789a_00ce1ddbd2de5293f65c895c608a73b61a4973a1fdce8810084f8771d180372a
If you don't see a command prompt, try pressing enter.



Job ecc53b4b-3362-9b45-5cfd-8c221b170c6c summary
Elapsed Time (Minutes): 0.0333
Total Number Of Transfers: 1
Number of Transfers Completed: 1
Number of Transfers Failed: 0
Number of Transfers Skipped: 0
TotalBytesTransferred: 626
Final Job Status: Completed

pod "azcopy6" deleted

The above copied 1 file (as expected). I will modify the path by dropping the last character.

kubectl.exe run azcopy6 -it --cluster db-t-cluster-au-dev --namespace au-dev-dev --restart=Never --overrides='{\"apiVersion\": \"v1\", \"spec\" : {\"imagePullSecrets\": [{\"name\": \"docker-registry\"}] } }' --rm --image=drawboardau.azurecr.io/azcopy:latest -- ./azcopy copy "https://testbullclip.blob.core.windows.net/grain-blobs[SAS]" "https://dbtaudevdev.blob.core.windows.net/grain-blobs[SAS]" --recursive --include-path DBCloud.ActorCollection.AccessControl.UserActionRatesGrain/GrainReference=000000000000000000000000000000000600000079a1789a_00ce1ddbd2de5293f65c895c608a73b61a4973a1fdce8810084f8771d180372
INFO: Scanning...

Job 6cf79a98-f44d-f045-7fcc-9f10eec92e80 has started
Log file is located at: /azcopy/logs/6cf79a98-f44d-f045-7fcc-9f10eec92e80.log


failed to perform copy command due to error: no transfers were scheduled because no files matched the specified criteria

pod "azcopy6" deleted
pod au-dev-dev/azcopy6 terminated (Error)

The only difference is I deleted the a from the end of the include path.

worldspawn on 21 Oct 2019

👍2

We're going to discuss this more, within the project team, then get back to you.

JohnRusk on 21 Oct 2019

👍1

@worldspawn We're still working through this issue in the project team. Need to tread carefully because Blob Storage is not the only thing that AzCopy can read from, and also we don't want to break any existing functionality, and we don't want to make things any more complicated than necessary. But we see your point about this being useful.

BTW, you wrote:

I was trying to write some script that would batch up these migrations by starting multiple copy jobs on a range of prefixes but ran into this.

That hints at some other possible issues, which I'm very interested in. In particular, I'm curious to know why you need to batch it up. Is it to:

Make the work easier to manage, by doing it in batches.
Assist with error handling and retries, by using smaller batches
Go faster. (This is the one I'm interested in, because I'd like to understand if there are performance issues affecting you in this regard, and whether they relate to finding the blobs - i.e. scanning - or actually moving the data.)

Reasons 1 and 2 make a lot of sense to me, and I often recommend them to AzCopy users. For performance though, if you are being driven to batch stuff up for performance reasons, we'd like to know, because that may point out performance bottlenecks that we should fix.

JohnRusk on 31 Oct 2019

I think the goal I had was to split the work into batches but then run those batches in parallel, just by fanning out more containers, spread over several vms. All in azure.

I have 1.3M blobs encompassing 115GB and was really looking to max out my storage accounts ingress/egress limit and copy the data as fast as possible. I'm trying to migrate this data and will need to have down time while the migration is in progress.

worldspawn on 31 Oct 2019

Thanks for describing what you're looking at doing. FYI we are planning to do some tweaks to AzCopy to make it a bit more efficient in these situations.

Performance tips

In my own testing, I have not observed any benefit from parallelizing the work in cases like yours. My assumption is that, just by setting one instance of AzCopy to use high concurrency (as below) that accomplishes the same thing, but in a more convenient way. Also, you don't need multiple VMs, because for account-to-account copies that data flows directly from source account to destination account. AzCopy just orchestrates the process, but doesn't actually handle the data. Things that _do_ help with performance for small blobs are:

Use a VM with, I'd suggest, at least 8 CPU cores. (Maybe 4 would do, but I haven't tested that small on this type of high-file-count scenario).
Consider opting-out of the length check that AzCopy does by default for each copied file. By opting out of this check, you reduce the number of IO Operations per file (which is good for performance). To opt out of this check, add this to the command line --check-length=false
Consider also setting --s2s-preserve-access-tier=false and just let your destination blobs pick up the default tier of their destination. That may also save on IO operations.
Turn down the logging level, with --log-level WARNING. This has a noticeable effect.
use a high concurrency setting, by setting the AZCOPY_CONCURRENCY_VALUE environment variable to about 600.

Hopefully that will give you performance just as good as what could be obtained by running parallel jobs, but without the administrative overhead of actually running parallel jobs.

Batching tips

BTW, if you have multiple source containers, you may be able to leverage that fact to break the work up into batches, for reasons 1 and 2, above. This is easy if your containers happen to have names that end in numbers: you just use container names like "0", "1", "*2" etc. That moves all containers ending in zero, then all ending in 1, and so on.

BTW, if you run all 1.3 M blobs in one AzCopy job, AzCopy can handle that. There's a known performance issue if overwrite=false and the blob count is in the 10's or 100's of millions, but there are no known issues for counts in the 1 million range.

Future

We are also kicking off work to decide how best to support prefixing. As noted above, I'm not sure that prefixing is actually something that would help with "only" 1.3 million small blobs. But for customers with 10's or 100's of times more blobs, it will definitely help for reasons 1 and 2 noted above. We've commenced design work on a prefix search feature, and I'll mark this issue as "feature request".

Summary

Hopefully the above info will let you migrate successfully with the current version of AzCopy. In future, we'll look at prefixing for manageability and performance tuning for speed.

JohnRusk on 1 Nov 2019

@JohnRusk I just want to share our experience we had regarding this issue.

I ran a similar command and it took almost a full day (a little bit more than 24h, including the initial scanning process) to complete in order to copy 4MB of data:

azcopy copy <src> <dest> --recursive --overwrite=ifSourceNewer

Job <jobId> summary
Elapsed Time (Minutes): 635.854
Total Number Of Transfers: 1025116
Number of Transfers Completed: 5529
Number of Transfers Failed: 0
Number of Transfers Skipped: 1019587
TotalBytesTransferred: 4581238
Final Job Status: CompletedWithSkipped

In brief, for our case, it might have taken less time to script something with the Azure CLI by using the command detailed by @worldspawn, but we found out too late the existence of this command.

felpel on 2 Feb 2020

🚀1 👍1

Might get a partial implementation in 10.4, maybe. Tagging this with 10.4 to remind us to post a further update here.

JohnRusk on 26 Mar 2020

include-pattern uses a service-side prefix when you're only doing non-recursive copies from blob storage as of 10.4. Thus, it should be nice and snappy for that.

However, include-path is a bit of a different story. @JohnRusk this would be a good opportunity for your prefix finding work to handle sub-virtual-directories.

As to _optimizing_ include-path and list of files to support partial directory names on other platforms, this is a metatraverser change that could be made in 10.5, I believe. It'd involve likely passing an additional parameter to the traverser in some nature. In the case of the files traverser, due to its per-directory nature, this would be much easier. In the case of the ADLSG2 traverser, this would be difficult-er, in that we'd need to make a non-recursive search before we went recursive if we found the folder (or started recursive with the folder if it existed)

adreed-msft on 10 Apr 2020

I will keep open this issue and tag it with 10.5, since it's an ongoing performance and usability concern.

adreed-msft on 10 Apr 2020

BTW, as of 10.4, include-pattern implements limited prefix-based searching. Specifically, patterns containing a single *, with that * at the end, are interpreted as prefixes... but only when recursive is false. (That's the limitation).

JohnRusk on 29 May 2020

Thanks for the updates

worldspawn on 29 May 2020

Was this page helpful?

0 / 5 - 0 ratings