Team...on a current project we are seeing some drastic performance degradations when attempting to use the *.aio libs from v12.0 of azure-storage-blob. It's quite possible we may be using the library incorrectly, but have not been able to find any documentation stating otherwise. The issue outlined below may in fact be a bug.
ab) utilityThere are 2 endpoints that we're testing to ensure we don't have any unexpected regressions. The first is a simple JSON serialization using coroutines/await+async, and does NOT use the Azure storage SDK in any code path.
Running ab against this endpoint with n=1000, c=50, this is handling ~450 - 550 req/sec (sample output below), showing that even on limited dev/test hardware this is performing reasonably well:
Concurrency Level: 50
Time taken for tests: 1.842 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 330000 bytes
HTML transferred: 38000 bytes
Requests per second: 542.83 [#/sec] (mean)
Time per request: 92.110 [ms] (mean)
Time per request: 1.842 [ms] (mean, across all concurrent requests)
Transfer rate: 174.94 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 2 0.5 2 8
Processing: 4 89 57.0 80 318
Waiting: 4 89 57.0 80 318
Total: 5 91 57.1 82 321
Percentage of the requests served within a certain time (ms)
50% 82
66% 103
75% 120
80% 130
90% 159
95% 186
98% 245
99% 319
100% 321 (longest request)
The second endpoint is simply downloading a VERY small JSON blob (~76 bytes), using the following code:
...
blob_client = BlobClient.from_connection_string(self.connection_string, "jobs", operation_id)
data_stream = await blob_client.download_blob()
raw_value = await data_stream.readall()
...
Using the same ab parameters (n=1000, c=50), we see a DRASTIC bottleneck showing, as the req/sec are reduced to around 20 - 25 rps:
Concurrency Level: 50
Time taken for tests: 43.861 seconds
Complete requests: 1000
Failed requests: 0
Non-2xx responses: 1000
Total transferred: 431000 bytes
HTML transferred: 126000 bytes
Requests per second: 22.80 [#/sec] (mean)
Time per request: 2193.062 [ms] (mean)
Time per request: 43.861 [ms] (mean, across all concurrent requests)
Transfer rate: 9.60 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 2 0.4 2 8
Processing: 193 2171 1116.8 1969 5131
Waiting: 193 2171 1116.8 1969 5130
Total: 194 2173 1116.8 1970 5133
Percentage of the requests served within a certain time (ms)
50% 1970
66% 2414
75% 2964
80% 3193
90% 3767
95% 4524
98% 4788
99% 5130
100% 5133 (longest request)
This does not seem to match at all what the underlying storage account is reporting via metrics (max around 20ms, allowing for 50ms just to be safe). Additionally, total transactions is reporting nearly exactly the number of tests (~1000), indicated that general purpose storage account throttling is not occuring.
Given the very small size of these blobs, and the fact the underlying pipeline transport should be using the aiohttp provider, something feels like it is either overwhelming the event loop, or still using a blocking call unintentionally.
Would appreciate any insights that can be offered - while there will of course be overhead to calling storage (don't expect the same perf as the first test), there is seemingly quite a bit of un-accounted for time being taken somewhere. We definitely do not want to resort to implementing these calls via Tornado / etc. by hand if we don't have to.
Thanks!
Based on this repo, re-ran tests that re-used a BlobClient. While this isn't applicable to our use case (each request will likely request a different blob), I wanted to share the results.
There was a positive change - up to around 50-60 rps - but nothing significant. Results from ab run are below:
Concurrency Level: 50
Time taken for tests: 17.459 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 365000 bytes
HTML transferred: 73000 bytes
Requests per second: 57.28 [#/sec] (mean)
Time per request: 872.964 [ms] (mean)
Time per request: 17.459 [ms] (mean, across all concurrent requests)
Transfer rate: 20.42 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 2 1.1 2 22
Processing: 44 858 482.3 771 2691
Waiting: 44 857 482.3 771 2690
Total: 45 860 482.4 772 2693
Percentage of the requests served within a certain time (ms)
50% 772
66% 992
75% 1114
80% 1217
90% 1447
95% 1715
98% 2335
99% 2489
100% 2693 (longest request)
@mjyeaney thank you so much for reporting this. Our team will take a look at this //cc: @mayurid, @rakshith91
@mjyeaney Thanks for reporting the issue.
Could you let me know what version of python are you using, so that i can reproduce this locally using the same version.
Local dev environment: Python 3.6.8, pip 9.0.1
Dockerfile: FROM python:3.6
@mjyeaney : I have a few follow up questions
Also, by creating a new blob client every time from a connection string, you are not using the same "connection pool". So the ideal usage for optimal performance would be something like this:
container_client = ContainerClient.from_connection_string(connection_string, "jobs")
async with container_client:
for operation_id in list_of_blobs:
blob_client = container_client.get_blob_client(operation_id)
data_stream = await blob_client.download_blob()
raw_value = await data_stream.readall()
Also note the "async with" block. This is required to close the client session properly, or it would throw a warning something like:
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x0447DCF0>
Finally, if you want to use more than one thread, you must pass max_concurrency param to the download_blob.
For example, await blob_client.download_blob(max_concurrency=50)
Here's the scenario:
With regards to connection pooling per-client, didn't realize BlobClient worked that way (docs don't cover this behavior - this is very important indeed!
Finally - we won't have a loop as shown above. Each incoming request will act independently, but let me re-work this into leveraging a container client first..again, wasn't sure what the guidance was as the docs don't dive into any specifics :-(
Will post updates here shortly.
Just changed our job store to the following (simplified):
class JobStore:
def __init__(self, connection_string):
self.connection_string = connection_string
self.container_client = ContainerClient.from_connection_string(connection_string, "jobs")
async def read_job(self, operation_id):
blob_client = self.container_client.get_blob_client(operation_id)
data_stream = await blob_client.download_blob()
raw_value = await data_stream.readall()
return json.loads(raw_value)
Results are below...no notable change even when re-using the container client to get a blob client.
Concurrency Level: 50
Time taken for tests: 39.199 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 365000 bytes
HTML transferred: 73000 bytes
Requests per second: 25.51 [#/sec] (mean)
Time per request: 1959.974 [ms] (mean)
Time per request: 39.199 [ms] (mean, across all concurrent requests)
Transfer rate: 9.09 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 2 0.8 2 10
Processing: 73 1929 1152.1 1755 5684
Waiting: 73 1929 1152.1 1755 5684
Total: 75 1931 1152.1 1757 5685
Percentage of the requests served within a certain time (ms)
50% 1757
66% 2078
75% 2795
80% 2913
90% 3667
95% 4265
98% 4639
99% 4711
100% 5685 (longest request)
Also tried specifying max_threads=50 during the download_blob call, _slightly_ increased calls up to 26.86, but variance in tests was making it hard to see if it was anything but noise.
Just for comparison, I coded up a manual storage REST API call (implemented in the repo linked above, in src/storage_rest_api.py. Results are below...leveraging aiohttp, consistently hitting ~117 rps from within the datacenter running ab with the same configuration of n=1000, c=50:
Concurrency Level: 50
Time taken for tests: 8.529 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 365000 bytes
HTML transferred: 73000 bytes
Requests per second: 117.25 [#/sec] (mean)
Time per request: 426.451 [ms] (mean)
Time per request: 8.529 [ms] (mean, across all concurrent requests)
Transfer rate: 41.79 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 2 0.6 2 10
Processing: 42 419 147.5 413 828
Waiting: 42 419 147.5 413 828
Total: 45 421 147.5 415 830
Percentage of the requests served within a certain time (ms)
50% 415
66% 481
75% 527
80% 556
90% 622
95% 666
98% 724
99% 775
100% 830 (longest request)
This manual code is hardly optimized and hitting a little over 5x perf gains with far better behaved latency distribution. Seems to reinforce the theory of queueing/contention in the SDK use case...hopefully we can isolate (I really don't want to manage hand-writing calls like this LOL).
QUICK UPDATE
Made some optimizations here
https://github.com/Azure/azure-sdk-for-python/pull/8698
Will create one more PR for updating documentation and will continue inspecting performance with downloading same blob and play with other scenarios
@mjyeaney I have tried downloading the same blob too and surprisingly the performance has improved. I'll try to run the same script you are using instead of my own.
Can you point out how you are running the script from your repo
EDIT:
Also, would you mind installing the azure-core dependency from the repo and see if there are any improvements in your results? Thanks.
Sure thing - the way I was running the load test was the following:
ab -n 1000 -c 50 http://{app_svc_url}/path/to/simple/endpointThat's it..not super scientific, but since these were all HTTP GET requests, it is super-easy to run over and over.
That said, I've been moving forward with using the SDK and am continuing to build out / deploy code, so simply cloning that repo likely won't give you the same endpoint. As an easier test - if you have a beta drop of the SDK I can try that to see how it's working.
As for azure-core, here's what I have currently installed in my virtualenv:
$> pip list | grep azure-core
azure-core (1.0.0)
Is that the dependency you were talking about?
Is that the dependency you were talking about?
Right! If you want to check with the editable version of azure-core temporarily before we release the next version, that'd be great. If not, we can let you know as soon as we release the package. Also, I'll continue inspecting any additional overheads meanwhile.
git clone [email protected]:Azure/azure-sdk-for-python.git
cd sdk/core/azure-core
pip install -e .
@mjyeaney We have just released the latest version of azure-core. Can you upgrade it using pip install azure-core==1.1.0 and see if you see improvements with results? Thanks.
Perfect! I'll test it tomorrow and post results here.
Thanks!
Hi @mjyeaney
Did you get a chance to try this?
Yes @rakshith91 (and apologies for the delay!)
The results with the new azure-core 1.1.0 lib are quite positive - I'm seeing around a ~5x improvement in the initial use case of requesting a blob. Results from ab are below - note this is now yielding a repeatable ~95 - 100 req's/sec (where previously we were only getting ~22):
Concurrency Level: 30
Time taken for tests: 10.623 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 397000 bytes
HTML transferred: 99000 bytes
Requests per second: 94.13 [#/sec] (mean)
Time per request: 318.704 [ms] (mean)
Time per request: 10.623 [ms] (mean, across all concurrent requests)
Transfer rate: 36.49 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 2 0.6 1 8
Processing: 53 314 205.0 269 1116
Waiting: 53 314 205.0 269 1116
Total: 54 316 205.0 271 1118
WARNING: The median and mean for the initial connection time are not within a normal deviation
These results are probably not that reliable.
Percentage of the requests served within a certain time (ms)
50% 271
66% 358
75% 419
80% 461
90% 593
95% 714
98% 923
99% 1020
100% 1118 (longest request)
Also - note this is still on the same low-end dev/test hardware SKU (B2), so moving to production will only improve this.
In all, definitely an improvement - easily approaching the results from a raw hand-coding against the REST API (as reported above). Definitely appreciate the improvements!
I'm hoping this can help guide further improvements in the SDK. Is there any additional information you need from these tests?
@mjyeaney No need to apologize :)
Thanks a lot for updating the results and glad to see the improvement. Will be closing the issue for now, but feel free to keep in touch if you think there may be more overheads.
Thanks for working with Microsoft on GitHub! Tell us how you feel about your experience using the reactions on this comment.