Azure-sdk-for-python: [Storage]: Severe performance degradation observed using v12.0 *.aio BlobClient

Created on 18 Nov 2019 · 18Comments · Source: Azure/azure-sdk-for-python

Severe performance degradation observed using v12.0 *.aio BlobClient

Team...on a current project we are seeing some drastic performance degradations when attempting to use the *.aio libs from v12.0 of azure-storage-blob. It's quite possible we may be using the library incorrectly, but have not been able to find any documentation stating otherwise. The issue outlined below may in fact be a bug.

System Under Test

Repo: https://github.com/mjyeaney/CustomVisionSlidingWindowAPI
Linux App Service Plan
B2 App Service Plan (200 ACU, 3.5 GB RAM, A-series equiv), instance count = 1
Running Gunicorn -> Uvicorn (ASGI server) -> Quart (async REST fx)
Gunicorn worker count: CPU core count + 1
Using azure-storage-blob v.12.0, *.aio methods

Load Testing Environment

Located in same datacenter
Ds2v3 (2 CPU, 8GB RAM)
Tests run using Apache Bench (ab) utility

Tests

There are 2 endpoints that we're testing to ensure we don't have any unexpected regressions. The first is a simple JSON serialization using coroutines/await+async, and does NOT use the Azure storage SDK in any code path.

Running ab against this endpoint with n=1000, c=50, this is handling ~450 - 550 req/sec (sample output below), showing that even on limited dev/test hardware this is performing reasonably well:

Concurrency Level:      50
Time taken for tests:   1.842 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      330000 bytes
HTML transferred:       38000 bytes
Requests per second:    542.83 [#/sec] (mean)
Time per request:       92.110 [ms] (mean)
Time per request:       1.842 [ms] (mean, across all concurrent requests)
Transfer rate:          174.94 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        1    2   0.5      2       8
Processing:     4   89  57.0     80     318
Waiting:        4   89  57.0     80     318
Total:          5   91  57.1     82     321

Percentage of the requests served within a certain time (ms)
  50%     82
  66%    103
  75%    120
  80%    130
  90%    159
  95%    186
  98%    245
  99%    319
 100%    321 (longest request)

The second endpoint is simply downloading a VERY small JSON blob (~76 bytes), using the following code:

...
blob_client = BlobClient.from_connection_string(self.connection_string, "jobs", operation_id)
data_stream = await blob_client.download_blob()
raw_value = await data_stream.readall()
...

Using the same ab parameters (n=1000, c=50), we see a DRASTIC bottleneck showing, as the req/sec are reduced to around 20 - 25 rps:

Concurrency Level:      50
Time taken for tests:   43.861 seconds
Complete requests:      1000
Failed requests:        0
Non-2xx responses:      1000
Total transferred:      431000 bytes
HTML transferred:       126000 bytes
Requests per second:    22.80 [#/sec] (mean)
Time per request:       2193.062 [ms] (mean)
Time per request:       43.861 [ms] (mean, across all concurrent requests)
Transfer rate:          9.60 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        1    2   0.4      2       8
Processing:   193 2171 1116.8   1969    5131
Waiting:      193 2171 1116.8   1969    5130
Total:        194 2173 1116.8   1970    5133

Percentage of the requests served within a certain time (ms)
  50%   1970
  66%   2414
  75%   2964
  80%   3193
  90%   3767
  95%   4524
  98%   4788
  99%   5130
 100%   5133 (longest request)

This does not seem to match at all what the underlying storage account is reporting via metrics (max around 20ms, allowing for 50ms just to be safe). Additionally, total transactions is reporting nearly exactly the number of tests (~1000), indicated that general purpose storage account throttling is not occuring.

Given the very small size of these blobs, and the fact the underlying pipeline transport should be using the aiohttp provider, something feels like it is either overwhelming the event loop, or still using a blocking call unintentionally.

Would appreciate any insights that can be offered - while there will of course be overhead to calling storage (don't expect the same perf as the first test), there is seemingly quite a bit of un-accounted for time being taken somewhere. We definitely do not want to resort to implementing these calls via Tornado / etc. by hand if we don't have to.

Thanks!

Client Storage customer-reported

Source

mjyeaney

All 18 comments

UPDATE

Based on this repo, re-ran tests that re-used a BlobClient. While this isn't applicable to our use case (each request will likely request a different blob), I wanted to share the results.

There was a positive change - up to around 50-60 rps - but nothing significant. Results from ab run are below:

Concurrency Level:      50
Time taken for tests:   17.459 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      365000 bytes
HTML transferred:       73000 bytes
Requests per second:    57.28 [#/sec] (mean)
Time per request:       872.964 [ms] (mean)
Time per request:       17.459 [ms] (mean, across all concurrent requests)
Transfer rate:          20.42 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        1    2   1.1      2      22
Processing:    44  858 482.3    771    2691
Waiting:       44  857 482.3    771    2690
Total:         45  860 482.4    772    2693

Percentage of the requests served within a certain time (ms)
  50%    772
  66%    992
  75%   1114
  80%   1217
  90%   1447
  95%   1715
  98%   2335
  99%   2489
 100%   2693 (longest request)

mjyeaney on 18 Nov 2019

@mjyeaney thank you so much for reporting this. Our team will take a look at this //cc: @mayurid, @rakshith91

kaerm on 18 Nov 2019

@mjyeaney Thanks for reporting the issue.

Could you let me know what version of python are you using, so that i can reproduce this locally using the same version.

rakshith91 on 18 Nov 2019

Local dev environment: Python 3.6.8, pip 9.0.1

Dockerfile: FROM python:3.6

mjyeaney on 18 Nov 2019

👍1

@mjyeaney : I have a few follow up questions

Is your scenario is to download 1000 blobs from the same container or is it a different container everytime?
I would like to know if you are comparing the results to 2.x version of the storage-blobs?

Also, by creating a new blob client every time from a connection string, you are not using the same "connection pool". So the ideal usage for optimal performance would be something like this:

container_client = ContainerClient.from_connection_string(connection_string, "jobs")
async with container_client:
    for operation_id in list_of_blobs:
        blob_client = container_client.get_blob_client(operation_id)
        data_stream = await blob_client.download_blob()
        raw_value = await data_stream.readall()

Also note the "async with" block. This is required to close the client session properly, or it would throw a warning something like:

Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x0447DCF0>

Finally, if you want to use more than one thread, you must pass max_concurrency param to the download_blob.
For example, await blob_client.download_blob(max_concurrency=50)

rakshith91 on 19 Nov 2019

Here's the scenario:

High throughput job status retrieval (many concurrent requests, very high likelyhood that each is after a different blob).
1000 was picked just for a nice averaging load...however, in this very simple test everybody is asking status on the same blob
- TODO: If this is a possible issue, I can modify the tests to all be different

With regards to connection pooling per-client, didn't realize BlobClient worked that way (docs don't cover this behavior - this is very important indeed!

TODO: Should this be a docs PR?

Finally - we won't have a loop as shown above. Each incoming request will act independently, but let me re-work this into leveraging a container client first..again, wasn't sure what the guidance was as the docs don't dive into any specifics :-(

Will post updates here shortly.

mjyeaney on 19 Nov 2019

Just changed our job store to the following (simplified):

class JobStore:
    def __init__(self, connection_string):
        self.connection_string = connection_string
        self.container_client = ContainerClient.from_connection_string(connection_string, "jobs")

    async def read_job(self, operation_id):
        blob_client = self.container_client.get_blob_client(operation_id)
        data_stream = await blob_client.download_blob()
        raw_value = await data_stream.readall()
        return json.loads(raw_value)

Results are below...no notable change even when re-using the container client to get a blob client.

Concurrency Level:      50
Time taken for tests:   39.199 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      365000 bytes
HTML transferred:       73000 bytes
Requests per second:    25.51 [#/sec] (mean)
Time per request:       1959.974 [ms] (mean)
Time per request:       39.199 [ms] (mean, across all concurrent requests)
Transfer rate:          9.09 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        1    2   0.8      2      10
Processing:    73 1929 1152.1   1755    5684
Waiting:       73 1929 1152.1   1755    5684
Total:         75 1931 1152.1   1757    5685

Percentage of the requests served within a certain time (ms)
  50%   1757
  66%   2078
  75%   2795
  80%   2913
  90%   3667
  95%   4265
  98%   4639
  99%   4711
 100%   5685 (longest request)

Also tried specifying max_threads=50 during the download_blob call, _slightly_ increased calls up to 26.86, but variance in tests was making it hard to see if it was anything but noise.

mjyeaney on 19 Nov 2019

Just for comparison, I coded up a manual storage REST API call (implemented in the repo linked above, in src/storage_rest_api.py. Results are below...leveraging aiohttp, consistently hitting ~117 rps from within the datacenter running ab with the same configuration of n=1000, c=50:

Concurrency Level:      50
Time taken for tests:   8.529 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      365000 bytes
HTML transferred:       73000 bytes
Requests per second:    117.25 [#/sec] (mean)
Time per request:       426.451 [ms] (mean)
Time per request:       8.529 [ms] (mean, across all concurrent requests)
Transfer rate:          41.79 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        1    2   0.6      2      10
Processing:    42  419 147.5    413     828
Waiting:       42  419 147.5    413     828
Total:         45  421 147.5    415     830

Percentage of the requests served within a certain time (ms)
  50%    415
  66%    481
  75%    527
  80%    556
  90%    622
  95%    666
  98%    724
  99%    775
 100%    830 (longest request)

This manual code is hardly optimized and hitting a little over 5x perf gains with far better behaved latency distribution. Seems to reinforce the theory of queueing/contention in the SDK use case...hopefully we can isolate (I really don't want to manage hand-writing calls like this LOL).

mjyeaney on 19 Nov 2019

QUICK UPDATE

Made some optimizations here
https://github.com/Azure/azure-sdk-for-python/pull/8698

Will create one more PR for updating documentation and will continue inspecting performance with downloading same blob and play with other scenarios

rakshith91 on 20 Nov 2019

@mjyeaney I have tried downloading the same blob too and surprisingly the performance has improved. I'll try to run the same script you are using instead of my own.
Can you point out how you are running the script from your repo

EDIT:
Also, would you mind installing the azure-core dependency from the repo and see if there are any improvements in your results? Thanks.

rakshith91 on 20 Nov 2019

Sure thing - the way I was running the load test was the following:

Deployed the test app Docker image to the App Service
SSH to the load test machine (Ds2v3 box) and simply run:
ab -n 1000 -c 50 http://{app_svc_url}/path/to/simple/endpoint

That's it..not super scientific, but since these were all HTTP GET requests, it is super-easy to run over and over.

That said, I've been moving forward with using the SDK and am continuing to build out / deploy code, so simply cloning that repo likely won't give you the same endpoint. As an easier test - if you have a beta drop of the SDK I can try that to see how it's working.

As for azure-core, here's what I have currently installed in my virtualenv:

$> pip list | grep azure-core
azure-core (1.0.0)

Is that the dependency you were talking about?

mjyeaney on 21 Nov 2019

Is that the dependency you were talking about?

Right! If you want to check with the editable version of azure-core temporarily before we release the next version, that'd be great. If not, we can let you know as soon as we release the package. Also, I'll continue inspecting any additional overheads meanwhile.

git clone [email protected]:Azure/azure-sdk-for-python.git
cd sdk/core/azure-core
pip install -e .

rakshith91 on 22 Nov 2019

@mjyeaney We have just released the latest version of azure-core. Can you upgrade it using pip install azure-core==1.1.0 and see if you see improvements with results? Thanks.

rakshith91 on 27 Nov 2019

👍1

Perfect! I'll test it tomorrow and post results here.

Thanks!

mjyeaney on 27 Nov 2019

👍1

Hi @mjyeaney
Did you get a chance to try this?

rakshith91 on 5 Dec 2019

Yes @rakshith91 (and apologies for the delay!)

The results with the new azure-core 1.1.0 lib are quite positive - I'm seeing around a ~5x improvement in the initial use case of requesting a blob. Results from ab are below - note this is now yielding a repeatable ~95 - 100 req's/sec (where previously we were only getting ~22):

Concurrency Level:      30
Time taken for tests:   10.623 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      397000 bytes
HTML transferred:       99000 bytes
Requests per second:    94.13 [#/sec] (mean)
Time per request:       318.704 [ms] (mean)
Time per request:       10.623 [ms] (mean, across all concurrent requests)
Transfer rate:          36.49 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        1    2   0.6      1       8
Processing:    53  314 205.0    269    1116
Waiting:       53  314 205.0    269    1116
Total:         54  316 205.0    271    1118
WARNING: The median and mean for the initial connection time are not within a normal deviation
        These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
  50%    271
  66%    358
  75%    419
  80%    461
  90%    593
  95%    714
  98%    923
  99%   1020
 100%   1118 (longest request)

Also - note this is still on the same low-end dev/test hardware SKU (B2), so moving to production will only improve this.

In all, definitely an improvement - easily approaching the results from a raw hand-coding against the REST API (as reported above). Definitely appreciate the improvements!

I'm hoping this can help guide further improvements in the SDK. Is there any additional information you need from these tests?

mjyeaney on 6 Dec 2019

@mjyeaney No need to apologize :)
Thanks a lot for updating the results and glad to see the improvement. Will be closing the issue for now, but feel free to keep in touch if you think there may be more overheads.

rakshith91 on 6 Dec 2019

Thanks for working with Microsoft on GitHub! Tell us how you feel about your experience using the reactions on this comment.