Aws-sdk-net: s3 putObjectRequest CPU consumption issue

Created on 10 Jan 2018 · 11Comments · Source: aws/aws-sdk-net

Under CPU load AWS S3 SDK (3.3.16.2) has poor performance and consumes CPU needlessly.

Suspected culprit: s3 putObjectRequest is using MD5Stream internally. PutObjectRequestMarshaller.cs
Retry mechanism implemented in api will always reset and recalculate MD5 hash of stream as well.
RetryHandler.cs

Expected Behavior

Performance (CPU and IO) using SDK S3 API should be as close as possible to direct s3 call

Current Behavior

Currently using SDK putObjectRequest MD5Stream is being used:
https://github.com/aws/aws-sdk-net/blob/master/sdk/src/Services/S3/Custom/Model/Internal/MarshallTransformations/PutObjectRequestMarshaller.cs (104)

// Wrap input stream in MD5Stream
var hashStream = new MD5Stream(streamWithLength, null, length);
putObjectRequest.InputStream = hashStream;

Same issue happening during RetryHandler which would reset state of MD5Stream and recalculate MD5 Hash of whole stream again.

Possible Solution

Remove MD5Stream.
Please explain the reasoning behind MD5Stream usage. Why is it required?

Steps to Reproduce (for bugs)

Context

S3 SDK implementation starving CPU needlessly.
Low performance under CPU load.

Your Environment

AWSSDK.Core version used:
3.3.21.6
Operating System and version:
Ubuntu 16.04 LTS
Targeted .NET platform:
Net Core 2.0

.NET Core Info

.NET Core version used for development:
2.0
.NET Core version installed in the environment where application runs:
2.0

guidance

Source

slavah

Most helpful comment

Hello,

@sstevenkang it would be good if we could disable this; I've noticed a huge spike in CPU usage on some of our servers because its uploading multiple 50MB+ files every few seconds.

Is it possible to allow us to flag the request to not have the Md5sum calculated? It's too CPU intensive for our purposes. I assume that due to the object length header being passed, that we still get some level of consistency checking.

Thank you

Plasma on 9 Apr 2018

👍3

All 11 comments

Any news ?

BogdanovKirill on 23 Jan 2018

From S3's documentation:

To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.

If close to the metal performance is required, I would suggest to roll up a bare-bone wrapper around HttpRequest. We made this trade-off with the opinion that data-integrity is more important than performance.

sstevenkang on 2 Feb 2018

So can you please explain how MD5 hashing helps with consistency.
I don't see it being used anywhere in code except to check result of response on Read from S3.

Why do you need MD5 Stream anywhere except to check S3 response?

Thank you.

slavah on 2 Feb 2018

Why do you need MD5 Stream anywhere except to check S3 response?

It's more efficient to calculate the MD5 hash while uploading an object than to calculate the hash after getting a response back. In the later case we end up reading the object twice.

sstevenkang on 2 Feb 2018

Thank you for prompt response!

Please correct me if i'm wrong, MD5 Stream is being used in SDK as backing collection to check optional Content-MD5 hash in response to Put request.

PutObjectRequest.cs#L357

This is the only time MD5 used for S3 as far as i see. Please correct me if i am wrong.

Also RetryHandler always resets MD5 Stream and recalculates it.
It would be beneficial performance wise to calculate MD5 hash only once.

slavah on 3 Feb 2018

Hello,

@sstevenkang it would be good if we could disable this; I've noticed a huge spike in CPU usage on some of our servers because its uploading multiple 50MB+ files every few seconds.

Thank you

Plasma on 9 Apr 2018

👍3

There looks to also be MD5Stream usage in read responses, https://github.com/aws/aws-sdk-net/blob/ae822fc19be5d95e38e777672f51c9358be99a6d/sdk/src/Services/S3/Custom/Internal/AmazonS3ResponseHandler.cs#L111

This is really unexpected from a network IO call.

Plasma on 9 Apr 2018

Ruby SDK seems to have a compute_checksums flag on the client level that would be good for the .NET SDK (either client or per request).

Plasma on 9 Apr 2018

👍1

Also at https://github.com/aws/aws-sdk-net/blob/master/sdk/src/Core/Amazon.Runtime/Internal/Util/ChunkedUploadWrapperStream.cs#L181 the checksum calculations here are dominating CPU

Plasma on 9 Apr 2018

@slavah as a workaround I've used the SDK's "get pre-signed URL" feature to generate a HTTPS URL to just then use a WebClient PUT towards to upload the content.

Plasma on 24 Apr 2018

@Plasma Thank you Andrew for suggestion. At this moment we have removed underlying MD5 Stream in our SDK version. CPU droped drastically. We have observed up to several times drop in CPU consumption. We do process tens of millions S3 PUT requests per day.

slavah on 24 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings