Aws-sdk-net: s3 putObjectRequest CPU consumption issue

Created on 10 Jan 2018  路  11Comments  路  Source: aws/aws-sdk-net


Under CPU load AWS S3 SDK (3.3.16.2) has poor performance and consumes CPU needlessly.

Suspected culprit: s3 putObjectRequest is using MD5Stream internally. PutObjectRequestMarshaller.cs
Retry mechanism implemented in api will always reset and recalculate MD5 hash of stream as well.
RetryHandler.cs

Expected Behavior



Performance (CPU and IO) using SDK S3 API should be as close as possible to direct s3 call

Current Behavior





Currently using SDK putObjectRequest MD5Stream is being used:
https://github.com/aws/aws-sdk-net/blob/master/sdk/src/Services/S3/Custom/Model/Internal/MarshallTransformations/PutObjectRequestMarshaller.cs (104)

// Wrap input stream in MD5Stream
var hashStream = new MD5Stream(streamWithLength, null, length);
putObjectRequest.InputStream = hashStream;

Same issue happening during RetryHandler which would reset state of MD5Stream and recalculate MD5 Hash of whole stream again.

Possible Solution


Remove MD5Stream.
Please explain the reasoning behind MD5Stream usage. Why is it required?

Steps to Reproduce (for bugs)




Context



S3 SDK implementation starving CPU needlessly.
Low performance under CPU load.

Your Environment

  • AWSSDK.Core version used:
    3.3.21.6
  • Operating System and version:
    Ubuntu 16.04 LTS
  • Targeted .NET platform:
    Net Core 2.0

.NET Core Info

  • .NET Core version used for development:
    2.0
  • .NET Core version installed in the environment where application runs:
    2.0
guidance

Most helpful comment

Hello,

@sstevenkang it would be good if we could disable this; I've noticed a huge spike in CPU usage on some of our servers because its uploading multiple 50MB+ files every few seconds.

Is it possible to allow us to flag the request to not have the Md5sum calculated? It's too CPU intensive for our purposes. I assume that due to the object length header being passed, that we still get some level of consistency checking.

Thank you

All 11 comments

Any news ?

From S3's documentation:

To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.

If close to the metal performance is required, I would suggest to roll up a bare-bone wrapper around HttpRequest. We made this trade-off with the opinion that data-integrity is more important than performance.

So can you please explain how MD5 hashing helps with consistency.
I don't see it being used anywhere in code except to check result of response on Read from S3.

Why do you need MD5 Stream anywhere except to check S3 response?

Thank you.

Why do you need MD5 Stream anywhere except to check S3 response?

It's more efficient to calculate the MD5 hash while uploading an object than to calculate the hash after getting a response back. In the later case we end up reading the object twice.

Thank you for prompt response!

Please correct me if i'm wrong, MD5 Stream is being used in SDK as backing collection to check optional Content-MD5 hash in response to Put request.

PutObjectRequest.cs#L357

This is the only time MD5 used for S3 as far as i see. Please correct me if i am wrong.

Also RetryHandler always resets MD5 Stream and recalculates it.
It would be beneficial performance wise to calculate MD5 hash only once.

Hello,

@sstevenkang it would be good if we could disable this; I've noticed a huge spike in CPU usage on some of our servers because its uploading multiple 50MB+ files every few seconds.

Is it possible to allow us to flag the request to not have the Md5sum calculated? It's too CPU intensive for our purposes. I assume that due to the object length header being passed, that we still get some level of consistency checking.

Thank you

There looks to also be MD5Stream usage in read responses, https://github.com/aws/aws-sdk-net/blob/ae822fc19be5d95e38e777672f51c9358be99a6d/sdk/src/Services/S3/Custom/Internal/AmazonS3ResponseHandler.cs#L111

This is really unexpected from a network IO call.

Ruby SDK seems to have a compute_checksums flag on the client level that would be good for the .NET SDK (either client or per request).

@slavah as a workaround I've used the SDK's "get pre-signed URL" feature to generate a HTTPS URL to just then use a WebClient PUT towards to upload the content.

@Plasma Thank you Andrew for suggestion. At this moment we have removed underlying MD5 Stream in our SDK version. CPU droped drastically. We have observed up to several times drop in CPU consumption. We do process tens of millions S3 PUT requests per day.

Was this page helpful?
0 / 5 - 0 ratings