Under CPU load AWS S3 SDK (3.3.16.2) has poor performance and consumes CPU needlessly.
Suspected culprit: s3 putObjectRequest is using MD5Stream internally. PutObjectRequestMarshaller.cs
Retry mechanism implemented in api will always reset and recalculate MD5 hash of stream as well.
RetryHandler.cs
Performance (CPU and IO) using SDK S3 API should be as close as possible to direct s3 call
Currently using SDK putObjectRequest MD5Stream is being used:
https://github.com/aws/aws-sdk-net/blob/master/sdk/src/Services/S3/Custom/Model/Internal/MarshallTransformations/PutObjectRequestMarshaller.cs (104)
// Wrap input stream in MD5Stream
var hashStream = new MD5Stream(streamWithLength, null, length);
putObjectRequest.InputStream = hashStream;
Same issue happening during RetryHandler which would reset state of MD5Stream and recalculate MD5 Hash of whole stream again.
Remove MD5Stream.
Please explain the reasoning behind MD5Stream usage. Why is it required?
S3 SDK implementation starving CPU needlessly.
Low performance under CPU load.
Any news ?
From S3's documentation:
To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.
If close to the metal performance is required, I would suggest to roll up a bare-bone wrapper around HttpRequest. We made this trade-off with the opinion that data-integrity is more important than performance.
So can you please explain how MD5 hashing helps with consistency.
I don't see it being used anywhere in code except to check result of response on Read from S3.
Why do you need MD5 Stream anywhere except to check S3 response?
Thank you.
Why do you need MD5 Stream anywhere except to check S3 response?
It's more efficient to calculate the MD5 hash while uploading an object than to calculate the hash after getting a response back. In the later case we end up reading the object twice.
Thank you for prompt response!
Please correct me if i'm wrong, MD5 Stream is being used in SDK as backing collection to check optional Content-MD5 hash in response to Put request.
This is the only time MD5 used for S3 as far as i see. Please correct me if i am wrong.
Also RetryHandler always resets MD5 Stream and recalculates it.
It would be beneficial performance wise to calculate MD5 hash only once.
Hello,
@sstevenkang it would be good if we could disable this; I've noticed a huge spike in CPU usage on some of our servers because its uploading multiple 50MB+ files every few seconds.
Is it possible to allow us to flag the request to not have the Md5sum calculated? It's too CPU intensive for our purposes. I assume that due to the object length header being passed, that we still get some level of consistency checking.
Thank you
There looks to also be MD5Stream usage in read responses, https://github.com/aws/aws-sdk-net/blob/ae822fc19be5d95e38e777672f51c9358be99a6d/sdk/src/Services/S3/Custom/Internal/AmazonS3ResponseHandler.cs#L111
This is really unexpected from a network IO call.
Ruby SDK seems to have a compute_checksums flag on the client level that would be good for the .NET SDK (either client or per request).
Also at https://github.com/aws/aws-sdk-net/blob/master/sdk/src/Core/Amazon.Runtime/Internal/Util/ChunkedUploadWrapperStream.cs#L181 the checksum calculations here are dominating CPU
@slavah as a workaround I've used the SDK's "get pre-signed URL" feature to generate a HTTPS URL to just then use a WebClient PUT towards to upload the content.
@Plasma Thank you Andrew for suggestion. At this moment we have removed underlying MD5 Stream in our SDK version. CPU droped drastically. We have observed up to several times drop in CPU consumption. We do process tens of millions S3 PUT requests per day.
Most helpful comment
Hello,
@sstevenkang it would be good if we could disable this; I've noticed a huge spike in CPU usage on some of our servers because its uploading multiple 50MB+ files every few seconds.
Is it possible to allow us to flag the request to not have the Md5sum calculated? It's too CPU intensive for our purposes. I assume that due to the object length header being passed, that we still get some level of consistency checking.
Thank you