Aws-cli: Inefficient CPU usage of `aws s3 cp`

Created on 25 Aug 2017 · 5Comments · Source: aws/aws-cli

This issue arises primarily when copying a list of files from S3. My understanding is that the suggested approach is to copy the files individually by invoking aws s3 cp, for example:

cat files.txt | parallel aws s3 cp s3://remote/path/{/} /local/path/{/}

In our experience, for files under about 1MB (we work with large image datasets), copy time is _CPU bound_ by the aws process.

By comparison, the following script which serves as an alternative to aws s3 cp takes about _1/10th_ the CPU usage by my measurements (and thus downloads files much faster):

#!/bin/bash
contentType="text/html; charset=UTF-8" 
date="`date -u +'%a, %d %b %Y %H:%M:%S GMT'`"
string="GET\n\n${contentType}\n\nx-amz-date:${date}\n${1}"
signature=`echo -en $string | openssl sha1 -hmac "${AWS_SECRET_KEY}" -binary | base64` 
curl -o ${2} -s \
     -H "x-amz-date: ${date}" \
     -H "Content-Type: ${contentType}" \
     -H "Authorization: AWS ${AWS_ACCESS_KEY}:${signature}" \
     "https://s3.amazonaws.com${1}"

Where ${1} is the S3 input path and ${2} is the local output path.

I profiled the aws s3 cp command and it seems that most of the time is spent by the Python interpreter initializing the execution environment. If there can't be anything done to speed this up, it would be helpful to have an aws command to copy a list of files in parallel to avoid re-occurring this compute cost. This would appear possible as aws s3 sync doesn't consume nearly as much CPU, but it doesn't offer an interface suitable for copying a specific list of files.

feature-request s3

Source

jklontz

👍1

Most helpful comment

Based on community feedback, we have decided to return feature requests to GitHub issues.

jamesls on 6 Apr 2018

👍5

All 5 comments

@jklontz
I can see that happening especially if parallel is being used because by default because with each invocation of the CLI command, 10 threads will be spun up. Have you tried lowering the number of threads being used by configuring max_concurrent_requests? That may improve cpu usage. The only other option (which is not great) is to use a single a cp --recursive command and use the --exclude and --include parameters to only include files from the list. The big problem with this is that you have to iterate over all of the keys under the prefix still.

Otherwise, I am going to mark this as a feature request. Also noting it is similar to this feature request: https://github.com/aws/aws-cli/issues/2463 where use a bucket manifest as the source for objects to transfer.

kyleknap on 25 Aug 2017

Thanks for the response @kyleknap. I looked into your suggestion on max_concurrent_requests and did not see an improvement when setting it to 1 before using parallel.

Profiling the aws s3 cp command on a single file download, I see that 95% of the CPU usage is spent in the main thread (presumably initializing the application as hypothesized). And then there is one worker thread consuming 5% of the CPU usage (presumably downloading the file).

Unfortunately, cp --recursive with --exclude and --include is not a viable solution for our use case which involves downloading random sets of 10k to 1M files from a directory of >10M files.

jklontz on 28 Aug 2017

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

ASayre on 6 Feb 2018

Based on community feedback, we have decided to return feature requests to GitHub issues.

jamesls on 6 Apr 2018

👍5

Question: How to fix AWS s3 high CPU usage when upload big file when tried all the 4 methods as follows?
SDK name:aws-java-sdk-s3
SDK version:1.11.731

Method1:

System.setProperty(SkipMd5CheckStrategy.DISABLE_PUT_OBJECT_MD5_VALIDATION_PROPERTY, "false");

System.setProperty(SkipMd5CheckStrategy.DISABLE_GET_OBJECT_MD5_VALIDATION_PROPERTY, "false");

Method2: AmazonS3Client putObject
Method3: TransferManager upload
Method4: AmazonS3Client uploadPart completeMultipartUpload