This issue arises primarily when copying a list of files from S3. My understanding is that the suggested approach is to copy the files individually by invoking aws s3 cp
, for example:
cat files.txt | parallel aws s3 cp s3://remote/path/{/} /local/path/{/}
In our experience, for files under about 1MB (we work with large image datasets), copy time is _CPU bound_ by the aws
process.
By comparison, the following script which serves as an alternative to aws s3 cp
takes about _1/10th_ the CPU usage by my measurements (and thus downloads files much faster):
#!/bin/bash
contentType="text/html; charset=UTF-8"
date="`date -u +'%a, %d %b %Y %H:%M:%S GMT'`"
string="GET\n\n${contentType}\n\nx-amz-date:${date}\n${1}"
signature=`echo -en $string | openssl sha1 -hmac "${AWS_SECRET_KEY}" -binary | base64`
curl -o ${2} -s \
-H "x-amz-date: ${date}" \
-H "Content-Type: ${contentType}" \
-H "Authorization: AWS ${AWS_ACCESS_KEY}:${signature}" \
"https://s3.amazonaws.com${1}"
Where ${1}
is the S3 input path and ${2}
is the local output path.
I profiled the aws s3 cp
command and it seems that most of the time is spent by the Python interpreter initializing the execution environment. If there can't be anything done to speed this up, it would be helpful to have an aws command to copy a list of files in parallel to avoid re-occurring this compute cost. This would appear possible as aws s3 sync
doesn't consume nearly as much CPU, but it doesn't offer an interface suitable for copying a specific list of files.
@jklontz
I can see that happening especially if parallel
is being used because by default because with each invocation of the CLI command, 10 threads will be spun up. Have you tried lowering the number of threads being used by configuring max_concurrent_requests? That may improve cpu usage. The only other option (which is not great) is to use a single a cp --recursive
command and use the --exclude
and --include
parameters to only include files from the list. The big problem with this is that you have to iterate over all of the keys under the prefix still.
Otherwise, I am going to mark this as a feature request. Also noting it is similar to this feature request: https://github.com/aws/aws-cli/issues/2463 where use a bucket manifest as the source for objects to transfer.
Thanks for the response @kyleknap. I looked into your suggestion on max_concurrent_requests
and did not see an improvement when setting it to 1
before using parallel
.
Profiling the aws s3 cp
command on a single file download, I see that 95% of the CPU usage is spent in the main thread (presumably initializing the application as hypothesized). And then there is one worker thread consuming 5% of the CPU usage (presumably downloading the file).
Unfortunately, cp --recursive
with --exclude
and --include
is not a viable solution for our use case which involves downloading random sets of 10k to 1M files from a directory of >10M files.
Good Morning!
We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.
This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.
As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.
We鈥檝e imported existing feature requests from GitHub - Search for this issue there!
And don't worry, this issue will still exist on GitHub for posterity's sake. As it鈥檚 a text-only import of the original post into UserVoice, we鈥檒l still be keeping in mind the comments and discussion that already exist here on the GitHub issue.
GitHub will remain the channel for reporting bugs.
Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface
-The AWS SDKs & Tools Team
Based on community feedback, we have decided to return feature requests to GitHub issues.
Question: How to fix AWS s3 high CPU usage when upload big file when tried all the 4 methods as follows?
SDK name:aws-java-sdk-s3
SDK version:1.11.731
Method1:
System.setProperty(SkipMd5CheckStrategy.DISABLE_PUT_OBJECT_MD5_VALIDATION_PROPERTY, "false");
System.setProperty(SkipMd5CheckStrategy.DISABLE_GET_OBJECT_MD5_VALIDATION_PROPERTY, "false");
Method2: AmazonS3Client putObject
Method3: TransferManager upload
Method4: AmazonS3Client uploadPart completeMultipartUpload
Most helpful comment
Based on community feedback, we have decided to return feature requests to GitHub issues.