Aws-cli: aws s3 cp doesn't work with prefix

Created on 18 Aug 2015  Â·  37Comments  Â·  Source: aws/aws-cli

From a bucket containing millions of files, I want do download a few thousand based on their prefix.

Note: using --exclude "*" --include "myprefix" works, but is impractical since it lists all the files and applies the filter afterwards.

The problem is aws s3 cp adds a trailing "/" after the prefix I specify. In the example below you can see that logs/2015-08-17 becomes logs/2015-08-17/
This additional "/" leads to an empty result set, since the keys in the bucket have the form logs/YYYY-MM-DD-HH-MM-SS-RANDOM).

aws s3 cp s3://mybucket/logs/2015-08-17 . --recursive --dryrun --debug

2015-08-17 21:48:32,026 - MainThread - awscli.clidriver - DEBUG - CLI version: aws-cli/1.7.27 Python/2.7.6 Linux/3.14.13-c9, botocore version: 0.108.0
2015-08-17 21:48:32,026 - MainThread - awscli.clidriver - DEBUG - Arguments entered to CLI: ['s3', 'cp', 's3://mybucket/logs/2015-08-17', '.', '--recursive', '--dryrun', '--debug']
....
2015-08-17 21:48:32,088 - MainThread - botocore.endpoint - DEBUG - Making request for (verify_ssl=True) with params: {'query_string': {u'prefix': u'logs/2015-08-17/', u'encoding-type': 'url'}, 'headers': {}, 'url_path': u'/mybucket', 'body': '', 'method': u'GET'}

feature-request pneeds-review s3 s3filters

Most helpful comment

@kyleknap @JordonPhillips @jamesls can we get some help closing this out?

All 37 comments

The current CLI "aws s3 cp" behavior (coincidentally?) complies with the "cp" behavior. We will need to think whether or how to support the use case you mentioned.

+1 This issue bit me today.

What is the reason/use case for adding a trailing slash?

I would like to filter on a partial prefix that does not correspond to a "directory". I was able to work around this by using bash to filter the files and piping it back to aws cp.

@fabiocesari @geota

We do read the exclude/include rules literally, and do not add the '/' to the end. To do what you want to do, you need to add a * to the end of the partial prefix, which means match all characters after. So a combination like --exclude '*' --include 'myprefix*' should work.

Let me know if you have any more questions.

@kyleknap please read my previous post. The trailing slash is clearly added when the python code places calls to the S3 API (see the debug transcript), and using the wildcards is impractical because they are applied after all files are listed, which results in thousands of unnecessary calls.

+1

@fabiocesari I see what you're saying. It looks like the --recursive is the heart of the issue. We're assuming that when the flag is set, the s3 path is intended to be a directory. As you've demonstrated, this is not always the intent.

+1

+1, Encountered this issue today

Running into this today as well. Looks like it affects s3 sync as well. s3 ls seems to be working as expected.

+1: Wanted to download a set of S3 log files that are date prefixed, not separated by a '/' path. Had to fetch all 5000+ files instead of just the ~20 matching my requirements (logging prefix set to s3://mybucket/S3logs/prefix/, so files are in s3://mybucket/S3logs/prefix/2015-11-30-03-28-33-XXXXXXXXX). aws-cli/1.7.29

I think instead of monkeying with --recursive in this context, you should concat the --include filter with the prefix at https://github.com/aws/aws-cli/blob/develop/awscli/customizations/s3/filegenerator.py#L317-L318 which I started a PR... #1707

If you could specify a different delimiter, that would also satisfy the main use case here, but it's hardcoded at https://github.com/aws/aws-cli/blob/develop/awscli/customizations/s3/subcommands.py#L465 and using boto's default everywhere else, I think -- the paginator code is really hard to follow.

Rough and ready: with S3 logging, you end up with one prefix with a very large number of objects matching (under the '/' separator). If you only want a small section, say perhaps matching a specific year-month-day-hour:
for F in aws s3 ls s3://bucket/pathprefix/2016-04-05-11 | cut -c 32-; do aws s3 cp s3://koondart-logs-prod/S3logs/prodmainpersistentblobstore-prodblobstore-1sa83kxh4c3bl/$F . ; done

I agree with both of ralph's ideas. Efficient downloading by prefix is a must have.

How do I get #1707 reviewed/merged ? Has anyone besides me tested my branch?

Just in case someone wants a script this worked for me:

aws s3 ls s3://bucket/fileprefix | cut -c 32- | while IFS= read -r line; do aws s3 cp s3://bucket/$line .; done

Here is a script that reports the number of hits in a day's logs for a specific string. It does a lot of fetches because AWS S3 is lame about prefixes:

#!/bin/bash

BUCKET=www.domain.com
DIR=log
DATE="$(date +%Y-%m-%d)"
LOGID=EUSMQ0RIJI24Q # varies between S3 buckets
PREFIX=$DIR/$LOGID.$DATE
REFERER=special.com

rm *.gz # clear out previously downloaded logs

aws s3 ls s3://$BUCKET/$PREFIX | \
  awk '{ print $4; }' | \
  while IFS= read -r line; do 
    CMD="aws s3 cp s3://$BUCKET/$DIR/$line ."
    #echo "$CMD" # just for debugging
    $CMD
  done

COUNT="$(zcat *.gz | grep $REFERER | wc -l)"
echo "$COUNT referrals from $REFERER on $DATE"

This workaround uses gnu parallel to run several downloads at the same time to make it faster to fetch all the files. It will probably break if you hit a certain number of matching keys, but I haven't test to find out where that limit is.

MATCHING_KEYS="$(
aws s3api list-objects-v2 \
--bucket "${BUCKET}" \
--prefix "${KEY_PREFIX}" |
jq --raw-output ".Contents[].Key"
)"

# Number of jobs arbitrarily chosen as a balance between speed and stability.
parallel --jobs 12 "aws s3 cp s3://${BUCKET}/{} ." ::: $MATCHING_KEYS

Using @kyleknap's suggested solution hasn't even started copying and it's been running for 10 minutes now. I'm assuming it's trying to list all my objects under the logs/prefix and then filter client-side:

aws s3 cp --recursive --exclude '*' --include '2016-11-16-17*'  s3://bucket/logs/ .

Best workaround I've seen so far is to abandon the AWS CLI tools completely:

s3cmd get 's3://bucket/logs/2016-11-16-17*' .

@andpol did you try my fix in #1707 ? i don't think alternative tools are a valid discussion point in this issue.

Yeah, that's fair. I'd love for the issue to be resolved correctly, but since we're already talking about shell scripts with awk and parallel.. ;)

FWIW, I just tried out your PR branch, executing aws s3 cp --recursive --exclude '*' --include '2016-11-16-17*' s3://bucket/logs/ .

Worked perfectly! Hopefully someone on the AWS team can review your changes and get them merged.

@kyleknap @JordonPhillips @jamesls can we get some help closing this out?

+1

+1

+1

+1

+1...moving 19TB without the ability to target specific keys on different machines is a nightmare. the include/exclude stuff still has to list through every page unlike a path wildcard found in the s3 ls command.

+1

Seems kinda crazy Ive now tried about 40 different aws s3 cp command variants and still can't find the incantation to copy the contents of a bucket folder to my local disk. Puzzled as to why its so hard... shouldn't take hours to work out how to copy a few files.

+1 just hit this; would have liked to just use aws s3 cp --recursive s3://bucket/2018-01- to grab logs for 2018-01-XX.

This partial match works in the S3 browser in AWS console, wish it worked for CLI.

+1

@awstools Can we get an update on this?

+1
as andychase mentions above, an update would be great now that the footwork was done...

1,350,000 objects must be listed prior to returning the 20 we need? This basically renders s3 useless for our use case...

It would be useful if there was at least an option to not add the trailing slash to the prefix. There are many use cases for copying by prefix while NOT treating keys like a directory structure. It would be very helpful to not have to write my own one-off tool to avoid listing potentially tens of millions of keys to copy a few thousand objects to another bucket.

+1

So, why can I ls but not cp a prefix? Is there some grand justification I missed in all those replies for the last 3 years?

+1, the same thing happens on 'aws s3 rm'.

+1 just hit this; would have liked to just use aws s3 cp --recursive s3://bucket/2018-01- to grab logs for 2018-01-XX.

Here's a workaround for that case:
aws s3 ls "s3://bucket/2018-01-" | awk '{print $4}' | xargs -I % aws s3 cp s3://bucket/% .

Was this page helpful?
0 / 5 - 0 ratings