Aws-cli: aws s3 cp doesn't work with prefix

Created on 18 Aug 2015 · 37Comments · Source: aws/aws-cli

From a bucket containing millions of files, I want do download a few thousand based on their prefix.

Note: using --exclude "*" --include "myprefix" works, but is impractical since it lists all the files and applies the filter afterwards.

The problem is aws s3 cp adds a trailing "/" after the prefix I specify. In the example below you can see that logs/2015-08-17 becomes logs/2015-08-17/
This additional "/" leads to an empty result set, since the keys in the bucket have the form logs/YYYY-MM-DD-HH-MM-SS-RANDOM).

aws s3 cp s3://mybucket/logs/2015-08-17 . --recursive --dryrun --debug

2015-08-17 21:48:32,026 - MainThread - awscli.clidriver - DEBUG - CLI version: aws-cli/1.7.27 Python/2.7.6 Linux/3.14.13-c9, botocore version: 0.108.0
2015-08-17 21:48:32,026 - MainThread - awscli.clidriver - DEBUG - Arguments entered to CLI: ['s3', 'cp', 's3://mybucket/logs/2015-08-17', '.', '--recursive', '--dryrun', '--debug']
....
2015-08-17 21:48:32,088 - MainThread - botocore.endpoint - DEBUG - Making request for (verify_ssl=True) with params: {'query_string': {u'prefix': u'logs/2015-08-17/', u'encoding-type': 'url'}, 'headers': {}, 'url_path': u'/mybucket', 'body': '', 'method': u'GET'}

feature-request pneeds-review s3 s3filters

Source

fabiocesari

👍26 ❤4

Most helpful comment

@kyleknap @JordonPhillips @jamesls can we get some help closing this out?

ralph-tice on 17 Nov 2016

👍12

All 37 comments

The current CLI "aws s3 cp" behavior (coincidentally?) complies with the "cp" behavior. We will need to think whether or how to support the use case you mentioned.

rayluo on 19 Aug 2015

+1 This issue bit me today.

What is the reason/use case for adding a trailing slash?

I would like to filter on a partial prefix that does not correspond to a "directory". I was able to work around this by using bash to filter the files and piping it back to aws cp.

geota on 22 Sep 2015

@fabiocesari @geota

We do read the exclude/include rules literally, and do not add the '/' to the end. To do what you want to do, you need to add a * to the end of the partial prefix, which means match all characters after. So a combination like --exclude '*' --include 'myprefix*' should work.

Let me know if you have any more questions.

kyleknap on 22 Sep 2015

@kyleknap please read my previous post. The trailing slash is clearly added when the python code places calls to the S3 API (see the debug transcript), and using the wildcards is impractical because they are applied after all files are listed, which results in thousands of unnecessary calls.

fabiocesari on 23 Sep 2015

👍3

emmekappa on 23 Sep 2015

👎1

@fabiocesari I see what you're saying. It looks like the --recursive is the heart of the issue. We're assuming that when the flag is set, the s3 path is intended to be a directory. As you've demonstrated, this is not always the intent.

JordonPhillips on 7 Oct 2015

cce on 16 Oct 2015

👎1

+1, Encountered this issue today

mtngld on 2 Nov 2015

Running into this today as well. Looks like it affects s3 sync as well. s3 ls seems to be working as expected.

bnferguson on 12 Nov 2015

+1: Wanted to download a set of S3 log files that are date prefixed, not separated by a '/' path. Had to fetch all 5000+ files instead of just the ~20 matching my requirements (logging prefix set to s3://mybucket/S3logs/prefix/, so files are in s3://mybucket/S3logs/prefix/2015-11-30-03-28-33-XXXXXXXXX). aws-cli/1.7.29

JamesBromberger on 30 Nov 2015

I think instead of monkeying with --recursive in this context, you should concat the --include filter with the prefix at https://github.com/aws/aws-cli/blob/develop/awscli/customizations/s3/filegenerator.py#L317-L318 which I started a PR... #1707

If you could specify a different delimiter, that would also satisfy the main use case here, but it's hardcoded at https://github.com/aws/aws-cli/blob/develop/awscli/customizations/s3/subcommands.py#L465 and using boto's default everywhere else, I think -- the paginator code is really hard to follow.

ralph-tice on 7 Jan 2016

👍1

Rough and ready: with S3 logging, you end up with one prefix with a very large number of objects matching (under the '/' separator). If you only want a small section, say perhaps matching a specific year-month-day-hour:
for F in aws s3 ls s3://bucket/pathprefix/2016-04-05-11 | cut -c 32-; do aws s3 cp s3://koondart-logs-prod/S3logs/prodmainpersistentblobstore-prodblobstore-1sa83kxh4c3bl/$F . ; done

JamesBromberger on 6 Apr 2016

👍4 👎2

I agree with both of ralph's ideas. Efficient downloading by prefix is a must have.

mathom on 26 Jul 2016

👍1

How do I get #1707 reviewed/merged ? Has anyone besides me tested my branch?

ralph-tice on 3 Aug 2016

Just in case someone wants a script this worked for me:

aws s3 ls s3://bucket/fileprefix | cut -c 32- | while IFS= read -r line; do aws s3 cp s3://bucket/$line .; done

alex88 on 11 Sep 2016

👍8

Here is a script that reports the number of hits in a day's logs for a specific string. It does a lot of fetches because AWS S3 is lame about prefixes:

#!/bin/bash

BUCKET=www.domain.com
DIR=log
DATE="$(date +%Y-%m-%d)"
LOGID=EUSMQ0RIJI24Q # varies between S3 buckets
PREFIX=$DIR/$LOGID.$DATE
REFERER=special.com

rm *.gz # clear out previously downloaded logs

aws s3 ls s3://$BUCKET/$PREFIX | \
  awk '{ print $4; }' | \
  while IFS= read -r line; do 
    CMD="aws s3 cp s3://$BUCKET/$DIR/$line ."
    #echo "$CMD" # just for debugging
    $CMD
  done

COUNT="$(zcat *.gz | grep $REFERER | wc -l)"
echo "$COUNT referrals from $REFERER on $DATE"

mslinn on 1 Oct 2016

This workaround uses gnu parallel to run several downloads at the same time to make it faster to fetch all the files. It will probably break if you hit a certain number of matching keys, but I haven't test to find out where that limit is.

MATCHING_KEYS="$(
aws s3api list-objects-v2 \
--bucket "${BUCKET}" \
--prefix "${KEY_PREFIX}" |
jq --raw-output ".Contents[].Key"
)"

# Number of jobs arbitrarily chosen as a balance between speed and stability.
parallel --jobs 12 "aws s3 cp s3://${BUCKET}/{} ." ::: $MATCHING_KEYS

iainelder on 6 Oct 2016

👍1

Using @kyleknap's suggested solution hasn't even started copying and it's been running for 10 minutes now. I'm assuming it's trying to list all my objects under the logs/prefix and then filter client-side:

aws s3 cp --recursive --exclude '*' --include '2016-11-16-17*'  s3://bucket/logs/ .

Best workaround I've seen so far is to abandon the AWS CLI tools completely:

s3cmd get 's3://bucket/logs/2016-11-16-17*' .

andpol on 16 Nov 2016

👍9

@andpol did you try my fix in #1707 ? i don't think alternative tools are a valid discussion point in this issue.

ralph-tice on 16 Nov 2016

Yeah, that's fair. I'd love for the issue to be resolved correctly, but since we're already talking about shell scripts with awk and parallel.. ;)

FWIW, I just tried out your PR branch, executing aws s3 cp --recursive --exclude '*' --include '2016-11-16-17*' s3://bucket/logs/ .

Worked perfectly! Hopefully someone on the AWS team can review your changes and get them merged.

andpol on 16 Nov 2016

@kyleknap @JordonPhillips @jamesls can we get some help closing this out?

ralph-tice on 17 Nov 2016

👍12

bwjjohnson on 4 Jul 2017

👎3

lababidi on 6 Jul 2017

👎2

ncaadam on 13 Jul 2017

👎2

metzlar on 1 Sep 2017

👎3

+1...moving 19TB without the ability to target specific keys on different machines is a nightmare. the include/exclude stuff still has to list through every page unlike a path wildcard found in the s3 ls command.

seankovacs on 20 Sep 2017

shahbour on 2 Oct 2017

👎3

Seems kinda crazy Ive now tried about 40 different aws s3 cp command variants and still can't find the incantation to copy the contents of a bucket folder to my local disk. Puzzled as to why its so hard... shouldn't take hours to work out how to copy a few files.

bootrino on 5 Feb 2018

+1 just hit this; would have liked to just use aws s3 cp --recursive s3://bucket/2018-01- to grab logs for 2018-01-XX.

This partial match works in the S3 browser in AWS console, wish it worked for CLI.

Plasma on 11 Feb 2018

Gatisseja on 14 Feb 2018

@awstools Can we get an update on this?

andychase on 6 Jun 2018

👍6

+1
as andychase mentions above, an update would be great now that the footwork was done...

1,350,000 objects must be listed prior to returning the 20 we need? This basically renders s3 useless for our use case...

clandolfi on 9 Jul 2018

It would be useful if there was at least an option to not add the trailing slash to the prefix. There are many use cases for copying by prefix while NOT treating keys like a directory structure. It would be very helpful to not have to write my own one-off tool to avoid listing potentially tens of millions of keys to copy a few thousand objects to another bucket.