Aws-cli: AWS S3 ls wildcard support

Created on 5 Dec 2018  路  15Comments  路  Source: aws/aws-cli

Currently it seems there is no way to search for file(s) using ls and a wild card. For example:

aws s3 ls s3://bucket/folder/2018*.txt

This would return nothing, even if the file is present.

I have done some searching online, it seems the wildcard is supported for rm, mv & cp but not ls. The common solution to getting this done is to ls the entire directory then grep for the files you are searching for

aws s3 ls s3://bucket/folder/ | grep 2018*.txt

But come across this, I also found warnings that this won't work effectively if there are over a 1000 objects in a bucket.

To me, it appears it would be nice to have the aws s3 ls command to work with wildcards instead of trying to handle with a grep & also having to deal with the 1000 object limit.

guidance s3

Most helpful comment

Python implementation of s3 wildcard search:

import boto3
import re

def search_s3_regex(results, bucket, prefix, regex_path):
    s3_client = boto3.client('s3')
    wc_parts = regex_path.split('/')
    if len(wc_parts) == 1 and len(wc_parts[0]) == 0:
        results.append(prefix)
        return
    else:
        regex = re.compile(wc_parts[0])
        next_regex_path = '/'.join(wc_parts[1:])
        paginator = s3_client.get_paginator('list_objects')
        result = paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=prefix)
        for pref in result.search('CommonPrefixes'):
            if pref is None:
                # check files
                for k in result.search('Contents'):
                    res = k.get('Key')
                    search_prefix = res if len(prefix) == 0 else res.split(prefix)[1]
                    if re.match(regex, search_prefix):
                        results.append(res)
            else:
                # check paths
                res = pref.get('Prefix')
                search_prefix = res if len(prefix) == 0 else res.split(prefix)[1]
                if re.match(regex, search_prefix):
                    search_s3_regex(results, bucket, res, next_regex_path)

After that you can call the function like this:

res =[]
search_s3_regex(res, 'my_bucket', 'initial_prefix/blah/', 'b.{2}h/[0-9]{2}-.*-2019/.*')
for p in res:
     print(p)

All 15 comments

@Bongani - Thank you for reaching out and reporting this feature. This would be a feature requests for our service team. If the service team approves and adds this feature it will be exposed from the API to the CLI. I have submitted an internal request to our service team but would recommend any follow up via a case with AWS Support or reach out to the AWS Forum for S3.

@justnance Thanks for submitting the request.

no problem at all.

As per our S3 Service team, _Thanks for the feedback, your feature request will be prioritized with other features planned for S3._

Thanks for the update. It's greatly appreciated.

I would love this also.

In my case I have lots of files in S3 under date folders of the form

bucket/name/YYYY-MM/YYYY-MM-DD/filename.ext

I would love to say "aws s3 ls s3://mybucket//2016-03//" and list all files from all root prefixed for march 2016.

So how this is going now? Is there any roadmap including this feature?

Hi All. This was escalated internally and there is no other update at this time. This feature is controlled by the service team and not the CLI. I can escalated it again with the forum link if this feature request was posted on the S3 Service team's forum.

This is really essential. You can grep, but man... really?

I just had to do this to move a series of folders from being called .csv to .jsonl. This is ridiculous!

# Copy
for file in `aws s3 ls s3://stackoverflow-events/07-30-2019/|tr -s ' '|cut -d ' ' -f3`; 
do 
    aws s3 cp --recursive \
        s3://stackoverflow-events/07-30-2019/$file \
        `echo s3://stackoverflow-events/07-30-2019/$file|sed 's/csv/jsonl/'`;
done

# Now delete
for file in `aws s3 ls 's3://stackoverflow-events/07-30-2019/'|grep 'Questions.Stratified.*.csv/$'|tr -s ' '|cut -d ' ' -f3`;
do
    aws s3 rm --recursive s3://stackoverflow-events/07-30-2019/$file;
done

For shame!

Python implementation of s3 wildcard search:

import boto3
import re

def search_s3_regex(results, bucket, prefix, regex_path):
    s3_client = boto3.client('s3')
    wc_parts = regex_path.split('/')
    if len(wc_parts) == 1 and len(wc_parts[0]) == 0:
        results.append(prefix)
        return
    else:
        regex = re.compile(wc_parts[0])
        next_regex_path = '/'.join(wc_parts[1:])
        paginator = s3_client.get_paginator('list_objects')
        result = paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=prefix)
        for pref in result.search('CommonPrefixes'):
            if pref is None:
                # check files
                for k in result.search('Contents'):
                    res = k.get('Key')
                    search_prefix = res if len(prefix) == 0 else res.split(prefix)[1]
                    if re.match(regex, search_prefix):
                        results.append(res)
            else:
                # check paths
                res = pref.get('Prefix')
                search_prefix = res if len(prefix) == 0 else res.split(prefix)[1]
                if re.match(regex, search_prefix):
                    search_s3_regex(results, bucket, res, next_regex_path)

After that you can call the function like this:

res =[]
search_s3_regex(res, 'my_bucket', 'initial_prefix/blah/', 'b.{2}h/[0-9]{2}-.*-2019/.*')
for p in res:
     print(p)

Just curious, about a year later, is there plan for native support? The scripts are nice, but it doesn't help in cases where there are tons of files in the directory that still have to be traversed on the client side. Even include/exclude filters which are already supported by cp/rm/mv would be nice.
https://docs.aws.amazon.com/cli/latest/reference/s3/index.html#use-of-exclude-and-include-filters

You can use sync with --dryrun option:

aws s3 sync --dryrun --exclude '*' --include '*/.DS_Store' s3://mybucket ./

You can use sync with --dryrun option:

aws s3 sync --dryrun --exclude '*' --include '*/.DS_Store' s3://mybucket ./

This solution is such a great hack. You can distill it into single S3 urls by piping to awk:

aws s3 sync --dryrun --exclude '*' --include '*/.DS_Store' s3://mybucket ./ | awk '/s3:\/\//{print $3}'

| awk '/s3:///{print $3}'

This works well for me!

Was this page helpful?
0 / 5 - 0 ratings