Google-cloud-python: storage API: Allow wildcard in object path to allow retrieval of several objects

Created on 11 Oct 2017 · 9Comments · Source: googleapis/google-cloud-python

gcsfs has a very handy feature that allows you to fetch multiple files by allowing wildcards in the object path. I think this would be a nice little feature to add to this library.

This example illustrates the idea:

from google.cloud import storage
c = storage.Client()
bucket = c.bucket('mybucket')
blobs = bucket.blob('2017/*.csv')

As far as I know, the current way to accomplish the same would be to filter the list of all the files in the bucket and then fetch the files one by one (please correct me if I'm wrong :). The problem with this is that it's slow if you have a bucket with a very large number of files.

Cheers,
Halfdan

feature request storage

Source

halfdanrump

Most helpful comment

@sagarrakshe As you note, the back-end doesn't provide such access, so what we are discussing is really a convenience wrapper for application code which would otherwise be something like:

import fnmatch

for blob in bucket.list_blobs():
    if fnmatch.fnmatch(blob.name, "*something"):
        do_something_with(blob)

tseaver on 30 Oct 2017

👍5

All 9 comments

@halfdanrump bucket.list_blobs takes an optional parameter prefix. This will filter blobs starting with the given string, this can partly solve your problem.
The following will return iterator for blobs starting with test_

from google.cloud import storage
client = storage.Client()
bucket = client.bucket('mybucket')
blobs = bucket.list_blobs(prefix='test_')

sagarrakshe on 11 Oct 2017

👍2 🎉1

@sagarrakshe You're right, I hadn't noticed that parameter. In my case this is sufficient, so thanks for telling me about it! :)

As you also point it's a partial solution. Actually it might be handle to even allow for regex matches on the filenames @lukesneeringer. I'm not into the code , so I don't know how difficult this would be to implement. Do you think it would be worth the effort?

Cheers,
Halfdan

halfdanrump on 13 Oct 2017

The list objects API doesn't allow any wildcard parameter, apart from prefix:
https://cloud.google.com/storage/docs/json_api/v1/objects/list

So we need to add a parameter to bucket.list_blobs method (say pattern=None) which will be used to filter objects by applying that pattern on each object. Is this the optimal way to do?
Any thoughts? @lukesneeringer @dhermes

sagarrakshe on 30 Oct 2017

👍2

@sagarrakshe As you note, the back-end doesn't provide such access, so what we are discussing is really a convenience wrapper for application code which would otherwise be something like:

import fnmatch

for blob in bucket.list_blobs():
    if fnmatch.fnmatch(blob.name, "*something"):
        do_something_with(blob)

tseaver on 30 Oct 2017

👍5

Nice. So can we close this issue? @tseaver

sagarrakshe on 31 Oct 2017

I'll close it, as there isn't much we can do to improve on application-level processing.

tseaver on 1 Nov 2017

👎1

fnmatch lib works, but the filter process so slow. Don't know how gsutil handle the problem, very effectively.

schunlee on 15 Nov 2018

👍1

At any point will this feature request be re-opened to allow wildcards within the list_blobs method?

For example:

bucket.list_blobs(prefix='2019*.csv')

ViRaL95 on 14 Aug 2019

👍3

@ViRaL95

At any point will this feature request be re-opened to allow wildcards within the list_blobs method?

Because the back-end doesn't provide support for that kind of matching, we decided that it was not worth the effort, given how simple it is do do the matching in the application (as my example above illustrates).

tseaver on 14 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings