gcsfs has a very handy feature that allows you to fetch multiple files by allowing wildcards in the object path. I think this would be a nice little feature to add to this library.
This example illustrates the idea:
from google.cloud import storage
c = storage.Client()
bucket = c.bucket('mybucket')
blobs = bucket.blob('2017/*.csv')
As far as I know, the current way to accomplish the same would be to filter the list of all the files in the bucket and then fetch the files one by one (please correct me if I'm wrong :). The problem with this is that it's slow if you have a bucket with a very large number of files.
Cheers,
Halfdan
@halfdanrump bucket.list_blobs takes an optional parameter prefix. This will filter blobs starting with the given string, this can partly solve your problem.
The following will return iterator for blobs starting with test_
from google.cloud import storage
client = storage.Client()
bucket = client.bucket('mybucket')
blobs = bucket.list_blobs(prefix='test_')
@sagarrakshe You're right, I hadn't noticed that parameter. In my case this is sufficient, so thanks for telling me about it! :)
As you also point it's a partial solution. Actually it might be handle to even allow for regex matches on the filenames @lukesneeringer. I'm not into the code , so I don't know how difficult this would be to implement. Do you think it would be worth the effort?
Cheers,
Halfdan
The list objects API doesn't allow any wildcard parameter, apart from prefix:
https://cloud.google.com/storage/docs/json_api/v1/objects/list
So we need to add a parameter to bucket.list_blobs method (say pattern=None) which will be used to filter objects by applying that pattern on each object. Is this the optimal way to do?
Any thoughts? @lukesneeringer @dhermes
@sagarrakshe As you note, the back-end doesn't provide such access, so what we are discussing is really a convenience wrapper for application code which would otherwise be something like:
import fnmatch
for blob in bucket.list_blobs():
if fnmatch.fnmatch(blob.name, "*something"):
do_something_with(blob)
Nice. So can we close this issue? @tseaver
I'll close it, as there isn't much we can do to improve on application-level processing.
fnmatch lib works, but the filter process so slow. Don't know how gsutil handle the problem, very effectively.
At any point will this feature request be re-opened to allow wildcards within the list_blobs method?
For example:
bucket.list_blobs(prefix='2019*.csv')
@ViRaL95
At any point will this feature request be re-opened to allow wildcards within the list_blobs method?
Because the back-end doesn't provide support for that kind of matching, we decided that it was not worth the effort, given how simple it is do do the matching in the application (as my example above illustrates).
Most helpful comment
@sagarrakshe As you note, the back-end doesn't provide such access, so what we are discussing is really a convenience wrapper for application code which would otherwise be something like: