In order to navigate (potentially broad) directory hierarchies in buckets, it would be useful to be able to list the sub-directories in a bucket at a given prefix. Happy to attempt to implement such a method if someone points me in the correct direction. Seems like perhaps a new Iterator?
@samskillman How about
from gcloud import storage
bucket = storage.Bucket(...)
blob_iter = bucket.list_blobs(max_results=..., prefix=..., delimiter=...)
I think I wasn't completely clear in the description above. Let's say I have a bucket structure like this:
gs://root_bucket/
A/blob1
A/blob2
B/blob3
B/blob4
Is there a way to use the list_blobs (or anything else) to return the first level of subdirectory names ["gs://root_bucket/A/", "gs://root_bucket/B/"] rather than all the blobs? This would have the same behavior as gsutil ls gs://root_bucket. Perhaps I don't understand how delimiter and prefix can be used to accomplish this.
Being more specific, shouldn't this work:
from gcloud import storage
bucket = storage.Bucket(...)
blob_iter = bucket.list_blobs(delimiter='/')
to get the top level?
And to list subdirectories inside A:
from gcloud import storage
bucket = storage.Bucket(...)
blob_iter = bucket.list_blobs(delimiter='/', prefix='A/')
?
@jgeewax That is what I had tried originally, but I get empty lists in both cases. From what I can tell in test_bucket.py on line 291, the test actually asserts that the list of blobs is [], so maybe this is an actual bug?
Huh, but I can do:
iterator = bucket.list_blobs(delimiter="/")
response = iterator.get_next_page_response()
blobs = list(iterator.get_items_from_response(response))
dirs = iterator.prefixes
(Got that hint from regression/storage.py)
The docs are a bit clarifying: https://cloud.google.com/storage/docs/json_api/v1/objects/list
If you want "just the stuff immediately in the directory A/", this translates to "Give me the stuff that starts with A/ but doesn't have any other / characters in the non-prefix part of the name, which you do with:
from gcloud import storage
bucket = storage.Bucket(...)
blob_iter = bucket.list_blobs(delimiter='/', prefix='A/')
This will return all stuff that _starts_ with A/ however in the rest of the name, there can't be another /. So A/foo.txt would come back in the list, but A/sub/foo.txt would _not_ come back.
@jgeewax that would list only the files, right? Then to get the directories I'd have to do what I wrote up above, is that right? How about a bucket.list_prefixes method like this:
def list_prefixes(self, prefix=None):
iterator = self.list_blobs(delimiter='/', prefix=prefix)
list(iterator) # Necessary to populate iterator.prefixes
for p in iterator.prefixes:
yield p
I think, effectively, what that's doing is going through the list of blobs in the bucket, and collecting all the prefixes. So yes, that should work -- probably just as well as collecting the prefixes from the list of blobs yourself.
If we were going to add a list_prefixes method, it would be tricky. I can't find a way to get "just the top-level list of directories"...
@Capstan : What would I do to get the list of all blobs that are "directories" in the root of the bucket?
I thought prefix='/', delimiter='/' would work, but it seems to (true to form) just return the non-directory items (ie, "give me everything starting with / that has no other /s in the name").
What i want is "give me everything starting with / that has exactly one more / in the name" (I think...)
@samskillman @jgeewax The list(iterator) will page through results and [sets] prefixes every time a new HTTP request is issued. This means that the value iterator.prefixes may not be comprehensive for the entire result set, just for the last request issued. (UPDATE: That's not to say we can't make self.prefixes a set, it just currently isn't implemented that way.)
@jgeewax @dhermes has it right. FWIW, we don't enforce a preceeding '/' at all, and many objects don't start with that character. If you haven't iterated over the entire list, you won't have the set of top-level directory-likes.
Given that the separator character is optional, and user defined, we can't really provide a lot of help for this usecase in the library. As @dhermes points out, we could change the iterator such that its prefixes attribute is cumulative across all requested pages: that seems like a reasonable choice.
If we made that change, I imagine that the most reasonable way to accomplish the goal is to write a helper function which "piped the items to /dev/null" and then extracted the relevant prefixes. E.g.:
def extract_prefixes(bucket, prefix=None, delimiter='/'):
iterator = bucket.list_blobs(prefix=prefix, delimiter=delimiter):
map(lambda x: None, iterator)
return iterator.prefixes
:+1: to that idea, I don't think it belongs in gcloud-python though. However yes, I can see the logic for making prefixes cumulative.
@tseaver I think it'd be worthwhile to at least have a protected attribute on the Iterator called _current_prefixes just to make the distinction from the set() that is self.prefixes. Worth making it public?
@samskillman I am going to implement the cumulative prefixes for the iterator, but likely nothing else.
Do you have an issue with us closing out this issue after that?
Very basic operation in other cloud providers folders = bucket.list("","/"), using Google Cloud it is too complex to need to have function with several lines of code:
def extract_prefixes(bucket, prefix=None, delimiter='/'):
"""Function Option A."""
iterator = bucket.list_blobs(delimiter="/")
response = iterator.get_next_page_response()
blobs = list(iterator.get_items_from_response(response))
dirs = iterator.prefixes
return dirs
def list_prefixes(self, prefix=None):
"""Function Option B."""
iterator = self.list_blobs(delimiter='/', prefix=prefix)
list(iterator) # Necessary to populate iterator.prefixes
for p in iterator.prefixes:
yield p
We understand that you don't want to upset the old users that are use to this, but for new users this looks bad quality API.
Other Google APIs outside the gcloud-python are managing better delimiter='/', an example: https://cloud.google.com/storage/docs/json_api/v1/objects/list#try-it
@dhermes @tseaver
https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/storage/google/cloud/storage/bucket.py#L331
The prefixes is set to be empty. How can I extract prefixed?
Nvm. I found how to use it by searching in your test files.
I had to say I am disappointed to the library. It is far from decent quality.
We've now seen a couple different people agreeing that this is a shortcoming against other libraries. Maybe it's time to explore fixing this? @dhermes
As iterator.get_next_page_response is hidden nowadays, you can use the following code:
def get_prefixes(bucket):
iterator = bucket.list_blobs(delimiter="/")
response = iterator._get_next_page_response()
return response['prefixes']
@JorritPosthuma Thanks! To make sure I get all the prefixes, even for large buckets, and using the "public" API, I wrote the following:
def list_gcs_directories(bucket, prefix):
# from https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920
iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
prefixes = set()
for page in iterator.pages:
print page, page.prefixes
prefixes.update(page.prefixes)
return prefixes
Feature Requests will now be tracked in our Project Feature Request. Please feel free to continue the discussions here.
@evanj I like your code, but have a strong feeling that it x times slower than gsutil ls specially for buckets with a lot of files deep in the three. Are there some options to optimize the Python code or is better to do system call of gsutil ls for speed up the listing of sub-folders?
It should be making a very similar API calls. I suspect if you know the bucket is large, you could attempt to list files using multiple API calls in parallel, by splitting the file name space. I've never had to do such a thing. Good luck!
ya, i 'm trying to write a web interface so that some images can be hand checked. I have ~3500 groups (subfolders) and each has 10,000 images (so 35,000,000 blobs). I need to select a random sample from each group as it would be next to impossible to hand check them all. I can't loop over all the the blobs just to get the prefixes _every page load_. Groups a added and removed constantly hence I'd like the bucket to be the single source of truth. I guess GCS is not great for BIG DATA :(
@whillas I'm doing something similar; but I use Firestore as the "directory structure" if you will, and only keep the actual image assets in GCS. The firestore queries can be made to be very fast, then you have the GCS URLs right there.
The code from @evanj works wonders! Here is a version for python3 and using the client directly since bucket.list_blobs is deprecated:
def list_gcs_directories(client, bucket, prefix):
# from https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920
iterator = client.list_blobs(bucket, prefix=prefix, delimiter='/')
prefixes = set()
for page in iterator.pages:
print (page, page.prefixes)
prefixes.update(page.prefixes)
return prefixes
I agree with comments above. Not being able to list top level directories and files for a given prefix is really annoying. This is not big data at all :).
We are currently stuck because we cannot properly list files from Google Storage......
And the best part here :
As I guess both gsutil and Console are using an API, why don't you open functionality ????
Plus, if you propose to index files in some database, I think this is a very bad idea and design. You have to maintain DB and filesystem synchronized... Which always leads to errors with time and high volumes.....
Most helpful comment
@JorritPosthuma Thanks! To make sure I get all the prefixes, even for large buckets, and using the "public" API, I wrote the following: