Google-cloud-python: List subdirectories

Created on 9 Jun 2015 · 29Comments · Source: googleapis/google-cloud-python

In order to navigate (potentially broad) directory hierarchies in buckets, it would be useful to be able to list the sub-directories in a bucket at a given prefix. Happy to attempt to implement such a method if someone points me in the correct direction. Seems like perhaps a new Iterator?

feature request storage p2

Source

samskillman

👍10

Most helpful comment

@JorritPosthuma Thanks! To make sure I get all the prefixes, even for large buckets, and using the "public" API, I wrote the following:

def list_gcs_directories(bucket, prefix):
    # from https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920
    iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
    prefixes = set()
    for page in iterator.pages:
        print page, page.prefixes
        prefixes.update(page.prefixes)
    return prefixes

evanj on 30 Aug 2017

👍26 ❤2

All 29 comments

@samskillman How about

from gcloud import storage
bucket = storage.Bucket(...)
blob_iter = bucket.list_blobs(max_results=..., prefix=..., delimiter=...)

dhermes on 9 Jun 2015

👍1

I think I wasn't completely clear in the description above. Let's say I have a bucket structure like this:

gs://root_bucket/
    A/blob1
    A/blob2
    B/blob3
    B/blob4

Is there a way to use the list_blobs (or anything else) to return the first level of subdirectory names ["gs://root_bucket/A/", "gs://root_bucket/B/"] rather than all the blobs? This would have the same behavior as gsutil ls gs://root_bucket. Perhaps I don't understand how delimiter and prefix can be used to accomplish this.

samskillman on 9 Jun 2015

Being more specific, shouldn't this work:

from gcloud import storage
bucket = storage.Bucket(...)
blob_iter = bucket.list_blobs(delimiter='/')

to get the top level?

And to list subdirectories inside A:

from gcloud import storage
bucket = storage.Bucket(...)
blob_iter = bucket.list_blobs(delimiter='/', prefix='A/')

jgeewax on 9 Jun 2015

👎2 👍1

@jgeewax That is what I had tried originally, but I get empty lists in both cases. From what I can tell in test_bucket.py on line 291, the test actually asserts that the list of blobs is [], so maybe this is an actual bug?

samskillman on 9 Jun 2015

Huh, but I can do:

iterator = bucket.list_blobs(delimiter="/")
response = iterator.get_next_page_response()
blobs = list(iterator.get_items_from_response(response))
dirs = iterator.prefixes

(Got that hint from regression/storage.py)

samskillman on 9 Jun 2015

👍1

The docs are a bit clarifying: https://cloud.google.com/storage/docs/json_api/v1/objects/list

If you want "just the stuff immediately in the directory A/", this translates to "Give me the stuff that starts with A/ but doesn't have any other / characters in the non-prefix part of the name, which you do with:

from gcloud import storage
bucket = storage.Bucket(...)
blob_iter = bucket.list_blobs(delimiter='/', prefix='A/')

This will return all stuff that _starts_ with A/ however in the rest of the name, there can't be another /. So A/foo.txt would come back in the list, but A/sub/foo.txt would _not_ come back.

jgeewax on 9 Jun 2015

👍4

@jgeewax that would list only the files, right? Then to get the directories I'd have to do what I wrote up above, is that right? How about a bucket.list_prefixes method like this:

def list_prefixes(self, prefix=None):
    iterator = self.list_blobs(delimiter='/', prefix=prefix)
    list(iterator)  # Necessary to populate iterator.prefixes
    for p in iterator.prefixes:
        yield p

samskillman on 9 Jun 2015

👍8 🎉1

I think, effectively, what that's doing is going through the list of blobs in the bucket, and collecting all the prefixes. So yes, that should work -- probably just as well as collecting the prefixes from the list of blobs yourself.

If we were going to add a list_prefixes method, it would be tricky. I can't find a way to get "just the top-level list of directories"...

jgeewax on 9 Jun 2015

@Capstan : What would I do to get the list of all blobs that are "directories" in the root of the bucket?

I thought prefix='/', delimiter='/' would work, but it seems to (true to form) just return the non-directory items (ie, "give me everything starting with / that has no other /s in the name").

What i want is "give me everything starting with / that has exactly one more / in the name" (I think...)

jgeewax on 9 Jun 2015

@samskillman @jgeewax The list(iterator) will page through results and [sets] prefixes every time a new HTTP request is issued. This means that the value iterator.prefixes may not be comprehensive for the entire result set, just for the last request issued. (UPDATE: That's not to say we can't make self.prefixes a set, it just currently isn't implemented that way.)

dhermes on 9 Jun 2015

@jgeewax @dhermes has it right. FWIW, we don't enforce a preceeding '/' at all, and many objects don't start with that character. If you haven't iterated over the entire list, you won't have the set of top-level directory-likes.

Capstan on 9 Jun 2015

Given that the separator character is optional, and user defined, we can't really provide a lot of help for this usecase in the library. As @dhermes points out, we could change the iterator such that its prefixes attribute is cumulative across all requested pages: that seems like a reasonable choice.

If we made that change, I imagine that the most reasonable way to accomplish the goal is to write a helper function which "piped the items to /dev/null" and then extracted the relevant prefixes. E.g.:

def extract_prefixes(bucket, prefix=None, delimiter='/'):
    iterator = bucket.list_blobs(prefix=prefix, delimiter=delimiter):
    map(lambda x: None, iterator)
    return iterator.prefixes

tseaver on 10 Jun 2015

:+1: to that idea, I don't think it belongs in gcloud-python though. However yes, I can see the logic for making prefixes cumulative.

jgeewax on 10 Jun 2015

@tseaver I think it'd be worthwhile to at least have a protected attribute on the Iterator called _current_prefixes just to make the distinction from the set() that is self.prefixes. Worth making it public?

dhermes on 10 Jun 2015

@samskillman I am going to implement the cumulative prefixes for the iterator, but likely nothing else.

Do you have an issue with us closing out this issue after that?

dhermes on 18 Jun 2015

Very basic operation in other cloud providers folders = bucket.list("","/"), using Google Cloud it is too complex to need to have function with several lines of code:

def extract_prefixes(bucket, prefix=None, delimiter='/'):
    """Function Option A."""
    iterator = bucket.list_blobs(delimiter="/")
    response = iterator.get_next_page_response()
    blobs = list(iterator.get_items_from_response(response))
    dirs = iterator.prefixes
    return dirs


def list_prefixes(self, prefix=None):
    """Function Option B."""
    iterator = self.list_blobs(delimiter='/', prefix=prefix)
    list(iterator)  # Necessary to populate iterator.prefixes
    for p in iterator.prefixes:
        yield p

We understand that you don't want to upset the old users that are use to this, but for new users this looks bad quality API.

VelizarVESSELINOV on 4 Jul 2016

👍19

Other Google APIs outside the gcloud-python are managing better delimiter='/', an example: https://cloud.google.com/storage/docs/json_api/v1/objects/list#try-it

VelizarVESSELINOV on 4 Jul 2016

@dhermes @tseaver
https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/storage/google/cloud/storage/bucket.py#L331
The prefixes is set to be empty. How can I extract prefixed?

xiangfeidongsc on 19 Apr 2017

Nvm. I found how to use it by searching in your test files.

I had to say I am disappointed to the library. It is far from decent quality.

xiangfeidongsc on 19 Apr 2017

👍1

We've now seen a couple different people agreeing that this is a shortcoming against other libraries. Maybe it's time to explore fixing this? @dhermes

jgeewax on 19 Apr 2017

👍22

As iterator.get_next_page_response is hidden nowadays, you can use the following code:

def get_prefixes(bucket):
    iterator = bucket.list_blobs(delimiter="/")
    response = iterator._get_next_page_response()
    return response['prefixes']

JorritPosthuma on 6 Jul 2017

❤6 👍1

@JorritPosthuma Thanks! To make sure I get all the prefixes, even for large buckets, and using the "public" API, I wrote the following:

def list_gcs_directories(bucket, prefix):
    # from https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920
    iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
    prefixes = set()
    for page in iterator.pages:
        print page, page.prefixes
        prefixes.update(page.prefixes)
    return prefixes

evanj on 30 Aug 2017

👍26 ❤2

Feature Requests will now be tracked in our Project Feature Request. Please feel free to continue the discussions here.

chemelnucfin on 22 Jan 2018

👎3

@evanj I like your code, but have a strong feeling that it x times slower than gsutil ls specially for buckets with a lot of files deep in the three. Are there some options to optimize the Python code or is better to do system call of gsutil ls for speed up the listing of sub-folders?

VelizarVESSELINOV on 20 Mar 2019

👍2

It should be making a very similar API calls. I suspect if you know the bucket is large, you could attempt to list files using multiple API calls in parallel, by splitting the file name space. I've never had to do such a thing. Good luck!

evanj on 20 Mar 2019

👎1

ya, i 'm trying to write a web interface so that some images can be hand checked. I have ~3500 groups (subfolders) and each has 10,000 images (so 35,000,000 blobs). I need to select a random sample from each group as it would be next to impossible to hand check them all. I can't loop over all the the blobs just to get the prefixes _every page load_. Groups a added and removed constantly hence I'd like the bucket to be the single source of truth. I guess GCS is not great for BIG DATA :(

whillas on 5 Dec 2019

@whillas I'm doing something similar; but I use Firestore as the "directory structure" if you will, and only keep the actual image assets in GCS. The firestore queries can be made to be very fast, then you have the GCS URLs right there.

garyo on 12 Dec 2019

👍2

The code from @evanj works wonders! Here is a version for python3 and using the client directly since bucket.list_blobs is deprecated:

def list_gcs_directories(client, bucket, prefix):
    # from https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920
    iterator = client.list_blobs(bucket, prefix=prefix, delimiter='/')
    prefixes = set()
    for page in iterator.pages:
        print (page, page.prefixes)
        prefixes.update(page.prefixes)
    return prefixes

haehn on 5 Jul 2020

👍7

I agree with comments above. Not being able to list top level directories and files for a given prefix is really annoying. This is not big data at all :).
We are currently stuck because we cannot properly list files from Google Storage......
And the best part here :

If you use gsutil command, this is possible
GCP storage interface console is able to do it.

As I guess both gsutil and Console are using an API, why don't you open functionality ????

Plus, if you propose to index files in some database, I think this is a very bad idea and design. You have to maintain DB and filesystem synchronized... Which always leads to errors with time and high volumes.....