Google-cloud-python: Tell if blob is a directory?

Created on 9 Nov 2015  ·  12Comments  ·  Source: googleapis/google-cloud-python

How can you tell if a blob is a directory or file? I know with the old cloud storage API you could just do stat.is_dir, but I don't see any equivalent functionality here?

question storage p2

Most helpful comment

There is a difference between the concept of a directory structure (which exists in the mind of the user) and the reality of how the data is stored. When using the API, I do not care about the reality of the data storage. Delimited paths that collate groups of files together are directories and I need API primitives that allow me to treat them as such, just like it was right here on my local machine, without writing a bunch of helper functions to mediate the difference between the way the backend storage works (which is a mere implementation detail) and the concept interface presented to me (which de facto is a file system).

It's disappointing to see this reality of the situation not being engaged with.

All 12 comments

There is no notion of a "directory": a blob/object is always a file. The back-end API does not permit creating "folder"/"directory" objects within a bucket: you and emulate (somewhat) a hierarchy by using a separator in object names (see the "Object names" section of https://cloud.google.com/storage/docs/concepts-techniques#concepts).

I sorely beg to differ that there is no notion of a directory. The google developer console itself allows you to create "folders", and the app-engine-gcs library which is also maintained by google provides directory emulation.

I wanted to switch over to this library because it seems to be what the documentation is pushing and I prefer the way it handles files compared to the gcs-client library. It would be nice if this library incorporated some directory listing / stat abilities of the old library.

My end goal is to use gcloud-python to write a custom GoogleStorage class for Django. This was easy to do with the old library because of the .is_dir function.

In fact there are many python standard library functions and functions of many third party libraries that require the notion of directories. It would seem to make sense to me that the ability is incorporated into the library.

What is the end goal for google api with python for cloud storage? Is the old library going to be maintained? If so, should I just continue using that one?

developer console cloud storage

From the perspective of the API

The UI "does" folders in the sense that it splits filenames on / characters (and has the concept of "creating a folder"), but it's really "folder fiction" in the sense that the API itself doesn't actually have any concept of a "folder".

The definition of is_dir would be:

def is_dir():
  for f in get_files(prefix='/myfoldername/'):
    return True
  return False

That is... if there is an object called "/home/jj/file.txt" then "/home/jj" is a "folder". But you could create an object named "/home/jj/file.txt" wihtout creating the "folders" called "/home" or "/home/jj" -- which doesn't quite line up with the way Unix thinks of "directories" (/"folders")

With a bit of magic

This rule-based approach stands in contrast to the way many tools work, which create objects to mark the existence of folders (such as “dir_$folder$”)

In other words, we may just be able to check for the existence of this -- which would correspond to "the UI created this folder".

An app _can_ emulate folders by creating blobs with embedded separators (e.g., /) in their names, and search for blobs "within" the emulated folder using the prefix argument to Bucket.list_blobs.

However, just like the little man upon the stair, there isn't any "folder" object in the back-end whose name is just the prefix, terminated by the /.

It would be nice if this library incorporated some directory listing / stat abilities of the old library

I think we can probably help with this but need a bit more detail. The basic gist here is "GCS technically does _not_ support folders, and a library that did that was 'faking it' for you."

The "faking" is effectively:

  • A path is a folder if it ends in /
  • You can ls anything (worst case you'll get nothing back)
  • ls'ing a directory is basically a prefix search (give me everything that starts with "/foo/bar/")
  • ls'ing via prefix actually includes _subdirectories_ as well! (ie, /foo/bar/baz/file.txt will be in the ls for /foo/bar)

    • If you want _just_ the immediate directory, you need to exclude anything containing more slashes (I forgot how to do this one, I think there's a separator attribute?)

So -- if you give us some more detail about what exactly we're shooting for here, maybe we can figure out what methods make sense and how to fake this folder stuff in a consistent way (it'd be bad if we did it a way that wasn't quite right).

@jgeewax thanks for considering this.

In my mind the most straightforward thing to shoot for here is to duplicate the is_dir property from the old GCSFileStat class onto the blob / object of gcloud-python.

I'll give you an example of how I would use it, so that you get an idea of intent behind my meaning.

Here is the default FileSystemStorage class of django based on the abstract Storage class:

def listdir(self, path):
        path = self.path(path)
        directories, files = [], []
        for entry in os.listdir(path):
            if os.path.isdir(os.path.join(path, entry)):
                directories.append(entry)
            else:
                files.append(entry)
        return directories, files

And here is a custom GoogleStorage class, again based on the default Storage using the old library emulating the same functionality.

def listdir(self, path):
        directories, files = [], []

        for stat in cloudstorage.listbucket(self.bucket + '/' + path, delimiter='/'):
            logger.debug(stat)
            if stat.is_dir:
                directories.append(stat.filename)
            else:
                files.append(stat.filename)

        return directories, files

Now, I think that the .is_dir functionality assumes that if it ends in a / then it is a directory. I think this would fit the majority of use cases. I can also imagine some use cases where it would be nice to have a generator yield the directories so that you can recursively walk through the "file system".

maybe we can figure out what methods make sense and how to fake this folder stuff in a consistent way (it'd be bad if we did it a way that wasn't quite right).

I agree with it being consistent, I'm also a proponent of Kenneth Reitz's spirit of API's should be simple and fit 95% of use cases, instead of being overtly technical to fit 100% of use cases. The requests library vs the python standard library for http requests as a case in point.

It makes a lot of sense to have this in the gcs client library. The concept of folders is important as the information within a bucket is organized in folder structure. Yes, the underlying implementing might lay it down as a flat model, but this doesn't help from a API consumer point of view. Please consider raising its priority.

thanks!

I am going to close this as a duplicate of #920 (which I am keeping open). They are not identical but cover the same territory.

Not sure if this has been already discussed, but I do see that creating a "folder" from the developer console actually makes a difference, in that there is a blob object corresponding the folder created.
For eg: I start with a bucket with no files/object and I do the following:
gsutil cp file gs://my-bucket/dir1/
No when I list recursively with blank prefix, the following objects is/are listed:
dir1/file1
However when I use the Storage.BlobListOption.currentDirectory option(this is non-recursive listing) with blank prefix, I get:
dir/ (The result is returned as a Blob, but when I try to explicitly get bucket.get("dir1/") I get a null).
Once I use create folder option from the console and create a directory dir1 (mind you dir1/file1 already exists at this point) and when I list , I get the following:
dir1/
dir1/file1
Though I haven't found any other way of creating a folder outside of console, including gsutil
I thought it is worth to point this out here.

I just noticed the same thing as @debuggger, which was quite surprising to me. I wrote a Python application that downloads files from a bucket by listing the blobs under a prefix. In most cases, I get a list of blobs like this:

myprefix/file1.txt
myprefix/file2.txt

However, because a folder was created through the UI explicitly, I am getting this for one of the prefixes:

myotherprefix/
myotherprefix/file1.txt
myotherprefix/file2.txt

My application is trying to download each of the blobs and can't do it for myotherprefix. My understanding from the docs is that the folder structure is completely fictional, and only created my blob objects; I understand now that you can have a blob at the directory level that has no content but causes the prefix/folder to exist, but this isn't what I was expecting.

My God yes. @debugger and @alexcwatt I thought I was going insane. My partner and I were running same code over different bucket instances, and getting totally different results. This is it.

There is a difference between the concept of a directory structure (which exists in the mind of the user) and the reality of how the data is stored. When using the API, I do not care about the reality of the data storage. Delimited paths that collate groups of files together are directories and I need API primitives that allow me to treat them as such, just like it was right here on my local machine, without writing a bunch of helper functions to mediate the difference between the way the backend storage works (which is a mere implementation detail) and the concept interface presented to me (which de facto is a file system).

It's disappointing to see this reality of the situation not being engaged with.

Was this page helpful?
0 / 5 - 0 ratings