gsutil work with explicit directories

Created on 11 Oct 2016  路  18Comments  路  Source: GoogleCloudPlatform/gsutil

gcsfuse requires explicit directories by default. So the bucket has object:

dir/
dir/a.txt

When do gsutils cp, mv, rsync and I need to copy/move the explicit directory too. Reason: gsutil is faster then gcsfuse and the gcsfuse cannot mv (rename) directories. So when I do gsutil mv gs://bucket/dir gs://bucket/newdir, I got the objects:

dir/
newdir/a.txt

but I would expect

newdir/
newdir/a.txt

Similarly with gsutils cp, rsync - the newdir/ explicit directory is not created.

_Note:_ as a workaround I have created a shell script to create explicit dirs gsmkdirs.sh.

Most helpful comment

+1 on something built into gsutil. I can do it from the web console, wonder why doesn't the cli support?

All 18 comments

It's unlikely we'll add support for gsutil to create placeholder objects representing directories. gsutil tries to be compatible with these placeholder objects, but it does not create them because it complicates gsutil command semantics.

Well, maybe to the _local disk - bucket_ operation you are right. But I think for the _bucket - bucket_ operations the semantics is simple: treat directory placeholder as the common file object. If it's in the source bucket, then mv/copy it to the destination bucket.

Changing the semantics based on whether the destination is local or a bucket seems confusing to me, because it's only going halfway to preserving the fiction. We can't preserve the fiction when we copy locally. If we decide to preserve it when copying in the cloud, then it would also make sense to create placeholder objects when we copy from local to cloud, which we've explicitly decided not to do.

As an example of the kind of complication that could arise with this semantic, imagine that you are copying with a customer-supplied encryption key. Should directory placeholder objects then be guarded by the encryption key and inaccessible without it? I think this is hard to reason about.

I do not know how the gsutil works with the customer-supplied encryption key and probably other use-cases.

Just from my novice point of view: the behaviour of gsutil is confusing for me now, since is does not mirror the source bucket when doing bucket - bucket operations. E.g. when I make a backup of a bucket directory to a (another or same) bucket, and then restore the bucket, I have incomplete information - without explicit dirs. I.e. I would expect the directory placeholders objects to behave as a files with zero size.

IMO it's acceptable if the local disk - bucket does not create the explicit dirs. The bucket and the local disk are different storages, other infos are also not preserved (contentType, ACLs, ...). (Although some option, like -E for creating explicit dirs would be nice, too.)

Although I do not know the customer-supplied encryption in gsutil, I would guess that the explicit dirs should be treated as other files, i.e. encrypted.

I agree with your point of view that the existing semantics cause confusion, in particular when you are using the browser UI and gsutil interchangeably. What I'm trying to point out is that I think it would be difficult to remedy this confusion completely; instead we'd trade for another potentially confusing set of semantics.

That being said, when you "back up" bucket to bucket and gsutil skips these placeholders, wouldn't restoring from the bucket to local with gsutil work fine even without the placeholder directories? Are you concerned that there are empty directories on your local filesystem that have important meaning?

Our particular reason for explicit directories is the gcsfuse (see the first post), not a copy on a local fs. gcsfuse cannot access the files without these explicit dirs. (It has the switch --implicit-dirs, but this gcsfuse mode is problematic.) Especially, the gcsfuse cannot move-rename directories. So when I gsutil mv a directory, then I cannot see it by the gcsfuse.

Note. I know the gcsfuse is not and never will be production ready, but it greatly speeds up and simplifies admin and maintenance tasks. E.g. using Linux find tool etc. So we are trying to maintain explicit dirs on our bucket for occasional gcsfuse access.

I'm still not in favor of making gsutil treat placeholder objects for bucket-to-bucket operations differently than for local-to-bucket or bucket-to-local operations, because causing these objects to have different meaning depending on the destination:

  1. is a confusing semantic that is hard to explain
  2. adds further complexity to the already difficult task of honoring placeholder folders in gsutil

That said, I also think your desire to safely interoperate with gcsfuse is a reasonable one.

What would you think about a command (maybe in gsutil, but it make make more sense elsewhere) that takes a bucket as an input and creates placeholder objects for each "directory" that it finds? This wouldn't allow you to preserve empty directories, but presumably those are not of high importance.

Yep, a tool to create explicit dirs would be enough for local-to-bucket operations. I have already created such script gsmkdirs.sh, but it's not optimal, it requires gcsfuse. It would be perfect it is a part of gsutil (somethink like gsutil mkdir).

Thanks - leaving this issue open to consider post-hoc population of placeholder directory objects (and potentially gsutil mkdir to create an empty directory placeholder).

not that it will be implemented, but this really is a missing feature, because anything copied to GCS from a local filesystem using gsutil cannot actually be seen using gcsfuse. we can set the --implicit-dirs option for gcsfuse, but this results in unacceptable performance for us. If only there was a mode in gsutil that honored and maintained these directory placeholders for recursive cp and rsync with gsutil (an after-the-fact pass is still useful, but less so)...

Weirdness of current semantics is also noticeable when used together with hadoop connector. While directory structure is preserved and fully understood by hadoop connector, gsutil is still better choice for gcs-to-gcs copy operations. While being the only (?) out-of-the box way for metadata-level copies, it becomes virtually unusable if it is required to preserve directory structure.

@thobrla So would you think it would be acceptable to add explicit flag like -E/--preserve-explicit-dirs during bucket-to-bucket operations?

I think adding --preserve-explicit-dirs would need to be supported bidirectionally. This is complicated because what it means for a GCS object to be a "dir" is not clear. What the expected behavior if I store data in an object ending with / and then try to copy it locally? There are other edge cases as well.

I'm open to suggestions for resolving this, but without a cohesive design I think it's more appropriate to add the ability to populate placeholder dir objects post-hoc based on an object listing.

Just a quick note: I'm still open to listening to potential solutions, but even if we were able to agree on an approach to reduce some of the confusion around pseudo-directory semantics, it's unlikely that we (the gsutil team) would be able to implement that solution in the near future. Unfortunately, we've had quite a few high-priority work items come up in recent months, and we're having to deprioritize and put other items on hold as a result. (My intention isn't to put a damper on the conversation, but to set expectations on when we might see this fixed if we get to an agreeable solution.)

+1 on something built into gsutil. I can do it from the web console, wonder why doesn't the cli support?

very confusing, just followed the documentation at https://cloud.google.com/storage/docs/gsutil/commands/mv and gsutil mv gs://my_bucket/olddir gs://my_bucket/newdir won't rename the olddir to newdir

not to say moving a folder into another folder

Use this simple code to sync src directorie structure to dest folder

mimicDirectories()
{
local srcD="$1"
local dstD="$2"
local p=""
local file=""

if [[ ! -d "$srcD" ]];then
return
fi

mkdir -p "$dstD"

local arr=($(find $srcD -maxdepth 1 -mindepth 1 -type d))

for ((p=0; p<${#arr[*]}; p++))
do
file=$(basename ${arr[p]})
mkdir -p $dstD/$file
if [ $? == 0 ];then
mimicDirectories "${arr[p]}" "$dstD/$file"
fi
done
}

any update on this? I am facing the same issue and by enabling --implicit-dirs, there is a tradeoff in performance which I can not afford.

@vishivish18 just use the mimicDirectories() shell function before using gsutil..(https://github.com/GoogleCloudPlatform/gsutil/issues/388#issuecomment-387357551).. It is a recursive function which duplicates the src to dst directory structury......

Was this page helpful?
0 / 5 - 0 ratings