Gsutil: how to exclude multiple directories when gsutil rsync?

Created on 6 May 2019  ยท  12Comments  ยท  Source: GoogleCloudPlatform/gsutil

I have some sub directories a b c under directory "d"
how can I exclude them once?

question

Most helpful comment

It's also worth making use of the rsync command's -n flag to run in dry-run mode. This will let you see if you would have copied files you didn't intend to.

All 12 comments

Hi @zffocussss !

You can use the -x flag to exclude many directories or files using a regex pattern. There's some more info in this doc: https://cloud.google.com/storage/docs/gsutil/commands/rsync

Here's more examples from the doc linked above:

-x pattern

Causes files/objects matching pattern to be excluded, i.e., any matching files/objects will not be copied or deleted. Note that the pattern is a Python regular expression, not a wildcard (so, matching any string ending in "abc" would be specified using ".*abc$" rather than "*abc"). Note also that the exclude path is always relative (similar to Unix rsync or tar exclude options). For example, if you run the command:

    gsutil rsync -x "data./.*\.txt$" dir gs://my-bucket

it will skip the file dir/data1/a.txt.

You can use regex alternation to specify multiple exclusions, for example:

    gsutil rsync -x ".*\.txt$|.*\.jpg$" dir gs://my-bucket

NOTE: When using this on the Windows command line, use ^ as an escape character instead of \ and escape the | character.

Please let me know if that helps or if you have any other questions!

Updated the comment above with a few more details specific to your question. :)

Updated the comment above with a few more details specific to your question. :)

Hi @catleeball ,I try it.
gsutil -d -x "a/|b/|c/" -r d gs://my-bucket
but it does not work.I check my bucket in GCP console,but a,b,c is still here.
I think -x just can exclude files not directories.

Hi @zffocussss ! It looks like the issue might be with your regex. Here's an example I just tested:

Given this local directory structure rsync-test

cball@cball:~$ tree rsync-test/
rsync-test/
โ”œโ”€โ”€ dirA
โ”‚ย ย  โ””โ”€โ”€ bar.txt
โ”œโ”€โ”€ dirB
โ”‚ย ย  โ””โ”€โ”€ baz.txt
โ”œโ”€โ”€ dirC
โ”‚ย ย  โ”œโ”€โ”€ baq.txt
โ”‚ย ย  โ””โ”€โ”€ dirCA
โ”‚ย ย      โ””โ”€โ”€ bat.txt
โ””โ”€โ”€ foo.txt

Let's say we want to upload everything except dirA and dirCA. We can do that by writing a regex to say "check the path string for substring 'dirA' or substring 'dirCA'". Here's one way to do that:

cball@cball:~$ gsutil rsync -r -x '^.*dirA.*$|^.*dirCA.*$' rsync-test gs://rsync-test-cball
Building synchronization state...
Starting synchronization...
Copying file://rsync-test/dirB/baz.txt [Content-Type=text/plain]...
Copying file://rsync-test/dirC/baq.txt [Content-Type=text/plain]...
Copying file://rsync-test/foo.txt [Content-Type=text/plain]...
/ [3 files][    0.0 B/    0.0 B]
Operation completed over 3 objects.

Now let's check and make sure the bucket looks like we want it to:

cball@cball:~$ gsutil ls gs://rsync-test-cball
gs://rsync-test-cball/foo.txt
gs://rsync-test-cball/dirB/
gs://rsync-test-cball/dirC/
cball@cball:~$ gsutil ls gs://rsync-test-cball/dirB/
gs://rsync-test-cball/dirB/baz.txt
cball@cball:~$ gsutil ls gs://rsync-test-cball/dirC
gs://rsync-test-cball/dirC/baq.txt

If it's helpful to you in writing your regex, I've found https://regex101.com/ to be a handy website for testing regexes. You can mouse over each part of the regex and it tells you what it does. ๐Ÿ™‚

Hi @zffocussss ! It looks like the issue might be with your regex. Here's an example I just tested:

Given this local directory structure rsync-test

cball@cball:~$ tree rsync-test/
rsync-test/
โ”œโ”€โ”€ dirA
โ”‚ย ย  โ””โ”€โ”€ bar.txt
โ”œโ”€โ”€ dirB
โ”‚ย ย  โ””โ”€โ”€ baz.txt
โ”œโ”€โ”€ dirC
โ”‚ย ย  โ”œโ”€โ”€ baq.txt
โ”‚ย ย  โ””โ”€โ”€ dirCA
โ”‚ย ย      โ””โ”€โ”€ bat.txt
โ””โ”€โ”€ foo.txt

Let's say we want to upload everything except dirA and dirCA. We can do that by writing a regex to say "check the path string for substring 'dirA' or substring 'dirCA'". Here's one way to do that:

cball@cball:~$ gsutil rsync -r -x '^.*dirA.*$|^.*dirCA.*$' rsync-test gs://rsync-test-cball
Building synchronization state...
Starting synchronization...
Copying file://rsync-test/dirB/baz.txt [Content-Type=text/plain]...
Copying file://rsync-test/dirC/baq.txt [Content-Type=text/plain]...
Copying file://rsync-test/foo.txt [Content-Type=text/plain]...
/ [3 files][    0.0 B/    0.0 B]
Operation completed over 3 objects.

Now let's check and make sure the bucket looks like we want it to:

cball@cball:~$ gsutil ls gs://rsync-test-cball
gs://rsync-test-cball/foo.txt
gs://rsync-test-cball/dirB/
gs://rsync-test-cball/dirC/
cball@cball:~$ gsutil ls gs://rsync-test-cball/dirB/
gs://rsync-test-cball/dirB/baz.txt
cball@cball:~$ gsutil ls gs://rsync-test-cball/dirC
gs://rsync-test-cball/dirC/baq.txt

If it's helpful to you in writing your regex, I've found https://regex101.com/ to be a handy website for testing regexes. You can mouse over each part of the regex and it tells you what it does. ๐Ÿ™‚

oh my god.thanks for your help.I know it is python regex.I used the pcre and shell regex.
you are right.I need to check my regex in gsutil.

Hi @zffocussss ! It looks like the issue might be with your regex. Here's an example I just tested:

Given this local directory structure rsync-test

cball@cball:~$ tree rsync-test/
rsync-test/
โ”œโ”€โ”€ dirA
โ”‚ย ย  โ””โ”€โ”€ bar.txt
โ”œโ”€โ”€ dirB
โ”‚ย ย  โ””โ”€โ”€ baz.txt
โ”œโ”€โ”€ dirC
โ”‚ย ย  โ”œโ”€โ”€ baq.txt
โ”‚ย ย  โ””โ”€โ”€ dirCA
โ”‚ย ย      โ””โ”€โ”€ bat.txt
โ””โ”€โ”€ foo.txt

Let's say we want to upload everything except dirA and dirCA. We can do that by writing a regex to say "check the path string for substring 'dirA' or substring 'dirCA'". Here's one way to do that:

cball@cball:~$ gsutil rsync -r -x '^.*dirA.*$|^.*dirCA.*$' rsync-test gs://rsync-test-cball
Building synchronization state...
Starting synchronization...
Copying file://rsync-test/dirB/baz.txt [Content-Type=text/plain]...
Copying file://rsync-test/dirC/baq.txt [Content-Type=text/plain]...
Copying file://rsync-test/foo.txt [Content-Type=text/plain]...
/ [3 files][    0.0 B/    0.0 B]
Operation completed over 3 objects.

Now let's check and make sure the bucket looks like we want it to:

cball@cball:~$ gsutil ls gs://rsync-test-cball
gs://rsync-test-cball/foo.txt
gs://rsync-test-cball/dirB/
gs://rsync-test-cball/dirC/
cball@cball:~$ gsutil ls gs://rsync-test-cball/dirB/
gs://rsync-test-cball/dirB/baz.txt
cball@cball:~$ gsutil ls gs://rsync-test-cball/dirC
gs://rsync-test-cball/dirC/baq.txt

If it's helpful to you in writing your regex, I've found https://regex101.com/ to be a handy website for testing regexes. You can mouse over each part of the regex and it tells you what it does. ๐Ÿ™‚

By the way,how do you test this regex format as they are in the path of the linux.they are not string.

By the way,how do you test this regex format as they are in the path of the linux.they are not string.

Hi @zffocussss ! When gsutil rsync runs, it walks the directory tree of the source directory. If you include an exclusion pattern, each file / directory gets matched against your provided regex:

https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/commands/rsync.py#L745

If you open the Python REPL, you can test your regex with something like this:

cball@cball:~$ python
Python 3.7.3 (default, Apr 25 2019, 13:07:15) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r = re.compile('^.*dirA.*$|^.*dirCA.*$')
>>> dirs = ['rsync-test/dirA', 'rsync-test/dirB', 'rsync-test/dirC', 'rsync-test/dirC/dirCA']
>>> for d in dirs:
...   if r.match(d):
...     print('Regex matches: ' + d)
...   else:
...     print('Regex does not match: ' + d)
... 
Regex matches: rsync-test/dirA
Regex does not match: rsync-test/dirB
Regex does not match: rsync-test/dirC
Regex matches: rsync-test/dirC/dirCA

Or if you're using the online regex tester, you can plug in different directories and see which ones match or don't. :slightly_smiling_face:

I hope that helps! Please let me know if you have any other questions @zffocussss !

It's also worth making use of the rsync command's -n flag to run in dry-run mode. This will let you see if you would have copied files you didn't intend to.

Smart thinking, @houglum ! :bulb:

By the way,how do you test this regex format as they are in the path of the linux.they are not string.

Hi @zffocussss ! When gsutil rsync runs, it walks the directory tree of the source directory. If you include an exclusion pattern, each file / directory gets matched against your provided regex:

https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/commands/rsync.py#L745

If you open the Python REPL, you can test your regex with something like this:

cball@cball:~$ python
Python 3.7.3 (default, Apr 25 2019, 13:07:15) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r = re.compile('^.*dirA.*$|^.*dirCA.*$')
>>> dirs = ['rsync-test/dirA', 'rsync-test/dirB', 'rsync-test/dirC', 'rsync-test/dirC/dirCA']
>>> for d in dirs:
...   if r.match(d):
...     print('Regex matches: ' + d)
...   else:
...     print('Regex does not match: ' + d)
... 
Regex matches: rsync-test/dirA
Regex does not match: rsync-test/dirB
Regex does not match: rsync-test/dirC
Regex matches: rsync-test/dirC/dirCA

Or if you're using the online regex tester, you can plug in different directories and see which ones match or don't. ๐Ÿ™‚

I hope that helps! Please let me know if you have any other questions @zffocussss !

okay.I see.thanks.

It's also worth making use of the rsync command's -n flag to run in dry-run mode. This will let you see if you would have copied files you didn't intend to.

so nice advice.I can use this to see what will happen

By the way,how do you test this regex format as they are in the path of the linux.they are not string.

Hi @zffocussss ! When gsutil rsync runs, it walks the directory tree of the source directory. If you include an exclusion pattern, each file / directory gets matched against your provided regex:

https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/commands/rsync.py#L745

If you open the Python REPL, you can test your regex with something like this:

cball@cball:~$ python
Python 3.7.3 (default, Apr 25 2019, 13:07:15) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r = re.compile('^.*dirA.*$|^.*dirCA.*$')
>>> dirs = ['rsync-test/dirA', 'rsync-test/dirB', 'rsync-test/dirC', 'rsync-test/dirC/dirCA']
>>> for d in dirs:
...   if r.match(d):
...     print('Regex matches: ' + d)
...   else:
...     print('Regex does not match: ' + d)
... 
Regex matches: rsync-test/dirA
Regex does not match: rsync-test/dirB
Regex does not match: rsync-test/dirC
Regex matches: rsync-test/dirC/dirCA

Or if you're using the online regex tester, you can plug in different directories and see which ones match or don't. ๐Ÿ™‚

I hope that helps! Please let me know if you have any other questions @zffocussss !

r=re.compile('^./dirA/.$|^.*/dirA$|^dirA')
dirs = ['rsync-test/dirA', 'rsync-test/dirB', 'rsync-test/dirC', 'rsync-test/dirC/dirCA', 'a/dirAk/b', 'a/dirA/b','dirA/A/B/C']
In [18]: for d in dirs:
...: if r.match(d):
...: print('Regex matches: ' + d)
...: else:
...: print('Regex does not match: ' + d)
...:

Regex matches: rsync-test/dirA
Regex does not match: rsync-test/dirB
Regex does not match: rsync-test/dirC
Regex does not match: rsync-test/dirC/dirCA
Regex does not match: a/dirAk/b
Regex matches: a/dirA/b
Regex matches: dirA/A/B/C

I may find what I want.I need to consider "/",as it is a subdirectory.
I also suggest GCP gsutil team can provide more examples when operating regex,as it is a little complex but it is used actually.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tispratik picture tispratik  ยท  4Comments

khavishbhundoo picture khavishbhundoo  ยท  7Comments

jterrace picture jterrace  ยท  3Comments

nathankw picture nathankw  ยท  6Comments

bboe picture bboe  ยท  3Comments