Google-cloud-python: Batch uploading of files

Created on 14 Mar 2017  路  6Comments  路  Source: googleapis/google-cloud-python

I'm confused about this comment https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/storage/google/cloud/storage/blob.py#L589

What is the correct way of uploading blobs in a batch?

    client = storage.Client.from_service_account_json('client-secret.json', project=PROJECT_NAME)
    bucket = client.get_bucket(BUCKET_NAME)
    with client.batch():
        for i in range(10):
            with open('base.py', 'rb') as my_file:
                blob = storage.Blob('/test/{}'.format(i), bucket)
                blob.upload_from_file(my_file, client=client)

This, of course, won't work because is using client._base_connection instead of client.current_batch.

storage

Most helpful comment

For anyone that might want to batch upload/download.
I wanted to speed up this process and it turned out that spawning a few threads and using a different google storage client instance in each did improve the speed significantly. Tested on a 6-core machine. I just needed to partition the list of files to upload/download myself.
Hope this will be helpful for some.

All 6 comments

@tartavull Thanks for the report! Unfortunately, the back-end API doesn't support batching "media" operations, which means we don't support it either.

Examples of situations when you might want to use batching:

- Updating metadata, such as permissions, on many objects.
- Deleting many objects.
...

* Note: Currently, Google Cloud Storage does not support batch operations for media,
  either for upload or download.

Any update on this? How do we upload thousands of files to GCS ? Should we always iterate and use file upload?

@arvindnrbt with consulting hat on "It depends...". You could:

  • Iterate and upload files one by one.
  • Populate a queue.Queue with the filenames and have worker threads upload items from the queue.
  • Populate a cross-process queue and have worker processes upload items from the queue.

You can't:

  • Upload a zipfile and have GCS automagically explode its entries into blobs / objects.
  • Create a "batched" MIME request and do multiple uploads in a single request.

any update on this? @tseaver

@limbuu I'm not sure what you're asking: the Storage API supports batch requests only for object metadata, not for uploads / downloads:

Note: Cloud Storage does not support batch operations for media, either for upload or download.

which means we can't support it in the google-cloud-storage client library, either.

The other solutions I outlined above aren't suitable for the library: they are application-dependent.

For anyone that might want to batch upload/download.
I wanted to speed up this process and it turned out that spawning a few threads and using a different google storage client instance in each did improve the speed significantly. Tested on a 6-core machine. I just needed to partition the list of files to upload/download myself.
Hope this will be helpful for some.

Was this page helpful?
0 / 5 - 0 ratings