We use google cloud storage to store .parquet files as part of our dataprocessing. We often want to load a .parquet file into memory, to be read directly into a pandas Dataframe without downloading it on disk.
The code below does the trick. My question is: would it be useful to include a download_as_buffer method in storage.blob?
from io import BytesIO
from google.oauth2.service_account import Credentials
from google.cloud.storage import Client
import pandas as pd
SERVICE_ACCOUNT = '/some/path/to/service-account.json'
credentials = Credentials.from_service_account_file(SERVICE_ACCOUNT)
bucket = Client(credentials=credentials).bucket('mediquest-closed-data')
f = BytesIO()
bucket.get_blob(blob_name='some_file.parquet').download_to_file(f)
df = pd.read_parquet(f)
Add method or modify download_as_string to have option to return the ByteIO buffer rather than getvalue()
def download_as_string(self, client=None):
"""Download the contents of this blob as a string.
:type client: :class:`~google.cloud.storage.client.Client` or
``NoneType``
:param client: Optional. The client to use. If not passed, falls back
to the ``client`` stored on the blob's bucket.
:rtype: bytes
:returns: The data stored in this blob.
:raises: :class:`google.cloud.exceptions.NotFound`
"""
string_buffer = BytesIO()
self.download_to_file(string_buffer, client=client)
return string_buffer.getvalue()
Or am I overlooking a similar method that is already included elsewhere in the API?
@dkapitan Blob.download_to_file does what you want (it takes a file object, versus the filename taken by Blob.download_to_filename).
import io
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('my-bucket-name')
blob = bucket.get_blob('my-blob-name')
buffer = io.BytesIO()
blob.download_to_file(buffer)
@dkapitan
Blob.download_to_filedoes what you want (it takes a file object, versus the filename taken byBlob.download_to_filename).import io from google.cloud import storage client = storage.Client() bucket = client.get_bucket('my-bucket-name') blob = bucket.get_blob('my-blob-name') buffer = io.BytesIO() blob.download_to_file(buffer)
For anyone working with this later, don't forget to call buffer.seek(0) before reading it.
Most helpful comment
For anyone working with this later, don't forget to call
buffer.seek(0)before reading it.