Google-cloud-python: BigQuery: 'RowIterator.to_dataframe' surprisingly consumes / merges all pages.

Created on 6 Feb 2019 · 6Comments · Source: googleapis/google-cloud-python

Environment details

python version: 3.7.2
virtual environment: Conda managed

pip freeze | grep 'google

google-api-core==1.7.0
google-auth==1.6.2
google-auth-oauthlib==0.2.0
google-cloud-bigquery==1.9.0
google-cloud-core==0.29.1
google-resumable-media==0.3.2
googleapis-common-protos==1.5.6

Problem

The strategy for paginating through a table in BigQuery with RowIterator.to_dataframe() does not work as expected.
Either

[ ] the documentation should be updated to show that it will iterate automatically over all pages,
[ ] or it should only fill the DataFrame with the first page, and let the user iterate and join the frames as needed.

Steps to reproduce

Check the instructions for paginating through a query.
Note the suggestion for how to read subsequent pages
>A page is a subset of the total number of rows. If your results are more than one page of data, the result data will have a pageToken property. To retrieve the next page of results, make another list call and include the token value as a URL parameter named pageToken.
Check the response for the Client.list_rows() method.
See that the Iterator may be transformed to a dataframe, and assume that it will only transform the page which was already loaded.
Call RowIterator.to_dataframe() and see that it loads the rest of the pages, and unions them into a single dataframe: more API queries are made . . .
Be surprised that the rest of the pages were loaded without warning, and that you cannot work with dataframes of pages at a time with the native implementation.
Check the source code and confirm your suspicions.

    def _to_dataframe_tabledata_list(self, dtypes):
        """Use (slower, but free) tabledata.list to construct a DataFrame."""
        column_names = [field.name for field in self.schema]
        frames = []
        for page in iter(self.pages):
            frames.append(self._to_dataframe_dtypes(page, column_names, dtypes))
        return pandas.concat(frames)

    def to_dataframe(self, bqstorage_client=None, dtypes=None):
        . . .
        if pandas is None:
            raise ValueError(_NO_PANDAS_ERROR)
        if dtypes is None:
            dtypes = {}

        if bqstorage_client is not None:
            return self._to_dataframe_bqstorage(bqstorage_client, dtypes)
        else:
            return self._to_dataframe_tabledata_list(dtypes)

Code example

table = client.get_table(destination_table)
row_iterator = client.list_rows(destination_table,
                                selected_fields=table.schema,
                                page_size=5,
                                max_results=20)
df = row_iterator.to_dataframe()
df.info()

Int64Index: 20 entries, 0 to X
Data columns (total X columns):

bigquery docs

Source

kykrueger

All 6 comments

@tswast I don't know whether this is just a docs issue (clarify that PageIterator.to_dataset consumes / merges all results into a single dataframe), or whether we need to add a feature which allows getting a dataframe from only a subset (the page, perhaps?).

tseaver on 6 Feb 2019

👍1

That whole https://cloud.google.com/bigquery/docs/paging-results page is pretty irrelevent when using the client library. The RowIterator takes care of pagination for you.

What is the reason that you'd want a dataframe for only a single page? Potentially this could be a feature request to provide a helper on the page property of the RowIterator to construct a dataframe if there is a reason this is needed.

tswast on 6 Feb 2019

I couldn't find any documentation about how the RowIterator takes care of pagination. At least not anything other than using the next_page_token at the time of writing this issue. Now I have found the documentation about it.

With that in mind, this is really just a documentation problem. However, I'll describe what I was trying to do, and if you think that it falls in line with the good-practices, we can open a feature request (I am also willing to make the pull request myself, but might need some help since I have not contributed to this repo before).

What I was doing

We have a small dataset in the cloud which is normally processed with C, and we have some tools for that C library which wrap it in Python3. It is required that we re-process this data for an analysis, but since Python3 is not yet supported by DataFlow, the plan was to just process it by paging through it, and uploading results one page at a time since we will be doing it locally on a RAM limited machine.

I tried to load a page, and wanted to process just that one page fro the sake of keeping memory free for the analysis. However, working with a dataframe would be ideal since some of our methods support dataframes as input. Then when using the to_dataframe method. We ended up with a lot more memory in use, and when increasing the amount of data being imported by this tool, we would not have enough memory left to do anything with it.

Maybe this is worth opening the feature request that @tswast suggested..

However for now,

Documentation fixes

I think we need to

[x] document in the docstring for to_dataframe that it will load the rest of the pages.
[ ] add a relevant link to the Python example code at the paging results page
[x] We should also add a more obvious connection to the documentation for how to use the ListIterator's base class HTTPIterator which I linked to above.

kykrueger on 7 Feb 2019

👍1

The most needed change is in, for the generated documentation, I am unsure about how to add a hyperlink, or what the best solution is. Shouldn't the documentation language automatically find the HTTPIterator, and let us click on the name to go to it since it belongs to the same repo?

kykrueger on 11 Feb 2019

👍1