python version: 3.7.2
virtual environment: Conda managed
pip freeze | grep 'google
The strategy for paginating through a table in BigQuery with RowIterator.to_dataframe() does not work as expected.
Either
Client.list_rows() method. def _to_dataframe_tabledata_list(self, dtypes):
"""Use (slower, but free) tabledata.list to construct a DataFrame."""
column_names = [field.name for field in self.schema]
frames = []
for page in iter(self.pages):
frames.append(self._to_dataframe_dtypes(page, column_names, dtypes))
return pandas.concat(frames)
def to_dataframe(self, bqstorage_client=None, dtypes=None):
. . .
if pandas is None:
raise ValueError(_NO_PANDAS_ERROR)
if dtypes is None:
dtypes = {}
if bqstorage_client is not None:
return self._to_dataframe_bqstorage(bqstorage_client, dtypes)
else:
return self._to_dataframe_tabledata_list(dtypes)
table = client.get_table(destination_table)
row_iterator = client.list_rows(destination_table,
selected_fields=table.schema,
page_size=5,
max_results=20)
df = row_iterator.to_dataframe()
df.info()
Int64Index: 20 entries, 0 to X
Data columns (total X columns):
@tswast I don't know whether this is just a docs issue (clarify that PageIterator.to_dataset consumes / merges all results into a single dataframe), or whether we need to add a feature which allows getting a dataframe from only a subset (the page, perhaps?).
That whole https://cloud.google.com/bigquery/docs/paging-results page is pretty irrelevent when using the client library. The RowIterator takes care of pagination for you.
What is the reason that you'd want a dataframe for only a single page? Potentially this could be a feature request to provide a helper on the page property of the RowIterator to construct a dataframe if there is a reason this is needed.
I couldn't find any documentation about how the RowIterator takes care of pagination. At least not anything other than using the next_page_token at the time of writing this issue. Now I have found the documentation about it.
With that in mind, this is really just a documentation problem. However, I'll describe what I was trying to do, and if you think that it falls in line with the good-practices, we can open a feature request (I am also willing to make the pull request myself, but might need some help since I have not contributed to this repo before).
We have a small dataset in the cloud which is normally processed with C, and we have some tools for that C library which wrap it in Python3. It is required that we re-process this data for an analysis, but since Python3 is not yet supported by DataFlow, the plan was to just process it by paging through it, and uploading results one page at a time since we will be doing it locally on a RAM limited machine.
I tried to load a page, and wanted to process just that one page fro the sake of keeping memory free for the analysis. However, working with a dataframe would be ideal since some of our methods support dataframes as input. Then when using the to_dataframe method. We ended up with a lot more memory in use, and when increasing the amount of data being imported by this tool, we would not have enough memory left to do anything with it.
Maybe this is worth opening the feature request that @tswast suggested..
However for now,
I think we need to
The most needed change is in, for the generated documentation, I am unsure about how to add a hyperlink, or what the best solution is. Shouldn't the documentation language automatically find the HTTPIterator, and let us click on the name to go to it since it belongs to the same repo?
Shouldn't the documentation language automatically find the HTTPIterator.
Sphinx doesn't show inheritance by default. I believe we'd need to add :show-inheritance: to the autoclass at the bottom of the template at https://github.com/googleapis/google-cloud-python/blob/master/third_party/sphinx/sphinx/ext/autosummary/templates/autosummary/class.rst
I've sent #7338 to fix the docs issue and created #7339 to track the feature request to get a DataFrame for each page.