Google-cloud-python: BigQuery: Load to table from dataframe without index

Created on 5 Jul 2018 · 12Comments · Source: googleapis/google-cloud-python

client.load_table_from_dataframe() results in the dataframe index being loaded into the bigquery table.

Can the capability to load data to a table from a dataframe without having to load the index be implemented?

feature request bigquery

Source

brodiemackenzie

Most helpful comment

Once this feature is released, do the following to omit indexes:

Install the pyarrow library.
Specify the schema with LoadJobConfig.schema for any object dtype columns.

To include indexes:

Install the pyarrow library.
Construct your index with a name by passing in name="my_index_name" to the Index constructo.
Set the index when you construct the DataFrame with the index=my_index argument.
Specify the schema with LoadJobConfig.schema for any object dtype columns and any indexes you want to include.
Note: if an index has the same name as a column, the column is included not the index..

Code sample:

https://github.com/googleapis/google-cloud-python/blob/a6ed9451cf92fede076ccde28e6914a380ec7878/bigquery/samples/load_table_dataframe.py#L18-L71

tswast on 28 Aug 2019

🎉2

All 12 comments

Details for how to do this are here: https://github.com/pydata/pandas-gbq/issues/133#issuecomment-411119426

More than happy for this to be implemented here rather than pandas-gbq...

max-sixty on 7 Aug 2018

Any updates on this being updated for the bigquery api vs. the pandas-gbq module?

sungchun12 on 2 Mar 2019

@sungchun12 I just tried the solution @max-sixty posted above for the bigquery client api and it worked fine. Load the job configuration and override the schema as suggested.
The schema should be a list with the format:

from google.cloud import bigquery 
schema=[bigquery.SchemaField('field1_name','field1_type'),...,bigquery.SchemaField('fieldn_name','fieldn_type')]

mikeymezher on 14 Mar 2019

@mikeymezher , thanks for getting back to me! I'll try it out and let you know. Do you know if one is more performant over the other in your hands-on experience?

sungchun12 on 14 Mar 2019

Haven't tested, but anecdotally I've noticed pandas-gbq to be faster than the client library. But there are instances where the client library is needed. Writing to partitioned tables for instance.

mikeymezher on 14 Mar 2019

They are implementation details, but pandas-gbq uses CSV whereas google-cloud-bigquery uses parquet as the serialization format. The reason for this is to support STRUCT / ARRAY BigQuery columns (though these aren't supported in pandas, anyway).

Implementation-wise, I just noticed pandas provides a way to override the parquet engine's default behavior with an index argument. I'm open to adding a similar argument to google-cloud-bigquery.

df.to_parquet('test.parquet', index=False)

From https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet

tswast on 17 May 2019

👍1

@tswast , I would LOVE "index=False" functionality in the google-cloud-bigquery package as it would allow me to remove pandas-gbq imports and have a consistent API to work with bigquery...at least for my use cases in having to not build override schema configurations.

sungchun12 on 17 May 2019

❤1 👍1

@tswast Has the index=False functionality been added to google-cloud-bigquery?? Thanks.

cavindsouza on 3 Jun 2019

@cavindsouza not yet. Right now you can avoid writing indexes by passing in a job_config argument with the desired job_config.schema. I agree this would still be a useful feature.

tswast on 3 Jun 2019

👍2

FYI: https://github.com/googleapis/google-cloud-python/pull/9064 and https://github.com/googleapis/google-cloud-python/pull/9049 are changing the index behavior, as a schema will be automatically populated in more cases now.

We might actually have a need to explicitly add indexes to the table. Currently, it's inconsistent when an index will be added and when not. It depends on if the schema is populated and which parquet engine is used to serialized the DataFrame.

Preferred option

Check if index (or indexes if multi-index) name(s) are present in job_config.schema. If so, include the index(es) that are specified. If not, omit the indexes.

Edge case: What if index name matches that of a column name? Prefer serializing the column. Don't add the index in this case.

Alternative

Add a index=True / index=False option to the load_table_from_dataframe function.

This makes it when to include indexes. This would allow the index dtype to be used to automatically determine the schema in some cases.

When the index dtype is object, we'll need to add the index to the job_config.schema, anyway, so this requires the same implementation as the preferred option.

tswast on 22 Aug 2019

Follow-up from https://github.com/googleapis/google-cloud-python/pull/9064

When this feature is added,

[x] Update test_load_table_from_dataframe snippets to show overriding schema for columns whose types can't be autodetected. Explicitly add indexes to the partial schema as well.

tswast on 22 Aug 2019

Once this feature is released, do the following to omit indexes:

Install the pyarrow library.
Specify the schema with LoadJobConfig.schema for any object dtype columns.

To include indexes:

Install the pyarrow library.
Construct your index with a name by passing in name="my_index_name" to the Index constructo.
Set the index when you construct the DataFrame with the index=my_index argument.
Specify the schema with LoadJobConfig.schema for any object dtype columns and any indexes you want to include.
Note: if an index has the same name as a column, the column is included not the index..