client.load_table_from_dataframe() results in the dataframe index being loaded into the bigquery table.
Can the capability to load data to a table from a dataframe without having to load the index be implemented?
Details for how to do this are here: https://github.com/pydata/pandas-gbq/issues/133#issuecomment-411119426
More than happy for this to be implemented here rather than pandas-gbq...
Any updates on this being updated for the bigquery api vs. the pandas-gbq module?
@sungchun12 I just tried the solution @max-sixty posted above for the bigquery client api and it worked fine. Load the job configuration and override the schema as suggested.
The schema should be a list with the format:
from google.cloud import bigquery
schema=[bigquery.SchemaField('field1_name','field1_type'),...,bigquery.SchemaField('fieldn_name','fieldn_type')]
@mikeymezher , thanks for getting back to me! I'll try it out and let you know. Do you know if one is more performant over the other in your hands-on experience?
Haven't tested, but anecdotally I've noticed pandas-gbq to be faster than the client library. But there are instances where the client library is needed. Writing to partitioned tables for instance.
They are implementation details, but pandas-gbq uses CSV whereas google-cloud-bigquery uses parquet as the serialization format. The reason for this is to support STRUCT / ARRAY BigQuery columns (though these aren't supported in pandas, anyway).
Implementation-wise, I just noticed pandas provides a way to override the parquet engine's default behavior with an index argument. I'm open to adding a similar argument to google-cloud-bigquery.
df.to_parquet('test.parquet', index=False)
From https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet
@tswast , I would LOVE "index=False" functionality in the google-cloud-bigquery package as it would allow me to remove pandas-gbq imports and have a consistent API to work with bigquery...at least for my use cases in having to not build override schema configurations.
@tswast Has the index=False functionality been added to google-cloud-bigquery?? Thanks.
@cavindsouza not yet. Right now you can avoid writing indexes by passing in a job_config argument with the desired job_config.schema. I agree this would still be a useful feature.
FYI: https://github.com/googleapis/google-cloud-python/pull/9064 and https://github.com/googleapis/google-cloud-python/pull/9049 are changing the index behavior, as a schema will be automatically populated in more cases now.
We might actually have a need to explicitly add indexes to the table. Currently, it's inconsistent when an index will be added and when not. It depends on if the schema is populated and which parquet engine is used to serialized the DataFrame.
Preferred option
Check if index (or indexes if multi-index) name(s) are present in job_config.schema. If so, include the index(es) that are specified. If not, omit the indexes.
Edge case: What if index name matches that of a column name? Prefer serializing the column. Don't add the index in this case.
Alternative
Add a index=True / index=False option to the load_table_from_dataframe function.
This makes it when to include indexes. This would allow the index dtype to be used to automatically determine the schema in some cases.
When the index dtype is object, we'll need to add the index to the job_config.schema, anyway, so this requires the same implementation as the preferred option.
Follow-up from https://github.com/googleapis/google-cloud-python/pull/9064
When this feature is added,
test_load_table_from_dataframe snippets to show overriding schema for columns whose types can't be autodetected. Explicitly add indexes to the partial schema as well.Once this feature is released, do the following to omit indexes:
pyarrow library.LoadJobConfig.schema for any object dtype columns.To include indexes:
pyarrow library.name="my_index_name" to the Index constructo.index=my_index argument.LoadJobConfig.schema for any object dtype columns and any indexes you want to include.Code sample:
Most helpful comment
Once this feature is released, do the following to omit indexes:
pyarrowlibrary.LoadJobConfig.schemafor anyobjectdtype columns.To include indexes:
pyarrowlibrary.name="my_index_name"to the Index constructo.index=my_indexargument.LoadJobConfig.schemafor anyobjectdtype columns and any indexes you want to include.Code sample:
https://github.com/googleapis/google-cloud-python/blob/a6ed9451cf92fede076ccde28e6914a380ec7878/bigquery/samples/load_table_dataframe.py#L18-L71