I am encountering the following problem, when uploading a Pandas DataFrame to a partitioned table:
API: BigQuery
OS: macOS High Sierra 10.13.6
Python: 3.5.7
Packages:
google-api-core==1.11.0
google-api-python-client==1.7.8
google-auth==1.6.3
google-auth-httplib2==0.0.3
google-cloud==0.34.0
google-cloud-bigquery==1.12.1
google-cloud-core==1.0.0
google-cloud-dataproc==0.3.1
google-cloud-datastore==1.8.0
google-cloud-storage==1.16.0
google-resumable-media==0.3.2
googleapis-common-protos==1.5.10
parquet==1.2
Create a table on BigQuery with the following fields:
import pandas as pd
from google.cloud import bigquery
PROJECT = "my-project"
DATASET = "my_dataset"
TABLE = "my_table"
# My table schema
schema = [
bigquery.SchemaField("foo", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("bar", "INTEGER", mode="REQUIRED"),
]
# Set everything up
client = bigquery.Client(PROJECT)
dataset_ref = client.dataset(DATASET)
table_ref = dataset_ref.table(TABLE)
# Delete the table if exists
print("Deleting table if exists...")
client.delete_table(table_ref, not_found_ok=True)
# Create the table
print("Creating table...")
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY
)
table = client.create_table(table, exists_ok=True)
print("Table schema:")
print(table.schema)
print("Table partitioning:")
print(table.time_partitioning)
# Upload data to partition
table_partition = TABLE + "$20190522"
table_ref = dataset_ref.table(table_partition)
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [2.0, 3.0, 4.0]})
client.load_table_from_dataframe(df, table_ref).result()
Deleting table if exists...
Creating table...
Table schema:
[SchemaField('foo', 'FLOAT', 'REQUIRED', None, ()), SchemaField('bar', 'INTEGER', 'REQUIRED', None, ())]
Table partitioning:
TimePartitioning(type=DAY)
Traceback (most recent call last):
File "<my-project>/bigquery_failure.py", line 49, in <module>
client.load_table_from_dataframe(df, table_ref).result()
File "<my-env>/lib/python3.5/site-packages/google/cloud/bigquery/job.py", line 732, in result
return super(_AsyncJob, self).result(timeout=timeout)
File "<my-env>/lib/python3.5/site-packages/google/api_core/future/polling.py", line 127, in result
raise self._exception
google.api_core.exceptions.BadRequest:
400 Provided Schema does not match Table my-project:my_dataset.my_table$20190522.
Field bar has changed mode from REQUIRED to NULLABLE
Process finished with exit code 1
@tswast ISTM that Client.load_table_from_dataframe is generating a schema with NULLABLE mode, which isn't compatible with the original table's schema, presumably in the process of calling Client.load_table_from_file with the generated parquet file.
Hmm, looks like this one is related to #7370.
I think BigQuery is probably auto-detecting the column as nullable since it's a parquet file. I don't think parquet has the option of required types.
@timocb Does this error still occur when you supply a schema manually to the load job? e.g.
job_config = bigquery.LoadJobConfig(schema=schema)
load_job = Config.CLIENT.load_table_from_dataframe(
df, table_ref, job_config=job_config
)
load_job.result()
@tswast Using your suggestion of passing the schema using the job_config, I get the following error:
google.api_core.exceptions.BadRequest: 400 Error while reading data, error message:
Provided schema is not compatible with the file 'prod-scotty-e26a7c4b-827d-4d3e-bb1f-002c27becd42'.
Field 'bar' is specified as REQUIRED in provided schema which does not match NULLABLE as specified in the file.
It seems like what @tseaver is saying is correct. Parquet specifies the fields as NULLABLE, but the schema we provide to the job specifies them as REQUIRED.
@timocb Thanks for reporting. As far as I can tell, there's no way to mark a column as REQUIRED in a parquet file, so I've raised this as a backend feature request at https://issuetracker.google.com/133415569 feel free to "star" it to watch for updates.
Turns out Parquet does have the ability to mark columns as required, but there's an open issue in Arrow to support it. https://issues.apache.org/jira/browse/ARROW-5169
Hi @tswast, does #8105 fix this issue?
@timocb #8105 gets us a step closer, but I need to follow-up and populate the requiredness bit in the parquet file based on the BQ schema.