Google-cloud-python: BigQuery: Field <field> has changed mode from REQUIRED to NULLABLE

Created on 22 May 2019  路  8Comments  路  Source: googleapis/google-cloud-python

I am encountering the following problem, when uploading a Pandas DataFrame to a partitioned table:

Environment details

API: BigQuery
OS: macOS High Sierra 10.13.6
Python: 3.5.7
Packages:

google-api-core==1.11.0
google-api-python-client==1.7.8
google-auth==1.6.3
google-auth-httplib2==0.0.3
google-cloud==0.34.0
google-cloud-bigquery==1.12.1
google-cloud-core==1.0.0
google-cloud-dataproc==0.3.1
google-cloud-datastore==1.8.0
google-cloud-storage==1.16.0
google-resumable-media==0.3.2
googleapis-common-protos==1.5.10
parquet==1.2

Steps to reproduce

Create a table on BigQuery with the following fields:

  • float_value, FLOAT, required
  • int_value, INTEGER, required

Reproducible code example (includes creating table)

import pandas as pd
from google.cloud import bigquery


PROJECT = "my-project"
DATASET = "my_dataset"
TABLE = "my_table"


# My table schema
schema = [
    bigquery.SchemaField("foo", "FLOAT", mode="REQUIRED"),
    bigquery.SchemaField("bar", "INTEGER", mode="REQUIRED"),
]


# Set everything up
client = bigquery.Client(PROJECT)
dataset_ref = client.dataset(DATASET)
table_ref = dataset_ref.table(TABLE)


# Delete the table if exists
print("Deleting table if exists...")
client.delete_table(table_ref, not_found_ok=True)


# Create the table
print("Creating table...")
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
    type_=bigquery.TimePartitioningType.DAY
)
table = client.create_table(table, exists_ok=True)

print("Table schema:")
print(table.schema)

print("Table partitioning:")
print(table.time_partitioning)

# Upload data to partition
table_partition = TABLE + "$20190522"
table_ref = dataset_ref.table(table_partition)

df = pd.DataFrame({"foo": [1, 2, 3], "bar": [2.0, 3.0, 4.0]})
client.load_table_from_dataframe(df, table_ref).result()

Output:

Deleting table if exists...
Creating table...
Table schema:
[SchemaField('foo', 'FLOAT', 'REQUIRED', None, ()), SchemaField('bar', 'INTEGER', 'REQUIRED', None, ())]
Table partitioning:
TimePartitioning(type=DAY)
Traceback (most recent call last):
  File "<my-project>/bigquery_failure.py", line 49, in <module>
    client.load_table_from_dataframe(df, table_ref).result()
  File "<my-env>/lib/python3.5/site-packages/google/cloud/bigquery/job.py", line 732, in result
    return super(_AsyncJob, self).result(timeout=timeout)
  File "<my-env>/lib/python3.5/site-packages/google/api_core/future/polling.py", line 127, in result
    raise self._exception
google.api_core.exceptions.BadRequest:
400 Provided Schema does not match Table my-project:my_dataset.my_table$20190522.
Field bar has changed mode from REQUIRED to NULLABLE

Process finished with exit code 1
bug bigquery p2

All 8 comments

@tswast ISTM that Client.load_table_from_dataframe is generating a schema with NULLABLE mode, which isn't compatible with the original table's schema, presumably in the process of calling Client.load_table_from_file with the generated parquet file.

Hmm, looks like this one is related to #7370.

I think BigQuery is probably auto-detecting the column as nullable since it's a parquet file. I don't think parquet has the option of required types.

@timocb Does this error still occur when you supply a schema manually to the load job? e.g.

job_config = bigquery.LoadJobConfig(schema=schema)
load_job = Config.CLIENT.load_table_from_dataframe(
    df, table_ref, job_config=job_config
)
load_job.result()

@tswast Using your suggestion of passing the schema using the job_config, I get the following error:

google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: 
Provided schema is not compatible with the file 'prod-scotty-e26a7c4b-827d-4d3e-bb1f-002c27becd42'.
Field 'bar' is specified as REQUIRED in provided schema which does not match NULLABLE as specified in the file.

It seems like what @tseaver is saying is correct. Parquet specifies the fields as NULLABLE, but the schema we provide to the job specifies them as REQUIRED.

@timocb Thanks for reporting. As far as I can tell, there's no way to mark a column as REQUIRED in a parquet file, so I've raised this as a backend feature request at https://issuetracker.google.com/133415569 feel free to "star" it to watch for updates.

Turns out Parquet does have the ability to mark columns as required, but there's an open issue in Arrow to support it. https://issues.apache.org/jira/browse/ARROW-5169

Hi @tswast, does #8105 fix this issue?

@timocb #8105 gets us a step closer, but I need to follow-up and populate the requiredness bit in the parquet file based on the BQ schema.

Was this page helpful?
0 / 5 - 0 ratings