Google-cloud-python: Autodetect feature is not available in BigQuery client

Created on 9 Jan 2017  路  22Comments  路  Source: googleapis/google-cloud-python

Autodetect feature works only if the body of the request contains a schema field.
For example, that works fine:

job = {
    'projectId': PROJECT_ID,
    'configuration': {
        'load': {
            'autodetect': True,  # works without it, by default autodetect = True 
            'destinationTable': {
                'projectId': PROJECT_ID,
                'datasetId': self.dataset,
                'tableId': self.table,
            },
            'sourceUris': [self.path],
            'writeDisposition': self.write_disposition
        }
    }
}

The task will be failed if schema path is not specified in the task params.

May be this code in the client works unexpected(job.py):

resource = {
    'jobReference': {
        'projectId': self.project,
        'jobId': self.name,
    },
    'configuration': {
        self._JOB_TYPE: {
            'sourceUris': self.source_uris,
            'destinationTable': {
                'projectId': self.destination.project,
                'datasetId': self.destination.dataset_name,
                'tableId': self.destination.name,
            },
        },
    },
}
configuration = resource['configuration'][self._JOB_TYPE]
self._populate_config_resource(configuration)

if len(self.schema) > 0:
    configuration['schema'] = {
        'fields': _build_schema_resource(self.schema)}
question bigquery p2

Most helpful comment

@MaximilianR may be you are interested in some notes about using autodetect.

Now you can use autodetect simply by polling the job:
```python 3
jobs = GbqConnector(project_id=project_id).service.jobs()
params = {} # declare

use autodetect if needed

params['configuration']['load']['autodetect'] = True

response = jobs.insert(projectId=project_id, body=params).execute()
job_id = response['jobReference']['jobId']

result = poll_job(jobs, job_id)

def poll_job(jobs, job_id):
while True:
status = jobs.get(projectId=project_id, jobId=job_id).execute()['status']

    if status['state'] == 'DONE':
        return 'DONE'

    sleep(1)

---
Sometimes you want to add the data with new columns that are absent in existing table. That is more popular case when using autodetect rather than manual schema. So you may consider schemaUpdateOption param with also new `'ALLOW_FIELD_ADDITION'` option that allows to detect new fields missed in target table:

```python 3
    if write_disposition == 'WRITE_APPEND':
        params['configuration']['load']['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']

One more problem is produced by using of these features. I guess a workflow is the following: if you have a table and try to load data with autodetect first the schema from the data is detected and then compared with the schema of existing table. That is a cause of some errors as sometimes you don't know what type has a column in the current data. For example if all values of some field in all processing rows are equal null then autodetect recognize it as a string type(not sure exactly) but actually in the existing table this field has another type. Simplest solution that can help:
```python 3
def poll_job(jobs, job_id):
while True:
status = jobs.get(projectId=project_id, jobId=job_id).execute()['status']

    # check for this situation
    if 'errorResult' in status:
        if str(status['errorResult']['message']).startswith('Invalid schema update'):
            return 'Autodetect failed'
        else:
            raise Exception(str(status))

    if status['state'] == 'DONE':
        return 'DONE'

    sleep(1)

param declaring...

result = poll_job(jobs, job_id)

if result == 'Autodetect failed':
del params['configuration']['load']['autodetect']
del params['configuration']['load']['schemaUpdateOptions']

response = jobs.insert(projectId=self.project_id, body=job).execute()
job_id = response['jobReference']['jobId']
poll_job(jobs, job_id)

```


One flexible option related to using of autodetect is unavailable. I mean you cannot create a table without data loading. So i had to write the own schema generator copying the workflow of autodetect option :)

Hope you found that helpful.

All 22 comments

@ivvory Thank you for your submission.
Unfortunately, I am unsure what it is you want us to change. Would you be willing to rephrase?

@lukesneeringer sorry for bad explanation. Let me try again. The problem is that when i was trying to use autodetect feature to upload data without the schema i faced with a problem the feature doesn't work correctly. I tried to use autodetect by passing empty schema, because didn't find how to use it through the client directly. By the way I decided to pass jobs myself specifying params manually. When i passed autodetect=True and empty fields param(like fields=None or fields='') i got an error. But when i didn't pass the empty fields param the job completed successfully.

May be the code below from the client adds the fields param to the job in any case.

if len(self.schema) > 0:
    configuration['schema'] = {
        'fields': _build_schema_resource(self.schema)}

P.S. Thanks for autodetect feature in gc web interface :)

@ivvory Thanks, that makes sense. Thank you.

I'm trying to parse through the posts above.

Is there a way to use autodetect with the google-cloud-bigquery library at the moment?

@MaximilianR We don't have any explicit support for it (yet).

@tswast AFAICT autodetect only shows up in

  • Job.configuration.load.autodetect
  • Job.configuration.query.tableDefinitions.autodetect
  • Table.externalDataConfiguration.autodetect

The relevant methods for autodetect at creation time seem to be

  • bigquery.jobs.insert
  • bigquery.tables.insert

Is this correct? (I based this off the discovery doc.)

I believe that is correct.

There are two times when you would use the autodetect feature: creating a new table from loaded data and making an external table definition. https://cloud.google.com/bigquery/external-table-definition

@MaximilianR may be you are interested in some notes about using autodetect.

Now you can use autodetect simply by polling the job:
```python 3
jobs = GbqConnector(project_id=project_id).service.jobs()
params = {} # declare

use autodetect if needed

params['configuration']['load']['autodetect'] = True

response = jobs.insert(projectId=project_id, body=params).execute()
job_id = response['jobReference']['jobId']

result = poll_job(jobs, job_id)

def poll_job(jobs, job_id):
while True:
status = jobs.get(projectId=project_id, jobId=job_id).execute()['status']

    if status['state'] == 'DONE':
        return 'DONE'

    sleep(1)

---
Sometimes you want to add the data with new columns that are absent in existing table. That is more popular case when using autodetect rather than manual schema. So you may consider schemaUpdateOption param with also new `'ALLOW_FIELD_ADDITION'` option that allows to detect new fields missed in target table:

```python 3
    if write_disposition == 'WRITE_APPEND':
        params['configuration']['load']['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']

One more problem is produced by using of these features. I guess a workflow is the following: if you have a table and try to load data with autodetect first the schema from the data is detected and then compared with the schema of existing table. That is a cause of some errors as sometimes you don't know what type has a column in the current data. For example if all values of some field in all processing rows are equal null then autodetect recognize it as a string type(not sure exactly) but actually in the existing table this field has another type. Simplest solution that can help:
```python 3
def poll_job(jobs, job_id):
while True:
status = jobs.get(projectId=project_id, jobId=job_id).execute()['status']

    # check for this situation
    if 'errorResult' in status:
        if str(status['errorResult']['message']).startswith('Invalid schema update'):
            return 'Autodetect failed'
        else:
            raise Exception(str(status))

    if status['state'] == 'DONE':
        return 'DONE'

    sleep(1)

param declaring...

result = poll_job(jobs, job_id)

if result == 'Autodetect failed':
del params['configuration']['load']['autodetect']
del params['configuration']['load']['schemaUpdateOptions']

response = jobs.insert(projectId=self.project_id, body=job).execute()
job_id = response['jobReference']['jobId']
poll_job(jobs, job_id)

```


One flexible option related to using of autodetect is unavailable. I mean you cannot create a table without data loading. So i had to write the own schema generator copying the workflow of autodetect option :)

Hope you found that helpful.

Where is module GbqConnector?

_UPDATE: the code below doesn't work. try this code instead!_

here's workaround code that works with the current released google-cloud-python (0.24.0). the key part is passing {'configuration': {'load': {'autodetect': True}}} to LoadTableFromStorageJob.from_api_repr().

from google.cloud.bigquery import Client
from google.cloud.bigquery.job import LoadTableFromStorageJob
from google.cloud.bigquery.table import Table, Dataset

client = Client(...)
job = LoadTableFromStorageJob.from_api_repr({
    'jobReference': {
        'jobId': 'UNIQUE_ID',
    },
    'configuration': {
        'load': {
            'destinationTable': {
                'datasetId': 'analytics',
                'tableId': table.name,
            },
            'autodetect': True,
        },
    },
  }, client=client)
job.destination = Table('TABLE_NAME', Dataset(Dataset('DATASET_NAME', client))
job.source_uris = ['file.csv', ...]

# set other options, e.g.:
job.source_format = 'CSV'
...

job.begin()

FWIW the latest is 0.25.0

oops, the code i gave in https://github.com/GoogleCloudPlatform/google-cloud-python/issues/2926#issuecomment-312121757 doesn't actually work after all. i ended up having to monkey patch to get autodetect into the API call. here's that code:

def _add_autodetect():
    resource = LoadTableFromStorageJob._build_resource(job)
    resource['configuration']['load']['autodetect'] = True
    return resource

job._build_resource = _add_autodetect

can't wait for official support! :P

@tswast Is what is left of this issue covered in the redesign? If so, I would like to close this out.

@snarfed While file (& what directory is that file in) did you change and add this new _add_auto_detect function?

@hemanthk92 it goes in your own code, not bigquery/google cloud's...but ignore it! they've fixed this bug. just set job.autodetect = True on your LoadTableFromStorageJob now.

@snarfed thanks for your response.
I'm on google-cloud-bigquery==0.28.0
I don't see autodetect parameter as an option.
screen shot 2017-12-12 at 10 29 51 pm

@hemanthk92 It is a property on LoadJobConfig in 0.28.0. The migration guide has a sample for providing a configuration.

from google.cloud import bigquery

client = bigquery.Client()

dataset_ref = client.dataset('test_dataset')
table_ref = dataset_ref.table('test_table')

job_config = bigquery.LoadJobConfig()
job_config.autodetect = True

with open('data_sample.txt', 'rb') as source_file:
    job = client.load_table_from_file(
        source_file, table_ref, job_config=job_config)  # Start the job.

job.result()  # Wait for the job to complete.

I am using streaming insertion. Is there any way to use this auto detect schemas in streaming??

Sorry, @ravi45722 the BigQuery API does not have an auto-detect feature for the streaming API. I recommend filing an issue requesting this feature at https://issuetracker.google.com/issues/new?component=187149&template=0

@yiga2 FWIW, that feature is marked "experimental" in the docs.

The workaround you propose won't work: instead, I would use:

    job_config._properties['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']

@tseaver did you mean to reply to #4728?

Was this page helpful?
0 / 5 - 0 ratings