Google-cloud-python: Autodetect feature is not available in BigQuery client

Created on 9 Jan 2017 · 22Comments · Source: googleapis/google-cloud-python

Autodetect feature works only if the body of the request contains a schema field.
For example, that works fine:

job = {
    'projectId': PROJECT_ID,
    'configuration': {
        'load': {
            'autodetect': True,  # works without it, by default autodetect = True 
            'destinationTable': {
                'projectId': PROJECT_ID,
                'datasetId': self.dataset,
                'tableId': self.table,
            },
            'sourceUris': [self.path],
            'writeDisposition': self.write_disposition
        }
    }
}

The task will be failed if schema path is not specified in the task params.

May be this code in the client works unexpected(job.py):

resource = {
    'jobReference': {
        'projectId': self.project,
        'jobId': self.name,
    },
    'configuration': {
        self._JOB_TYPE: {
            'sourceUris': self.source_uris,
            'destinationTable': {
                'projectId': self.destination.project,
                'datasetId': self.destination.dataset_name,
                'tableId': self.destination.name,
            },
        },
    },
}
configuration = resource['configuration'][self._JOB_TYPE]
self._populate_config_resource(configuration)

if len(self.schema) > 0:
    configuration['schema'] = {
        'fields': _build_schema_resource(self.schema)}

question bigquery p2

Source

ivvory

👍7

Most helpful comment

@MaximilianR may be you are interested in some notes about using autodetect.

Now you can use autodetect simply by polling the job:
```python 3
jobs = GbqConnector(project_id=project_id).service.jobs()
params = {} # declare

use autodetect if needed

params['configuration']['load']['autodetect'] = True

response = jobs.insert(projectId=project_id, body=params).execute()
job_id = response['jobReference']['jobId']

result = poll_job(jobs, job_id)

def poll_job(jobs, job_id):
while True:
status = jobs.get(projectId=project_id, jobId=job_id).execute()['status']

    if status['state'] == 'DONE':
        return 'DONE'

    sleep(1)


---
Sometimes you want to add the data with new columns that are absent in existing table. That is more popular case when using autodetect rather than manual schema. So you may consider schemaUpdateOption param with also new `'ALLOW_FIELD_ADDITION'` option that allows to detect new fields missed in target table:

```python 3
    if write_disposition == 'WRITE_APPEND':
        params['configuration']['load']['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']

One more problem is produced by using of these features. I guess a workflow is the following: if you have a table and try to load data with autodetect first the schema from the data is detected and then compared with the schema of existing table. That is a cause of some errors as sometimes you don't know what type has a column in the current data. For example if all values of some field in all processing rows are equal null then autodetect recognize it as a string type(not sure exactly) but actually in the existing table this field has another type. Simplest solution that can help:
```python 3
def poll_job(jobs, job_id):
while True:
status = jobs.get(projectId=project_id, jobId=job_id).execute()['status']

    # check for this situation
    if 'errorResult' in status:
        if str(status['errorResult']['message']).startswith('Invalid schema update'):
            return 'Autodetect failed'
        else:
            raise Exception(str(status))

    if status['state'] == 'DONE':
        return 'DONE'

    sleep(1)

param declaring...

result = poll_job(jobs, job_id)

if result == 'Autodetect failed':
del params['configuration']['load']['autodetect']
del params['configuration']['load']['schemaUpdateOptions']

response = jobs.insert(projectId=self.project_id, body=job).execute()
job_id = response['jobReference']['jobId']
poll_job(jobs, job_id)

```

One flexible option related to using of autodetect is unavailable. I mean you cannot create a table without data loading. So i had to write the own schema generator copying the workflow of autodetect option :)

Hope you found that helpful.

ivvory on 4 May 2017

👍2 🎉1

All 22 comments

@ivvory Thank you for your submission.
Unfortunately, I am unsure what it is you want us to change. Would you be willing to rephrase?

lukesneeringer on 28 Feb 2017

@lukesneeringer sorry for bad explanation. Let me try again. The problem is that when i was trying to use autodetect feature to upload data without the schema i faced with a problem the feature doesn't work correctly. I tried to use autodetect by passing empty schema, because didn't find how to use it through the client directly. By the way I decided to pass jobs myself specifying params manually. When i passed autodetect=True and empty fields param(like fields=None or fields='') i got an error. But when i didn't pass the empty fields param the job completed successfully.

May be the code below from the client adds the fields param to the job in any case.

if len(self.schema) > 0:
    configuration['schema'] = {
        'fields': _build_schema_resource(self.schema)}

P.S. Thanks for autodetect feature in gc web interface :)

ivvory on 1 Mar 2017

@ivvory Thanks, that makes sense. Thank you.

lukesneeringer on 1 Mar 2017

I'm trying to parse through the posts above.

Is there a way to use autodetect with the google-cloud-bigquery library at the moment?

max-sixty on 29 Apr 2017

👍1

@MaximilianR We don't have any explicit support for it (yet).

@tswast AFAICT autodetect only shows up in

Job.configuration.load.autodetect
Job.configuration.query.tableDefinitions.autodetect
Table.externalDataConfiguration.autodetect

The relevant methods for autodetect at creation time seem to be

bigquery.jobs.insert
bigquery.tables.insert

Is this correct? (I based this off the discovery doc.)

dhermes on 1 May 2017

I believe that is correct.

There are two times when you would use the autodetect feature: creating a new table from loaded data and making an external table definition. https://cloud.google.com/bigquery/external-table-definition

tswast on 1 May 2017

@MaximilianR may be you are interested in some notes about using autodetect.

Now you can use autodetect simply by polling the job:
```python 3
jobs = GbqConnector(project_id=project_id).service.jobs()
params = {} # declare

use autodetect if needed

params['configuration']['load']['autodetect'] = True

response = jobs.insert(projectId=project_id, body=params).execute()
job_id = response['jobReference']['jobId']

result = poll_job(jobs, job_id)

def poll_job(jobs, job_id):
while True:
status = jobs.get(projectId=project_id, jobId=job_id).execute()['status']

    if status['state'] == 'DONE':
        return 'DONE'

    sleep(1)


---
Sometimes you want to add the data with new columns that are absent in existing table. That is more popular case when using autodetect rather than manual schema. So you may consider schemaUpdateOption param with also new `'ALLOW_FIELD_ADDITION'` option that allows to detect new fields missed in target table:

```python 3
    if write_disposition == 'WRITE_APPEND':
        params['configuration']['load']['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']

    # check for this situation
    if 'errorResult' in status:
        if str(status['errorResult']['message']).startswith('Invalid schema update'):
            return 'Autodetect failed'
        else:
            raise Exception(str(status))

    if status['state'] == 'DONE':
        return 'DONE'

    sleep(1)

param declaring...

result = poll_job(jobs, job_id)

if result == 'Autodetect failed':
del params['configuration']['load']['autodetect']
del params['configuration']['load']['schemaUpdateOptions']

response = jobs.insert(projectId=self.project_id, body=job).execute()
job_id = response['jobReference']['jobId']
poll_job(jobs, job_id)

```

Hope you found that helpful.

ivvory on 4 May 2017

👍2 🎉1

Where is module GbqConnector?

gyangscrf on 26 Jun 2017

I believe GbqConnector is from the Pandas library, not this one. https://github.com/pydata/pandas-gbq/blob/4b04174ddc61e6490dd07f8e6b758463d82206df/pandas_gbq/gbq.py#L193

tswast on 26 Jun 2017

_UPDATE: the code below doesn't work. try this code instead!_

here's workaround code that works with the current released google-cloud-python (0.24.0). the key part is passing {'configuration': {'load': {'autodetect': True}}} to LoadTableFromStorageJob.from_api_repr().

from google.cloud.bigquery import Client
from google.cloud.bigquery.job import LoadTableFromStorageJob
from google.cloud.bigquery.table import Table, Dataset

client = Client(...)
job = LoadTableFromStorageJob.from_api_repr({
    'jobReference': {
        'jobId': 'UNIQUE_ID',
    },
    'configuration': {
        'load': {
            'destinationTable': {
                'datasetId': 'analytics',
                'tableId': table.name,
            },
            'autodetect': True,
        },
    },
  }, client=client)
job.destination = Table('TABLE_NAME', Dataset(Dataset('DATASET_NAME', client))
job.source_uris = ['file.csv', ...]

# set other options, e.g.:
job.source_format = 'CSV'
...

job.begin()

snarfed on 30 Jun 2017

FWIW the latest is 0.25.0

dhermes on 30 Jun 2017

oops, the code i gave in https://github.com/GoogleCloudPlatform/google-cloud-python/issues/2926#issuecomment-312121757 doesn't actually work after all. i ended up having to monkey patch to get autodetect into the API call. here's that code:

def _add_autodetect():
    resource = LoadTableFromStorageJob._build_resource(job)
    resource['configuration']['load']['autodetect'] = True
    return resource

job._build_resource = _add_autodetect

can't wait for official support! :P

snarfed on 1 Jul 2017

👍1

At least some of this issue was fixed in https://github.com/GoogleCloudPlatform/google-cloud-python/pull/3648

tswast on 30 Jul 2017

❤1 🎉1

@tswast Is what is left of this issue covered in the redesign? If so, I would like to close this out.

lukesneeringer on 11 Aug 2017

@snarfed While file (& what directory is that file in) did you change and add this new _add_auto_detect function?

hemanthk92 on 13 Dec 2017

@hemanthk92 it goes in your own code, not bigquery/google cloud's...but ignore it! they've fixed this bug. just set job.autodetect = True on your LoadTableFromStorageJob now.

snarfed on 13 Dec 2017

@snarfed thanks for your response.
I'm on google-cloud-bigquery==0.28.0
I don't see autodetect parameter as an option.
screen shot 2017-12-12 at 10 29 51 pm

hemanthk92 on 13 Dec 2017

@hemanthk92 It is a property on LoadJobConfig in 0.28.0. The migration guide has a sample for providing a configuration.

from google.cloud import bigquery

client = bigquery.Client()

dataset_ref = client.dataset('test_dataset')
table_ref = dataset_ref.table('test_table')

job_config = bigquery.LoadJobConfig()
job_config.autodetect = True

with open('data_sample.txt', 'rb') as source_file:
    job = client.load_table_from_file(
        source_file, table_ref, job_config=job_config)  # Start the job.

job.result()  # Wait for the job to complete.

tswast on 13 Dec 2017

I am using streaming insertion. Is there any way to use this auto detect schemas in streaming??

ravi45722 on 29 Dec 2017

Sorry, @ravi45722 the BigQuery API does not have an auto-detect feature for the streaming API. I recommend filing an issue requesting this feature at https://issuetracker.google.com/issues/new?component=187149&template=0

tswast on 2 Jan 2018

👍2

@yiga2 FWIW, that feature is marked "experimental" in the docs.

The workaround you propose won't work: instead, I would use:

    job_config._properties['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']

tseaver on 10 Jan 2018

@tseaver did you mean to reply to #4728?

chemelnucfin on 11 Jan 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

pip install failing

VikramTiwari · 4Comments

Cloud Storage: list_blobs(delimiter='/', prefix="subdir/") includes subdir?

tkaymak · 4Comments

Firestore: watch unsubscribe generates unexpected exception

rmceoin · 3Comments

pubsub: streaming pull needs keepalive

pongad · 4Comments

Resource Manager: 403 creating new project with service-account

adamdavis40208 · 3Comments