Google-cloud-python: BigQuery: Streaming insert return sucessful but not insert data after table update

Created on 25 Feb 2019 · 12Comments · Source: googleapis/google-cloud-python

I know that is an issue insert data after deleting and recreating table:
https://github.com/googleapis/google-cloud-python/issues/3822

The workaround should wait for 2 minutes as we can see the comments.

Here I had the same problem to insert data but it seems not related with time.

After update a table, I always had problem insert the data using insert_rows method.
So I made a test.

1) Update Table:

table.schema = schema
bigquery_client.update_table(table, ['schema'])

2) Try to insert data calling the insert_rows twice

bigquery_client.insert_rows(table, payload)
bigquery_client.insert_rows(table, payload)

It always write the payload once after table update.
So the first call never works. Only the second call.

I think it shows that this case is not related with time for buffer cache. The table issue seems solved after first insert_rows operation.

I can´t leave that workaround at the code because if I didn't update the table that code duplicate the payload at my table. Also it is strange to keep the code with that workaround.

question bigquery

Source

bgbraga

All 12 comments

@tswast Does this ring any bells for you?

tseaver on 26 Feb 2019

@bgbraga Does this issue still occur if you wait 2 minutes? I believe the buffer issue might affect table updates as well as recreating a table, especially if the table update adds a new column to the schema. If this is the case, I'd expect the error in the first call to insert_rows to fail due to an unexpected column.

/cc @shollyman who is working on additional streaming docs.

tswast on 26 Feb 2019

I can´t leave that workaround at the code because if I didn't update the table that code duplicate the payload at my table. Also it is strange to keep the code with that workaround.

If you specify a manual row ID, BigQuery will de-duplicate the writes.

tswast on 26 Feb 2019

@tswast sorry, I have to test what occur if I wait for 2 minutes.
But if immediately after table update (add field) I have two insert_rows returning successful but only the second one is writing data, it break the theory of 2 minutes... or not?

Two minutes could be the time to process/update cache area.
But an insert data at this cache area also force this update.
So after first insert, the cache area is ready to receive data.

For me it is very strange. If there is a way to force the cache update, it should be done before the insert call to fix that problem without missing any data.

bgbraga on 6 Mar 2019

What does the response of the first call to bigquery_client.insert_rows(table, payload) look like? Note that you need to check for errors in the response manually because insert_rows is designed to support partial failures.

tswast on 6 Mar 2019

@tswast you are right about first insert_rows call. That is the error:
[{'index': 0, 'errors': [{'reason': 'invalid', 'location': 'phone', 'debugInfo': '', 'message': 'no such field.'}]}]
obs: phone is the field that I'm inserting.

But the second insert_rows call that is called in sequence, always works.

        response = self.bigquery_client.insert_rows(table, payload)
        if self.context.table_updated is True: #hack to force insert in case of table update
            response = self.bigquery_client.insert_rows(table, payload)

Apparently the first call is forcing the cache area update or something like that.

So I made two extra tests:
1) add a sleep time (5 seconds) before first insert_rows call.
Purpose: check if the second insert_rows is working just because it was been called 1 second after the first one and that time is enough to update table.

        import time
        time.sleep(5)
        response = self.bigquery_client.insert_rows(table, payload)
        if self.context.table_updated is True: #hack to force table refresh
            response = self.bigquery_client.insert_rows(table, payload)

Result: false. The firt command continue failing.

2) add a sleep time (2 minutes) before first insert_rows call.
Purpose: validate 2 minutes theory about cache area

        import time
        time.sleep(121)
        response = self.bigquery_client.insert_rows(table, payload)
        if self.context.table_updated is True: #hack to force table refresh
            response = self.bigquery_client.insert_rows(table, payload)

Result: false. The firt command continue failing. Even if I wait 2 minutes to call first insert_rows.

Conclusion:
In my tests any insert_rows after table update is failing. After that failure everything works again.
Maybe that failure is forcing cache update. I don´t know.

Do you have any guess about that?

In my tests I'm calling Google remotely from my machine and my code is using these google API versions:
google==2.0.1
google-api-core==1.4.0
google-api-python-client==1.7.4
google-auth==1.5.1
google-auth-httplib2==0.0.3
google-cloud==0.34.0
google-cloud-bigquery==1.9.0
google-cloud-core==0.28.1
google-cloud-datastore==1.7.0
google-cloud-error-reporting==0.30.0
google-cloud-logging==1.7.0
google-cloud-storage==1.12.0

My files (data1.json is the original file, data2.json is the new file with phone field and schema.txt is my new schema that I used to update table)
data1.json.txt
data2.json.txt
schema.txt

bgbraga on 11 Mar 2019

👍1

Are there any updates on this issue ?

bansalrajat on 30 Aug 2019

I have same problem using golang libray ;/

kubaceg on 9 Oct 2019

I have this problem with python apache beam when trying to update table schema as well. Note that this doesn't happen with JavaSDK.

kalanyuz on 31 Oct 2019

This is a known issue with the backend, due to the same problem as deleting and recreating a table. https://issuetracker.google.com/64329577#comment3

A workaround is to update the table by using a load job with the "ALLOW_FIELD_ADDITION" value set in schema_update_options rather than updating the schema and inserting. https://github.com/googleapis/google-cloud-python/blob/7ba0220ff9d0d68241baa863b3152c34dc9f7a1a/bigquery/docs/snippets.py#L1425-L1431

tswast on 12 Nov 2019

This is closed? How is this an acceptable issue? Data gets sprayed into the void.

lordnynex on 21 Apr 2020

👍1

@lordnynex This repo only holds issues for bugs in the client. Since this is a backend issue, it needs to go to the API team by going through support https://cloud.google.com/support/ or filing an issue here: https://issuetracker.google.com/issues/new?component=187149&template=1162659

tswast on 22 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings