I know that is an issue insert data after deleting and recreating table:
https://github.com/googleapis/google-cloud-python/issues/3822
The workaround should wait for 2 minutes as we can see the comments.
Here I had the same problem to insert data but it seems not related with time.
After update a table, I always had problem insert the data using insert_rows method.
So I made a test.
1) Update Table:
table.schema = schema
bigquery_client.update_table(table, ['schema'])
2) Try to insert data calling the insert_rows twice
bigquery_client.insert_rows(table, payload)
bigquery_client.insert_rows(table, payload)
It always write the payload once after table update.
So the first call never works. Only the second call.
I think it shows that this case is not related with time for buffer cache. The table issue seems solved after first insert_rows operation.
I can麓t leave that workaround at the code because if I didn't update the table that code duplicate the payload at my table. Also it is strange to keep the code with that workaround.
@tswast Does this ring any bells for you?
@bgbraga Does this issue still occur if you wait 2 minutes? I believe the buffer issue might affect table updates as well as recreating a table, especially if the table update adds a new column to the schema. If this is the case, I'd expect the error in the first call to insert_rows to fail due to an unexpected column.
/cc @shollyman who is working on additional streaming docs.
I can麓t leave that workaround at the code because if I didn't update the table that code duplicate the payload at my table. Also it is strange to keep the code with that workaround.
If you specify a manual row ID, BigQuery will de-duplicate the writes.
@tswast sorry, I have to test what occur if I wait for 2 minutes.
But if immediately after table update (add field) I have two insert_rows returning successful but only the second one is writing data, it break the theory of 2 minutes... or not?
Two minutes could be the time to process/update cache area.
But an insert data at this cache area also force this update.
So after first insert, the cache area is ready to receive data.
For me it is very strange. If there is a way to force the cache update, it should be done before the insert call to fix that problem without missing any data.
What does the response of the first call to bigquery_client.insert_rows(table, payload) look like? Note that you need to check for errors in the response manually because insert_rows is designed to support partial failures.
@tswast you are right about first insert_rows call. That is the error:
[{'index': 0, 'errors': [{'reason': 'invalid', 'location': 'phone', 'debugInfo': '', 'message': 'no such field.'}]}]
obs: phone is the field that I'm inserting.
But the second insert_rows call that is called in sequence, always works.
response = self.bigquery_client.insert_rows(table, payload)
if self.context.table_updated is True: #hack to force insert in case of table update
response = self.bigquery_client.insert_rows(table, payload)
Apparently the first call is forcing the cache area update or something like that.
So I made two extra tests:
1) add a sleep time (5 seconds) before first insert_rows call.
Purpose: check if the second insert_rows is working just because it was been called 1 second after the first one and that time is enough to update table.
import time
time.sleep(5)
response = self.bigquery_client.insert_rows(table, payload)
if self.context.table_updated is True: #hack to force table refresh
response = self.bigquery_client.insert_rows(table, payload)
Result: false. The firt command continue failing.
2) add a sleep time (2 minutes) before first insert_rows call.
Purpose: validate 2 minutes theory about cache area
import time
time.sleep(121)
response = self.bigquery_client.insert_rows(table, payload)
if self.context.table_updated is True: #hack to force table refresh
response = self.bigquery_client.insert_rows(table, payload)
Result: false. The firt command continue failing. Even if I wait 2 minutes to call first insert_rows.
Conclusion:
In my tests any insert_rows after table update is failing. After that failure everything works again.
Maybe that failure is forcing cache update. I don麓t know.
Do you have any guess about that?
In my tests I'm calling Google remotely from my machine and my code is using these google API versions:
google==2.0.1
google-api-core==1.4.0
google-api-python-client==1.7.4
google-auth==1.5.1
google-auth-httplib2==0.0.3
google-cloud==0.34.0
google-cloud-bigquery==1.9.0
google-cloud-core==0.28.1
google-cloud-datastore==1.7.0
google-cloud-error-reporting==0.30.0
google-cloud-logging==1.7.0
google-cloud-storage==1.12.0
My files (data1.json is the original file, data2.json is the new file with phone field and schema.txt is my new schema that I used to update table)
data1.json.txt
data2.json.txt
schema.txt
Are there any updates on this issue ?
I have same problem using golang libray ;/
I have this problem with python apache beam when trying to update table schema as well. Note that this doesn't happen with JavaSDK.
This is a known issue with the backend, due to the same problem as deleting and recreating a table. https://issuetracker.google.com/64329577#comment3
A workaround is to update the table by using a load job with the "ALLOW_FIELD_ADDITION" value set in schema_update_options rather than updating the schema and inserting. https://github.com/googleapis/google-cloud-python/blob/7ba0220ff9d0d68241baa863b3152c34dc9f7a1a/bigquery/docs/snippets.py#L1425-L1431
This is closed? How is this an acceptable issue? Data gets sprayed into the void.
@lordnynex This repo only holds issues for bugs in the client. Since this is a backend issue, it needs to go to the API team by going through support https://cloud.google.com/support/ or filing an issue here: https://issuetracker.google.com/issues/new?component=187149&template=1162659