Google-cloud-go: bigquery: detecting when the streaming buffer is empty

Created on 10 Feb 2017  路  11Comments  路  Source: googleapis/google-cloud-go

With BigQuery, how can I poll when the streaming buffer is empty and it's safe to copy a table without dropping down to the lower level library?
I presume it should return a Job you can Wait() on similar to copies?

bigquery question

Most helpful comment

Hi Jonathan,
Let's say you've streamed data, then you'd like to clean data up by deleting some duplicates / updating some records - i.e. you need to execute DML query (UPDATE and/or DELETE) when streaming buffer isn't empty. In such case BigQuery will throw "Error 400: UPDATE or DELETE statements are not supported on table XYZ with streaming buffer". And you have to wait ~90 minutes to do that. So it would be nice to have API to flush streaming buffer and make table available for DELETE/UPDATE DML.

All 11 comments

We blew it on this one. We didn't mean to ignore you. Do you still have this question? If so, can you explain what you mean by "the streaming buffer"? If you're calling Uploader.Put, it should be synchronous.

BigQuery doesn't stream directly into their long term storage, they first put it into a write optimized store and periodically flush that to the main storage. Queries are able to immediately use the streaming buffer, but since other BQ functions ignore the buffer, he's wanting to wait for the buffer to clear.

For anyone interested in the details, this is a great article.
https://cloud.google.com/blog/big-data/2017/06/life-of-a-bigquery-streaming-insert

@mikebell-org We didn't write a poll method, but you can now find streaming buffer info in TableMetadata, so you can write your own polling. (Feel free to lift our own Retry function. Sorry it's not public.)

Ok, thank you. Is it possible to flush it if it isn't empty? What if process needs bulk UPDATE after streaming?

There's no way to flush it.

What if process needs bulk UPDATE after streaming?

Can you explain more what you mean by this?

Hi Jonathan,
Let's say you've streamed data, then you'd like to clean data up by deleting some duplicates / updating some records - i.e. you need to execute DML query (UPDATE and/or DELETE) when streaming buffer isn't empty. In such case BigQuery will throw "Error 400: UPDATE or DELETE statements are not supported on table XYZ with streaming buffer". And you have to wait ~90 minutes to do that. So it would be nice to have API to flush streaming buffer and make table available for DELETE/UPDATE DML.

OK, I see what you mean.

This is really a question for the BigQuery service itself, not the Go client. But I'll attempt an answer.

The reason there is unlikely to be a flush is that the streaming buffer is a performance optimization designed to keep queries fast, by batching changes to the underlying storage. So if there was a flush, everyone would use it, and the performance optimization would be gone.

Have you looked at using a load job as an alternative to a streaming insert?

Thank you Jonathan. I see the reason, of course it should be expensive operation however "impossible" sounds like impossible. To be honest, my plan B is to put more functionality on the client side to minimize DML necessity.

@jba I am doing some testing, using dataflow to stream to bigquery from pubsub, and I'd like to have a process to publish some message, test what appears in bigquery, and then reset it (i.e. delete/truncate the bigquery table).
The performance value of the streaming buffer makes total sense, and the tradeoffs are worth it, but wondering what to do in this minor use case. I've been using in-line SQL e.g. DELETE FROM {table_name} WHERE 1=1 , the fake where clause is because Bigquery requires a where clause on a delete statement. The query works if the streaming buffer is empty - 90 minutes later after I stop publishing messages to pubsub. If it's not empty, it throws the error.

Is there a way to delete records that are already materialized, ignoring the streaming buffer records? That's probably not very useful, just wondering.

Would deleting & recreating the table be an alternative? Is it going to let me drop & recreate it within that 90 minutes? I will have a try.

I would be also interested in an option to do iterative development on streaming inserts - i.e. being able to flush the buffer.

Was this page helpful?
0 / 5 - 0 ratings