If https://github.com/google/google-api-go-client/issues/142 doesn't get fixed, it might be useful to retry inserting rows on transient errors. The AWS API does this transparently and lot's of users try to work around this issue differently for the Google APIs in Go.
Hi Ingo,
We're moving in this direction. We're rebuilding our clients on top of a framework that will handle retries for many RPCs. Look for that to land in the next few months.
@guregu also wants streaming to work.
Just FYI: I added an ugly hack along the lines of
// isHTTP2ConnectionClose implements temporary classification by string matching
// for errors that are misclassified due to missing propagation of Temporary in google apis.
func isHTTP2ConnectionClose(err error) bool {
if err == nil {
return false
}
es := err.Error()
if !strings.Contains(es, "http2: server sent GOAWAY and closed the connection; ") {
return false
}
switch {
case strings.Contains(es, `ErrCode=NO_ERROR, debug="max_age"`):
return true
case strings.Contains(es, `ErrCode=NO_ERROR, debug="session_timed_out"`):
return true
}
return false
}
to detect BigQuery expiring connections on timeout and/or connection age while streaming. I am aware this is a gross hack and totally against the HTTP2 protocol. But it doesn't change the fact that this condition is always temporary (server pruning connections, Go 1.7.1 http2 client detecting old connection on sending the first bytes).
This is an great example of why retry should be configured at the API client level, but managed by the API clients.
@jba Any update on the status of retrying on failure?
No, but I'll bump the priority of this.
This is now at the top of my list.
Isn't there a concern that a row will be inserted multiple times if we retry?
@jba That's always a possibility with the streaming API, so it's not changing the contract at all. Retrying immediately means that the InsertID is more likely going to filter out duplicate entries. If the first try fails, most people will have it try later from a task queue, at which point the InsertID deduping window will have passed, so this should decrease duplicate entries rather than increase them.
Is there an easy way to replicate an error? I've uploaded a couple of million rows now with no errors. I'd like to have something I can see before changing the code.
Not that I'm aware of. This is my string matching retry code that hits every so often.
switch {
// These are identified network errors, so we'll loop and try again
case strings.Contains(errStr, "i/o timeout"):
logger.WarningInfo(c, "Google do() attempt #%d :: '%s'", i+1, "i/o timeout")
case strings.Contains(errStr, "connection reset by peer"):
logger.WarningInfo(c, "Google do() attempt #%d :: '%s'", i+1, "connection reset by peer")
// There was an error not due to a network timeout, so retrying won't help
default:
return nil, ge
}
OK, giving up trying to reproduce. https://code-review.googlesource.com/10132 does retrying by the book, just for uploading rows. @nightlyone @derekperkins I'm uncomfortable introducing error-string matching into the BigQuery client.
LGTM. I don't like the string matching either, those were just the errors I encountered. I also use a global client for Google requests, so I couldn't just rely on the BQ definitions.
Also, I love that it retries indefinitely. Using context to cancel the request is 100x better than having another option to pass into Uploader or having a hardcoded retry limit.
@jba where did you get the retry delay values from? Are these from the BigQuery team?
I made them up. The docs say "wait a few seconds" so I thought those values made sense. Do you have another suggestion?
@jba BigQuery claim to be able to insert up to 100k rows/s so at that rate you will quickly hit memory limits with multiple second waits. I have a minimum wait of 20 milliseconds and go up to 2 seconds max. After this my queues fill up and I need to push back.
Getting official numbers on rates and retries at which clients will be throttled from the api would inform this decision better. Could you have an internal chat with the BigQuery team to get these?
@nightlyone I think that's the value of using context, at least for the upper limit. For people not maxing out 100k/sec, I think his retry defaults are pretty reasonable. If you wanted to limit it to a 2 sec max, you could easily send in a context.Timeout/Deadline
@derekperkins I understand that I can configure the maximum now. But I would also like to configure the intervals or get confirmation that they cannot be lower to avoid being throttled (at which point we need to reconsider our architecture).
@nightlyone Finally had that chat. They recommend a 1 second initial sleep for retry, which is what we do. If you are experiencing memory issues, then I suggest gating your inserts on a semaphore.
Most helpful comment
Hi Ingo,
We're moving in this direction. We're rebuilding our clients on top of a framework that will handle retries for many RPCs. Look for that to land in the next few months.