Google-cloud-go: Controlling backoff and retry parameters spanner client

Created on 11 Nov 2018  路  25Comments  路  Source: googleapis/google-cloud-go

Client

Spanner go client

Is there any way to control retry and backoff parameters when I am creating a client like the following.

     client, err := spanner.NewClientWithConfig(ctx, opts.DatabasePath(),
         spanner.ClientConfig{NumChannels: numSpannerChannels, SessionPoolConfig: spanner.SessionPoolConfig{
             MaxOpened:     maxSpannerSessions,
             MinOpened:     minSpannerSessions,
             MaxBurst:      maxSpannerSessionBurst,
             WriteSessions: spannerWriteSessions,
         }})

Our usage might have some aborted transactions because multiple parallel calls trying to update similar records , and we want to fail fast and retry by first looking up using read txn.

spanner feature request

Most helpful comment

Quick note: I've sent out https://code-review.googlesource.com/c/gocloud/+/38210, which further lowers min backoff from 100ms to 20ms.

All 25 comments

Also we are getting a lot of unavailable errors from spanner, with 502.Bad gateway and retries from the client happen with a 1 second backoff and which is making our p99 write latencies bad.

We are also seeing lot of errors of the type lastErr is < rpc error: code = Unavailable desc = The server cannot handle this request right now. Please retry with backoff > There's not much info on why spanner is throwing that error. Any information regarding this would be helpful

Is there any way to control retry and backoff parameters

Unfortunately not at the moment. I'll label this a feature request.

Also we are getting a lot of unavailable errors from spanner, with 502.Bad gateway and retries from the client happen with a 1 second backoff and which is making our p99 write latencies bad.

Could you create a separate issue for this?

We are also seeing lot of errors of the type "lastErr is ". There's not much info on why spanner is throwing that error. Any information regarding this would be helpful

Just to check - are you seeing "lastErr is ", or "lastErr is "?

@jadekler
Hey I edited the comment above with actual error. Sorry my markdown skillset is too bad. Apparently
< > without spaces doesn't include the content between it

Thank you!

Is this acceptable feature request? If so I have an interested in sending a patch.

Regrettably the design here is not yet clear to me. If we choose to allow retrying in one client we must consider all clients. If you have a detailed design, please discuss it here.

The current defaultBackoff is like this.

var defaultBackoff = exponentialBackoff{minBackoff, maxBackoff}

The exponentialBackoff just has delay method to determine duration from retry count.
So I'm thinking to introduce an interface:

type backoff interface {
   Delay(retries int) time.Duration
}

Then add initialization function like SetDefaultBackoff() to set defaultBackoff. This is intended to be called in init(). This is package-level configuration, but I think it's enough.

How do you think?

There are problems with that: users expect to set configuration in NewClient calls, not at init time. Also, any design here should be applicable to other clients. Probably we should make use of gax's CallOption, but I'll have to investigate some.

@jadekler

How about to add new fields to ClientConfig and Client like below and propagate the backoff configurations to runRetryable and resumableStreamDecoder?

type Client struct {
    // ...
    backoff exponentialBackoff
}

type ClientConfig struct {
    // ...
    MinBackoff int64
    MaxBackoff int64
}

If this is acceptable, I'll try to make patch.

That's a possibility that is similar to Pub/Sub, which is good. I unfortunately haven't yet had the time to explore this enough to definitively say that's a good way forward, though.

@jadekler

I've created a patch and sent to Gerrit: https://code-review.googlesource.com/c/gocloud/+/36750.
I'd appreciate it if you could review this CL.

I can't see any example of an existing package exposing retry/backoff settings.

I think it requires more thought and deliberation.

@broady

I've observed that Cloud Spanner returns errors in variety of cases.
For example, I've sometimes observed calling BeginTransaction failed unexpectedly and calling StreamingRead aborted due to stale transactions.

Since the default backoff duration of this library is set to 1s, it always takes additional 1s latency whenever errors described above occurred.
I think that this additional latency is too high for user-facing requests.

Furthermore, since my team is architecting systems in Microservices manner, my systems call Cloud Spanner many times per one user request.
So, if several of my microservices take additional latency due to backoff, the total latency will be quite high.

Therefore, I think that this library should allow to set backoff for each requirements.

Does the first retry takes too long to start? i.e. min backoff is too high?

Also, which error codes are being returned? Does the BeginTransaction failure return Aborted, for example?

To make the problem simple, I fixed my comment..

We have observed all methods including BeginTransaction, StreamingRead, Commit returns Aborted. This library retries when aborted for all methods, but sees the backoff duration from the error(RetryInfo) only for Commit. That results in too long backoff for Aborted error except for Commit method.

Is it correct for us to just fix the backoff for those calls? It sounds like a broken configuration, and not just for you, but all users of the spanner package.

I'm trying to figure out if we just need to fix the defaults, rather than needing to add customization for backoffs (which can also lead to poor outcomes, as well as being difficult to design in a cohesive and complete way).

FWIW https://code-review.googlesource.com/c/gocloud/+/38130 reduces the minBackoff from 1s to 100ms, bringing it more inline with our other clients.

Is it correct for us to just fix the backoff for those calls? It sounds like a broken configuration, and not just for you, but all users of the spanner package.

No. Our requests are just that how to avoid 1 second back off for Avorted error. That is the problem everyone MUST suffer from.

I'm trying to figure out if we just need to fix the defaults, rather than needing to add customization for backoff.

You are right. That is just one of solution for that.
I know that the problem can be solved by fix error handling because we have observed all methods such as StreamingRead or BeginTransacstion also have RetryInfo in the error when Aborted happens.

Why we (just) suggest to allow us to configure default back off is:

  1. When Aborted happens, the RetryInfo in the error is always 0 sec for now at least while we have observed. So there is no problem for too small back off. But off course, it will be bad if server side wants to control duration in future.
  2. Configuring back off also solves the issue https://github.com/googleapis/google-cloud-go/issues/1223 . This is why this issue is complicated. We really really want to fix both problems and both are different off course. When we ask about configuring backoff, you seem feeling not bad. So then we suggest this solution.

re: #1223, does your application perform well if you don't wrap the errors?

Even if we don't wrap the errors, this problem still exists. It works only for Commit error.

We should try going even lower than 100ms.

Spanner team suggested 20-50ms.

My small app that I used to reproduce the issue returns Aborted only on Commit and Update (not for Query)

As noted above, RetryInfo for Commit is already respected. The RetryInfo returned by the Spanner server for Update calls appears to be very consistent, between 10-15ms, so it sounds like 20ms might be very reasonable.


Log of RetryInfo values for Update calls

2019/02/07 17:49:26 Update trailers: 11.997761ms, true
2019/02/07 17:49:28 Update trailers: 11.997453ms, true
2019/02/07 17:49:34 Update trailers: 10.998325ms, true
2019/02/07 17:49:39 Update trailers: 15.998403ms, true
2019/02/07 17:49:39 Update trailers: 10.994829ms, true
2019/02/07 17:49:41 Update trailers: 11.996834ms, true
2019/02/07 17:49:47 Update trailers: 10.998464ms, true
2019/02/07 17:49:48 Update trailers: 14.998592ms, true
2019/02/07 17:49:53 Update trailers: 11.998235ms, true
2019/02/07 17:49:56 Update trailers: 10.998543ms, true
2019/02/07 17:49:58 Update trailers: 10.997684ms, true
2019/02/07 17:50:06 Update trailers: 12.997582ms, true
2019/02/07 17:50:06 Update trailers: 12.998171ms, true
2019/02/07 17:50:08 Update trailers: 13.998741ms, true
2019/02/07 17:50:09 Update trailers: 15.998094ms, true
2019/02/07 17:50:10 Update trailers: 10.99823ms, true
2019/02/07 17:50:11 Update trailers: 14.998151ms, true
2019/02/07 17:50:12 Update trailers: 13.998703ms, true
2019/02/07 17:50:13 Update trailers: 13.998546ms, true
2019/02/07 17:50:19 Update trailers: 11.998267ms, true
2019/02/07 17:50:20 Update trailers: 9.998016ms, true
2019/02/07 17:50:21 Update trailers: 15.998502ms, true
2019/02/07 17:50:22 Update trailers: 12.997663ms, true
2019/02/07 17:50:24 Update trailers: 10.998756ms, true
2019/02/07 17:50:29 Update trailers: 11.998432ms, true
2019/02/07 17:50:31 Update trailers: 15.998712ms, true
2019/02/07 17:50:36 Update trailers: 11.998063ms, true
2019/02/07 17:50:37 Update trailers: 12.997537ms, true
2019/02/07 17:50:41 Update trailers: 15.998469ms, true
2019/02/07 17:50:42 Update trailers: 11.99858ms, true
2019/02/07 17:50:46 Update trailers: 12.998666ms, true
2019/02/07 17:50:47 Update trailers: 10.99805ms, true
2019/02/07 17:50:50 Update trailers: 11.997795ms, true

Quick note: I've sent out https://code-review.googlesource.com/c/gocloud/+/38210, which further lowers min backoff from 100ms to 20ms.

Marking this as closed/fixed. It looks like we could reasonably go down to 10ms, but it seems 20ms is quite good for now.

As I wrote in #1309, we could also respect the RetryInfo from the server, but I don't think it provides very much value right now.

Was this page helpful?
0 / 5 - 0 ratings