Google-cloud-go: storage: High spontaneous latencies

Created on 8 Sep 2017 · 15Comments · Source: googleapis/google-cloud-go

We're currently facing issues with high latencies on GCS operations for 10-30 minutes a time, multiple times a day, causing timeouts on our APIs. Most issues surprisingly seem to happen with delete operations, even though they should be the easiest to process.

99th percentile
95th percentile
90th percentile
50th percentile (for comparison)

Our upload/delete code is similar to this:

func Upload(ctx context.Context, path string, src []byte) error {
    wc := s.bucket.Object(path).NewWriter(ctx)
    wc.ContentType = contentType
    wc.CacheControl = "max-age=31556926" // 1 year

    wc.Write(src)
    return wc.Close()
}

func Delete(ctx context.Context, path string) error {
    return s.bucket.Object(path).Delete(ctx)
}

Are you aware of anyone else having these troubles? Or do you have a solution for this?
I personally suspect these issues to stem from my request for #701, which - due to retrying on REFUSED_STREAM, causes those high latencies (tenfold even!). Could that be the case? And if so: Could it be solved somehow?

bigtable storage question

Source

lhecker

👍1

All 15 comments

No one else has mentioned this problem (yet).

Could you instrument shouldRetry to see if your theory is correct?

jba on 8 Sep 2017

@jba I just deployed our service with an instrumented shouldRetry function. I‘ll keep you posted. 🙂

lhecker on 11 Sep 2017

@jba As already suspected it indeed is caused by spontaneous excessive amounts of REFUSED_STREAM errors. In our case 40% of uploads and 10% of all deletes fail that way, or 25 and 5 requests per second respectively.
The retrying that was added sadly does not solve the problem - it only prolongs it until the deadline runs out. For instance the above deletes run with a deadline of 25s, which is still seemingly easily exceeded.
Our applications are furthermore even running within the Google Container Engine by the way.

Do you have ideas as to why that is the case, or how it could be solved?
Could it be an infrastructure problem on your side? Something like your GCS load-balancer/front-proxy being overloaded? Because HTTP/2 is using fewer connections than HTTP/1?
Either way this is a very uncomfortable problem since sometimes 40% of all our requests fail to complete within an arguably generous deadline of 5s. 😕

lhecker on 12 Sep 2017

Bringing it up internally. Will keep you posted.

jba on 12 Sep 2017

We are seeing similar problems with uploads, high rate of failure that can spike up to 30% (but probably hovers around 1-5%). Our code looks something like:

our timeout is long, 2min, but still see failures.

ctx, _ := context.WithTimeout(context.Background(), time.Duration(config.GcsTimedContextSec)*time.Second)
writer := object.NewWriter(ctx)
writer.ChunkSize = 0
for i := 0; i < config.GcsWriteRetries; i++ {
if _, err := writer.Write(body); err != nil {
log.WithFields(
logrus.Fields{
"reason": err.Error(),
"uri": uri,
"bodySize": len(body),
"retries": config.GcsWriteRetries,
"bucketName": config.HlsBucket,
"attempt": i,
},
).Error(errGcsWrite)
continue
} else {
log.WithFields(
logrus.Fields{
"reason": "wrote to gcs",
"uri": uri,
"bodySize": len(body),
"retries": config.GcsWriteRetries,
"bucketName": config.HlsBucket,
"attempt": i,
},
).Info(errGcsWrite)
}
return
}

chasmastin on 15 Sep 2017

Is there any update or progress on this? we are still seeing a very high level of uploads fail with REFUSED_STREAM, and very high latencies as suggested above.

chasmastin on 11 Dec 2017

@chasmastin We "solved" the issue in the interim, by keeping a running average of failed GCS requests. If that value is too high we forcefully instantiate a new connection pool.
You can do so by providing the GCS library with your own http.Transport using the option.WithHTTPClient() option each time, causing a new HTTP pool to be created and used.

This did not solve the underlying issue though of course.

lhecker on 11 Dec 2017

We have seen somewhat similar things with the bigtable library. This uses GRPC. We have been struggling with occasional latency for ~30 minutes where our upper percentiles timeout. We often see this across hosts client, but have occasionally seen it against one particular host.

We have been heavily instrumenting the connection logic and the request logic. It is very hard to pinpoint what is actually going on. For bigtable, our fear was either large rows or hot keys but this does not appear to be the case. Our current main suspect is the GRPC connection. I wonder if we are seeing similar issues manifested in different API's.

jonomacd on 6 Mar 2018

We are seeing the same behavior in BigTable, in the graph below you can see the periods of high latency. We have a 1 minute timeout set, so it never exceeds 60s, but these periods of high latency are occurring fairly often. Not that all of the requests plotted here are BigTable reads, not writes.

jcgregorio on 14 Oct 2018

Interesting fact: storage uses JSON/HTTP, while bigtable uses gRPC. So if these are related, they are at a deeper level—the Go networking stack, perhaps, or something in Google's network infrastructure.

It would be useful to know if these same issues crop up in other languages' clients.

jba on 15 Oct 2018

Off-the-cuff observation: if the BigTable client maintains long-lived connections, then those spikes are probably the GFE killing connections every hour (WAI).

redacted: gRPC should handle connection closures gracefully and with minimal downtime. Nevermind. :)

jadekler on 15 Oct 2018

@jba At my company we solved the initial issue above by recreating the GCS client and subsequently its http.Transport whenever the error rate is >10% in the last 100 requests.
This simple heuristic triggers around 1-2 times per hour per instance in average, but mostly solved the perceived issue in the meantime.

But in fact we indeed have more issues like that with other services for which we didn't find any easy solution yet. For instance sometimes PubSub clients of our low-traffic services simply stop receiving any messages at all for up to multiple minutes at a time - far more than the client-side timeout of 60s.

(That's probably not related and I hope I'm not mistaken, but... Every single day at some time each database in our Spanner instances increase their respective CPU usage. Which is kinda understandable, since they're probably running some kind of DB compaction etc. But what's interesting is that each single day at some consistently random _other_ time Spanner just "decides" to have a 3-10x (not a typo) increase in latency for a few minutes without any increase in OPS or CPU usage. Strange stuff. 😄)

lhecker on 15 Oct 2018

At a high level, I feel like these issues are best discussed with the product teams via their support channels. Latency is pretty hard to debug, and I don't think that the problems are specific to GoLang. I think that we should have some good documentation about who to talk to about performance issues in a centralized place. I do think that we ought to close this issue, since there isn't anything actionable from the client team.

@jadekler, we did some research on the Cloud Bigtable side. There's an interesting client-side issue with a subtle interaction between connection pooling, low QPS clients, connection refreshes and retries-with-exponential-backoff. That issue ought to be discussed off line, and in the context of performance testing and fixes.

sduskis on 6 Dec 2018

I suspect this issue is another manifestation of https://github.com/golang/go/issues/27208, which we discovered in https://github.com/googleapis/google-cloud-go/issues/1211. In short, it's an http/2 bug, which has since been fixed and will arrive in Go 1.11.3.

I'm going to close this. If the problem appears again with Go 1.11.3, feel free to re-open.