I'm using v1.12.0 with go1.15.6.
Hello,
I make GCS API calls from a daemon that itself serves HTTP requests. I had a case where a client of my daemon would repeatedly time out due to the GCS request taking too long.
To try to avoid this, I have a health check that the daemon can contact GCS and mark itself offline if the check fails for too long, so requests go to other instances of the daemon. (It checks whether a bucket exists, in a loop with sleeps).
This check periodically fails for unknown reasons, though my guess is that GCS has issues periodically, which probably I must expect.
The check is also flawed in that there are multiple GCS IPs (not to mention hosts I'm sure), plus the HTTP/2 connections are cached. Which is to say some IPs/connections could be fine while others are not.
In debugging this, I noticed a cached HTTP/2 connection sticks around for ~4 minutes, even if it is consistently not returning a response (I have a context timeout in my health check). This makes me wonder whether I should disable keepalives and/or add a shorter timeout and retry in my own code, to avoid these connections that could be causing my clients to time out. Neither of these options seem attractive but waiting on a connection that is dead while using a new one that may succeed seems not ideal either.
Basically it seems like from the perspective of my clients there could be an outage due to a cached dead connection (and possibly other reasons), and I wonder if I could improve things somehow.
A couple questions then:
Thanks for your question. A few thoughts on this:
GODEBUG=http2client=0 and see if that changes the behavior you see at all.package main
import (
"context"
"net/http"
"cloud.google.com/go/storage"
"google.golang.org/api/option"
raw "google.golang.org/api/storage/v1"
htransport "google.golang.org/api/transport/http"
)
func main() {
ctx := context.Background()
// Standard way to initialize client:
// client, err := storage.NewClient(ctx)
// if err != nil {
// // handle error
// }
// Instead, create a http.Client using a custom transport.
base := http.DefaultTransport.(*http.Transport).Clone()
// TODO: configure base as desired.
trans, err := htransport.NewTransport(ctx, base, option.WithScopes(raw.DevstorageFullControlScope),
option.WithUserAgent("custom-user-agent"))
if err != nil {
// Handle error.
}
c := http.Client{Transport:trans}
// Supply this client to storage.NewClient
client, err := storage.NewClient(ctx, option.WithHTTPClient(&c))
if err != nil {
// Handle error.
}
}
Thank you for your thoughts!
Yeah, I agree trying to manage the connections is not desirable.
Regarding disabling HTTP/2 - I tried this and surprisingly there was a difference. With HTTP/2, the dead cached connection (blocked with iptables) keeps being tried and my requests repeatedly time out (for ~4 minutes):
2021/01/12 15:57:02 success
# Added iptables rule here
2021/01/12 15:57:13 error: error retrieving bucket attributes: context deadline exceeded
2021/01/12 15:57:24 error: error retrieving bucket attributes: context deadline exceeded
2021/01/12 15:57:35 error: error retrieving bucket attributes: context deadline exceeded
But if I disable HTTP/2, the dead connection seems to be recycled after it fails:
2021/01/12 15:57:55 success
# Added iptables rule here
2021/01/12 15:58:06 error: error retrieving bucket attributes: context deadline exceeded
2021/01/12 15:58:13 success
I wonder if that suggests the net/http client could be improved in this situation. It also suggests to me that disabling keepalives, though not ideal, would help as well. That way I could control the timeout with the dial timeout options at least, as well as rely on my context timeout and a retry actually trying a new connection. It's not clear to me what timeout to adjust for the cached HTTP/2 connection.
Regarding the healthcheck not being indicative of subsequent requests failing - agreed. It may be better than nothing as things stand though. Last evening for example 2 of my compute engine VMs (differente zones, same region) had apparent GCS issues leading to client timeouts, and the check did eventually pull both out of rotation for a time.
Yeah, retrying with a shorter timeout in my application code could help I think. My concern is that a dead HTTP/2 connection is going to keep being used, as above, making the retry kind of pointless. Together with either disabling HTTP/2 or keepalives though it might be an improvement at least...
This issue sounds related to what I'm seeing with cached HTTP/2 connections being reused - https://github.com/golang/go/issues/36026. As well: https://github.com/golang/go/issues/30702.
That's interesting about disabling HTTP/2, but I guess not entirely surprising given that HTTP/1.1 does not have long-lived connections in the same way. I would definitely file an issue for net/http (or comment on one of the existing ones with your experience).
If you use my code snippet above to create your storage client, you can play with some of these settings yourself via fields in Transport and see if that helps (e.g. DisableKeepAlives or IdleConnTimeout). You could also manually close idle open connections in the transport via Transport.CloseIdleConnections when your health check fails. I'd be curious to hear if any of these lead to an improvement on your side.
@horgh I looked at the golang issue and it looks like you used ReadIdleTimeout as a workaround; is that through golang.org/x/net/http2 ? Curious about how you did that.
Thanks for looking :). Yeah, through golang.org/x/net/http2 together with your example:
httpTransport := http.DefaultTransport.(*http.Transport).Clone()
http2Transport, err := http2.ConfigureTransports(httpTransport)
if err != nil {
return nil, errors.Wrap(err, "error configuring http/2 transport")
}
// Setting this enables the ping timeout functionality.
http2Transport.ReadIdleTimeout = 5 * time.Second
htrans, err := htransport.NewTransport(ctx, httpTransport)
if err != nil {
return nil, errors.Wrap(err, "error creating transport")
}
httpClient := &http.Client{
Transport: htrans,
}
client, err := storage.NewClient(ctx, option.WithHTTPClient(httpClient))
if err != nil {
return nil, errors.Wrap(err, "error creating storage client")
}
Awesome, it's very helpful to have this example!
I don't think we can do this by default because it would require depending on the x version of net/http but I can definitely suggest this to others who might run into a similar issue.
By the way, the one change we do make in the default transport is to increase the value for MaxIdleConnectionsPerHost from 2 to 100. So you might want to do the same in your setup if you're doing workloads that require high read throughput.
Just to circle back on this closed issue-- after some internal discussion, we decided to add a ReadIdleTimeout of 15s to the transport by default (for Go 1.16+). This change went out today in the latest release of the storage client, v1.15.0.
That's awesome! Thank you for letting me/everyone know :-).