How to configure envoy to retry on grpc status codes?
Title: Need to enable retries on envoy for transient failures
Description:
We are having a grpc backend fronted by envoy for load balancing. The entire infra setup is on kubernetes and we have HPA enabled. Whenever there is a pod scale down we get the error - "upstream connect error or disconnect/reset before headers. reset reason: connection failure". Hence we need to enable retries on envoy for such errors.
We followed the documentation - envoy reference
Client side code:
func main() {
// Set up a connection to the server.
grpcAddress := os.Getenv("GRPC_SERVER_ADDRESS")
conn, err := grpc.Dial(grpcAddress, grpc.WithInsecure())
if err != nil {
log.Fatalf("did not connect: %v", err)
}
defer conn.Close()
c := pb.NewGreeterClient(conn)
// Contact the server and print out its response.
name := defaultName
if len(os.Args) > 1 {
name = os.Args[1]
}
outgoingContext := metadata.AppendToOutgoingContext(context.Background(), "x-envoy-retry-grpc-on", "cancelled,deadline-exceeded,internal,resource-exhausted,unavailable", "x-envoy-max-retries", "50", "test-key", "test-val", "x-envoy-upstream-rq-timeout-ms", "15000")
for {
var header, trailer metadata.MD
r, err := c.SayHello(outgoingContext, &pb.HelloRequest{Name: name}, grpc.Header(&header), grpc.Trailer(&trailer), grpc.FailFast(false))
log.Printf("Response headers: %v\n", header)
log.Printf("Response trailers: %v\n", trailer)
if err != nil {
log.Fatalf("could not greet: %v", err)
fromError, _ := status.FromError(err)
log.Fatalf("error status: %v", fromError)
}
log.Printf("Greeting: %s\n", r.Message)
}
}
We are passing the x-envoy headers for each request. We are also printing the response headers that we get from the server. The response headers does not contain any information about the grpc-status.
This is the response headers and trailers that we get:
2019/06/27 09:51:46 Response headers: map[content-type:[application/grpc] date:[Thu, 27 Jun 2019 09:51:45 GMT] server:[envoy] test-header:[test-val] x-envoy-upstream-service-time:[101]]
2019/06/27 09:51:46 Response trailers: map[]
Server side code
func (s *server) SayHello(ctx context.Context, in *pb.HelloRequest) (*pb.HelloReply, error) {
log.Printf("Received: %v", in.Name)
meta, _ := metadata.FromIncomingContext(ctx)
log.Printf("metadata - %v\n", meta)
time.Sleep(time.Millisecond * 100)
header := metadata.Pairs("grpc-status", "0", "test-header", "test-val")
err := grpc.SetHeader(ctx, header)
if err != nil {
log.Printf("Error while setting header: %v\n", err)
}
return &pb.HelloReply{Message: "Hello " + in.Name}, nil
}
func main() {
lis, err := net.Listen("tcp", port)
if err != nil {
log.Fatalf("failed to listen: %v", err)
}
s := grpc.NewServer()
pb.RegisterGreeterServer(s, &server{})
stopping := make(chan bool, 1)
finished := make(chan bool)
go handleSignals(s, stopping, finished)
err = s.Serve(lis)
select {
case _ = <-stopping:
_ = <-finished
default:
fmt.Println("Error serving: ", err.Error())
}
fmt.Println("Done")
}
func handleSignals(s *grpc.Server, stopping, finished chan bool) {
c := make(chan os.Signal, 1)
signal.Notify(c, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT, syscall.SIGHUP)
_ = <-c
fmt.Println("Shutting down gracefully...")
stopping <- true
log.Println("Graceful stop")
s.GracefulStop()
finished <- true
}
Here when we printed the incoming request headers, we do not see any x-envoy headers that we sent from the client
metadata - map[:authority:[10.202.184.12:80] content-type:[application/grpc] test-key:[test-val] user-agent:[grpc-go/1.17.0] x-envoy-expected-rq-timeout-ms:[15000] x-forwarded-proto:[http] x-request-id:[80eca672-8fc3-46d5-a18b-9fa107b4e9d9]]
Note: Even though we send the grpc-status header explicitly from the server, we do not see this in the response at the client side.
Envoy config:
filters:
-
name: "envoy.http_connection_manager"
config:
stat_prefix: "ingress"
route_config:
name: "local_route"
virtual_hosts:
-
name: "http-route"
domains:
- "*"
routes:
-
match:
prefix: "/"
route:
cluster: "simple-grpc-server"
http_filters:
-
name: "envoy.router"
config:
upstream_log:
- name: envoy.file_access_log
config:
path: "/tmp/envoy-http-upstream"
format: "%START_TIME(%Y/%m/%dT%H:%M:%S%z %s)% %PROTOCOL% %REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\n\t(Response code):(response flags): %RESPONSE_CODE%:%RESPONSE_FLAGS%\n\tUpsteam host: %UPSTREAM_HOST%\n\tUpsteam Cluster: %UPSTREAM_CLUSTER%\n\tUser Agent %REQ(USER-AGENT)%\n\tX-ENVOY-ORIGINAL-DESTINATION-HOST: %REQ(X-ENVOY-ORIGINAL-DESTINATION-HOST)%\n\tAUTHORITY: %REQ(:AUTHORITY)%\n\tX-FORWARDED-FOR: %REQ(X-FORWARDED-FOR)%\n"
suppress_envoy_headers: "false"
access_log:
name: "envoy.file_access_log"
config:
path: "/tmp/envoy-http-access"
format: "%START_TIME(%Y/%m/%dT%H:%M:%S%z %s)% %PROTOCOL% %REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\n\t(Response code):(response flags): %RESPONSE_CODE%:%RESPONSE_FLAGS%\n\tUpsteam host: %UPSTREAM_HOST%\n\tUpsteam Cluster: %UPSTREAM_CLUSTER%\n\tUser Agent %REQ(USER-AGENT)%\n\tX-ENVOY-ORIGINAL-DESTINATION-HOST: %REQ(X-ENVOY-ORIGINAL-DESTINATION-HOST)%\n\tAUTHORITY: %REQ(:AUTHORITY)%\n\tX-FORWARDED-FOR: %REQ(X-FORWARDED-FOR)%\n"
When checked the envoy logs, we could not get any information of whether the request was retried or not. We only see 2 kinds of response
Success
2019/06/27T09:55:12+0000 1561629312 HTTP/2 POST /helloworld.Greeter/SayHello
(Response code):(response flags): 200:-
Upsteam host: 10.163.130.185:50051
Upsteam Cluster: simple-grpc-server
User Agent grpc-go/1.17.0
X-ENVOY-ORIGINAL-DESTINATION-HOST: -
AUTHORITY: 10.202.184.12:80
X-FORWARDED-FOR: -
Failure
2019/06/27T09:06:09+0000 1561626369 HTTP/2 POST /helloworld.Greeter/SayHello
(Response code):(response flags): 0:UF
Upsteam host: 10.163.128.108:50051
Upsteam Cluster: simple-grpc-server
User Agent grpc-go/1.17.0
X-ENVOY-ORIGINAL-DESTINATION-HOST: -
AUTHORITY: 10.202.184.12:80
X-FORWARDED-FOR: -
We always get 0:UF for those requests which were sent to the terminating pods
Is there anything that we missed here?
Steps to reproduce:
upstream connect error or disconnect/reset before headers. reset reason: connection failureHave you looked at https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http/http_routing.html?highlight=retry#arch-overview-http-routing-retry and https://www.envoyproxy.io/docs/envoy/latest/configuration/http_filters/router_filter#config-http-filters-router-x-envoy-retry-grpc-on?
Yes. we have added those retry headers to our request in the client side
outgoingContext := metadata.AppendToOutgoingContext(context.Background(), "x-envoy-retry-grpc-on", "cancelled,deadline-exceeded,internal,resource-exhausted,unavailable", "x-envoy-max-retries", "50", "test-key", "test-val", "x-envoy-upstream-rq-timeout-ms", "15000")
But still getting the same error.
How to verify if envoy is retrying or not?
Look at "upstream_rq_retry" stat on the cluster - it should tell you
whether it is retrying
On Fri, Jun 28, 2019 at 12:01 PM roobalimsab notifications@github.com
wrote:
Yes. we have added those retry headers to our request in the client side
outgoingContext := metadata.AppendToOutgoingContext(context.Background(),
"x-envoy-retry-grpc-on",
"cancelled,deadline-exceeded,internal,resource-exhausted,unavailable",
"x-envoy-max-retries", "50", "test-key", "test-val",
"x-envoy-upstream-rq-timeout-ms", "15000")But still getting the same error.
How to verify if envoy is retrying or not?—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/envoyproxy/envoy/issues/7413?email_source=notifications&email_token=AEDINKGVGEDRB4FJRIFYXZLP4WV4NA5CNFSM4H32O3WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYZGL5I#issuecomment-506619381,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEDINKB5IEITU5LWXLQ5BOTP4WV4NANCNFSM4H32O3WA
.
cluster.simple-grpc-server.external.upstream_rq_503: 3
cluster.simple-grpc-server.upstream_cx_connect_fail: 3
cluster.simple-grpc-server.upstream_rq_retry: 0
cluster.simple-grpc-server.upstream_rq_retry_overflow: 0
cluster.simple-grpc-server.upstream_rq_retry_success: 0
cluster.simple-grpc-server.upstream_rq_pending_failure_eject: 3
@ramaraochavali this is what we see, even though we send those retry headers in the request
Works for us now with the below route policy:
retry_policy:
retry_on: "5xx"
num_retries: 10
Our backend is grpc. So, we were retrying on grpc status codes. But after adding 5xx it seems to work now.
Thanks everyone.
@roobalimsab glad this worked out. Normally, gRPC will not 5xx, since it should only return 200 and we use gRPC status codes instead in the trailer to determine the response status. So, if your backend is 5xx'ing, then you will need a retry policy as above..
@htuch
... and we use gRPC status codes instead in the trailer to determine the response status.
How can this be achieved, if, according to x-envoy-retry-grpc-on documentation, only header codes are supported?
Edit:
The aforementioned doc says:
... gRPC retries are currently only supported for gRPC status codes in response headers. gRPC status codes in trailers will not trigger retry logic. ...
But, according to this document, gRPC status can only be passed in trailers.