Envoy: How to configure envoy to retry on grpc status codes?

Created on 27 Jun 2019 · 7Comments · Source: envoyproxy/envoy

How to configure envoy to retry on grpc status codes?

Title: Need to enable retries on envoy for transient failures

Description:
We are having a grpc backend fronted by envoy for load balancing. The entire infra setup is on kubernetes and we have HPA enabled. Whenever there is a pod scale down we get the error - "upstream connect error or disconnect/reset before headers. reset reason: connection failure". Hence we need to enable retries on envoy for such errors.

We followed the documentation - envoy reference

Client side code:

func main() {
    // Set up a connection to the server.
    grpcAddress := os.Getenv("GRPC_SERVER_ADDRESS")

    conn, err := grpc.Dial(grpcAddress, grpc.WithInsecure())
    if err != nil {
        log.Fatalf("did not connect: %v", err)
    }
    defer conn.Close()
    c := pb.NewGreeterClient(conn)

    // Contact the server and print out its response.
    name := defaultName
    if len(os.Args) > 1 {
        name = os.Args[1]
    }

    outgoingContext := metadata.AppendToOutgoingContext(context.Background(), "x-envoy-retry-grpc-on", "cancelled,deadline-exceeded,internal,resource-exhausted,unavailable", "x-envoy-max-retries", "50", "test-key", "test-val", "x-envoy-upstream-rq-timeout-ms", "15000")
    for {
        var header, trailer metadata.MD
        r, err := c.SayHello(outgoingContext, &pb.HelloRequest{Name: name}, grpc.Header(&header), grpc.Trailer(&trailer), grpc.FailFast(false))

        log.Printf("Response headers: %v\n", header)
        log.Printf("Response trailers: %v\n", trailer)

        if err != nil {
            log.Fatalf("could not greet: %v", err)
            fromError, _ := status.FromError(err)
            log.Fatalf("error status: %v", fromError)
        }
        log.Printf("Greeting: %s\n", r.Message)

    }
}

We are passing the x-envoy headers for each request. We are also printing the response headers that we get from the server. The response headers does not contain any information about the grpc-status.
This is the response headers and trailers that we get:
2019/06/27 09:51:46 Response headers: map[content-type:[application/grpc] date:[Thu, 27 Jun 2019 09:51:45 GMT] server:[envoy] test-header:[test-val] x-envoy-upstream-service-time:[101]] 2019/06/27 09:51:46 Response trailers: map[]

Server side code

func (s *server) SayHello(ctx context.Context, in *pb.HelloRequest) (*pb.HelloReply, error) {
    log.Printf("Received: %v", in.Name)
    meta, _ := metadata.FromIncomingContext(ctx)
    log.Printf("metadata - %v\n", meta)
    time.Sleep(time.Millisecond * 100)

    header := metadata.Pairs("grpc-status", "0", "test-header", "test-val")
    err := grpc.SetHeader(ctx, header)

    if err != nil {
        log.Printf("Error while setting header: %v\n", err)
    }

    return &pb.HelloReply{Message: "Hello " + in.Name}, nil
}

func main() {
    lis, err := net.Listen("tcp", port)
    if err != nil {
        log.Fatalf("failed to listen: %v", err)
    }
    s := grpc.NewServer()
    pb.RegisterGreeterServer(s, &server{})

    stopping := make(chan bool, 1)
    finished := make(chan bool)
    go handleSignals(s, stopping, finished)

    err = s.Serve(lis)

    select {
    case _ = <-stopping:
        _ = <-finished
    default:
        fmt.Println("Error serving: ", err.Error())
    }

    fmt.Println("Done")
}

func handleSignals(s *grpc.Server, stopping, finished chan bool) {
    c := make(chan os.Signal, 1)
    signal.Notify(c, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT, syscall.SIGHUP)
    _ = <-c
    fmt.Println("Shutting down gracefully...")
    stopping <- true
    log.Println("Graceful stop")
    s.GracefulStop()
    finished <- true
}

Here when we printed the incoming request headers, we do not see any x-envoy headers that we sent from the client
metadata - map[:authority:[10.202.184.12:80] content-type:[application/grpc] test-key:[test-val] user-agent:[grpc-go/1.17.0] x-envoy-expected-rq-timeout-ms:[15000] x-forwarded-proto:[http] x-request-id:[80eca672-8fc3-46d5-a18b-9fa107b4e9d9]]

Note: Even though we send the grpc-status header explicitly from the server, we do not see this in the response at the client side.

Envoy config:

 filters:
                -
                  name: "envoy.http_connection_manager"
                  config:
                    stat_prefix: "ingress"
                    route_config:
                      name: "local_route"
                      virtual_hosts:
                        -
                          name: "http-route"
                          domains:
                            - "*"
                          routes:
                            -
                              match:
                                prefix: "/"
                              route:
                                cluster: "simple-grpc-server"
                    http_filters:
                      -
                        name: "envoy.router"
                        config:
                          upstream_log:
                          - name: envoy.file_access_log
                            config:
                              path: "/tmp/envoy-http-upstream"
                              format: "%START_TIME(%Y/%m/%dT%H:%M:%S%z %s)% %PROTOCOL% %REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\n\t(Response code):(response flags): %RESPONSE_CODE%:%RESPONSE_FLAGS%\n\tUpsteam host: %UPSTREAM_HOST%\n\tUpsteam Cluster: %UPSTREAM_CLUSTER%\n\tUser Agent %REQ(USER-AGENT)%\n\tX-ENVOY-ORIGINAL-DESTINATION-HOST: %REQ(X-ENVOY-ORIGINAL-DESTINATION-HOST)%\n\tAUTHORITY: %REQ(:AUTHORITY)%\n\tX-FORWARDED-FOR: %REQ(X-FORWARDED-FOR)%\n"
                          suppress_envoy_headers: "false"
                    access_log:
                      name: "envoy.file_access_log"
                      config:
                        path: "/tmp/envoy-http-access"
                        format: "%START_TIME(%Y/%m/%dT%H:%M:%S%z %s)% %PROTOCOL% %REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\n\t(Response code):(response flags): %RESPONSE_CODE%:%RESPONSE_FLAGS%\n\tUpsteam host: %UPSTREAM_HOST%\n\tUpsteam Cluster: %UPSTREAM_CLUSTER%\n\tUser Agent %REQ(USER-AGENT)%\n\tX-ENVOY-ORIGINAL-DESTINATION-HOST: %REQ(X-ENVOY-ORIGINAL-DESTINATION-HOST)%\n\tAUTHORITY: %REQ(:AUTHORITY)%\n\tX-FORWARDED-FOR: %REQ(X-FORWARDED-FOR)%\n"

When checked the envoy logs, we could not get any information of whether the request was retried or not. We only see 2 kinds of response

Success

2019/06/27T09:55:12+0000 1561629312 HTTP/2 POST /helloworld.Greeter/SayHello
    (Response code):(response flags): 200:-
    Upsteam host: 10.163.130.185:50051
    Upsteam Cluster: simple-grpc-server
    User Agent grpc-go/1.17.0
    X-ENVOY-ORIGINAL-DESTINATION-HOST: -
    AUTHORITY: 10.202.184.12:80
    X-FORWARDED-FOR: -

Failure

2019/06/27T09:06:09+0000 1561626369 HTTP/2 POST /helloworld.Greeter/SayHello
    (Response code):(response flags): 0:UF
    Upsteam host: 10.163.128.108:50051
    Upsteam Cluster: simple-grpc-server
    User Agent grpc-go/1.17.0
    X-ENVOY-ORIGINAL-DESTINATION-HOST: -
    AUTHORITY: 10.202.184.12:80
    X-FORWARDED-FOR: -

We always get 0:UF for those requests which were sent to the terminating pods

Is there anything that we missed here?

Steps to reproduce:

Deploy the server and envoy with more number of pods(more than 1) for server.
Deploy the client which sends infinite number of requests to the server via envoy
Scale down the number of server pods
Error: upstream connect error or disconnect/reset before headers. reset reason: connection failure
gRPC version: 1.17.0
envoy version: latest - 1.10.0

question

Source

roobalimsab

👍2

All 7 comments

Have you looked at https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http/http_routing.html?highlight=retry#arch-overview-http-routing-retry and https://www.envoyproxy.io/docs/envoy/latest/configuration/http_filters/router_filter#config-http-filters-router-x-envoy-retry-grpc-on?

htuch on 28 Jun 2019

Yes. we have added those retry headers to our request in the client side
outgoingContext := metadata.AppendToOutgoingContext(context.Background(), "x-envoy-retry-grpc-on", "cancelled,deadline-exceeded,internal,resource-exhausted,unavailable", "x-envoy-max-retries", "50", "test-key", "test-val", "x-envoy-upstream-rq-timeout-ms", "15000")

But still getting the same error.
How to verify if envoy is retrying or not?

roobalimsab on 28 Jun 2019

Look at "upstream_rq_retry" stat on the cluster - it should tell you
whether it is retrying

On Fri, Jun 28, 2019 at 12:01 PM roobalimsab notifications@github.com
wrote:

Yes. we have added those retry headers to our request in the client side
outgoingContext := metadata.AppendToOutgoingContext(context.Background(),
"x-envoy-retry-grpc-on",
"cancelled,deadline-exceeded,internal,resource-exhausted,unavailable",
"x-envoy-max-retries", "50", "test-key", "test-val",
"x-envoy-upstream-rq-timeout-ms", "15000")

But still getting the same error.
How to verify if envoy is retrying or not?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/envoyproxy/envoy/issues/7413?email_source=notifications&email_token=AEDINKGVGEDRB4FJRIFYXZLP4WV4NA5CNFSM4H32O3WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYZGL5I#issuecomment-506619381,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEDINKB5IEITU5LWXLQ5BOTP4WV4NANCNFSM4H32O3WA
.

ramaraochavali on 28 Jun 2019

cluster.simple-grpc-server.external.upstream_rq_503: 3
cluster.simple-grpc-server.upstream_cx_connect_fail: 3
cluster.simple-grpc-server.upstream_rq_retry: 0
cluster.simple-grpc-server.upstream_rq_retry_overflow: 0
cluster.simple-grpc-server.upstream_rq_retry_success: 0
cluster.simple-grpc-server.upstream_rq_pending_failure_eject: 3

@ramaraochavali this is what we see, even though we send those retry headers in the request

roobalimsab on 28 Jun 2019

Works for us now with the below route policy:

retry_policy:
    retry_on: "5xx"
    num_retries: 10

Our backend is grpc. So, we were retrying on grpc status codes. But after adding 5xx it seems to work now.
Thanks everyone.

roobalimsab on 28 Jun 2019

👍1

@roobalimsab glad this worked out. Normally, gRPC will not 5xx, since it should only return 200 and we use gRPC status codes instead in the trailer to determine the response status. So, if your backend is 5xx'ing, then you will need a retry policy as above..

htuch on 28 Jun 2019

@htuch

... and we use gRPC status codes instead in the trailer to determine the response status.

How can this be achieved, if, according to x-envoy-retry-grpc-on documentation, only header codes are supported?

Edit:
The aforementioned doc says:

... gRPC retries are currently only supported for gRPC status codes in response headers. gRPC status codes in trailers will not trigger retry logic. ...

But, according to this document, gRPC status can only be passed in trailers.