Envoy: gRPC health checking against a predefined proto

Created on 23 Jan 2017 · 13Comments · Source: envoyproxy/envoy

Feature request.
Envoy could come up with a standard for grpc health-check eg:
https://github.com/grpc/grpc/blob/master/doc/health-checking.md

syntax = "proto3";

package grpc.health.v1;

message HealthCheckRequest {
  string service = 1;
}

message HealthCheckResponse {
  enum ServingStatus {
    UNKNOWN = 0;
    SERVING = 1;
    NOT_SERVING = 2;
  }
  ServingStatus status = 1;
}

service Health {
  rpc Check(HealthCheckRequest) returns (HealthCheckResponse);
}

Anyone wanting to do this would have to implement this and configure the check to be grpc.

enhancement help wanted

Source

irfn

Most helpful comment

I know that there has been some work started on this, and in the long run it would be the best way to do a health check against a GRPC service.

In the meantime, we were able to use the existing TCP health check stuff to construct some binary data that we can send to a GRPC service (not over SSL - using http2 prior knowledge) and cramming all the http2 data frames into the packets that are sent by the health check, we are able to send a valid health check to the grpc service and then look for it to respond with the "SERVING" enum status value.

For this to work, it requires a build of envoy that has #2001 merged in - because otherwise the _second_ health check request will use the same TCP connection and so the hardcoded prior-knowledge magic packet that we send as the first set of bytes are invalid (they get interpreted as a frame size, causing the server to see an invalid request and send a GOAWAY / terminate the connection)

In any case... for anyone who is interested in using something as a stopgap measure, here's an example of a YAML config that sets up a GRPC health check (I'm using YAML here so I can add comments about what the different pieces of the bytes are):

    health_check:
      type: tcp
      reuse_connection: false
      timeout_ms: 150
      interval_ms: 750
      unhealthy_threshold: 2
      healthy_threshold: 2
      # as may be obvious to the casual observer, this is a simple GRPC health check.
      send:
        # the HTTP/2 prior-knowledge packet
        - binary: 505249202a20485454502f322e300d0a0d0a534d0d0a0d0a
        # settings data frame
        - binary: 00002404000000000000020000000000030000000000040000ffff000500010005000600002000fe0300000001
        # headers (authority, method)
        - binary: 0000f301040000000140073a736368656d65046874747040073a6d6574686f6404504f535400053a706174681c2f677270632e6865616c74682e76312e4865616c74682f436865636b400a3a617574686f726974790f3132372e302e302e313a35303035324002746508747261696c657273400c636f6e74656e742d74797065106170706c69636174696f6e2f67727063400a757365722d6167656e7432677270632d6e6f64652f312e362e3620677270632d632f342e302e3020286f73783b206368747470323b20676172636961294014677270632d6163636570742d656e636f64696e67156964656e746974792c6465666c6174652c677a6970
        #data frame - with the grpc service name in it at the end in ascii / hex - in this case "userservice"
        - binary: 000012000100000001000000000d0a0b5573657253657276696365
      # look for the the "SERVING" status protobuf message in the response
      receive: [ { "binary": "00020801"}]
      interval_jitter_ms: 68

In the lines above, you will need to change that "#data frame" part to have a valid GRPC health check message... To get this value, we used wireshark and the grpcc command line to send out a health check packet to some service, told wireshark to decode the stream as an HTTP2 stream and then copied the hex stream value of the data frame itself...

If you are familiar with the wire format of GRPC, and the http2 spec you could construct this frame by hand - starting with the value of the protobuf for the health check message itself and then crafting the data frame (the data frame includes in it the number of bytes in the frame)... but we found it a lot easier to just wireshark it.

... Again - this is just in case it helps anyone who needs a stop gap until a "grpc" type health check exists at the top level.

ryangardner on 17 Nov 2017

❤2

All 13 comments

@mattklein123 are you ok adding grpc dependencies to achieve this. If not then any suggestions?

irfn on 29 Jan 2017

We won't take the gRPC runtime as a dependency, mostly do to lack of ability to control threading and transport. Instead, we should do the following:
1) Define a .proto gRPC HC API (this should be proposed here so community can discuss). This would get added to the Envoy repo and compiled like this proto: https://github.com/lyft/envoy/blob/master/source/common/ratelimit/ratelimit.proto
2) Use rpc_channel_impl like for ratelimit to call HC API
3) Define new gRPC HC checker type here: https://github.com/lyft/envoy/blob/master/source/common/upstream/health_checker_impl.h

This is probably an ~1 week item for someone not familiar with the code but is comfortable w/ C++. If you interested feel feel free, but please keep us in the loop so that we can make sure you are on the right track.

mattklein123 on 29 Jan 2017

@mattklein123 i would be interested for sure. ill spend some time understanding the health_checker first.
Would Envoy be supporting streaming grpc in addition to the current unary rpc. Healthchecks would then probably need something like a DRAIN in addition to SERVING, NOT_SERVING.

irfn on 30 Jan 2017

Apparently there is this proto that I didn't know about:
https://github.com/grpc/grpc/blob/master/doc/health-checking.md

So we should use that and define a gRPC http/2 HC type.

mattklein123 on 28 Jun 2017

I think @craffert0 is going to implement this as an onboarding project as there is a place we can make use of this at Lyft and this is a generally useful feature.

@htuch @fengli79 @lizan do you know if the proto listed in the above MD file is hosted anywhere official? Or should we copy into data-plane-api? (That seems not great).

mattklein123 on 3 Oct 2017

https://github.com/grpc/grpc/blob/master/src/proto/grpc/health/v1/health.proto

htuch on 3 Oct 2017

I believe https://github.com/grpc/grpc-proto is for proto-only purpose but looks like health.proto is not there.

lizan on 3 Oct 2017

@htuch what is the best way of importing this into Envoy? Just make the repo a dependency in bazel just like we do for data-plane-api?

mattklein123 on 3 Oct 2017

I know that there has been some work started on this, and in the long run it would be the best way to do a health check against a GRPC service.

    health_check:
      type: tcp
      reuse_connection: false
      timeout_ms: 150
      interval_ms: 750
      unhealthy_threshold: 2
      healthy_threshold: 2
      # as may be obvious to the casual observer, this is a simple GRPC health check.
      send:
        # the HTTP/2 prior-knowledge packet
        - binary: 505249202a20485454502f322e300d0a0d0a534d0d0a0d0a
        # settings data frame
        - binary: 00002404000000000000020000000000030000000000040000ffff000500010005000600002000fe0300000001
        # headers (authority, method)
        - binary: 0000f301040000000140073a736368656d65046874747040073a6d6574686f6404504f535400053a706174681c2f677270632e6865616c74682e76312e4865616c74682f436865636b400a3a617574686f726974790f3132372e302e302e313a35303035324002746508747261696c657273400c636f6e74656e742d74797065106170706c69636174696f6e2f67727063400a757365722d6167656e7432677270632d6e6f64652f312e362e3620677270632d632f342e302e3020286f73783b206368747470323b20676172636961294014677270632d6163636570742d656e636f64696e67156964656e746974792c6465666c6174652c677a6970
        #data frame - with the grpc service name in it at the end in ascii / hex - in this case "userservice"
        - binary: 000012000100000001000000000d0a0b5573657253657276696365
      # look for the the "SERVING" status protobuf message in the response
      receive: [ { "binary": "00020801"}]
      interval_jitter_ms: 68

... Again - this is just in case it helps anyone who needs a stop gap until a "grpc" type health check exists at the top level.

ryangardner on 17 Nov 2017

❤2

^ is awesome. Thank you @ryangardner!

mattklein123 on 17 Nov 2017

@mattklein123 , that proto is implemented here:
https://github.com/grpc/grpc/blob/master/src/cpp/server/health/default_health_check_service.cc
And gRPC C++ server will install it if there's no user provided one.
https://github.com/grpc/grpc/blob/b0bad8f3864dc9c8745736fe68efe513b2b84932/src/cpp/server/server_cc.cc#L532

fengli79 on 17 Nov 2017

And in the request proto, there's a service field. I think it should be used to distinguish different gRPC backends. For example:

If the service is empty, means the client is actually checking the health status of envoy itself.
If the service is a gRPC backend (like the backend cluster name), we may return the health status of that backend.

fengli79 on 17 Nov 2017

I'm going to work on this in a following couple of weeks