Feature request.
Envoy could come up with a standard for grpc health-check eg:
https://github.com/grpc/grpc/blob/master/doc/health-checking.md
syntax = "proto3";
package grpc.health.v1;
message HealthCheckRequest {
string service = 1;
}
message HealthCheckResponse {
enum ServingStatus {
UNKNOWN = 0;
SERVING = 1;
NOT_SERVING = 2;
}
ServingStatus status = 1;
}
service Health {
rpc Check(HealthCheckRequest) returns (HealthCheckResponse);
}
Anyone wanting to do this would have to implement this and configure the check to be grpc.
@mattklein123 are you ok adding grpc dependencies to achieve this. If not then any suggestions?
We won't take the gRPC runtime as a dependency, mostly do to lack of ability to control threading and transport. Instead, we should do the following:
1) Define a .proto gRPC HC API (this should be proposed here so community can discuss). This would get added to the Envoy repo and compiled like this proto: https://github.com/lyft/envoy/blob/master/source/common/ratelimit/ratelimit.proto
2) Use rpc_channel_impl like for ratelimit to call HC API
3) Define new gRPC HC checker type here: https://github.com/lyft/envoy/blob/master/source/common/upstream/health_checker_impl.h
This is probably an ~1 week item for someone not familiar with the code but is comfortable w/ C++. If you interested feel feel free, but please keep us in the loop so that we can make sure you are on the right track.
@mattklein123 i would be interested for sure. ill spend some time understanding the health_checker first.
Would Envoy be supporting streaming grpc in addition to the current unary rpc. Healthchecks would then probably need something like a DRAIN in addition to SERVING, NOT_SERVING.
Apparently there is this proto that I didn't know about:
https://github.com/grpc/grpc/blob/master/doc/health-checking.md
So we should use that and define a gRPC http/2 HC type.
I think @craffert0 is going to implement this as an onboarding project as there is a place we can make use of this at Lyft and this is a generally useful feature.
@htuch @fengli79 @lizan do you know if the proto listed in the above MD file is hosted anywhere official? Or should we copy into data-plane-api? (That seems not great).
I believe https://github.com/grpc/grpc-proto is for proto-only purpose but looks like health.proto is not there.
@htuch what is the best way of importing this into Envoy? Just make the repo a dependency in bazel just like we do for data-plane-api?
I know that there has been some work started on this, and in the long run it would be the best way to do a health check against a GRPC service.
In the meantime, we were able to use the existing TCP health check stuff to construct some binary data that we can send to a GRPC service (not over SSL - using http2 prior knowledge) and cramming all the http2 data frames into the packets that are sent by the health check, we are able to send a valid health check to the grpc service and then look for it to respond with the "SERVING" enum status value.
For this to work, it requires a build of envoy that has #2001 merged in - because otherwise the _second_ health check request will use the same TCP connection and so the hardcoded prior-knowledge magic packet that we send as the first set of bytes are invalid (they get interpreted as a frame size, causing the server to see an invalid request and send a GOAWAY / terminate the connection)
In any case... for anyone who is interested in using something as a stopgap measure, here's an example of a YAML config that sets up a GRPC health check (I'm using YAML here so I can add comments about what the different pieces of the bytes are):
health_check:
type: tcp
reuse_connection: false
timeout_ms: 150
interval_ms: 750
unhealthy_threshold: 2
healthy_threshold: 2
# as may be obvious to the casual observer, this is a simple GRPC health check.
send:
# the HTTP/2 prior-knowledge packet
- binary: 505249202a20485454502f322e300d0a0d0a534d0d0a0d0a
# settings data frame
- binary: 00002404000000000000020000000000030000000000040000ffff000500010005000600002000fe0300000001
# headers (authority, method)
- binary: 0000f301040000000140073a736368656d65046874747040073a6d6574686f6404504f535400053a706174681c2f677270632e6865616c74682e76312e4865616c74682f436865636b400a3a617574686f726974790f3132372e302e302e313a35303035324002746508747261696c657273400c636f6e74656e742d74797065106170706c69636174696f6e2f67727063400a757365722d6167656e7432677270632d6e6f64652f312e362e3620677270632d632f342e302e3020286f73783b206368747470323b20676172636961294014677270632d6163636570742d656e636f64696e67156964656e746974792c6465666c6174652c677a6970
#data frame - with the grpc service name in it at the end in ascii / hex - in this case "userservice"
- binary: 000012000100000001000000000d0a0b5573657253657276696365
# look for the the "SERVING" status protobuf message in the response
receive: [ { "binary": "00020801"}]
interval_jitter_ms: 68
In the lines above, you will need to change that "#data frame" part to have a valid GRPC health check message... To get this value, we used wireshark and the grpcc command line to send out a health check packet to some service, told wireshark to decode the stream as an HTTP2 stream and then copied the hex stream value of the data frame itself...
If you are familiar with the wire format of GRPC, and the http2 spec you could construct this frame by hand - starting with the value of the protobuf for the health check message itself and then crafting the data frame (the data frame includes in it the number of bytes in the frame)... but we found it a lot easier to just wireshark it.
... Again - this is just in case it helps anyone who needs a stop gap until a "grpc" type health check exists at the top level.
^ is awesome. Thank you @ryangardner!
@mattklein123 , that proto is implemented here:
https://github.com/grpc/grpc/blob/master/src/cpp/server/health/default_health_check_service.cc
And gRPC C++ server will install it if there's no user provided one.
https://github.com/grpc/grpc/blob/b0bad8f3864dc9c8745736fe68efe513b2b84932/src/cpp/server/server_cc.cc#L532
And in the request proto, there's a service field. I think it should be used to distinguish different gRPC backends. For example:
I'm going to work on this in a following couple of weeks
Most helpful comment
I know that there has been some work started on this, and in the long run it would be the best way to do a health check against a GRPC service.
In the meantime, we were able to use the existing TCP health check stuff to construct some binary data that we can send to a GRPC service (not over SSL - using http2 prior knowledge) and cramming all the http2 data frames into the packets that are sent by the health check, we are able to send a valid health check to the grpc service and then look for it to respond with the "SERVING" enum status value.
For this to work, it requires a build of envoy that has #2001 merged in - because otherwise the _second_ health check request will use the same TCP connection and so the hardcoded prior-knowledge magic packet that we send as the first set of bytes are invalid (they get interpreted as a frame size, causing the server to see an invalid request and send a GOAWAY / terminate the connection)
In any case... for anyone who is interested in using something as a stopgap measure, here's an example of a YAML config that sets up a GRPC health check (I'm using YAML here so I can add comments about what the different pieces of the bytes are):
In the lines above, you will need to change that "#data frame" part to have a valid GRPC health check message... To get this value, we used wireshark and the grpcc command line to send out a health check packet to some service, told wireshark to decode the stream as an HTTP2 stream and then copied the hex stream value of the data frame itself...
If you are familiar with the wire format of GRPC, and the http2 spec you could construct this frame by hand - starting with the value of the protobuf for the health check message itself and then crafting the data frame (the data frame includes in it the number of bytes in the frame)... but we found it a lot easier to just wireshark it.
... Again - this is just in case it helps anyone who needs a stop gap until a "grpc" type health check exists at the top level.