Title: Envoy returns 403 for failed gRPC ext_authz requests
Description:
The gRPC implementation for ext_authz returns 403(forbidden), when the request to the gRPC server is failed on connection error. It can be an temporary error. However, 403 doesn't recommend an automatic retry.
In order to have client retry, can we change it to 503?
Config:
http_filters:
- name: envoy.ext_authz
config:
failure_mode_allow: false
grpc_service:
envoy_grpc:
cluster_name: token-grpc
timeout: 10.0s
Call Stack:
The FORBIDDEN is hard coded at: https://github.com/envoyproxy/envoy/blob/master/source/extensions/filters/common/ext_authz/ext_authz_grpc_impl.cc#L77
Could you add a bit of clarification why you want to retry authz failure? I'm unsure of your use case but most authz failures I've seen have been pretty deterministic, such that retries are going to result in another failure.
Sure.
When an existing gRPC service is un-deployed, Envoy may run into a race condition with connection closure - the tcp package is out before FIN is received by Envoy. As such, the request is failed due to network issue.
I'm referring to the failures due to network issue, rather than explicit failed authorization. As network is not always reliable anyway, I suppose it not a 4xx, since it's more on the server side.
Please feel free to let me know if anything is unclear to you.
Ah, I think I've got you. Seems reasonable to me but I think @gsagula is likely to have a more informed opinion. Gabriel, any thoughts?
@hanyu-liu I see your point. The hard-coded 403 may seem a bit opinionated, but do you really want to give away what's going on with the authorization system? It's not uncommon for authorization servers to always return unauthorized or forbidden, and then treat underlying issues internally. However, I think we could enhance the filter by making it configurable. Please, let me know what you think. Thanks!
I feel that clients may get confused by the 403 for an invalid authorization. I am okay with a hard coded 503 returned to client. I can probably make the PR myself, if it sounds good to you.
It also sounds good to me to have a configurable response code for the purpose of backward compatibility. @alyssawilk @gsagula
Thanks! It sounds good to me.
Continuing the talk from https://github.com/envoyproxy/envoy/pull/6148.
Do you actually think an enum can work well for status_code_on_error? Don't want users to abuse the system by returning like 200, when the ext_authz is down.
Alternatively, can we fix it by introducing a hard-coded error code as a breaking change in the next major release?
Link to https://github.com/envoyproxy/envoy/issues/5974, who suffers from similar symptoms on http ext_authz
I have the same issue with ext_authz. When I remove or update the ext_authz service. (I run 2 instances in a cluster), the client may have to do multiple retries before the removed service endpoint is properly removed from load balancing. Can the filter be configured to do retries much as a route would. I don't believe the status code received by the downstream would need to be changed in this case. I like to know that the failure (4xx) is from the ext_authz service and not the target upstream.
Link to #5974, who suffers from similar symptoms on http ext_authz
I have this done, just need to write a test to it.
@gsagula any change in status?
@gdheller42 @hanyu-liu ^^
Looks like it will not work for gRPC? @gsagula
@gdheller42 @hanyu-liu ^^
@hanyu-liu Why not?
Looks like the status code has been set on gRPC failures, while the change seems only handle the http based ext auth(https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/filter/http/ext_authz/v2/ext_authz.proto#config-filter-http-ext-authz-v2-httpservice). Did I miss anything? @gsagula
This is the default response code and it will get returned to the client unless the different code is configured on the filter level.
@gsagula Has this been merged into master yet? Thanks.
Not yet, I'm addressing the last comment. It should get merged soon though.
@gsagula Thanks for the update..
@gsagula pulled a docker image of the update. still need to grab v2 api and update my auth code. Will test next week. Thanks Have a good weekend.
Most helpful comment
I have this done, just need to write a test to it.