If we enable Ext Autz filter with a wrong cluster name, Envoy crashes during request process with the following exception
libc++abi.dylib: terminating with uncaught exception of type Envoy::EnvoyException: Unknown gRPC client cluster 'ext_autz_cluster'
[2018-03-13 13:05:21.874][3347742][critical][backtrace] bazel-out/darwin-fastbuild/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] Caught Abort trap: 6, suspect faulting address 0x7fff73ec2e3e
[2018-03-13 13:05:21.876][3347742][critical][backtrace] bazel-out/darwin-fastbuild/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:87] Backtrace obj<envoy-static> thr<123145484828672>:
[2018-03-13 13:05:21.876][3347742][critical][backtrace] bazel-out/darwin-fastbuild/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:99] thr<123145484828672> obj<envoy-static 0x000000010349ede6 _ZN8backward7details6unwindINS_14StackTraceImplINS_10system_tag10darwin_tagEE8callbackEEEmT_m>
[2018-03-13 13:05:21.876][3347742][critical][backtrace] bazel-out/darwin-fastbuild/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:105] thr<123145484828672> #0 0x10349ede6:
[2018-03-13 13:05:21.876][3347742][critical][backtrace] bazel-out/darwin-fastbuild/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:99] thr<123145484828672> obj<envoy-static>
[2018-03-13 13:05:21.876][3347742][critical][backtrace] bazel-out/darwin-fastbuild/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:105] thr<123145484828672> #1 0x10349e975: backward::StackTraceImpl<backward::system_tag::darwin_tag>::load_here(unsigned long) + 101
[2018-03-13 13:05:21.877][3347742][critical][backtrace] bazel-out/darwin-fastbuild/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:105] thr<123145484828672> #2 0x10349e771: backward::StackTraceImpl<backward::system_tag::darwin_tag>::load_from(void*, unsigned long) + 49
[2018-03-13 13:05:21.877][3347742][critical][backtrace] bazel-out/darwin-fastbuild/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:105] thr<123145484828672> #3 0x10349cffe: Envoy::BackwardsTrace::captureFrom(void*) + 46
[2018-03-13 13:05:21.877][3347742][critical][backtrace] bazel-out/darwin-fastbuild/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:105] thr<123145484828672> #4 0x10349cedf: Envoy::SignalAction::sigHandler(int, __siginfo*, void*) + 143
[2018-03-13 13:05:21.877][3347742][critical][backtrace] bazel-out/darwin-fastbuild/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:110] end backtrace thread 123145484828672
While this exception is valid that the cluster is not a known cluster, it would be good if we do this validation during server startup.
Also if this filter is enabled with listener served via LDS/ADS, we should validate the cluster is available and then only update the listener otherwise listener update should fail.
@saumoh can you take a look at fixing?
@mattklein123 @saumoh could we divide this into two tickets?
envoy_grpc then proto_config.grpc_service().envoy_grpc.cluster_name() is inside the current bootstrapped context.clusterManager().clusters().If it's OK, I could probably help working on the first one, while still investigating on the second one. Thank you.
@dio yes we can split. The dynamic case is being discussed in https://github.com/envoyproxy/envoy/issues/2762. I would check to see if @saumoh is fixing or not. The static case should be trivial to fix.
@dio and @mattklein123 I was thinking of addressing item 2 directly in which case we won't need to add a check for 1. wdyt?
Will get to it this week.
@saumoh cool!
@saumoh SGTM, thanks.
@mattklein123 @htuch
To make the authz filter use a dynamically learned gRPC cluster how about I implement the following:
AsyncClientFactoryImpl. The inherited class will allow the authz filter to accept the cluster_name from v2api.client_ may be nullptr and we can use the failure_mode_allow_(https://github.com/envoyproxy/envoy/blob/9271200ee1799c5cddf9785c5a63d0634cfcbf02/source/common/http/filter/ext_authz.h#L53) option to decide if the filter should permit|deny the request. There is already a counter for CheckStatus::Error that will click when we hit such a condition and that should provide the required debugging. See here.CheckStatus::Error status. Does this sound reasonable approach to you. Any condition that I am not thinking about?
@saumoh the issue here is that if an AsyncClient is stored, you will still run into issues. I think this is what you need to do:
factoryForGrpcService() take a boolean as to whether the client should require the cluster to exist and be a primary cluster (effectively).AsyncClientImpl::send and AsyncClientImpl::start safe if the cluster does not exist. In these cases it should return nullptr and not crash.Then I think you can continue on with the rest of what you propose which sounds fine.
@mattklein123 Makes sense.
For your point 2. In AsyncClientImpl I'll catch the exception thrown here so that the code is safe.
Will start implementing it.
This is fixed.