_Title_: allow extension factories to be asynchronous
_Context_:
create*(config) method, assuming that a given configuration itself is enough to return a ready-to-use extension instance or *FactoryCb. E.g.envoy.access_loggers: AccessLog::InstanceSharedPtr createAccessLogInstance(config)envoy.filters.http: Http::FilterFactoryCb createFilterFactoryFromProto(config)envoy.filters.network: Network::FilterFactoryCb createFilterFactoryFromProto(config)envoy.filters.listener: Network::ListenerFilterFactoryCb createListenerFilterFactoryFromProto(config)_Problem_:
Wasm extensions, config object itself is not enough to return a ready-to-use extension instance or *FactoryCb, e.g.Wasm code needs to be loaded from a remote locationWasm extension is allowed to define _start / proxy_on_vm_start / proxy_on_configure callbacks that, in a general case, might need to use async Envoy API, such as HTTP Client / gRPC Clientenvoy.access_loggers.wasm, envoy.filters.network.wasm and envoy.filters.http.wasm implement create*(config) method in a way that lets Envoy to proceed and start serving requests even though Wasm extension has not been loaded, configured and warmed yet_Proposal_:
Future/Promise instead of a ready-to-use objectWasm extensions flexibility in deciding when they are ready-to-useEnvoy behaviour consistent - the enclosing object (e.g., Listener) will remain in the warming state until all extensions are actually ready-to-usewasm extensions. It's a burden for users to decide whether it's ok to serve requests while the extension is not ready yet. Just don't serve the requests until configuration is complete.I think the big question is what happens if configuration fails? There is no exit state from warming other than success.
eventually, config based soft-failure or hard-failure? Presumably the filter owner can decide if they want the server to limp along without that functionality (500 on routes using WASM, fail to access log) or hard-fail (not start up / never listen)
Would it be sufficient to hook into the listener's Init::Manager and use that to defer finalizing the listener load until the filter factory is ready?
If a failure happens, it might be desirable to back track to a previously working listener, or fast-track to the next incoming listener update and interrupt the configuration future.
I feel very strongly that we have a holistic design for this issue which also covers https://github.com/envoyproxy/envoy/issues/12183. This is also highly related to the work that @kyessenov already did on the filter config discovery service implementation inside the HCM. At a high level the problem is very similar and we should be very consistent on how we handle it both in terms of config and implementation (like in FilterConfigDS we should have default config, skip warming, fail until success, etc.).
Can we please do a design doc that covers all of these different uses cases and possibly consolidate issues where possible? Thank you!
cc @ASOPVII @lambdai @kyessenov @lizan
@ASOPVII has switched to ondemand filter chain, which is independent from the proposal here.
The major efforts of ondemand filter chain is to trigger the initialization upon traffic at dataplane(on worker thread),
which can never be addressed by master thread tricks.
@yskopets
This looks reasonable except:
it will give Wasm extensions flexibility in deciding when they are ready-to-use
it will make Envoy behaviour consistent - the enclosing object (e.g., Listener) will remain in the warming state until all extensions are actually ready-to-use
The existing code doesn't introduce future concept but envoy does achieved in the warming state until all extensions are actually ready-to-use with init manager.
The callback embedded in the watcher of Init::Manager acts as if the Promise::set_value(). Could you look into it?
xref: https://github.com/envoyproxy/envoy/issues/12061 (a local init manager for extension config loading was cut out since it's complicated)
@ASOPVII has switched to ondemand filter chain, which is independent from the proposal here.
OK.
Please no coding on any of this until we have design documents for discussion. Thank you!
Thanks everyone for the feedback!
I will need some time to look into the related issues.
Hopefully, in the end of the week.
Here is the draft version of proposal.
It takes into account this issue, #12183 and FCDS.
If you find that the overall direction is correct, then I will it move into a doc to continue with more focused threads of discussion.
Listener/FilterChain should remain in the warming state until all its extensions have been warmed upOutline:
Init::ManagersInit::Manager by extension factoriesInit::Managerwasm extensionsInit::Managers_Current behaviour_:
Listener has only one Init::Manager for everything under it, i.e. ListenerFilter, AccessLogger, NetworkFilter, HttpFilter extensions and their xDS_Proposed behaviour_:
Listener should have a hierarchy of Init::ManagersInit::Manager for Listener itselfInit::Manager will be used by ListenerFilter and AccessLogger extensions (at listener level) and relevant xDSInit::Manager per FilterChainInit::Managers will be used by NetworkFilter, HttpFilter and AccessLogger extensions (at filter level) and relevant xDS_Motivation_:
Init::Managers is meant to address #12183NetworkFilter/HttpFilter extensions to always register their init targets under the FilterChain-specific Init::ManagerFilterChain-specific Init::Managers will depend on the "warming mode"In the "default" warming mode (current behaviour):
Init::Manager of the Listener will be waiting for Init::Managers of FilterChainsListener will be considered "warmed up" (ready to accept connections) only after all ListenerFilter, AccessLogger, NetworkFilter and HttpFilter extensions have been warmed up and relevant xDS resources fetchedIn the "on demand" warming mode (#12183):
Init::Manager of the Listener will not wait for Init::Managers of FilterChainsListener will be considered "warmed up" (ready to accept connections) afterListenerFilter and AccessLogger extensions (at listener level) have been warmed up and relevant xDS resources fetchedInit::Managers of all FilterChains have been initialized (kicked off but not ready)worker threads will have to pause until Init::Manager of the respective FilterChain becomes readyInit::Manager by extension factoriesTo support async initialization, extension factories should use Init::Manager from the *Context object passed in as an argument (namely, ListenerFilter and AccessLogger extensions (at listener level) should use Init::Manager of the Listener, other extensions should use Init::Manager of the FilterChain).
E.g., in the case of HttpFilter extensions:
HttpFilterFactory implementations should return shared_ptr<Http::FilterFactoryCb> (instead of Http::FilterFactoryCb at the moment)HttpFilterFactory shouldInit::Manager (of the FilterChain)shared_ptr(nullptr) (the assumption is that extension should not be used anyway until warming is compete)HttpFilterFactory shouldshared_ptr with the actual implementationready(Success)HttpFilterFactory shouldready(Failed)Once Init::Manager of the FilterChain becomes ready
Failed init target, consider overall FilterChain configuration invalid"default" warming mode, propagate ready(Failed) to the Init::Manager of the Listener, which should mean that overall Listener configuration is invalid and such Listener should be discarded"on demand" warming mode, replace that filter chain with a stub rejecting TCP connectionsSuccess"default" warming mode, propagate ready(Success) to the Init::Manager of the Listener"on demand" warming mode, consider FilterChain ready to accept connectionsNOTE:
"default" warming mode, none of the init targets is allowed to fail; otherwise the entire Listener will be deemed invalid and will be discarded"on demand" warming mode, none of the init targets of a FilterChain is allowed to fail; otherwise the entire FilterChain will be deemed invalid and will be replaced by a stub rejecting TCP connectionsFilterChain are using the same Init::Manager. When one of the init targets calls ready(Failed), there is not enough context to understand which particular extension has failed and apply more local corrective actionInit::Manager_Current behaviour_:
Init::Manager and Init::Targets have only 1 outcome - ready()_Proposed behaviour_:
Init::Manager should be parameterized to support various outcomesInit::Manager<T> where T is enum { Done } to represent current behaviourready() calls should be replaced with ready(Done)Init::Manager<T> where T is enum { Success, Failure } to represent behaviour of this proposalready() calls should be replaced with ready(Success) and ready(Failure)Init::Manager implementation should be parameterized with a strategy to aggregate outcomesDone + Done = DoneSuccess + Failure = FailureSuccess + Failure = Failure and notify the watcher immediatelywasm extensionswasm extension requires initialization on the main thread and each of the worker threadswasm extension should complete initialization on all threads (main + all worker threads) and only after that consider the extension warmedInit::ManagerHi yskopets, great proposal. I just want to introduce the current progress of the on-demand filter chain as well.
As you mentioned, in the default warming mode, the listener's init manager will wait for filter chain's init manager.
In the current implementation for on-demand filter chain, the filter chain will not pass its init manager to the transport_factory_context. Instead, we just provide an empty placeholder. Hence the listener's init manager will not include any dependencies of this filter chain as targets.
Instead, a PerFilterChainRebuilder which includes init manager has been implemented for passing init manager and getting callbacks from dependencies later when rebuilding the real filter chain. This might be the Hierarchy of init manager you mentioned.
@ASOPVII Having per-FilterChain Init::Managers from the beginning (as opposed to only at a time of PerFilterChainRebuilder) is meant to:
Listener config processing while in the context of onUpdate of the LDS subscription (so that invalid Listener config could be rejected asap)FilterChain configs twice (once without Init::Manager and another time with Init::Manager)worker (thread) -> main (thread) communication in the "on demand" warming modeFilterChain has to be initialized anyway, main thread can kick off FilterChain initialization without waiting for an explicit request for that (and then handle deduplication and so on)worker threads could use a thread local waiting list to hold paused connectionsFilterChain has been warmed, main thread can use Envoy's TLS mechanism to deliver new Listener state to every worker thread and resume connections in the local waiting listI'm not saying that you shouldn't use PerFilterChainRebuilder and worker (thread) -> main (thread) communication. Just explaining what other options this proposal can give you.
@yskopets
- assuming that every
FilterChainhas to be initialized anyway,
Sorry this is the core value of on-demand.
What's more. I am not sure if I get the story of init failure. Under the scenario of LDS, why should listener address the failure?
If LDS server cannot respond to a failure, the listener should not be aware of that failure.
A complete-or-block init manager is supporting the listener behavior that the listener is always fully initialized. You see the new
listener and drain the old listener, or you stay at the the old listener. What connection should you drain when listener is partial
ready when you make the init manager asynchronous?
@lambdai
From the description of #12183 I understand that the problem it is solving is how to enable individual FilterChains to start serving requests as soon as they are ready. I assume that you still need all FilterChains, just at a different speed.
Regarding why to handle configuration errors and discard invalid Listener configurations:
To release unused resources timely
wasm extensions have non-negligible overhead:
Wasm VM, each worker thread needs a separate instance of Wasm VMThere is no reason to hold this memory if we know that Listener will never become ready to serve requests.
To make sense of xDS protocol and metrics inferred from it
Listener configuration after just 1 second but in the background is spending minutes trying to warm it up and silently fails in the end, it doesn't help understanding the status of the system and where to look for an error (in Envoy/in control plane/in user-defined policies/etc ?)Regarding draining the old listener and making init manager asynchronous, I don't understand what you mean:
Listener get discarded, the old listener will not be drainedWasm VMs get fully initialized before Listener is considered warmed up and it is time to drain the old listener At a high level this proposal makes sense to me and I think we should move it into a doc which also covers ECDS (sorry I was calling it FCDS before) as well as #12183. See also: https://docs.google.com/document/d/1OT7L5GbsJ03njPsSscZVnh-bz-t80z4rzdX8a4ICd8A/edit?ts=5f20b724.
A few comments:
@yskopets
I assume that you still need all FilterChains, just at a different speed.
I am afraid not. There are several follow ups on top of on-demand feature.There is no concrete roadmap, but
definitely are unblocked once we have on-demand.
@mattklein123
From what we discussed offline I think what you want is
@yskopets
I feel you may want to add the asynchronous factory interface before introducing the failure callback at init manager.
The failure callback at init manager is an optimization it could be done lagter.
Currently we can simulate the failure calling ready() but provide a not-working stuff such as NotReadyTransportSocket. https://github.com/envoyproxy/envoy/blob/a5d56f5c2e027fa3a3965487a06035de272648df/source/extensions/transport_sockets/tls/ssl_socket.cc#L28
@lambdai yes. Please look at https://github.com/envoyproxy/envoy/blob/5fb236b4c3376e2b761d540a54b4772d620740aa/api/envoy/config/core/v3/extension.proto#L48-L56 and discuss with @kyessenov. We need to be consistent with how we handle failure across all of these features.