In order to support filter chain updates without forcing a full listener drain, it's been suggested we add a FilterChain Discovery Service. This would allow individual FilterChains belonging to a Listener to be independently discovered and updated.
This would also lend support to finer grained Listener resource naming for incremental and on-demand xDS (generalizing https://github.com/envoyproxy/envoy/issues/2500).
Opening this issue to track any design work and ownership.
@ggreenway @andraxylia @lizan @PiotrSikora @mattklein123
We need this fix for Istio for long-lived connections with SNI @rshriram @costinm .
Related add-on feature: lazy-loading of FilterChains, triggered by a listener-filter that is watching for unknown SNI values.
@htuch does the drain happen when a particular FilterChain is updated, and also when any FilterChain is added or deleted in the listener?
We need this fix for Istio for long-lived connections with SNI
@andraxylia the intention here is to only effect new connections, right? Obviously previous connections will use the old filter chain? I just want to make sure we are on the same page. If so this sounds like a useful and straightforward feature to me.
@mattklein123 All is good if a change in the FilterChain does not affect existing TCP and GRPC connections - for instance the addition of a host in server_names in the FilterMatch.
@rshriram @costinm told me that because of issues with full drain we cannot move Istio from multiple listeners to 1 listener with multiple FilterChains.
I brainstormed with @htuch and here's what we propose for now:
Any comments or concerns with this plan?
I don't understand how filter chains interact with draining? Can't old connections use the old filter chain and new ones use the new filter chain? If we want to add draining on filter chain change I think that's totally orthogonal?
@mattklein123 Yes, that's basically the conclusion @htuch and I came to. There are two orthogonal feature requests/requirements here:
1) Be able to add/remove filterchains without draining. Right now this is not possible only due to the implementation of listener config changes. The API does not need to change to fix this.
2) lazy-load filterchains.
The implementation of these two features is mostly/entirely non-intersecting.
OK, thanks for the clarification. IMO we don't ever need to implement draining on a per filter chain basis. I would track as a totally separate issue? Fixing (1) should be simple compared to having to do draining also (not simple).
Actually, thinking through this, I think (2) would depend on (1) to work as expected. In order to modify one of the filter chains via FCDS, it would need to modify the listener and cause draining of connections on that filterchain, but not drain connections on unchanged filterchains.
But even with lazy load, why is draining required? It seems like we can write the code so that a filter chain is latched onto a connection at initial connect time? Why is it required to drain? (I can see why someone might want that, but it seems not required).
That's an interesting point. I suppose it's not required. But it would be inconsistent with current behavior for listeners: there's not currently a way to modify/remove a listener without draining. But the answer to (1) may be as simple as adding a new DrainType of "hot-restart only". That's not quite as fine-grained, but would be much less effort for most of the potential gains. It would entirely solve the case of filterchain add/remove, but would not allow forcing draining for a filterchain modify.
FWIW, my view of filter chain is that it's something that is only consulted when a connection is created. As such, draining is not required. The reason I wrote draining into listeners is that when a listener goes away, by definition is destroys all owned connections (this wouldn't be the case for filter chains). I agree that we could add a configuration option to listeners to not drain and just hard swap, though it seems like few would want this behavior?
The reason I wrote draining into listeners is that when a listener goes away, by definition is destroys all owned connections (this wouldn't be the case for filter chains).
I don't understand why filter chains would be any different in this way. We still need to ensure that all of the filter configs, etc are not destroyed until all connections using them have closed. This could be done with shared_ptr or similar, but the same could be said for listeners (I'm not arguing that we want to do that for listeners, just that we could). I don't think I understand what you have in mind for the different lifetimes of listeners vs filter chains and why they're fundamentally different.
I don't think I understand what you have in mind for the different lifetimes of listeners vs filter chains and why they're fundamentally different.
The reason that I added draining for the listener implementation was honestly mainly for simplicity due to connection ownership. I agree that connections could live beyond the life of a listener but it would be a very large change. Obviously not doing anything at all (draining vs. shared_ptr ownership) would be not good for many users.
For filter chains, I think the keeping the filter chain alive via shared pointer if necessary is substantially simpler and if it were me is how I would start. I don't object to adding draining in any way, I just don't think it adds much and would be pretty complicated.
Ok, that makes sense. I just wanted to make sure I understood your reasoning.
What do you think about changing listeners so that they do not drain when their config is modified (but not deleted) if the only part of the config that changed is the filter chains?
What do you think about changing listeners so that they do not drain when their config is modified (but not deleted) if the only part of the config that changed is the filter chains?
Yup makes sense and I don't think would be that hard.
SGTM to me as well, FWIW I hadn't grokked the full tradeoff between shared_ptr and drain here in Listener implementation, so thanks @mattklein123 for explaining this for posterity.
@mattklein123 I've been thinking this through, and simply using a shared_ptr for the filter-chain will run into other issues, because right now ThreadLocal slots must be destructed on the main thread. But I can probably make it post deletions to the main thread when the refcount goes to zero somehow.
Hmm yeah that's definitely a complexity with the stored lambdas. Yeah I think could have some type of shared_ptr wrapper that on destruction posts back to the main listener for deletion. It's definitely not trivial b but doable. I suppose if TLS is the only issue you could fix that to allow deletion from any thread, but not sure if that is worth it or not.
my two cents. the draining is totally optional. If we can have the old filter chain with the old connection live for ever in the same listener, then as Matt says, I see no reason to drain. Unless its a security issue and I want to terminate old connections in a timely fashion, and not have any connections with expired tls certs lingering around for ages.
Wouldn't not draining lead to a situation, where we have long-running connections (i.e. HTTP/2, gRPC) that are latched to a filter chain that no longer exist, along with its outdated configuration, with no way to force those clients to re-connect and re-select a new filter chain?
it does but the point is that unless these configurations pertain to some security stuff, this "forcing" thingy should be optional. In a way, we put the onus on the end user - if they want to take advantage of the newer config, they should reconnect. Gives them the option of staying with old config or reconnecting
End users don't know that there is a new config, how would they?
The old config might be pointing to a non-existing RDS/CDS/EDS targets, in which case already established connections will be permanently broken at the stream level, whereas new connections will work perfectly fine. I don't believe that's an acceptable state to be in.
@PiotrSikora I don't think anyone is saying we shouldn't implement filter chain draining. All we are saying is that it's extra complexity that is orthogonal to the main implementation. I would suggest we track as a separate issue once the primary feature is implemented.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
Ack: working on this
@silentdai would you be able to post a brief status update on this issue for followers. Thanks. Dave
I have several draft PRs on benchmark and FCDS proto. Blocked by #7246
I also have draft design doc supposed to publish by the end of this week.
One more thing to consider as part of this work is decoupling of filter chains from the filter chain matchers. This would not only help with the future on-demand work, but it would also allow multiple filter chain matchers to match to a single filter chain, instead of having separate instances of filter chains for each matcher, since those can be exact copies in some cases.
fwiw, for our use case it would be great to allow multiple matchers to refer to the same chain and avoid the extra copies.
@mattklein123 @lambdai do you think this is needed for 1.12.0 or can we do this as a v3 add-on once shipped?
Add on is fine with me.
I was curious about time-line on this feature (FDS). Looks like lot of work is already committed.
Most helpful comment
@andraxylia the intention here is to only effect new connections, right? Obviously previous connections will use the old filter chain? I just want to make sure we are on the same page. If so this sounds like a useful and straightforward feature to me.