Hi,
At Atlassian we are a pretty big user of envoy (a few thousand services). We recently upgraded from 1.12.3 to 1.14.1 and we discovered that the circuit breaker behavior, given our existing configuration, had changed.
Our config up to that point had intended to disable circuit breakers on the ingress by default, and so we had:
circuit_breakers:
thresholds:
- priority: "DEFAULT"
max_connections: 0xffffffff
We discovered that as of 1.14.1 there was a circuit breaker that was limiting the number of concurrent connections through the ingress. We believe this came about as a result of this change:
upstream: combined HTTP/1 and HTTP/2 connection pool code. This means that circuit breaker limits for both requests and connections apply to both pool types. Also, HTTP/2 now has the option to limit concurrent requests on a connection, and allow multiple draining connections. The old behavior is deprecated, but can be used during the deprecation period by disabling runtime feature envoy.reloadable_features.new_http1_connection_pool_behavior or envoy.reloadable_features.new_http2_connection_pool_behavior and then re-configure your clusters or restart Envoy. The behavior will not switch until the connection pools are recreated. The new circuit breaker behavior is described here.
In order for us to restore the same behavior we needed to change our config to:
circuit_breakers:
thresholds:
- priority: "DEFAULT"
max_connections: 0xffffffff
max_requests: 0xffffffff
We got lucky on this, because one of our teams happened to be doing load testing just when this change was undergoing soaking in a development environment. They reported that during their test the success rate of the calling service went from 98% to 4% or so. As a result, the version bump never got rolled out all the way, but it was a close call for us.
We reviewed the release notes at the time, but the full implications of the change did not occur to us. We'd like to propose that in the future backwards incompatible changes be spelled out more clearly in the release notes, and when changes to config are required to preserve existing behavior those instructions are given explicitly as well.
Probably @alyssawilk has the context for this?
@ggreenway
We reviewed the release notes at the time, but the full implications of the change did not occur to us. We'd like to propose that in the future backwards incompatible changes be spelled out more clearly in the release notes, and when changes to config are required to preserve existing behavior those instructions are given explicitly as well.
Yes, this is good feedback. We actually had a similar problem at Lyft and had to temporarily revert the new connection pool behavior. I think we could have done a better job here of calling out the implications of this change. cc @tonya11en
One idea is to break up the release notes into some different sections: new features, deprecations, behavior changes from a previous version, and configuration changes. There would be some overlap sometimes, but some duplication would be ok.
When someone is evaluating a version upgrade, the first question is usually "will anything break", and they could answer that by looking at behavior and config changes sections, which hopefully is much shorter (and thus easier to digest) than what we have now.
I like the idea of breaking up the release notes into different sections.
I really like this new separation of concerns! I think it will make it much easier for us in the future to gauge the impact of a new version.
I really like this new separation of concerns! I think it will make it much easier for us in the future to gauge the impact of a new version.
Thanks! Let us know if there are any more suggestions to further clarify things.
Closing this issue, as I don't think there's anything further to do here.
Most helpful comment
I really like this new separation of concerns! I think it will make it much easier for us in the future to gauge the impact of a new version.