Envoy: [circuit_breakers] MaxConnections not being honored with http1 traffic

Created on 28 Mar 2018 · 6Comments · Source: envoyproxy/envoy

Description:

Context: https://envoyproxy.slack.com/archives/C78M4KW76/p1522105647000209

Been playing around with circuit breaking and found what I think to be some unexpected behavior with the max connections circuit breaker. When setting the value to something small like 0 or 1, I'd expect no requests to make it to the upstream cluster. However, I'm seeing that they do.

Repro steps:
I've been testing with the front-proxy example with the following configuration
front-envoy.yaml. Running this, I'd expect requests to start getting 503'd with the x-envoy-overloaded header set after the first few but they end up making it to the upstream service 1 host.

I ran some curl commands like so:

$ (printf '%s\n' {1..10}}) | xargs -I % -P 20 curl -s "http://localhost:8000/service/1"
Hello from behind Envoy (service 1)! hostname: f2e2cdd50d1b resolvedhostname: 172.18.0.4
Hello from behind Envoy (service 1)! hostname: f2e2cdd50d1b resolvedhostname: 172.18.0.4
upstream connect error or disconnect/reset before headersHello from behind Envoy (service 1)! hostname: f2e2cdd50d1b resolvedhostname: 172.18.0.4
upstream connect error or disconnect/reset before headersHello from behind Envoy (service 1)! hostname: f2e2cdd50d1b resolvedhostname: 172.18.0.4
Hello from behind Envoy (service 1)! hostname: f2e2cdd50d1b resolvedhostname: 172.18.0.4
upstream connect error or disconnect/reset before headersHello from behind Envoy (service 1)! hostname: f2e2cdd50d1b resolvedhostname: 172.18.0.4
Hello from behind Envoy (service 1)! hostname: f2e2cdd50d1b resolvedhostname: 172.18.0.4

(Note: the upstream connect errors were a result of the upstream rejecting conns, not envoy applying backpressure)

This gave me the following stats:

curl -s localhost:8001/stats | grep service1.upstream
cluster.service1.upstream_cx_active: 2
cluster.service1.upstream_cx_close_notify: 0
cluster.service1.upstream_cx_connect_attempts_exceeded: 0
cluster.service1.upstream_cx_connect_fail: 0
cluster.service1.upstream_cx_connect_timeout: 0
cluster.service1.upstream_cx_destroy: 0
cluster.service1.upstream_cx_destroy_local: 0
cluster.service1.upstream_cx_destroy_local_with_active_rq: 0
cluster.service1.upstream_cx_destroy_remote: 0
cluster.service1.upstream_cx_destroy_remote_with_active_rq: 0
cluster.service1.upstream_cx_destroy_with_active_rq: 0
cluster.service1.upstream_cx_http1_total: 2
cluster.service1.upstream_cx_http2_total: 0
cluster.service1.upstream_cx_idle_timeout: 0
cluster.service1.upstream_cx_max_requests: 0
cluster.service1.upstream_cx_none_healthy: 0
cluster.service1.upstream_cx_overflow: 14
cluster.service1.upstream_cx_protocol_error: 0
cluster.service1.upstream_cx_rx_bytes_buffered: 280
cluster.service1.upstream_cx_rx_bytes_total: 4766
cluster.service1.upstream_cx_total: 2
cluster.service1.upstream_cx_tx_bytes_buffered: 0
cluster.service1.upstream_cx_tx_bytes_total: 4440

Note that cluster.service1.upstream_cx_active is 2, which I'd expect to be capped out at 1.

And from clusters admin:

$ curl -s localhost:8001/clusters
version_info::static
service1::default_priority::max_connections::1
service1::default_priority::max_pending_requests::1024
service1::default_priority::max_requests::1024
service1::default_priority::max_retries::3
service1::high_priority::max_connections::1024
service1::high_priority::max_pending_requests::1024
service1::high_priority::max_requests::1024
service1::high_priority::max_retries::3
service1::added_via_api::false
service1::172.18.0.4:80::cx_active::2
service1::172.18.0.4:80::cx_connect_fail::0
service1::172.18.0.4:80::cx_total::2
service1::172.18.0.4:80::rq_active::0
service1::172.18.0.4:80::rq_error::5
service1::172.18.0.4:80::rq_success::15
service1::172.18.0.4:80::rq_timeout::0
service1::172.18.0.4:80::rq_total::20
service1::172.18.0.4:80::health_flags::healthy
service1::172.18.0.4:80::weight::1

As mentioned in envoy slack, I imagine this has to do with connection warming bypassing the resource manager that keeps track of max connections but I may just be misunderstanding the semantics of this circuit breaker.

bug

Source

brirams

👍5

Most helpful comment

@rohaldbUni There are a couple things going on that may be affecting your test results:

In order to prevent requests from getting "stuck" (queued in a way where they won't ever get processed), Envoy will allow 1 connection to each instance in the upstream pool, per envoy worker. So if you have a cluster of 100 services, and Envoy concurrency set to 10 (the default is the number of CPUs on the box running Envoy), Envoy will allow up to 1000 concurrent requests IF the requests are distributed perfectly between workers, and load balanced perfectly between all instances in the cluster. If it is important to your use case to enforce low limits, possibly try it with Envoy concurrency set to 1.
The circuit breakers are intended to prevent too much load from propagating through the system, not enforce a strict limit. The system is implemented in a way that is simpler and more performant, but can slightly exceed the limits in some cases. Here's a comment from the implementation of the circuit breaker limit tracking:
```/**
- Implementation of ResourceManager.
- NOTE: This implementation makes some assumptions which favor simplicity over correctness.
- 1) Primarily, it assumes that traffic will be mostly balanced over all the worker threads since
- no attempt is made to balance resources between them. It is possible that starvation can
- occur during high contention.
- 2) Though atomics are used, it is possible for resources to temporarily go above the supplied
- maximums. This should not effect overall behavior.
  */
```

ggreenway on 12 Feb 2020

❤2

All 6 comments

I honestly don't recall why this code is the way it is, but it's because of this logic block: https://github.com/envoyproxy/envoy/blob/master/source/common/http/http1/conn_pool.cc#L95

I would probably call this a bug, but would need to page back in why the code is doing this (assuming there was a good reason at some point).

mattklein123 on 29 Mar 2018

So referring to the above statistics, cluster.service1.upstream_cx_overflow: 14 seems to be a red herring. Is it supposed to be contributing to 5xx counters? If so, currently in what conditions does it contribute to 5xx counters?

ashimrana on 10 Nov 2018

I think this is fixed/consistent now with the recent conn pool refactor so will close this for now.

mattklein123 on 30 Jan 2020

@mattklein123 we are not only seeing counters being incremented but also more connections than expected are being permitted. We posted a s/o question here outlining the details. Should envoy be enforcing strict limits on max_connection's or only approximate (+/- a few).

rohaldb on 12 Feb 2020

Please try again with current master. The HTTP1/HTTP2 connection pool implementations have been unified and the logic is now shared. cc @ggreenway

mattklein123 on 12 Feb 2020

@rohaldbUni There are a couple things going on that may be affecting your test results:

In order to prevent requests from getting "stuck" (queued in a way where they won't ever get processed), Envoy will allow 1 connection to each instance in the upstream pool, per envoy worker. So if you have a cluster of 100 services, and Envoy concurrency set to 10 (the default is the number of CPUs on the box running Envoy), Envoy will allow up to 1000 concurrent requests IF the requests are distributed perfectly between workers, and load balanced perfectly between all instances in the cluster. If it is important to your use case to enforce low limits, possibly try it with Envoy concurrency set to 1.
The circuit breakers are intended to prevent too much load from propagating through the system, not enforce a strict limit. The system is implemented in a way that is simpler and more performant, but can slightly exceed the limits in some cases. Here's a comment from the implementation of the circuit breaker limit tracking:
```/**
- Implementation of ResourceManager.
- NOTE: This implementation makes some assumptions which favor simplicity over correctness.
- 1) Primarily, it assumes that traffic will be mostly balanced over all the worker threads since
- no attempt is made to balance resources between them. It is possible that starvation can
- occur during high contention.
- 2) Though atomics are used, it is possible for resources to temporarily go above the supplied
- maximums. This should not effect overall behavior.
  */
```