Please describe your use case / problem.
Hi,
we are seeing some stale connections on our k8s cluster, between ambassador and upstream services, not a lot of, but that affects SLA. Root cause of probably that something in the middle (conntrack. ipvs) just lost connection and when envoy want to use that removed connection we got RST.
Describe the solution you'd like
Enable keep_alive socket options in envoy.
Describe alternatives you've considered
no alternatives
initial PR https://github.com/datawire/ambassador/pull/1994 (updated one)
We also need TCP keepalive on upstream connections.
I've tested the PR and it works. What we would really prefer additionally is a means to enable keepalives without having to specify these values so kernel defaults will apply, and to be able to configure keepalives as a global default and not (necessarily) on each mapping individually.
I just added global config support for keepalive :)
How you can use keepalive, as global configuration
---
apiVersion: v1
kind: Service
metadata:
labels:
service: ambassador
name: ambassador
annotations:
getambassador.io/config: |
---
apiVersion: ambassador/v1
kind: Module
name: ambassador
config:
keepalive:
time: 2
interval: 2
probes: 100
spec:
type: ClusterIP
ports:
- port: 443
name: ambassador-https
targetPort: 8443
selector:
service: ambassador
or in per service basis
apiVersion: ambassador/v1
kind: Mapping
name: tour-backend_mapping
connect_timeout_ms: 3000
prefix: /backend/
service: tour:8080
labels:
ambassador:
- request_label:
- backend
keepalive:
time: 10
interval: 1
probes: 100
i see expected configuration in envoy configuration and also underlying TCP connection is sending ACKs to the upstream
22:59:26.486121 IP tour.default.svc.cluster.local.8080 > ambassador-6f47699486-2wwtv.36200: Flags [.], ack 1, win 243, options [nop,nop,TS val 3058661490 ecr 707636676], length 0
22:59:28.278315 IP ambassador-6f47699486-2wwtv.36206 > tour.default.svc.cluster.local.8080: Flags [.], ack 1, win 237, options [nop,nop,TS val 707654864 ecr 3058661234], length 0
22:59:28.278411 IP tour.default.svc.cluster.local.8080 > ambassador-6f47699486-2wwtv.36206: Flags [.], ack 1, win 235, options [nop,nop,TS val 3058663282 ecr 707636452], length 0
22:59:28.534090 IP ambassador-6f47699486-2wwtv.36200 > tour.default.svc.cluster.local.8080: Flags [.], ack 1, win 245, options [nop,nop,TS val 707655119 ecr 3058661490], length 0
22:59:28.534206 IP tour.default.svc.cluster.local.8080 > ambassador-6f47699486-2wwtv.36200: Flags [.], ack 1, win 243, options [nop,nop,TS val 3058663538 ecr 707636676], length 0
22:59:28.534314 IP ambassador-6f47699486-2wwtv.36152 > tour.default.svc.cluster.local.8080: Flags [.], ack 1, win 237, options [nop,nop,TS val 707655120 ecr 3058661490], length 0
22:59:28.534395 IP tour.default.svc.cluster.local.8080 > ambassador-6f47699486-2wwtv.36152: Flags [.], ack 1, win 235, options [nop,nop,TS val 3058663538 ecr 707630585], length 0
22:59:30.326440 IP ambassador-6f47699486-2wwtv.36206 > tour.default.svc.cluster.local.8080: Flags [.], ack 1, win 237, options [nop,nop,TS val 707656912 ecr 3058663282], length 0
22:59:30.326548 IP tour.default.svc.cluster.local.8080 > ambassador-6f47699486-2wwtv.36206: Flags [.], ack 1, win 235, options [nop,nop,TS val 3058665330 ecr 707636452], length 0
22:59:30.582205 IP ambassador-6f47699486-2wwtv.36200 > tour.default.svc.cluster.local.8080: Flags [.], ack 1, win 245, options [nop,nop,TS val 707657167 ecr 3058663538], length 0
22:59:30.582207 IP ambassador-6f47699486-2wwtv.36152 > tour.default.svc.cluster.local.8080: Flags [.], ack 1, win 237, options [nop,nop,TS val 707657167 ecr 3058663538], length 0
22:59:30.582288 IP tour.default.svc.cluster.local.8080 > ambassador-6f47699486-2wwtv.36200: Flags [.], ack 1, win 243, options [nop,nop,TS val 3058665586 ecr 707636676], length 0
22:59:30.582288 IP tour.default.svc.cluster.local.8080 > ambassador-6f47699486-2wwtv.36152: Flags [.], ack 1, win 235, options [nop,nop,TS val 3058665586 ecr 707630585], length 0
鉂わ笍 鉂わ笍 鉂わ笍 鉂わ笍
added documentation.
I would normally ask for a test, but I'm not coming up with any simple way to write that test, so I'm gonna go ahead and accept it. 馃槀
thank you @kflynn - I promise that before next PR I will prepare my local test env :)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
I just added global config support for keepalive :)
How you can use keepalive, as global configuration
or in per service basis
i see expected configuration in envoy configuration and also underlying TCP connection is sending ACKs to the upstream
鉂わ笍 鉂わ笍 鉂わ笍 鉂わ笍