Describe the bug
After deploying Ambassador on an AKS cluster, service configuration changes stop updating after 5-10 minutes for the Ambassador service.
To Reproduce
Expected behavior
Service configurations continue to update.
Versions (please complete the following information):
+1 I was just about to write a bug report myself. I have the exact same issue in the exact same configuration (v0.40.2 and AKS with Kubernetes 1.11.5)
I even built the Ambassador Docker image myself and added a few extra log messages in kubewatch.py. It appears that the events from the Kubernetes API server don't reach kubewatch. It can't be a permission issue because it works for a few minutes after redeploying Ambassador.
We've seen this on v1.11.13, v1.11.14, and 1.9.11, in both RBAC and non-RBAC mode. This appears to be an issue with clusters deployed more recently, i.e., clusters deployed in September do not have this issue.
We're pinging AKS engineering on this. If others on this thread can open up AKS support tickets on this issue that would be helpful. This issue is easily reproducible on AKS, and does not seem to exist on other hosted Kubernetes providers.
This slack bot that watches the kube-apiserver does not appear to have any issue receiving events. https://github.com/bitnami-labs/kubewatch
As someone above reported, Ambassador's kubewatch does not appear to be receiving events on Azure after a couple of minute with no errors.
@richarddli Did you check if all of those clusters are running on the Moby engine? https://github.com/Azure/acs-engine/pull/3896
The following will output the docker engine version:
kubectl describe nodes | grep 'Container Runtime Version'
3.0.1 indicates the Moby engine.
Per here, Moby went GA on all new node deployments on December 5th.
I opened a support case with Microsoft on the issue.
@HoveringHalibut Interesting find.
I just checked the other environments I've run Ambassador on to see the Docker version they're running.
AKS: 3.0.1 (Fork of 18.06)
GKE: 17.3.2
EKS: 18.6.1
Docker: 18.9.0
Minikube: 17.12.1-ce
Steps to reproduce:
We are experiencing the same thing and had previously noticed #928 and went and build a custom Ambassador with an updated kube-client as suggested in e5dcd66 into the 0.40.2 code and the issue persisted.
I've run into this issue too. However deleting all ambassador pods renews ambassabor routing table when they recreate. it worked and stoped updating after few minutes.
kubectl delete pods -l service=ambassador
Just a quick update. It's not Moby, but working with the Azure engineering team we believe we are zeroing in on the root cause. We hope to provide a more detailed update soon.
The underlying reason for this issue is Ambassador talks to the kube-apiserver via a series of proxies. It seems that at some point, one of these proxies is dropping the connection with the python-client Ambassador uses.
The fix we are evaluating is taking advantage of the mutating webhook admissions controller feature AKS recently implemented to bypass this series of proxies with the go-client.
We will provide more details as progress is made.
@nbkrause and @richarddli Thanks for the update and work on this issue. I'm continuing to push on my support case with this issue. A problem with a proxy timeout makes me nervous about its affect on other services with similar hooks as Ambassador.
Testing seems to have shown that #1087 fixes this issue. Targeting for rc4.
Most helpful comment
The underlying reason for this issue is Ambassador talks to the kube-apiserver via a series of proxies. It seems that at some point, one of these proxies is dropping the connection with the python-client Ambassador uses.
The fix we are evaluating is taking advantage of the mutating webhook admissions controller feature AKS recently implemented to bypass this series of proxies with the go-client.
We will provide more details as progress is made.