Ambassador: Configuration updates stop on Azure Kubernetes Service (AKS)

Created on 17 Dec 2018  路  12Comments  路  Source: datawire/ambassador

Describe the bug
After deploying Ambassador on an AKS cluster, service configuration changes stop updating after 5-10 minutes for the Ambassador service.

To Reproduce

  1. Deploy new AKS cluster (Tested with RBAC enabled and disabled)
  2. Deploy Ambassador per https://www.getambassador.io/user-guide/getting-started
  3. Wait 10 minutes
  4. Deploy httpbin per the getting-started doc
  5. Check if routes have updated via Ambassador diagnostics

Expected behavior
Service configurations continue to update.

Versions (please complete the following information):

  • Ambassador: 0.40.2
  • Kubernetes environment: AKS
  • Version 1.11.5

Most helpful comment

The underlying reason for this issue is Ambassador talks to the kube-apiserver via a series of proxies. It seems that at some point, one of these proxies is dropping the connection with the python-client Ambassador uses.

The fix we are evaluating is taking advantage of the mutating webhook admissions controller feature AKS recently implemented to bypass this series of proxies with the go-client.

We will provide more details as progress is made.

All 12 comments

+1 I was just about to write a bug report myself. I have the exact same issue in the exact same configuration (v0.40.2 and AKS with Kubernetes 1.11.5)

I even built the Ambassador Docker image myself and added a few extra log messages in kubewatch.py. It appears that the events from the Kubernetes API server don't reach kubewatch. It can't be a permission issue because it works for a few minutes after redeploying Ambassador.

We've seen this on v1.11.13, v1.11.14, and 1.9.11, in both RBAC and non-RBAC mode. This appears to be an issue with clusters deployed more recently, i.e., clusters deployed in September do not have this issue.

We're pinging AKS engineering on this. If others on this thread can open up AKS support tickets on this issue that would be helpful. This issue is easily reproducible on AKS, and does not seem to exist on other hosted Kubernetes providers.

This slack bot that watches the kube-apiserver does not appear to have any issue receiving events. https://github.com/bitnami-labs/kubewatch

As someone above reported, Ambassador's kubewatch does not appear to be receiving events on Azure after a couple of minute with no errors.

@richarddli Did you check if all of those clusters are running on the Moby engine? https://github.com/Azure/acs-engine/pull/3896

The following will output the docker engine version:
kubectl describe nodes | grep 'Container Runtime Version'
3.0.1 indicates the Moby engine.

Per here, Moby went GA on all new node deployments on December 5th.

I opened a support case with Microsoft on the issue.

@HoveringHalibut Interesting find.

I just checked the other environments I've run Ambassador on to see the Docker version they're running.
AKS: 3.0.1 (Fork of 18.06)
GKE: 17.3.2
EKS: 18.6.1
Docker: 18.9.0
Minikube: 17.12.1-ce

Steps to reproduce:

  1. Deploy a cluster
    a. RBAC or non-RBAC
    b. I have reproduced it on v1.9.11, 1.11.13 and 1.11.14
  2. Extract the yaml files attached AKS_deployment.zip
  3. Apply the yaml
    a. kubectl apply -f ambassador-deploy.yaml
    b. kubectl apply -f ambassador-service.yaml
    c. kubectl apply -f qotm/qotm-deploy.yaml
    d. kubectl apply -f qotm/qotm1.yaml
    This will deploy ambassador and create a route to a service running in the cluster
  4. Get the ip of the load balancer ambassador service (ambassador-external-ip)
  5. Test the mapping with curl -v http:///qotm/
  6. Wait 5-10 minutes
  7. Apply a mapping for the url http:// /qotm2/
    a. kubectl apply -f qotm/qotm2.yaml
  8. Test the mapping with curl and notice that Ambassador does not notice the mapping.

We are experiencing the same thing and had previously noticed #928 and went and build a custom Ambassador with an updated kube-client as suggested in e5dcd66 into the 0.40.2 code and the issue persisted.

I've run into this issue too. However deleting all ambassador pods renews ambassabor routing table when they recreate. it worked and stoped updating after few minutes.

kubectl delete pods -l service=ambassador

Just a quick update. It's not Moby, but working with the Azure engineering team we believe we are zeroing in on the root cause. We hope to provide a more detailed update soon.

The underlying reason for this issue is Ambassador talks to the kube-apiserver via a series of proxies. It seems that at some point, one of these proxies is dropping the connection with the python-client Ambassador uses.

The fix we are evaluating is taking advantage of the mutating webhook admissions controller feature AKS recently implemented to bypass this series of proxies with the go-client.

We will provide more details as progress is made.

@nbkrause and @richarddli Thanks for the update and work on this issue. I'm continuing to push on my support case with this issue. A problem with a proxy timeout makes me nervous about its affect on other services with similar hooks as Ambassador.

Testing seems to have shown that #1087 fixes this issue. Targeting for rc4.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Viacheslav-Akimov picture Viacheslav-Akimov  路  6Comments

nilanjan-samajdar picture nilanjan-samajdar  路  4Comments

klarose picture klarose  路  5Comments

caiobegotti picture caiobegotti  路  4Comments

kfkawalec picture kfkawalec  路  6Comments