Describe the bug
On a new Azure Container Service (AKS) cluster with 1.10.6 -- as soon as I install istio 1.0.0 (I've tried the official release and daily istio-release-1.0-20180803-09-15) -- requests in the K8S dashboard take 5-10 seconds or timeout completely. Additionally commands to tiller timeout retrieving configmaps.
All kubectl commands I can think to run succeed and run quickly. Installing istio 0.8 does not have this issue.
Expected behavior
No negative impact to other services when installing istio.
Steps to reproduce the bug
1) Create new AKS cluster.
2) Install istio.. I used the following helm command (and corresponding kubectl apply):
helm template install/kubernetes/helm/istio --name istio --set servicegraph.enabled=true --set grafana.enabled=true --set tracing.enabled=true --set galley.enabled=false --set telemetry-gateway.grafanaEnabled=true --set telemetry-gateway.prometheusEnabled=true --namespace istio-system
3) Wait a few minutes for the various pods to start up.
4) Run kubectl proxy (or az aks browse) and try to navigate in the dashboard. Or run helm ls
.
Version
Istio: release-1.0-20180803-09-15
K8S: 1.10.6
Is Istio Auth enabled or not?
No
Environment
Azure AKS
I ran your commands on bare metal. Note I don't immediately have access to AKS. I suspect you are are in an OOM situation where the kernel continually kills processes and Kubernetes continually restarts them (hence the helm version/helm ls lag, and dashboard lag). This is hard to detect - but can be seen with kubectl describe on a restarted pod (grep for OOM).
Also, a namespace was not created for istio-system above. Are you executing an upgrade, or a fresh install? I suspect an upgrade will require more memory. Please reference the documentation for installation instructions here:
https://istio.io/docs/setup/kubernetes/helm-install/#option-1-install-with-helm-via-helm-template
and for Azure platform setup here:
https://istio.io/docs/setup/kubernetes/platform-setup/azure/
Note I have not personally validated the Azure platform setup instructions.
You can see from my AIO workflow below, that a very bare bones Ubuntu 16.04.04 bare metal system requires 13 GB of ram for Kubernetes + Istio. Reading the azure documentation on isito.io, you might try increasing the node count beyond 3 nodes, to provide yourself more memory for the cluster to work with. It also took around 6 minutes to deploy Kubernetes and Istio on my bare metal system (which is a beast of a server). You mentioned you waited a few minutes - this may not be sufficient for Istio to initialize.
sdake@falkor-07:~$ kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
grafana-5fb774bcc9-ds2xs 1/1 Running 0 5m
istio-citadel-5b956fdf54-tmqnh 1/1 Running 0 5m
istio-cleanup-secrets-qcks7 0/1 Completed 0 5m
istio-egressgateway-6cff45b4db-jwchv 1/1 Running 0 5m
istio-grafana-post-install-wkfrp 0/1 Completed 0 5m
istio-ingressgateway-fc648887c-fg56t 1/1 Running 0 5m
istio-pilot-6cd95f9cc4-kpxbh 1/2 Running 0 5m
istio-policy-75f75cc6fd-h5hzb 2/2 Running 0 5m
istio-sidecar-injector-6d59d46ff4-rmh2t 1/1 Running 0 5m
istio-statsd-prom-bridge-7f44bb5ddb-k96tx 1/1 Running 0 5m
istio-telemetry-544b8d7dcf-tpmp7 2/2 Running 0 5m
istio-tracing-ff94688bb-wzrs9 1/1 Running 0 5m
prometheus-84bd4b9796-cc4dl 1/1 Running 0 5m
servicegraph-6c6dbbf599-9q2wb 1/1 Running 2 5m
sdake@falkor-07:~$ vmstat -s --unit M
128829 M total memory
3709 M used memory
12993 M active memory
7355 M inactive memory
105425 M free memory
1306 M buffer memory
18387 M swap cache
0 M total swap
0 M used swap
0 M free swap
161538896 non-nice user cpu ticks
207515 nice user cpu ticks
194624795 system cpu ticks
11580707701 idle cpu ticks
44524010 IO-wait cpu ticks
0 IRQ cpu ticks
16863622 softirq cpu ticks
0 stolen cpu ticks
4849792 pages paged in
602829514 pages paged out
0 pages swapped in
0 pages swapped out
2308409615 interrupts
2356661043 CPU context switches
1529810860 boot time
146777354 forks
sdake@falkor-07:~$ helm version
Client: &version.Version{SemVer:"v2.7.2", GitCommit:"8478fb4fc723885b155c924d1c8c410b7a9444e6", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.7.2", GitCommit:"8478fb4fc723885b155c924d1c8c410b7a9444e6", GitTreeState:"clean"}
sdake@falkor-07:~$ kubectl get services -n istio-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana ClusterIP 10.110.229.156 <none> 3000/TCP 6m
istio-citadel ClusterIP 10.110.55.202 <none> 8060/TCP,9093/TCP 6m
istio-egressgateway ClusterIP 10.110.133.106 <none> 80/TCP,443/TCP 6m
istio-ingressgateway LoadBalancer 10.110.22.27 10.23.220.90 80:31380/TCP,443:31390/TCP,31400:31400/TCP,15011:31234/TCP,8060:30143/TCP,15030:31404/TCP,15031:31465/TCP 6m
istio-pilot ClusterIP 10.110.60.79 <none> 15010/TCP,15011/TCP,8080/TCP,9093/TCP 6m
istio-policy ClusterIP 10.110.22.189 <none> 9091/TCP,15004/TCP,9093/TCP 6m
istio-sidecar-injector ClusterIP 10.110.11.218 <none> 443/TCP 6m
istio-statsd-prom-bridge ClusterIP 10.110.60.203 <none> 9102/TCP,9125/UDP 6m
istio-telemetry ClusterIP 10.110.183.250 <none> 9091/TCP,15004/TCP,9093/TCP,42422/TCP 6m
jaeger-agent ClusterIP None <none> 5775/UDP,6831/UDP,6832/UDP 6m
jaeger-collector ClusterIP 10.110.165.191 <none> 14267/TCP,14268/TCP 6m
jaeger-query ClusterIP 10.110.56.112 <none> 16686/TCP 6m
prometheus ClusterIP 10.110.75.218 <none> 9090/TCP 6m
servicegraph ClusterIP 10.110.175.150 <none> 8088/TCP 6m
tracing ClusterIP 10.110.235.175 <none> 80/TCP 6m
zipkin ClusterIP 10.110.158.41 <none> 9411/TCP 6m
sdake@falkor-07:~$
Its not an issue that went away with time. I just mentioned the few minutes to see it happening for repro.. Its definitely in its own namespace and no pods are being restarted regularly. Galley was but I disabled it via helm config.
Istio, Prometheus, Grafana, Jaeger, Servicegraph are all functioning.. As is the BookInfo demo app.
Cluster has 10.5GB memory and doesn't run its own masters (because AKS provides those).
⟩ kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
grafana-5b575487bc-5w6c6 1/1 Running 0 1d
istio-citadel-5856986bb6-mc9l5 1/1 Running 0 1d
istio-egressgateway-68d9f9946-mw24c 1/1 Running 0 1d
istio-ingressgateway-5986d965fc-6smmr 1/1 Running 0 1d
istio-pilot-54f6fbc998-78ctf 2/2 Running 0 1d
istio-policy-55cd59d88d-bcjgl 2/2 Running 0 1d
istio-sidecar-injector-69b8d5f76b-hfrch 1/1 Running 0 1d
istio-statsd-prom-bridge-7f44bb5ddb-7bnq6 1/1 Running 0 1d
istio-telemetry-569ccddd69-wvnkj 2/2 Running 1 1d
istio-tracing-ff94688bb-plplr 1/1 Running 0 1d
prometheus-84bd4b9796-bwwpj 1/1 Running 1 1d
servicegraph-6c986fd7fc-hbb7t 1/1 Running 2 1d
⟩ kubectl get services -n istio-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana ClusterIP 10.0.3.48 <none> 3000/TCP 1d
istio-citadel ClusterIP 10.0.135.147 <none> 8060/TCP,9093/TCP 1d
istio-egressgateway ClusterIP 10.0.91.217 <none> 80/TCP,443/TCP 1d
istio-ingressgateway LoadBalancer 10.0.182.248 xx.xx.xx.xx 80:31380/TCP,443:31390/TCP,31400:31400/TCP,15011:31788/TCP,8060:30603/TCP,15030:30099/TCP,15031:32626/TCP 1d
istio-pilot ClusterIP 10.0.70.93 <none> 15010/TCP,15011/TCP,8080/TCP,9093/TCP 1d
istio-policy ClusterIP 10.0.137.122 <none> 9091/TCP,15004/TCP,9093/TCP 1d
istio-sidecar-injector ClusterIP 10.0.179.103 <none> 443/TCP 1d
istio-statsd-prom-bridge ClusterIP 10.0.94.214 <none> 9102/TCP,9125/UDP 1d
istio-telemetry ClusterIP 10.0.96.225 <none> 9091/TCP,15004/TCP,9093/TCP,42422/TCP 1d
jaeger-agent ClusterIP None <none> 5775/UDP,6831/UDP,6832/UDP 1d
jaeger-collector ClusterIP 10.0.174.70 <none> 14267/TCP,14268/TCP 1d
jaeger-query ClusterIP 10.0.147.114 <none> 16686/TCP 1d
prometheus ClusterIP 10.0.110.122 <none> 9090/TCP 1d
servicegraph ClusterIP 10.0.112.147 <none> 8088/TCP 1d
tracing ClusterIP 10.0.151.228 <none> 80/TCP 1d
zipkin ClusterIP 10.0.16.118 <none> 9411/TCP 1d
*Edit: Adding a list of kube-system pods to show that dashboard/tiller aren't restarting regularly
⟩ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
azureproxy-6496d6f4c6-hb8ht 1/1 Running 2 4d
heapster-864b6d7fb7-dk989 2/2 Running 0 4d
kube-dns-v20-55645bfd65-88prw 3/3 Running 0 4d
kube-dns-v20-55645bfd65-zgvtz 3/3 Running 0 4d
kube-proxy-5lltm 1/1 Running 0 4d
kube-proxy-lz2bg 1/1 Running 0 4d
kube-proxy-vrb96 1/1 Running 0 4d
kube-state-metrics-688bbf7446-sffqw 2/2 Running 2 1d
kube-svc-redirect-4ff6d 1/1 Running 0 4d
kube-svc-redirect-h2pch 1/1 Running 1 4d
kube-svc-redirect-lsmz4 1/1 Running 0 4d
kubernetes-dashboard-66bf8db6cf-f4xkc 1/1 Running 2 4d
tiller-deploy-84f4c8bb78-pfhwb 1/1 Running 0 1d
tunnelfront-78b8fc8485-c6jcr 1/1 Running 0 4d
Same issue here.
@blackbaud-brandonstirnaman - I don't immediately have access to a AKS system. Is there any chance you can increase the cluster memory (possibly by adding more nodes?) . 10.5mb sounds pretty tight for Istio with what is enabled, even with the Kubernetes control plane on a different node. It is possible other parts of the system (kubelet for example) are being killed by the kernel OOM killer. One way to verify this (if a shell is available on AKS) is to check dmesg
output. This will tell you if the oom killer is being triggered.
Other possibilities include a defect in Istio, memory overcommit ratio's being set too high on the cluster causing swapping (which triggers blocking and sluggish performance), and possibly other things. I am not all that familiar with Azure's Kubernetes system or their cloud in general. It would be helpful to know if more memory allocated to the cluster alleviates the problems so we can make appropriate recommendations in the documentation.
Cheers
-steve
My current cluster is 6 nodes with 8Gb of memory. So 40gb of Memory should be enough. There are no other services on the cluster at the moment.
Thanks @BernhardRode - that helps possibly eliminate OOM problems. I'm on PTO until the 10th, just thought I'd offer some quick help here, but I don't have time at this immediate moment to spin up AKS. Once PTO finishes up, will have time.
Sounds like a common problem people are suffering with.
If I can assist in any way, just contact me.
I see the same issue: cluster running fine up until the point where I installed istio 1.0.0,, after which I get laggy kubectl, helm misbehaving, etc... On important point, k8s conformance tests before installing istio were all green, however I did not run them again yet.
I had the same scenario with 2 other cluster couple days ago, where conformance tests were not running at all afterwards.
possibly related (OOM problem): https://github.com/istio/istio/issues/7734
Any updates on this? Confirmed I don't have any real resource contention going on in my 'repro' cluster but the issue persists.
A bit more data..
⟩ kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
aks-nodepool1-78495379-2 390m 39% 2195Mi 65%
aks-nodepool1-78495379-0 48m 4% 339Mi 10%
aks-nodepool1-78495379-1 197m 19% 1974Mi 59%
⟩ kubectl top pod --all-namespaces
NAMESPACE NAME CPU(cores) MEMORY(bytes)
istio-system istio-statsd-prom-bridge-7f44bb5ddb-gk7fq 11m 33Mi
istio-system servicegraph-6c986fd7fc-mwxtr 0m 6Mi
istio-system istio-egressgateway-68d9f9946-jr8bb 1m 29Mi
istio-system prometheus-84bd4b9796-hrmlc 10m 434Mi
default reviews-v2-7ff5966b99-8wfht 2m 200Mi
kube-system kube-svc-redirect-lsmz4 18m 36Mi
kube-system kube-state-metrics-688bbf7446-tz4p2 2m 40Mi
kube-system heapster-864b6d7fb7-p2sd6 0m 37Mi
kube-system kubernetes-dashboard-66bf8db6cf-7fqsl 0m 26Mi
kube-system kube-svc-redirect-4ff6d 4m 45Mi
kube-system kube-dns-v20-55645bfd65-pf4cp 3m 27Mi
istio-system istio-telemetry-569ccddd69-snrbt 63m 487Mi
istio-system istio-ingressgateway-5986d965fc-8qt8j 2m 33Mi
kube-system kube-dns-v20-55645bfd65-4rdt8 3m 21Mi
istio-system istio-sidecar-injector-69b8d5f76b-kpgxr 9m 25Mi
istio-system istio-tracing-ff94688bb-9mlwj 5m 270Mi
kube-system azureproxy-6496d6f4c6-4wchw 165m 73Mi
kube-system kube-proxy-lz2bg 1m 36Mi
default ratings-v1-77f657f55d-r6f5l 2m 46Mi
istio-system istio-pilot-54f6fbc998-sxmqm 67m 99Mi
kube-system kube-proxy-5lltm 1m 31Mi
kube-system kube-proxy-vrb96 1m 63Mi
istio-system grafana-5b575487bc-925nv 4m 43Mi
kube-system kube-svc-redirect-h2pch 18m 34Mi
istio-system istio-policy-55cd59d88d-d4ljk 58m 492Mi
istio-system istio-citadel-5856986bb6-76t95 0m 79Mi
default productpage-v1-f8c8fb8-fwvlz 5m 70Mi
kube-system tiller-deploy-84f4c8bb78-j4zcv 0m 16Mi
kube-system tunnelfront-78b8fc8485-dq5rs 18m 10Mi
default reviews-v3-5df889bcff-cx2f4 2m 120Mi
Just jumping in here, we are experiencing the same issue on AKS as Brandon has documented above. Would be happy to help with a resolution, but given that this setup is my first foray into Istio could use a little direction on where to get started.
I just created a new AKS cluster with Kubernetes 1.11.1.
Installing of Istio with Helm worked, after I've created a cluster role like this:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: cluster-admin
rules:
- apiGroups: ['*']
resources: ['*']
verbs: ['*']
- nonResourceURLs: ['*']
verbs: ['*']%
Everything was running for more then one hour. Then I deployed 10 simple pods to the cluster and everything started to go down again = Helm and Kubernetes dashboard work really slow.
I'm going to keep the cluster for some days.
I'm having a similar issue on AKS as well. I set up a new cluster (v 1.11.1), installed Istio, deployed some pods and a gateway. Everything worked fine for a few hours and then it all went down and I can see the istio-policy and istio-telemetry pods are constantly restarting. I disabled galley with the helm install since that was constantly crashing on my initial installation.
Once this happens the Kubernetes Dashboard and helm are both completely unresponsive. Once I delete the istio-system namespace everything goes back to normal.
helm install install/kubernetes/helm/istio \
--name istio \
--namespace istio-system \
--set gateways.istio-ingressgateway.loadBalancerIP=$PUBLIC_IP \
--set grafana.enabled=true \
--set tracing.enabled=true \
--set certmanager.enabled=true \
--set galley.enabled=false
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Started 59m (x76 over 15h) kubelet, aks-nodepool1-54255075-1 Started container
Normal Killing 44m (x79 over 5h) kubelet, aks-nodepool1-54255075-1 Killing container with id docker://mixer:Container failed liveness probe.. Container will be killed and recreated.
Normal Created 24m (x85 over 15h) kubelet, aks-nodepool1-54255075-1 Created container
Warning Unhealthy 14m (x263 over 5h) kubelet, aks-nodepool1-54255075-1 Liveness probe failed: Get http://10.200.0.67:9093/version: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Normal Pulled 8m (x88 over 5h) kubelet, aks-nodepool1-54255075-1 Container image "docker.io/istio/mixer:1.0.0" already present on machine
Warning BackOff 4m (x623 over 5h) kubelet, aks-nodepool1-54255075-1 Back-off restarting failed container
I have also noticed that when this happens the istio-policy and istio-telemetry targets are > 100%.
NAMESPACE NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
istio-system horizontalpodautoscaler.autoscaling/istio-egressgateway Deployment/istio-egressgateway 30%/60% 1 5 1 22h
istio-system horizontalpodautoscaler.autoscaling/istio-ingressgateway Deployment/istio-ingressgateway 30%/60% 1 5 1 22h
istio-system horizontalpodautoscaler.autoscaling/istio-pilot Deployment/istio-pilot 20%/55% 1 1 1 22h
istio-system horizontalpodautoscaler.autoscaling/istio-policy Deployment/istio-policy 313%/80% 1 5 5 22h
istio-system horizontalpodautoscaler.autoscaling/istio-telemetry Deployment/istio-telemetry 319%/80% 1 5 5 22h
@douglas-reid any ideas on the liveness probe timeout or further debug? Just sluggish responsiveness on the network? A slew of people are suffering from this problem.
Cheers
-steve
@sdake unfortunately, I don't have any real insight on the liveness probe timeouts (or why that would impact the k8s dashboard or tiller. I have not experienced these issues on my test clusters.
One thing to try, perhaps, is giving istio-policy
and istio-telemetry
more CPU and see if that resolves OOM issues (this came up in this week's WG meeting).
@mandarjog any thoughts?
I got an email from the aks team tonight. They told me that they increased the proxy_read_timeout for our clusters and everything should work now. So I started to create a fresh cluster (see description below).
At the moment I'm waiting for some time, if things get worse. If everything stays the same, I'm going to deploy some services and monitor the system for some more time. Hopefully, everything is going to just work and this problem is gone for me. I'm going to let you know.
Helm: v2.9.1
Kubectl: v1.11.2
az group create --location westeurope --name istio-issue
az aks create --resource-group istio-issue --name istio-aks-cluster --node-count 3 --node-vm-size Standard_D2_v3 --kubernetes-version 1.11.1
az aks get-credentials --resource-group istio-issue --name istio-aks-cluster
kubectl config use-context istio-aks-cluster
kubectl create clusterrolebinding kubernetes-dashboard -n kube-system --clusterrole=cluster-admin --serviceaccount=kube-system:kubernetes-dashboard
kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller -n tiller --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller
Download the 1.0.0 release and go to the folder, then:
kubectl apply -f install/kubernetes/helm/istio/templates/crds.yaml
kubectl apply -f install/kubernetes/istio-demo.yaml
az aks browse --resource-group istio-issue --name istio-aks-cluster
I just tried to reconnect to the cluster and the issue is still there :(
istio-galley is crashing all the time.
➜ ~ kubectl get pods --all-namespaces=true -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
istio-system grafana-86645d6b4d-66kt4 1/1 Running 0 8h 10.244.1.8 aks-nodepool1-25917760-1
istio-system istio-citadel-55d9bb9b5f-w2l66 1/1 Running 0 8h 10.244.0.6 aks-nodepool1-25917760-0
istio-system istio-cleanup-secrets-7sff5 0/1 Completed 0 8h 10.244.0.4 aks-nodepool1-25917760-0
istio-system istio-egressgateway-74bbdd9669-dwt5b 1/1 Running 0 8h 10.244.1.6 aks-nodepool1-25917760-1
istio-system istio-galley-d4bc6c974-ppcr6 0/1 CrashLoopBackOff 127 8h 10.244.1.10 aks-nodepool1-25917760-1
istio-system istio-grafana-post-install-6pt4v 0/1 Completed 0 8h 10.244.2.5 aks-nodepool1-25917760-2
istio-system istio-ingressgateway-756584cc64-kvkq5 1/1 Running 0 8h 10.244.1.7 aks-nodepool1-25917760-1
istio-system istio-pilot-7dd78846f5-hg67f 2/2 Running 0 8h 10.244.2.7 aks-nodepool1-25917760-2
istio-system istio-policy-b9d65465-5jzfw 2/2 Running 0 2h 10.244.2.8 aks-nodepool1-25917760-2
istio-system istio-policy-b9d65465-7k767 2/2 Running 0 1h 10.244.0.12 aks-nodepool1-25917760-0
istio-system istio-policy-b9d65465-pz7bb 2/2 Running 0 8h 10.244.0.5 aks-nodepool1-25917760-0
istio-system istio-policy-b9d65465-qn2kl 2/2 Running 0 5h 10.244.1.11 aks-nodepool1-25917760-1
istio-system istio-policy-b9d65465-tzll5 2/2 Running 0 2h 10.244.1.13 aks-nodepool1-25917760-1
istio-system istio-sidecar-injector-854f6498d9-jdxf5 1/1 Running 0 8h 10.244.0.9 aks-nodepool1-25917760-0
istio-system istio-statsd-prom-bridge-549d687fd9-p5xzf 1/1 Running 0 8h 10.244.1.5 aks-nodepool1-25917760-1
istio-system istio-telemetry-64fff55fdd-94qjg 2/2 Running 0 3h 10.244.1.12 aks-nodepool1-25917760-1
istio-system istio-telemetry-64fff55fdd-fjg8q 2/2 Running 0 8h 10.244.2.6 aks-nodepool1-25917760-2
istio-system istio-telemetry-64fff55fdd-hscs2 2/2 Running 0 2h 10.244.0.10 aks-nodepool1-25917760-0
istio-system istio-telemetry-64fff55fdd-krc4w 2/2 Running 0 2h 10.244.0.11 aks-nodepool1-25917760-0
istio-system istio-telemetry-64fff55fdd-pnl7n 2/2 Running 0 2h 10.244.1.14 aks-nodepool1-25917760-1
istio-system istio-tracing-7596597bd7-4fj7z 1/1 Running 0 8h 10.244.0.8 aks-nodepool1-25917760-0
istio-system prometheus-6ffc56584f-r9ls6 1/1 Running 0 8h 10.244.1.9 aks-nodepool1-25917760-1
istio-system servicegraph-7bdb8bfc9d-gmhk6 1/1 Running 0 8h 10.244.0.7 aks-nodepool1-25917760-0
kube-system azureproxy-58b96f4d87-78n7q 1/1 Running 2 9h 10.244.2.3 aks-nodepool1-25917760-2
kube-system heapster-6fdcf4f4f4-fp9wb 2/2 Running 0 9h 10.244.0.2 aks-nodepool1-25917760-0
kube-system kube-dns-v20-56b5b568d-hrmvz 3/3 Running 0 9h 10.244.0.3 aks-nodepool1-25917760-0
kube-system kube-dns-v20-56b5b568d-prlng 3/3 Running 0 9h 10.244.1.2 aks-nodepool1-25917760-1
kube-system kube-proxy-2t4f2 1/1 Running 0 9h 10.240.0.4 aks-nodepool1-25917760-1
kube-system kube-proxy-4hljf 1/1 Running 0 9h 10.240.0.6 aks-nodepool1-25917760-0
kube-system kube-proxy-ngks4 1/1 Running 0 9h 10.240.0.5 aks-nodepool1-25917760-2
kube-system kube-svc-redirect-vkw9r 1/1 Running 0 9h 10.240.0.6 aks-nodepool1-25917760-0
kube-system kube-svc-redirect-zjgx9 1/1 Running 0 9h 10.240.0.4 aks-nodepool1-25917760-1
kube-system kube-svc-redirect-zz5lh 1/1 Running 0 9h 10.240.0.5 aks-nodepool1-25917760-2
kube-system kubernetes-dashboard-7979b9b5f4-qg4kx 1/1 Running 2 9h 10.244.2.4 aks-nodepool1-25917760-2
kube-system metrics-server-789c47657d-6q88f 1/1 Running 2 9h 10.244.2.2 aks-nodepool1-25917760-2
kube-system tiller-deploy-759cb9df9-8gx7g 1/1 Running 0 8h 10.244.1.4 aks-nodepool1-25917760-1
kube-system tunnelfront-6dc6bd7cb8-ntvjn 1/1 Running 0 9h 10.244.1.3 aks-nodepool1-25917760-1
Pods
➜ ~ k describe pods istio-galley-d4bc6c974-ppcr6 -n istio-system
Name: istio-galley-d4bc6c974-ppcr6
Namespace: istio-system
Priority: 0
PriorityClassName: <none>
Node: aks-nodepool1-25917760-1/
Start Time: Fri, 17 Aug 2018 09:24:28 +0200
Labels: istio=galley
pod-template-hash=806727530
Annotations: scheduler.alpha.kubernetes.io/critical-pod=
sidecar.istio.io/inject=false
Status: Running
IP: 10.244.1.10
Controlled By: ReplicaSet/istio-galley-d4bc6c974
Containers:
validator:
Container ID: docker://e7cb57eb08156d2ed8ca24648dde9779d350534cd446e727ad1fb9555205d24a
Image: gcr.io/istio-release/galley:1.0.0
Image ID: docker-pullable://gcr.io/istio-release/galley@sha256:01394fea1e55de6d4c7fbfc28c2dd7462bd26e093008367972b04e29d5b475cf
Ports: 443/TCP, 9093/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/usr/local/bin/galley
validator
--deployment-namespace=istio-system
--caCertFile=/etc/istio/certs/root-cert.pem
--tlsCertFile=/etc/istio/certs/cert-chain.pem
--tlsKeyFile=/etc/istio/certs/key.pem
--healthCheckInterval=2s
--healthCheckFile=/health
--webhook-config-file
/etc/istio/config/validatingwebhookconfiguration.yaml
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Fri, 17 Aug 2018 17:47:32 +0200
Finished: Fri, 17 Aug 2018 17:47:47 +0200
Ready: False
Restart Count: 131
Requests:
cpu: 10m
Liveness: exec [/usr/local/bin/galley probe --probe-path=/health --interval=4s] delay=4s timeout=1s period=4s #success=1 #failure=3
Readiness: exec [/usr/local/bin/galley probe --probe-path=/health --interval=4s] delay=4s timeout=1s period=4s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/istio/certs from certs (ro)
/etc/istio/config from config (ro)
/var/run/secrets/kubernetes.io/serviceaccount from istio-galley-service-account-token-5slbs (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
certs:
Type: Secret (a volume populated by a Secret)
SecretName: istio.istio-galley-service-account
Optional: false
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: istio-galley-configuration
Optional: false
istio-galley-service-account-token-5slbs:
Type: Secret (a volume populated by a Secret)
SecretName: istio-galley-service-account-token-5slbs
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 1h (x167 over 8h) kubelet, aks-nodepool1-25917760-1 Readiness probe failed: fail on inspecting path /health: stat /health: no such file or directory
Normal Started 28m (x122 over 8h) kubelet, aks-nodepool1-25917760-1 Started container
Warning BackOff 3m (x1249 over 8h) kubelet, aks-nodepool1-25917760-1 Back-off restarting failed container
Created a brand new AKS cluster with much larger nodes (8c, 28gb mem * 3). Definitely zero resource issues occuring in this setup.. but the moment Istio is installed Helm/Dashboard are unusable and some kubectl commands begin to fail randomly.
From helm logs it seems to be having issues accessing the API.
[storage/driver] 2018/08/19 16:31:51 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get secrets)
After removing the istio namespace and everything in it.... Dashboard becomes responsive and Helm works 100% again.
Emailed [email protected] to try and get some additional assistance/point out this issue to them. Not sure what else I can do to help debug.
Just got this from Azure Support:
Our engineering team was able to see that the istio-pilot pod eventually fails its health check and goes into CrashLoopBackoff state and restarts. The health check fails with:
Readiness probe failed: Get http://10.200.0.39:8080/debug/endpointz: dial tcp 10.200.0.39:8080: connect: connection refused.
It appears either this health check is misconfigured, or the istio-proxy container in the istio-pilot pod is failing to open a listen socket as expected.
Our engineering team ran commands from within both containers in that pod and from the istio-ingress controller, and "connection refused" everywhere tells them that the connectivity is good but that nothing is listening at the expected address http://10.200.0.39:8080 even though that IP was successfully assigned to the pod.
In reading through Istio documentation, there is not much troubleshooting information for our engineering team to assist with. It is suggested that you compare this health check configuration to their working Istio installation, and perhaps engage with the support team at Istio.
@costinm @andraxylia any thoughts here re: istio-pilot health check?
@blackbaud-brandonstirnaman Were you able to get any response back from [email protected] regarding this issue?
We have a fix to the Pilot readiness in 1.0.1 - as well as several memory optimizations.
However I'm a bit confused - a pilot ( or any other app) crash loop or failure should not impact dashboard or apiserver. It's a different isolated container, and the probes are not happening that often.
We will need to find a way to run tests on multiple platforms - right now we rely on volunteers testing on each individual platform ( which really translate into each vendor contributing to istio testing the platforms they support )
@costinm How can I test the changes from 1.0.1 in my AKS cluster?
I don't see it tagged in the daily builds. Is 1.0.1 master?
https://gcsweb.istio.io/gcs/istio-prerelease/daily-build/
@costinm is this the fix your are talking about?
https://github.com/istio/istio/issues/7586
I just installed a freshly created AKS cluster following:
https://gist.github.com/BernhardRode/57099e039c75072ba04d91ed3d22935a
instead of downloading the 1.0 release, i used:
https://gcsweb.istio.io/gcs/istio-prerelease/daily-build/release-1.0-20180822-09-15/
After some minutes, everything seems to be cool. I'm going to deploy some services and give it a shot.
Thanks to @ayj
https://github.com/istio/istio/issues/7586#issuecomment-415192552
@CapTaek radio silence from AKS-Help so far.
Installed the latest daily on a new cluster (with galley enabled this time). Nothing crashing/restarting but still the same problems with K8S Dashboard performance with istio installed/no issues with it removed.
I also installed the latest daily istio-release-1.0-20180822-09-15
build on my AKS cluster. Everything was running smoothly for a bit so I deployed a simple application with a gateway configuration and then I noticed the istio-telemetry
and istio-policy
pods using a lot of CPU. When they autoscaled to 2 replicas their replicas went in to a CrashLoopBackOff state with the error:
Liveness probe failed: Get http://10.200.0.90:9093/version: dial tcp 10.200.0.90:9093: connect: connection refused
Looking at my logs there are a lot of these errors:
Failed to list *v1beta2.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)
and these
gc 233 @3240.155s 0%: 0.044+5.9+7.6 ms clock, 0.089+0.24/2.8/8.2+15 ms cpu, 15->15->7 MB, 16 MB goal, 2 P\n
My Kubernetes Dashboard is still unresponsive, but Helm is working.
Microsoft has been responsive to me through the Azure support channel, but they are out of ways to troubleshoot the issue.
Ran the same daily (release-1.0-20180822-09-15
) overnight on AKS (Istio installed via Helm with no options) and I also put in a couple of test services. There is no load on the cluster, no one is using it. As @rsnj reported, telemetry and policy are having a bad time:
❯ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
istio-system istio-citadel-5d85b758f4-vkr5z 1/1 Running 0 16h
istio-system istio-egressgateway-5764c598cf-qckqm 1/1 Running 0 16h
istio-system istio-galley-5f595485b9-g9fb4 1/1 Running 0 16h
istio-system istio-ingressgateway-6647dd4b64-4pmc2 1/1 Running 0 16h
istio-system istio-pilot-57ffcdc795-74pq7 2/2 Running 0 16h
istio-system istio-policy-87bfd665b-mvjmz 1/2 CrashLoopBackOff 219 9h
istio-system istio-policy-87bfd665b-psqjh 1/2 CrashLoopBackOff 317 14h
istio-system istio-policy-87bfd665b-qjjvg 1/2 CrashLoopBackOff 283 12h
istio-system istio-policy-87bfd665b-shmx4 2/2 Running 0 16h
istio-system istio-policy-87bfd665b-zhxxm 1/2 CrashLoopBackOff 341 15h
istio-system istio-sidecar-injector-6677558cfc-jlbp6 1/1 Running 0 16h
istio-system istio-statsd-prom-bridge-7f44bb5ddb-fgxkv 1/1 Running 0 16h
istio-system istio-telemetry-696487b84f-2n92r 2/2 Running 0 16h
istio-system istio-telemetry-696487b84f-4b9px 1/2 CrashLoopBackOff 329 15h
istio-system istio-telemetry-696487b84f-7mdg8 1/2 CrashLoopBackOff 339 15h
istio-system istio-telemetry-696487b84f-pxgkq 1/2 CrashLoopBackOff 303 13h
istio-system istio-telemetry-696487b84f-wpr6h 1/2 CrashLoopBackOff 281 12h
istio-system prometheus-84bd4b9796-lfhd4 1/1 Running 0 16h
kube-system azureproxy-6496d6f4c6-4hx8w 1/1 Running 2 17h
kube-system heapster-864b6d7fb7-gjgj6 2/2 Running 0 17h
kube-system kube-dns-v20-5695d5c69d-879xd 3/3 Running 0 17h
kube-system kube-dns-v20-5695d5c69d-l897h 3/3 Running 0 17h
kube-system kube-proxy-7pfbs 1/1 Running 0 17h
kube-system kube-proxy-whzvk 1/1 Running 0 17h
kube-system kube-proxy-zpkjt 1/1 Running 0 17h
kube-system kube-svc-redirect-6bksg 1/1 Running 0 17h
kube-system kube-svc-redirect-j8jj5 1/1 Running 1 17h
kube-system kube-svc-redirect-k7w48 1/1 Running 0 17h
kube-system kubernetes-dashboard-66bf8db6cf-pcqhv 1/1 Running 3 17h
kube-system metrics-server-64f6d6b47-9922n 1/1 Running 0 17h
kube-system tiller-deploy-895d57dd9-2k6ns 1/1 Running 0 16h
kube-system tunnelfront-dcc6d8447-gpgsq 1/1 Running 0 17h
I was using istio-release-1.0-20180820-09-15
and Galley was crashing, so the problem seemed to move around (See #7586).
So, with no load, the policy and telemetry services are crashing at startup? @rsnj and @polothy is it possible that this is a recurrence of https://github.com/istio/istio/issues/6152? There was a PR to fix that awhile back that was shelved because Istio 1.0 went with k8s 1.9 as the base. Maybe it is still needed?
I have no idea, sorry. The cluster is still standing, would you like me to run any commands for you?
@douglas-reid I'm using a brand new AKS Cluster running Kubernetes 1.11.2 that has no load on it.
I installed Istio via helm using the default settings.
I then deployed a simple service and connected it to a gateway.
After the services deployed the entire system went to into deadlock and the istio-policy and istio-telemetry started using more and more CPU until they replicated and the second replica just went into CrashLoopBackOff. My service was never accessible.
Looking at the logs, I can see my services deployed and then its just a steady stream of the same error coming from istio-mixer and istio-pilot.
There are thousands of errors just like these:
2018-08-23T14:59:29.341Z | istio-release/mixer | 2018-08-23T14:59:29.341670Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:146: Failed to list *v1beta2.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)\n
2018-08-23T14:59:29.175Z | istio-release/mixer | 2018-08-23T14:59:29.175730Z\terror\tistio.io/istio/mixer/pkg/config/crd/store.go:119: Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request (get redisquotas.config.istio.io)\n
2018-08-23T14:59:29.163Z | istio-release/mixer | 2018-08-23T14:59:29.163277Z\terror\tistio.io/istio/mixer/pkg/config/crd/store.go:119: Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request (get tracespans.config.istio.io)\n
2018-08-23T14:59:29.162Z | istio-release/pilot | 2018-08-23T14:59:29.162425Z\terror\tistio.io/istio/pilot/pkg/serviceregistry/kube/controller.go:217: Failed to list *v1.Endpoints: the server was unable to return a response in the time allotted, but may still be processing the request (get endpoints)\n
2018-08-23T14:59:29.158Z | istio-release/mixer | 2018-08-23T14:59:29.158004Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:146: Failed to list *v1beta2.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)\n
2018-08-23T14:59:29.157Z | istio-release/mixer | 2018-08-23T14:59:29.157718Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:145: Failed to list *v1.Pod: the server was unable to return a response in the time allotted, but may still be processing the request (get pods)\n
2018-08-23T14:59:29.157Z | istio-release/mixer | 2018-08-23T14:59:29.157722Z\terror\tistio.io/istio/mixer/pkg/config/crd/store.go:119: Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request (get stdios.config.istio.io)\n
2018-08-23T14:59:29.157Z | istio-release/mixer | 2018-08-23T14:59:29.157766Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:145: Failed to list *v1.Pod: the server was unable to return a response in the time allotted, but may still be processing the request (get pods)\n
2018-08-23T14:59:29.156Z | istio-release/mixer | 2018-08-23T14:59:29.156579Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:146: Failed to list *v1beta2.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)\n
2018-08-23T14:59:29.156Z | istio-release/mixer | 2018-08-23T14:59:29.155984Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:148: Failed to list *v1beta1.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.extensions)\n
2018-08-23T14:59:29.078Z | istio-release/mixer | 2018-08-23T14:59:29.077170Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:146: Failed to list *v1beta2.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)\n
2018-08-23T14:59:29.078Z | istio-release/pilot | 2018-08-23T14:59:29.078054Z\terror\tistio.io/istio/pilot/pkg/config/kube/crd/controller.go:208: Failed to list *crd.EnvoyFilter: the server was unable to return a response in the time allotted, but may still be processing the request (get envoyfilters.networking.istio.io)\n
2018-08-23T14:59:29.077Z | istio-release/mixer | 2018-08-23T14:59:29.076728Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:146: Failed to list *v1beta2.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)\n
2018-08-23T14:59:29.071Z | istio-release/mixer | 2018-08-23T14:59:29.071149Z\terror\tistio.io/istio/mixer/pkg/config/crd/store.go:119: Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request (get listcheckers.config.istio.io)\n
2018-08-23T14:59:28.914Z | prometheus | level=error ts=2018-08-23T14:59:28.894316964Z caller=main.go:218 component=k8s_client_runtime err=\"github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:287: Failed to list *v1.Endpoints: the server cannot complete the requested operation at this time, try again later (get endpoints)\"\n
2018-08-23T14:59:28.914Z | istio-release/mixer | 2018-08-23T14:59:28.893873Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:147: Failed to list *v1.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)\n
@rsnj Wow: Failed to list *v1.Pod: the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
. That is not good. I wonder why everything with the API Server is so slow.
@douglas-reid I'm not sure, but I'm happy to help debug. I also have a ticket open with Azure Support who are trying to troubleshoot on their end. They applied some patches on their end, but nothing has seemed to fix the issue. You can see their response here.
Installing Istio makes the entire Kubernetes Dashboard unusable and removing it immediately resolves the issue.
@rsnj any chance they could report on the health of the API server itself?
In the meantime, here's an experiment to try: delete the deployments for istio-policy
and istio-telemetry
. This should remove ~8 watches on the API server. If that helps, it might indicate that Istio is overwhelming the API server on startup.
Just jumping in here, I was experiencing the exact same issue (policy & telemetry scaling up, then entering CrashLoopBackOff, etc). I just removed those two deployments and the load dropped and everything seems to be responsive again (kubectl, dashboard, etc).
@douglas-reid I deleted both deployments and the system came back. The Dashboard is now working and my log is no longer being flooded with errors. The only issue now is that my service is still not accessible. The web frontend went from a 404 error to a 503: UNAVAILABLE:no healthy upstream
.
@rsnj OK. It seems that the API Server in your cluster is possibly getting overwhelmed by the watch clients in the various components. Can you try adding back the istio-policy
deployment and see if things still work?
@douglas-reid If I just start istio-policy
everything looks like its working.
I also tried starting istio-telemetry
and then the entire system went down.
@rsnj this seems like an issue with the resources given to the API Server. I'd suggest trying the experiment in reverse (delete both, then add back istio-telemetry
and then, after testing, add back istio-policy
).
Mixer (which backs both istio-policy
and istio-telemetry
) opens a fair number of watches (~40) on CRDs and otherwise. I suspect that the API Server in these clusters is just not setup to handle this.
If there Azure Support has any information on how to increase resources for the API Server, that'd be the best way to resolve the issue. Maybe @lachie83 has some ideas (or contacts that do) here ?
@douglas-reid thanks for all your help with this issue. I forwarded all your comments to Azure support.
So I redeployed the istio-telemetry
deployment and that pod just went into a CrashLoopBackOff state. So I deleted the istio-policy
deployment and the istio-telemetry
pod was stable, but my website went down with the error: UNAVAILABLE:no healthy upstream
.
So far the only way the system is stable is if I have everything running except for istio-telemetry
. Any ideas?
@rsnj It appears, at the moment, that on AKS, you have to chose between policy or telemetry. If you aren't enforcing any policies in the Mixer layer (rate limits, whitelists, etc.), then I would recommend prioritizing telemetry (but that's the part of the system I spend the most time on, so I may be slightly biased). Istio RBAC currently does not require Mixer, so you'll still have some functionality policy-wise.
To be successful without istio-policy
running, you'll need to turn off check calls (otherwise you'll get connectivity issues as requests are denied because the proxy cannot reach the policy service). To do that, you need to install Istio with global.disablePolicyChecks
set to true
. I haven't spent much time trying this out, but I know that others have done this, so if this is of interest, I'm sure we can get this working. Istio is working on documentation for piecemeal installs. This would be a good test case.
In the slightly longer term, Mixer should reduce the number of CRDs down to 3, which should should help reduce burden on the API Server. Sometime after that, Mixer will receive config directly from Galley, reducing the burden even further.
Does that help?
Just an update I left the cluster running for about 8 hours without istio-telemetry
running. The istio-policy
pod autoscaled to 5 instances that were all in a CrashLoopBackOff state and my entire cluster went down again. The cluster has zero load on it and only has a simple web service running without any external dependencies.
@douglas-reid I will try your suggestion next to enable telemetry and disable policy.
@douglas-reid @rsnj I appreciate the heads up. I'm investigating and will report back.
Related - https://github.com/Azure/AKS/issues/620
is it related to https://github.com/Azure/AKS/issues/618? - deleting the HPA stops the crash loop backoff in telemetry and policy
Galley only starts in validation mode in istio 1.0. So I do not think galley will contribute to this issue on 1.0 release.
@lachie83 any updates?
All, a recent PR was merged to add initial support for CRD consolidation within Mixer. I cannot promise that it solves this issue, but I am looking for brave volunteers to try from tip with an edited deployment to set UseAdapterCRDs=false
in the Mixer deployments and see if that improves things.
I'm happy to help guide that process, if someone has access to AKS clusters and some free time.
I've given this flag a try in a test environment on AKS, using release istio-release-1.1-20181021-09-15
. As far as I can see, there is immediate improvement.
With 1.0.2, I had similar problems as others describe above; a cluster with mTLS enabled for the default namespace (not globally) and a few sample services deployed would become sluggish after a few hours. At this point, I could see that the telemetry and policy pods would restart frequently, and requests to the API-server (from e.g. Tiller and NMI (https://github.com/Azure/aad-pod-identity)) would time out. Deleting the istio-telemetry
Deployment would immediately make the cluster responsive again.
With istio-release-1.1-20181021-09-15
, and the flag applied, the cluster seems to be stable.
Installation was performed as follows, using Helm 2.9.1 and Kubernetes 1.11.2:
${ISTIO_HOME}/install/kubernetes/helm/subcharts/mixer/templates/deployment.yaml
, adding - --useAdapterCRDs=false
to the args
of the mixer
container in the policy_container
and telemetry_container
sections.helm dependency update ${ISTIO_HOME}/install/kubernetes/helm/istio
helm install ${ISTIO_HOME}/install/kubernetes/helm/istio --name istio --namespace istio-system --tls --wait --set global.configValidation=true --set sidecarInjectorWebhook.enabled=true --set gateways.istio-ingressgateway.loadBalancerIP=${PUBLIC_IP}
Are there any plans to explicitly support the useAdapterCRDs
flag in Istio 1.1, i.e. either exposing it through Helm variables, or setting the default to false
? Not having to manually edit the deployment file would definitely be preferrable.
@fhoy @douglas-reid I just updated my existing PR for the helm chart to include the switch for useAdapterCRDs
https://github.com/istio/istio/pull/9435/files
@dtzar switching the default is definitely on the agenda for 1.1. I'll be raising that issue at the P&T WG meeting today.
fwiw, in the WG meeting today, we opted for the following approach:
UseAdapterCRDs
defaulting to true
for 1.1 (switching to false
in 1.2 and removed altogether in 1.3).Hope that helps.
Finally had a chance to do some testing around useAdapterCRDs
in my AKS test cluster today. Definitely had a massive positive impact.
Edit: a few days later, still seeing a negative impact on API calls from Dashboard. 😢
To report back.. I deployed a new cluster with useAdapterCRDs=false
and its been running for ~10 days or so now without a recurrence of the watch issue slowing helm/dashboard.
Good work! It'd be great if we could get the helm option @douglas-reid mentioned so my scripts can stop hacking the subchart in 1.1 releases. Can't find the PR mentioned though.
Hopefully this PR gets merged soon.. then we can close this issue: https://github.com/istio/istio/pull/10247
The existing PR is still held up debating on what to do about the conditional CRDs in the helm chart, so I separated enabling the useAdaperCRDs in the helm chart to a separate PR
@douglas-reid I believe I addressed all your feedback from the 1st PR, so theoretically this one should be good to merge.
https://github.com/istio/istio/pull/10404 was merged so this can probably be closed
Took me some time to get back and close this. Apologies and thanks for the fix!
Most helpful comment
Thanks @BernhardRode - that helps possibly eliminate OOM problems. I'm on PTO until the 10th, just thought I'd offer some quick help here, but I don't have time at this immediate moment to spin up AKS. Once PTO finishes up, will have time.
Sounds like a common problem people are suffering with.