Istio: K8S Dashboard loads slowly, tiller unresponsive once istio 1.0.0 installed

Created on 7 Aug 2018  ·  57Comments  ·  Source: istio/istio

Describe the bug
On a new Azure Container Service (AKS) cluster with 1.10.6 -- as soon as I install istio 1.0.0 (I've tried the official release and daily istio-release-1.0-20180803-09-15) -- requests in the K8S dashboard take 5-10 seconds or timeout completely. Additionally commands to tiller timeout retrieving configmaps.

All kubectl commands I can think to run succeed and run quickly. Installing istio 0.8 does not have this issue.

Expected behavior
No negative impact to other services when installing istio.

Steps to reproduce the bug
1) Create new AKS cluster.
2) Install istio.. I used the following helm command (and corresponding kubectl apply):
helm template install/kubernetes/helm/istio --name istio --set servicegraph.enabled=true --set grafana.enabled=true --set tracing.enabled=true --set galley.enabled=false --set telemetry-gateway.grafanaEnabled=true --set telemetry-gateway.prometheusEnabled=true --namespace istio-system
3) Wait a few minutes for the various pods to start up.
4) Run kubectl proxy (or az aks browse) and try to navigate in the dashboard. Or run helm ls.

Version
Istio: release-1.0-20180803-09-15
K8S: 1.10.6

Is Istio Auth enabled or not?
No

Environment
Azure AKS

Most helpful comment

Thanks @BernhardRode - that helps possibly eliminate OOM problems. I'm on PTO until the 10th, just thought I'd offer some quick help here, but I don't have time at this immediate moment to spin up AKS. Once PTO finishes up, will have time.

Sounds like a common problem people are suffering with.

All 57 comments

I ran your commands on bare metal. Note I don't immediately have access to AKS. I suspect you are are in an OOM situation where the kernel continually kills processes and Kubernetes continually restarts them (hence the helm version/helm ls lag, and dashboard lag). This is hard to detect - but can be seen with kubectl describe on a restarted pod (grep for OOM).

Also, a namespace was not created for istio-system above. Are you executing an upgrade, or a fresh install? I suspect an upgrade will require more memory. Please reference the documentation for installation instructions here:

https://istio.io/docs/setup/kubernetes/helm-install/#option-1-install-with-helm-via-helm-template

and for Azure platform setup here:

https://istio.io/docs/setup/kubernetes/platform-setup/azure/

Note I have not personally validated the Azure platform setup instructions.

You can see from my AIO workflow below, that a very bare bones Ubuntu 16.04.04 bare metal system requires 13 GB of ram for Kubernetes + Istio. Reading the azure documentation on isito.io, you might try increasing the node count beyond 3 nodes, to provide yourself more memory for the cluster to work with. It also took around 6 minutes to deploy Kubernetes and Istio on my bare metal system (which is a beast of a server). You mentioned you waited a few minutes - this may not be sufficient for Istio to initialize.

sdake@falkor-07:~$ kubectl get pods -n istio-system
NAME                                        READY     STATUS      RESTARTS   AGE
grafana-5fb774bcc9-ds2xs                    1/1       Running     0          5m
istio-citadel-5b956fdf54-tmqnh              1/1       Running     0          5m
istio-cleanup-secrets-qcks7                 0/1       Completed   0          5m
istio-egressgateway-6cff45b4db-jwchv        1/1       Running     0          5m
istio-grafana-post-install-wkfrp            0/1       Completed   0          5m
istio-ingressgateway-fc648887c-fg56t        1/1       Running     0          5m
istio-pilot-6cd95f9cc4-kpxbh                1/2       Running     0          5m
istio-policy-75f75cc6fd-h5hzb               2/2       Running     0          5m
istio-sidecar-injector-6d59d46ff4-rmh2t     1/1       Running     0          5m
istio-statsd-prom-bridge-7f44bb5ddb-k96tx   1/1       Running     0          5m
istio-telemetry-544b8d7dcf-tpmp7            2/2       Running     0          5m
istio-tracing-ff94688bb-wzrs9               1/1       Running     0          5m
prometheus-84bd4b9796-cc4dl                 1/1       Running     0          5m
servicegraph-6c6dbbf599-9q2wb               1/1       Running     2          5m
sdake@falkor-07:~$ vmstat -s --unit M
       128829 M total memory
         3709 M used memory
        12993 M active memory
         7355 M inactive memory
       105425 M free memory
         1306 M buffer memory
        18387 M swap cache
            0 M total swap
            0 M used swap
            0 M free swap
    161538896 non-nice user cpu ticks
       207515 nice user cpu ticks
    194624795 system cpu ticks
  11580707701 idle cpu ticks
     44524010 IO-wait cpu ticks
            0 IRQ cpu ticks
     16863622 softirq cpu ticks
            0 stolen cpu ticks
      4849792 pages paged in
    602829514 pages paged out
            0 pages swapped in
            0 pages swapped out
   2308409615 interrupts
   2356661043 CPU context switches
   1529810860 boot time
    146777354 forks
sdake@falkor-07:~$ helm version
Client: &version.Version{SemVer:"v2.7.2", GitCommit:"8478fb4fc723885b155c924d1c8c410b7a9444e6", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.7.2", GitCommit:"8478fb4fc723885b155c924d1c8c410b7a9444e6", GitTreeState:"clean"}
sdake@falkor-07:~$ kubectl get services -n istio-system
NAME                       TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)                                                                                                     AGE
grafana                    ClusterIP      10.110.229.156   <none>         3000/TCP                                                                                                    6m
istio-citadel              ClusterIP      10.110.55.202    <none>         8060/TCP,9093/TCP                                                                                           6m
istio-egressgateway        ClusterIP      10.110.133.106   <none>         80/TCP,443/TCP                                                                                              6m
istio-ingressgateway       LoadBalancer   10.110.22.27     10.23.220.90   80:31380/TCP,443:31390/TCP,31400:31400/TCP,15011:31234/TCP,8060:30143/TCP,15030:31404/TCP,15031:31465/TCP   6m
istio-pilot                ClusterIP      10.110.60.79     <none>         15010/TCP,15011/TCP,8080/TCP,9093/TCP                                                                       6m
istio-policy               ClusterIP      10.110.22.189    <none>         9091/TCP,15004/TCP,9093/TCP                                                                                 6m
istio-sidecar-injector     ClusterIP      10.110.11.218    <none>         443/TCP                                                                                                     6m
istio-statsd-prom-bridge   ClusterIP      10.110.60.203    <none>         9102/TCP,9125/UDP                                                                                           6m
istio-telemetry            ClusterIP      10.110.183.250   <none>         9091/TCP,15004/TCP,9093/TCP,42422/TCP                                                                       6m
jaeger-agent               ClusterIP      None             <none>         5775/UDP,6831/UDP,6832/UDP                                                                                  6m
jaeger-collector           ClusterIP      10.110.165.191   <none>         14267/TCP,14268/TCP                                                                                         6m
jaeger-query               ClusterIP      10.110.56.112    <none>         16686/TCP                                                                                                   6m
prometheus                 ClusterIP      10.110.75.218    <none>         9090/TCP                                                                                                    6m
servicegraph               ClusterIP      10.110.175.150   <none>         8088/TCP                                                                                                    6m
tracing                    ClusterIP      10.110.235.175   <none>         80/TCP                                                                                                      6m
zipkin                     ClusterIP      10.110.158.41    <none>         9411/TCP                                                                                                    6m
sdake@falkor-07:~$ 

Its not an issue that went away with time. I just mentioned the few minutes to see it happening for repro.. Its definitely in its own namespace and no pods are being restarted regularly. Galley was but I disabled it via helm config.

Istio, Prometheus, Grafana, Jaeger, Servicegraph are all functioning.. As is the BookInfo demo app.

Cluster has 10.5GB memory and doesn't run its own masters (because AKS provides those).

⟩ kubectl get pods -n istio-system
NAME                                        READY     STATUS    RESTARTS   AGE
grafana-5b575487bc-5w6c6                    1/1       Running   0          1d
istio-citadel-5856986bb6-mc9l5              1/1       Running   0          1d
istio-egressgateway-68d9f9946-mw24c         1/1       Running   0          1d
istio-ingressgateway-5986d965fc-6smmr       1/1       Running   0          1d
istio-pilot-54f6fbc998-78ctf                2/2       Running   0          1d
istio-policy-55cd59d88d-bcjgl               2/2       Running   0          1d
istio-sidecar-injector-69b8d5f76b-hfrch     1/1       Running   0          1d
istio-statsd-prom-bridge-7f44bb5ddb-7bnq6   1/1       Running   0          1d
istio-telemetry-569ccddd69-wvnkj            2/2       Running   1          1d
istio-tracing-ff94688bb-plplr               1/1       Running   0          1d
prometheus-84bd4b9796-bwwpj                 1/1       Running   1          1d
servicegraph-6c986fd7fc-hbb7t               1/1       Running   2          1d

⟩ kubectl get services -n istio-system
NAME                       TYPE           CLUSTER-IP     EXTERNAL-IP       PORT(S)                                                                                                     AGE
grafana                    ClusterIP      10.0.3.48      <none>            3000/TCP                                                                                                    1d
istio-citadel              ClusterIP      10.0.135.147   <none>            8060/TCP,9093/TCP                                                                                           1d
istio-egressgateway        ClusterIP      10.0.91.217    <none>            80/TCP,443/TCP                                                                                              1d
istio-ingressgateway       LoadBalancer   10.0.182.248   xx.xx.xx.xx   80:31380/TCP,443:31390/TCP,31400:31400/TCP,15011:31788/TCP,8060:30603/TCP,15030:30099/TCP,15031:32626/TCP   1d
istio-pilot                ClusterIP      10.0.70.93     <none>            15010/TCP,15011/TCP,8080/TCP,9093/TCP                                                                       1d
istio-policy               ClusterIP      10.0.137.122   <none>            9091/TCP,15004/TCP,9093/TCP                                                                                 1d
istio-sidecar-injector     ClusterIP      10.0.179.103   <none>            443/TCP                                                                                                     1d
istio-statsd-prom-bridge   ClusterIP      10.0.94.214    <none>            9102/TCP,9125/UDP                                                                                           1d
istio-telemetry            ClusterIP      10.0.96.225    <none>            9091/TCP,15004/TCP,9093/TCP,42422/TCP                                                                       1d
jaeger-agent               ClusterIP      None           <none>            5775/UDP,6831/UDP,6832/UDP                                                                                  1d
jaeger-collector           ClusterIP      10.0.174.70    <none>            14267/TCP,14268/TCP                                                                                         1d
jaeger-query               ClusterIP      10.0.147.114   <none>            16686/TCP                                                                                                   1d
prometheus                 ClusterIP      10.0.110.122   <none>            9090/TCP                                                                                                    1d
servicegraph               ClusterIP      10.0.112.147   <none>            8088/TCP                                                                                                    1d
tracing                    ClusterIP      10.0.151.228   <none>            80/TCP                                                                                                      1d
zipkin                     ClusterIP      10.0.16.118    <none>            9411/TCP                                                                                                    1d

*Edit: Adding a list of kube-system pods to show that dashboard/tiller aren't restarting regularly

⟩ kubectl get pods -n kube-system
NAME                                    READY     STATUS    RESTARTS   AGE
azureproxy-6496d6f4c6-hb8ht             1/1       Running   2          4d
heapster-864b6d7fb7-dk989               2/2       Running   0          4d
kube-dns-v20-55645bfd65-88prw           3/3       Running   0          4d
kube-dns-v20-55645bfd65-zgvtz           3/3       Running   0          4d
kube-proxy-5lltm                        1/1       Running   0          4d
kube-proxy-lz2bg                        1/1       Running   0          4d
kube-proxy-vrb96                        1/1       Running   0          4d
kube-state-metrics-688bbf7446-sffqw     2/2       Running   2          1d
kube-svc-redirect-4ff6d                 1/1       Running   0          4d
kube-svc-redirect-h2pch                 1/1       Running   1          4d
kube-svc-redirect-lsmz4                 1/1       Running   0          4d
kubernetes-dashboard-66bf8db6cf-f4xkc   1/1       Running   2          4d
tiller-deploy-84f4c8bb78-pfhwb          1/1       Running   0          1d
tunnelfront-78b8fc8485-c6jcr            1/1       Running   0          4d

Same issue here.

@blackbaud-brandonstirnaman - I don't immediately have access to a AKS system. Is there any chance you can increase the cluster memory (possibly by adding more nodes?) . 10.5mb sounds pretty tight for Istio with what is enabled, even with the Kubernetes control plane on a different node. It is possible other parts of the system (kubelet for example) are being killed by the kernel OOM killer. One way to verify this (if a shell is available on AKS) is to check dmesg output. This will tell you if the oom killer is being triggered.

Other possibilities include a defect in Istio, memory overcommit ratio's being set too high on the cluster causing swapping (which triggers blocking and sluggish performance), and possibly other things. I am not all that familiar with Azure's Kubernetes system or their cloud in general. It would be helpful to know if more memory allocated to the cluster alleviates the problems so we can make appropriate recommendations in the documentation.

Cheers
-steve

My current cluster is 6 nodes with 8Gb of memory. So 40gb of Memory should be enough. There are no other services on the cluster at the moment.

Thanks @BernhardRode - that helps possibly eliminate OOM problems. I'm on PTO until the 10th, just thought I'd offer some quick help here, but I don't have time at this immediate moment to spin up AKS. Once PTO finishes up, will have time.

Sounds like a common problem people are suffering with.

If I can assist in any way, just contact me.

I see the same issue: cluster running fine up until the point where I installed istio 1.0.0,, after which I get laggy kubectl, helm misbehaving, etc... On important point, k8s conformance tests before installing istio were all green, however I did not run them again yet.

I had the same scenario with 2 other cluster couple days ago, where conformance tests were not running at all afterwards.

possibly related (OOM problem): https://github.com/istio/istio/issues/7734

Any updates on this? Confirmed I don't have any real resource contention going on in my 'repro' cluster but the issue persists.

A bit more data..

⟩ kubectl top node
NAME                       CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%
aks-nodepool1-78495379-2   390m         39%       2195Mi          65%
aks-nodepool1-78495379-0   48m          4%        339Mi           10%
aks-nodepool1-78495379-1   197m         19%       1974Mi          59%
⟩ kubectl top pod --all-namespaces
NAMESPACE      NAME                                        CPU(cores)   MEMORY(bytes)
istio-system   istio-statsd-prom-bridge-7f44bb5ddb-gk7fq   11m          33Mi
istio-system   servicegraph-6c986fd7fc-mwxtr               0m           6Mi
istio-system   istio-egressgateway-68d9f9946-jr8bb         1m           29Mi
istio-system   prometheus-84bd4b9796-hrmlc                 10m          434Mi
default        reviews-v2-7ff5966b99-8wfht                 2m           200Mi
kube-system    kube-svc-redirect-lsmz4                     18m          36Mi
kube-system    kube-state-metrics-688bbf7446-tz4p2         2m           40Mi
kube-system    heapster-864b6d7fb7-p2sd6                   0m           37Mi
kube-system    kubernetes-dashboard-66bf8db6cf-7fqsl       0m           26Mi
kube-system    kube-svc-redirect-4ff6d                     4m           45Mi
kube-system    kube-dns-v20-55645bfd65-pf4cp               3m           27Mi
istio-system   istio-telemetry-569ccddd69-snrbt            63m          487Mi
istio-system   istio-ingressgateway-5986d965fc-8qt8j       2m           33Mi
kube-system    kube-dns-v20-55645bfd65-4rdt8               3m           21Mi
istio-system   istio-sidecar-injector-69b8d5f76b-kpgxr     9m           25Mi
istio-system   istio-tracing-ff94688bb-9mlwj               5m           270Mi
kube-system    azureproxy-6496d6f4c6-4wchw                 165m         73Mi
kube-system    kube-proxy-lz2bg                            1m           36Mi
default        ratings-v1-77f657f55d-r6f5l                 2m           46Mi
istio-system   istio-pilot-54f6fbc998-sxmqm                67m          99Mi
kube-system    kube-proxy-5lltm                            1m           31Mi
kube-system    kube-proxy-vrb96                            1m           63Mi
istio-system   grafana-5b575487bc-925nv                    4m           43Mi
kube-system    kube-svc-redirect-h2pch                     18m          34Mi
istio-system   istio-policy-55cd59d88d-d4ljk               58m          492Mi
istio-system   istio-citadel-5856986bb6-76t95              0m           79Mi
default        productpage-v1-f8c8fb8-fwvlz                5m           70Mi
kube-system    tiller-deploy-84f4c8bb78-j4zcv              0m           16Mi
kube-system    tunnelfront-78b8fc8485-dq5rs                18m          10Mi
default        reviews-v3-5df889bcff-cx2f4                 2m           120Mi

Just jumping in here, we are experiencing the same issue on AKS as Brandon has documented above. Would be happy to help with a resolution, but given that this setup is my first foray into Istio could use a little direction on where to get started.

I just created a new AKS cluster with Kubernetes 1.11.1.

Installing of Istio with Helm worked, after I've created a cluster role like this:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: cluster-admin
rules:
- apiGroups: ['*']
  resources: ['*']
  verbs: ['*']
- nonResourceURLs: ['*']
  verbs: ['*']%

Everything was running for more then one hour. Then I deployed 10 simple pods to the cluster and everything started to go down again = Helm and Kubernetes dashboard work really slow.

I'm going to keep the cluster for some days.

I'm having a similar issue on AKS as well. I set up a new cluster (v 1.11.1), installed Istio, deployed some pods and a gateway. Everything worked fine for a few hours and then it all went down and I can see the istio-policy and istio-telemetry pods are constantly restarting. I disabled galley with the helm install since that was constantly crashing on my initial installation.
Once this happens the Kubernetes Dashboard and helm are both completely unresponsive. Once I delete the istio-system namespace everything goes back to normal.

helm install install/kubernetes/helm/istio \
    --name istio \
    --namespace istio-system \
    --set gateways.istio-ingressgateway.loadBalancerIP=$PUBLIC_IP \
    --set grafana.enabled=true \
    --set tracing.enabled=true \
    --set certmanager.enabled=true \
    --set galley.enabled=false
Events:
  Type     Reason     Age                 From                               Message
  ----     ------     ----                ----                               -------
  Normal   Started    59m (x76 over 15h)  kubelet, aks-nodepool1-54255075-1  Started container
  Normal   Killing    44m (x79 over 5h)   kubelet, aks-nodepool1-54255075-1  Killing container with id docker://mixer:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Created    24m (x85 over 15h)  kubelet, aks-nodepool1-54255075-1  Created container
  Warning  Unhealthy  14m (x263 over 5h)  kubelet, aks-nodepool1-54255075-1  Liveness probe failed: Get http://10.200.0.67:9093/version: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Normal   Pulled     8m (x88 over 5h)    kubelet, aks-nodepool1-54255075-1  Container image "docker.io/istio/mixer:1.0.0" already present on machine
  Warning  BackOff    4m (x623 over 5h)   kubelet, aks-nodepool1-54255075-1  Back-off restarting failed container

I have also noticed that when this happens the istio-policy and istio-telemetry targets are > 100%.

NAMESPACE      NAME                                                       REFERENCE                         TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
istio-system   horizontalpodautoscaler.autoscaling/istio-egressgateway    Deployment/istio-egressgateway    30%/60%    1         5         1          22h
istio-system   horizontalpodautoscaler.autoscaling/istio-ingressgateway   Deployment/istio-ingressgateway   30%/60%    1         5         1          22h
istio-system   horizontalpodautoscaler.autoscaling/istio-pilot            Deployment/istio-pilot            20%/55%    1         1         1          22h
istio-system   horizontalpodautoscaler.autoscaling/istio-policy           Deployment/istio-policy           313%/80%   1         5         5          22h
istio-system   horizontalpodautoscaler.autoscaling/istio-telemetry        Deployment/istio-telemetry        319%/80%   1         5         5          22h

@douglas-reid any ideas on the liveness probe timeout or further debug? Just sluggish responsiveness on the network? A slew of people are suffering from this problem.

Cheers
-steve

@sdake unfortunately, I don't have any real insight on the liveness probe timeouts (or why that would impact the k8s dashboard or tiller. I have not experienced these issues on my test clusters.

One thing to try, perhaps, is giving istio-policy and istio-telemetry more CPU and see if that resolves OOM issues (this came up in this week's WG meeting).

@mandarjog any thoughts?

I got an email from the aks team tonight. They told me that they increased the proxy_read_timeout for our clusters and everything should work now. So I started to create a fresh cluster (see description below).

  • After helm install, the dashboard/helm is very responsive
  • After istio install, the dashboard/helm is slower, but still usable.

At the moment I'm waiting for some time, if things get worse. If everything stays the same, I'm going to deploy some services and monitor the system for some more time. Hopefully, everything is going to just work and this problem is gone for me. I'm going to let you know.

Istio AKS Demo Setup

CLI Versions

Helm: v2.9.1
Kubectl: v1.11.2

Setup

AKS

az group create --location westeurope --name istio-issue
az aks create --resource-group istio-issue --name istio-aks-cluster --node-count 3 --node-vm-size Standard_D2_v3 --kubernetes-version 1.11.1
az aks get-credentials --resource-group istio-issue --name istio-aks-cluster
kubectl config use-context istio-aks-cluster
kubectl create clusterrolebinding kubernetes-dashboard -n kube-system --clusterrole=cluster-admin --serviceaccount=kube-system:kubernetes-dashboard

Helm

kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller -n tiller --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller

istio

Download the 1.0.0 release and go to the folder, then:

kubectl apply -f install/kubernetes/helm/istio/templates/crds.yaml
kubectl apply -f install/kubernetes/istio-demo.yaml

Dashboard

az aks browse --resource-group istio-issue --name istio-aks-cluster

I just tried to reconnect to the cluster and the issue is still there :(

istio-galley is crashing all the time.

➜  ~ kubectl get pods --all-namespaces=true -o wide

NAMESPACE      NAME                                        READY     STATUS             RESTARTS   AGE       IP            NODE
istio-system   grafana-86645d6b4d-66kt4                    1/1       Running            0          8h        10.244.1.8    aks-nodepool1-25917760-1
istio-system   istio-citadel-55d9bb9b5f-w2l66              1/1       Running            0          8h        10.244.0.6    aks-nodepool1-25917760-0
istio-system   istio-cleanup-secrets-7sff5                 0/1       Completed          0          8h        10.244.0.4    aks-nodepool1-25917760-0
istio-system   istio-egressgateway-74bbdd9669-dwt5b        1/1       Running            0          8h        10.244.1.6    aks-nodepool1-25917760-1
istio-system   istio-galley-d4bc6c974-ppcr6                0/1       CrashLoopBackOff   127        8h        10.244.1.10   aks-nodepool1-25917760-1
istio-system   istio-grafana-post-install-6pt4v            0/1       Completed          0          8h        10.244.2.5    aks-nodepool1-25917760-2
istio-system   istio-ingressgateway-756584cc64-kvkq5       1/1       Running            0          8h        10.244.1.7    aks-nodepool1-25917760-1
istio-system   istio-pilot-7dd78846f5-hg67f                2/2       Running            0          8h        10.244.2.7    aks-nodepool1-25917760-2
istio-system   istio-policy-b9d65465-5jzfw                 2/2       Running            0          2h        10.244.2.8    aks-nodepool1-25917760-2
istio-system   istio-policy-b9d65465-7k767                 2/2       Running            0          1h        10.244.0.12   aks-nodepool1-25917760-0
istio-system   istio-policy-b9d65465-pz7bb                 2/2       Running            0          8h        10.244.0.5    aks-nodepool1-25917760-0
istio-system   istio-policy-b9d65465-qn2kl                 2/2       Running            0          5h        10.244.1.11   aks-nodepool1-25917760-1
istio-system   istio-policy-b9d65465-tzll5                 2/2       Running            0          2h        10.244.1.13   aks-nodepool1-25917760-1
istio-system   istio-sidecar-injector-854f6498d9-jdxf5     1/1       Running            0          8h        10.244.0.9    aks-nodepool1-25917760-0
istio-system   istio-statsd-prom-bridge-549d687fd9-p5xzf   1/1       Running            0          8h        10.244.1.5    aks-nodepool1-25917760-1
istio-system   istio-telemetry-64fff55fdd-94qjg            2/2       Running            0          3h        10.244.1.12   aks-nodepool1-25917760-1
istio-system   istio-telemetry-64fff55fdd-fjg8q            2/2       Running            0          8h        10.244.2.6    aks-nodepool1-25917760-2
istio-system   istio-telemetry-64fff55fdd-hscs2            2/2       Running            0          2h        10.244.0.10   aks-nodepool1-25917760-0
istio-system   istio-telemetry-64fff55fdd-krc4w            2/2       Running            0          2h        10.244.0.11   aks-nodepool1-25917760-0
istio-system   istio-telemetry-64fff55fdd-pnl7n            2/2       Running            0          2h        10.244.1.14   aks-nodepool1-25917760-1
istio-system   istio-tracing-7596597bd7-4fj7z              1/1       Running            0          8h        10.244.0.8    aks-nodepool1-25917760-0
istio-system   prometheus-6ffc56584f-r9ls6                 1/1       Running            0          8h        10.244.1.9    aks-nodepool1-25917760-1
istio-system   servicegraph-7bdb8bfc9d-gmhk6               1/1       Running            0          8h        10.244.0.7    aks-nodepool1-25917760-0
kube-system    azureproxy-58b96f4d87-78n7q                 1/1       Running            2          9h        10.244.2.3    aks-nodepool1-25917760-2
kube-system    heapster-6fdcf4f4f4-fp9wb                   2/2       Running            0          9h        10.244.0.2    aks-nodepool1-25917760-0
kube-system    kube-dns-v20-56b5b568d-hrmvz                3/3       Running            0          9h        10.244.0.3    aks-nodepool1-25917760-0
kube-system    kube-dns-v20-56b5b568d-prlng                3/3       Running            0          9h        10.244.1.2    aks-nodepool1-25917760-1
kube-system    kube-proxy-2t4f2                            1/1       Running            0          9h        10.240.0.4    aks-nodepool1-25917760-1
kube-system    kube-proxy-4hljf                            1/1       Running            0          9h        10.240.0.6    aks-nodepool1-25917760-0
kube-system    kube-proxy-ngks4                            1/1       Running            0          9h        10.240.0.5    aks-nodepool1-25917760-2
kube-system    kube-svc-redirect-vkw9r                     1/1       Running            0          9h        10.240.0.6    aks-nodepool1-25917760-0
kube-system    kube-svc-redirect-zjgx9                     1/1       Running            0          9h        10.240.0.4    aks-nodepool1-25917760-1
kube-system    kube-svc-redirect-zz5lh                     1/1       Running            0          9h        10.240.0.5    aks-nodepool1-25917760-2
kube-system    kubernetes-dashboard-7979b9b5f4-qg4kx       1/1       Running            2          9h        10.244.2.4    aks-nodepool1-25917760-2
kube-system    metrics-server-789c47657d-6q88f             1/1       Running            2          9h        10.244.2.2    aks-nodepool1-25917760-2
kube-system    tiller-deploy-759cb9df9-8gx7g               1/1       Running            0          8h        10.244.1.4    aks-nodepool1-25917760-1
kube-system    tunnelfront-6dc6bd7cb8-ntvjn                1/1       Running            0          9h        10.244.1.3    aks-nodepool1-25917760-1

Pods

➜  ~ k describe pods istio-galley-d4bc6c974-ppcr6  -n istio-system
Name:               istio-galley-d4bc6c974-ppcr6
Namespace:          istio-system
Priority:           0
PriorityClassName:  <none>
Node:               aks-nodepool1-25917760-1/
Start Time:         Fri, 17 Aug 2018 09:24:28 +0200
Labels:             istio=galley
                    pod-template-hash=806727530
Annotations:        scheduler.alpha.kubernetes.io/critical-pod=
                    sidecar.istio.io/inject=false
Status:             Running
IP:                 10.244.1.10
Controlled By:      ReplicaSet/istio-galley-d4bc6c974
Containers:
  validator:
    Container ID:  docker://e7cb57eb08156d2ed8ca24648dde9779d350534cd446e727ad1fb9555205d24a
    Image:         gcr.io/istio-release/galley:1.0.0
    Image ID:      docker-pullable://gcr.io/istio-release/galley@sha256:01394fea1e55de6d4c7fbfc28c2dd7462bd26e093008367972b04e29d5b475cf
    Ports:         443/TCP, 9093/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /usr/local/bin/galley
      validator
      --deployment-namespace=istio-system
      --caCertFile=/etc/istio/certs/root-cert.pem
      --tlsCertFile=/etc/istio/certs/cert-chain.pem
      --tlsKeyFile=/etc/istio/certs/key.pem
      --healthCheckInterval=2s
      --healthCheckFile=/health
      --webhook-config-file
      /etc/istio/config/validatingwebhookconfiguration.yaml
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Fri, 17 Aug 2018 17:47:32 +0200
      Finished:     Fri, 17 Aug 2018 17:47:47 +0200
    Ready:          False
    Restart Count:  131
    Requests:
      cpu:        10m
    Liveness:     exec [/usr/local/bin/galley probe --probe-path=/health --interval=4s] delay=4s timeout=1s period=4s #success=1 #failure=3
    Readiness:    exec [/usr/local/bin/galley probe --probe-path=/health --interval=4s] delay=4s timeout=1s period=4s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/istio/certs from certs (ro)
      /etc/istio/config from config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from istio-galley-service-account-token-5slbs (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  istio.istio-galley-service-account
    Optional:    false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-galley-configuration
    Optional:  false
  istio-galley-service-account-token-5slbs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  istio-galley-service-account-token-5slbs
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                 From                               Message
  ----     ------     ----                ----                               -------
  Warning  Unhealthy  1h (x167 over 8h)   kubelet, aks-nodepool1-25917760-1  Readiness probe failed: fail on inspecting path /health: stat /health: no such file or directory
  Normal   Started    28m (x122 over 8h)  kubelet, aks-nodepool1-25917760-1  Started container
  Warning  BackOff    3m (x1249 over 8h)  kubelet, aks-nodepool1-25917760-1  Back-off restarting failed container

Created a brand new AKS cluster with much larger nodes (8c, 28gb mem * 3). Definitely zero resource issues occuring in this setup.. but the moment Istio is installed Helm/Dashboard are unusable and some kubectl commands begin to fail randomly.

From helm logs it seems to be having issues accessing the API.

[storage/driver] 2018/08/19 16:31:51 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get secrets)

After removing the istio namespace and everything in it.... Dashboard becomes responsive and Helm works 100% again.

Emailed [email protected] to try and get some additional assistance/point out this issue to them. Not sure what else I can do to help debug.

Just got this from Azure Support:

Our engineering team was able to see that the istio-pilot pod eventually fails its health check and goes into CrashLoopBackoff state and restarts. The health check fails with:
Readiness probe failed: Get http://10.200.0.39:8080/debug/endpointz: dial tcp 10.200.0.39:8080: connect: connection refused.

It appears either this health check is misconfigured, or the istio-proxy container in the istio-pilot pod is failing to open a listen socket as expected.

Our engineering team ran commands from within both containers in that pod and from the istio-ingress controller, and "connection refused" everywhere tells them that the connectivity is good but that nothing is listening at the expected address http://10.200.0.39:8080 even though that IP was successfully assigned to the pod.

In reading through Istio documentation, there is not much troubleshooting information for our engineering team to assist with. It is suggested that you compare this health check configuration to their working Istio installation, and perhaps engage with the support team at Istio.

@costinm @andraxylia any thoughts here re: istio-pilot health check?

@blackbaud-brandonstirnaman Were you able to get any response back from [email protected] regarding this issue?

We have a fix to the Pilot readiness in 1.0.1 - as well as several memory optimizations.

However I'm a bit confused - a pilot ( or any other app) crash loop or failure should not impact dashboard or apiserver. It's a different isolated container, and the probes are not happening that often.

We will need to find a way to run tests on multiple platforms - right now we rely on volunteers testing on each individual platform ( which really translate into each vendor contributing to istio testing the platforms they support )

@costinm How can I test the changes from 1.0.1 in my AKS cluster?
I don't see it tagged in the daily builds. Is 1.0.1 master?
https://gcsweb.istio.io/gcs/istio-prerelease/daily-build/

@costinm is this the fix your are talking about?
https://github.com/istio/istio/issues/7586

I just installed a freshly created AKS cluster following:
https://gist.github.com/BernhardRode/57099e039c75072ba04d91ed3d22935a

instead of downloading the 1.0 release, i used:

https://gcsweb.istio.io/gcs/istio-prerelease/daily-build/release-1.0-20180822-09-15/

After some minutes, everything seems to be cool. I'm going to deploy some services and give it a shot.

Thanks to @ayj
https://github.com/istio/istio/issues/7586#issuecomment-415192552

@CapTaek radio silence from AKS-Help so far.

Installed the latest daily on a new cluster (with galley enabled this time). Nothing crashing/restarting but still the same problems with K8S Dashboard performance with istio installed/no issues with it removed.

I also installed the latest daily istio-release-1.0-20180822-09-15 build on my AKS cluster. Everything was running smoothly for a bit so I deployed a simple application with a gateway configuration and then I noticed the istio-telemetry and istio-policy pods using a lot of CPU. When they autoscaled to 2 replicas their replicas went in to a CrashLoopBackOff state with the error:
Liveness probe failed: Get http://10.200.0.90:9093/version: dial tcp 10.200.0.90:9093: connect: connection refused

Looking at my logs there are a lot of these errors:
Failed to list *v1beta2.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)
and these
gc 233 @3240.155s 0%: 0.044+5.9+7.6 ms clock, 0.089+0.24/2.8/8.2+15 ms cpu, 15->15->7 MB, 16 MB goal, 2 P\n

My Kubernetes Dashboard is still unresponsive, but Helm is working.
Microsoft has been responsive to me through the Azure support channel, but they are out of ways to troubleshoot the issue.

Ran the same daily (release-1.0-20180822-09-15) overnight on AKS (Istio installed via Helm with no options) and I also put in a couple of test services. There is no load on the cluster, no one is using it. As @rsnj reported, telemetry and policy are having a bad time:

❯ kubectl get pods --all-namespaces
NAMESPACE      NAME                                        READY     STATUS             RESTARTS   AGE
istio-system   istio-citadel-5d85b758f4-vkr5z              1/1       Running            0          16h
istio-system   istio-egressgateway-5764c598cf-qckqm        1/1       Running            0          16h
istio-system   istio-galley-5f595485b9-g9fb4               1/1       Running            0          16h
istio-system   istio-ingressgateway-6647dd4b64-4pmc2       1/1       Running            0          16h
istio-system   istio-pilot-57ffcdc795-74pq7                2/2       Running            0          16h
istio-system   istio-policy-87bfd665b-mvjmz                1/2       CrashLoopBackOff   219        9h
istio-system   istio-policy-87bfd665b-psqjh                1/2       CrashLoopBackOff   317        14h
istio-system   istio-policy-87bfd665b-qjjvg                1/2       CrashLoopBackOff   283        12h
istio-system   istio-policy-87bfd665b-shmx4                2/2       Running            0          16h
istio-system   istio-policy-87bfd665b-zhxxm                1/2       CrashLoopBackOff   341        15h
istio-system   istio-sidecar-injector-6677558cfc-jlbp6     1/1       Running            0          16h
istio-system   istio-statsd-prom-bridge-7f44bb5ddb-fgxkv   1/1       Running            0          16h
istio-system   istio-telemetry-696487b84f-2n92r            2/2       Running            0          16h
istio-system   istio-telemetry-696487b84f-4b9px            1/2       CrashLoopBackOff   329        15h
istio-system   istio-telemetry-696487b84f-7mdg8            1/2       CrashLoopBackOff   339        15h
istio-system   istio-telemetry-696487b84f-pxgkq            1/2       CrashLoopBackOff   303        13h
istio-system   istio-telemetry-696487b84f-wpr6h            1/2       CrashLoopBackOff   281        12h
istio-system   prometheus-84bd4b9796-lfhd4                 1/1       Running            0          16h
kube-system    azureproxy-6496d6f4c6-4hx8w                 1/1       Running            2          17h
kube-system    heapster-864b6d7fb7-gjgj6                   2/2       Running            0          17h
kube-system    kube-dns-v20-5695d5c69d-879xd               3/3       Running            0          17h
kube-system    kube-dns-v20-5695d5c69d-l897h               3/3       Running            0          17h
kube-system    kube-proxy-7pfbs                            1/1       Running            0          17h
kube-system    kube-proxy-whzvk                            1/1       Running            0          17h
kube-system    kube-proxy-zpkjt                            1/1       Running            0          17h
kube-system    kube-svc-redirect-6bksg                     1/1       Running            0          17h
kube-system    kube-svc-redirect-j8jj5                     1/1       Running            1          17h
kube-system    kube-svc-redirect-k7w48                     1/1       Running            0          17h
kube-system    kubernetes-dashboard-66bf8db6cf-pcqhv       1/1       Running            3          17h
kube-system    metrics-server-64f6d6b47-9922n              1/1       Running            0          17h
kube-system    tiller-deploy-895d57dd9-2k6ns               1/1       Running            0          16h
kube-system    tunnelfront-dcc6d8447-gpgsq                 1/1       Running            0          17h

I was using istio-release-1.0-20180820-09-15 and Galley was crashing, so the problem seemed to move around (See #7586).

So, with no load, the policy and telemetry services are crashing at startup? @rsnj and @polothy is it possible that this is a recurrence of https://github.com/istio/istio/issues/6152? There was a PR to fix that awhile back that was shelved because Istio 1.0 went with k8s 1.9 as the base. Maybe it is still needed?

I have no idea, sorry. The cluster is still standing, would you like me to run any commands for you?

@douglas-reid I'm using a brand new AKS Cluster running Kubernetes 1.11.2 that has no load on it.
I installed Istio via helm using the default settings.
I then deployed a simple service and connected it to a gateway.
After the services deployed the entire system went to into deadlock and the istio-policy and istio-telemetry started using more and more CPU until they replicated and the second replica just went into CrashLoopBackOff. My service was never accessible.

Looking at the logs, I can see my services deployed and then its just a steady stream of the same error coming from istio-mixer and istio-pilot.

There are thousands of errors just like these:

2018-08-23T14:59:29.341Z | istio-release/mixer | 2018-08-23T14:59:29.341670Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:146: Failed to list *v1beta2.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)\n
2018-08-23T14:59:29.175Z | istio-release/mixer | 2018-08-23T14:59:29.175730Z\terror\tistio.io/istio/mixer/pkg/config/crd/store.go:119: Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request (get redisquotas.config.istio.io)\n
2018-08-23T14:59:29.163Z | istio-release/mixer | 2018-08-23T14:59:29.163277Z\terror\tistio.io/istio/mixer/pkg/config/crd/store.go:119: Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request (get tracespans.config.istio.io)\n
2018-08-23T14:59:29.162Z | istio-release/pilot | 2018-08-23T14:59:29.162425Z\terror\tistio.io/istio/pilot/pkg/serviceregistry/kube/controller.go:217: Failed to list *v1.Endpoints: the server was unable to return a response in the time allotted, but may still be processing the request (get endpoints)\n
2018-08-23T14:59:29.158Z | istio-release/mixer | 2018-08-23T14:59:29.158004Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:146: Failed to list *v1beta2.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)\n
2018-08-23T14:59:29.157Z | istio-release/mixer | 2018-08-23T14:59:29.157718Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:145: Failed to list *v1.Pod: the server was unable to return a response in the time allotted, but may still be processing the request (get pods)\n
2018-08-23T14:59:29.157Z | istio-release/mixer | 2018-08-23T14:59:29.157722Z\terror\tistio.io/istio/mixer/pkg/config/crd/store.go:119: Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request (get stdios.config.istio.io)\n
2018-08-23T14:59:29.157Z | istio-release/mixer | 2018-08-23T14:59:29.157766Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:145: Failed to list *v1.Pod: the server was unable to return a response in the time allotted, but may still be processing the request (get pods)\n
2018-08-23T14:59:29.156Z | istio-release/mixer | 2018-08-23T14:59:29.156579Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:146: Failed to list *v1beta2.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)\n
2018-08-23T14:59:29.156Z | istio-release/mixer | 2018-08-23T14:59:29.155984Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:148: Failed to list *v1beta1.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.extensions)\n
2018-08-23T14:59:29.078Z | istio-release/mixer | 2018-08-23T14:59:29.077170Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:146: Failed to list *v1beta2.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)\n
2018-08-23T14:59:29.078Z | istio-release/pilot | 2018-08-23T14:59:29.078054Z\terror\tistio.io/istio/pilot/pkg/config/kube/crd/controller.go:208: Failed to list *crd.EnvoyFilter: the server was unable to return a response in the time allotted, but may still be processing the request (get envoyfilters.networking.istio.io)\n
2018-08-23T14:59:29.077Z | istio-release/mixer | 2018-08-23T14:59:29.076728Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:146: Failed to list *v1beta2.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)\n
2018-08-23T14:59:29.071Z | istio-release/mixer | 2018-08-23T14:59:29.071149Z\terror\tistio.io/istio/mixer/pkg/config/crd/store.go:119: Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request (get listcheckers.config.istio.io)\n
2018-08-23T14:59:28.914Z | prometheus | level=error ts=2018-08-23T14:59:28.894316964Z caller=main.go:218 component=k8s_client_runtime err=\"github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:287: Failed to list *v1.Endpoints: the server cannot complete the requested operation at this time, try again later (get endpoints)\"\n
2018-08-23T14:59:28.914Z | istio-release/mixer | 2018-08-23T14:59:28.893873Z\terror\tistio.io/istio/mixer/adapter/kubernetesenv/cache.go:147: Failed to list *v1.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)\n

@rsnj Wow: Failed to list *v1.Pod: the server was unable to return a response in the time allotted, but may still be processing the request (get pods). That is not good. I wonder why everything with the API Server is so slow.

@douglas-reid I'm not sure, but I'm happy to help debug. I also have a ticket open with Azure Support who are trying to troubleshoot on their end. They applied some patches on their end, but nothing has seemed to fix the issue. You can see their response here.

Installing Istio makes the entire Kubernetes Dashboard unusable and removing it immediately resolves the issue.

@rsnj any chance they could report on the health of the API server itself?

In the meantime, here's an experiment to try: delete the deployments for istio-policy and istio-telemetry. This should remove ~8 watches on the API server. If that helps, it might indicate that Istio is overwhelming the API server on startup.

Just jumping in here, I was experiencing the exact same issue (policy & telemetry scaling up, then entering CrashLoopBackOff, etc). I just removed those two deployments and the load dropped and everything seems to be responsive again (kubectl, dashboard, etc).

@douglas-reid I deleted both deployments and the system came back. The Dashboard is now working and my log is no longer being flooded with errors. The only issue now is that my service is still not accessible. The web frontend went from a 404 error to a 503: UNAVAILABLE:no healthy upstream.

@rsnj OK. It seems that the API Server in your cluster is possibly getting overwhelmed by the watch clients in the various components. Can you try adding back the istio-policy deployment and see if things still work?

@douglas-reid If I just start istio-policy everything looks like its working.
I also tried starting istio-telemetry and then the entire system went down.

@rsnj this seems like an issue with the resources given to the API Server. I'd suggest trying the experiment in reverse (delete both, then add back istio-telemetry and then, after testing, add back istio-policy).

Mixer (which backs both istio-policy and istio-telemetry) opens a fair number of watches (~40) on CRDs and otherwise. I suspect that the API Server in these clusters is just not setup to handle this.

If there Azure Support has any information on how to increase resources for the API Server, that'd be the best way to resolve the issue. Maybe @lachie83 has some ideas (or contacts that do) here ?

@douglas-reid thanks for all your help with this issue. I forwarded all your comments to Azure support.

So I redeployed the istio-telemetry deployment and that pod just went into a CrashLoopBackOff state. So I deleted the istio-policy deployment and the istio-telemetry pod was stable, but my website went down with the error: UNAVAILABLE:no healthy upstream.

So far the only way the system is stable is if I have everything running except for istio-telemetry. Any ideas?

@rsnj It appears, at the moment, that on AKS, you have to chose between policy or telemetry. If you aren't enforcing any policies in the Mixer layer (rate limits, whitelists, etc.), then I would recommend prioritizing telemetry (but that's the part of the system I spend the most time on, so I may be slightly biased). Istio RBAC currently does not require Mixer, so you'll still have some functionality policy-wise.

To be successful without istio-policy running, you'll need to turn off check calls (otherwise you'll get connectivity issues as requests are denied because the proxy cannot reach the policy service). To do that, you need to install Istio with global.disablePolicyChecks set to true. I haven't spent much time trying this out, but I know that others have done this, so if this is of interest, I'm sure we can get this working. Istio is working on documentation for piecemeal installs. This would be a good test case.

In the slightly longer term, Mixer should reduce the number of CRDs down to 3, which should should help reduce burden on the API Server. Sometime after that, Mixer will receive config directly from Galley, reducing the burden even further.

Does that help?

Just an update I left the cluster running for about 8 hours without istio-telemetry running. The istio-policy pod autoscaled to 5 instances that were all in a CrashLoopBackOff state and my entire cluster went down again. The cluster has zero load on it and only has a simple web service running without any external dependencies.

@douglas-reid I will try your suggestion next to enable telemetry and disable policy.

@douglas-reid @rsnj I appreciate the heads up. I'm investigating and will report back.

is it related to https://github.com/Azure/AKS/issues/618? - deleting the HPA stops the crash loop backoff in telemetry and policy

Galley only starts in validation mode in istio 1.0. So I do not think galley will contribute to this issue on 1.0 release.

@lachie83 any updates?

All, a recent PR was merged to add initial support for CRD consolidation within Mixer. I cannot promise that it solves this issue, but I am looking for brave volunteers to try from tip with an edited deployment to set UseAdapterCRDs=false in the Mixer deployments and see if that improves things.

I'm happy to help guide that process, if someone has access to AKS clusters and some free time.

I've given this flag a try in a test environment on AKS, using release istio-release-1.1-20181021-09-15. As far as I can see, there is immediate improvement.

With 1.0.2, I had similar problems as others describe above; a cluster with mTLS enabled for the default namespace (not globally) and a few sample services deployed would become sluggish after a few hours. At this point, I could see that the telemetry and policy pods would restart frequently, and requests to the API-server (from e.g. Tiller and NMI (https://github.com/Azure/aad-pod-identity)) would time out. Deleting the istio-telemetry Deployment would immediately make the cluster responsive again.

With istio-release-1.1-20181021-09-15, and the flag applied, the cluster seems to be stable.

Installation was performed as follows, using Helm 2.9.1 and Kubernetes 1.11.2:

  • edit ${ISTIO_HOME}/install/kubernetes/helm/subcharts/mixer/templates/deployment.yaml, adding - --useAdapterCRDs=false to the args of the mixer container in the policy_container and telemetry_container sections.
  • helm dependency update ${ISTIO_HOME}/install/kubernetes/helm/istio
  • helm install ${ISTIO_HOME}/install/kubernetes/helm/istio --name istio --namespace istio-system --tls --wait --set global.configValidation=true --set sidecarInjectorWebhook.enabled=true --set gateways.istio-ingressgateway.loadBalancerIP=${PUBLIC_IP}

Are there any plans to explicitly support the useAdapterCRDs flag in Istio 1.1, i.e. either exposing it through Helm variables, or setting the default to false? Not having to manually edit the deployment file would definitely be preferrable.

@fhoy @douglas-reid I just updated my existing PR for the helm chart to include the switch for useAdapterCRDs https://github.com/istio/istio/pull/9435/files

@dtzar switching the default is definitely on the agenda for 1.1. I'll be raising that issue at the P&T WG meeting today.

fwiw, in the WG meeting today, we opted for the following approach:

  • leave UseAdapterCRDs defaulting to true for 1.1 (switching to false in 1.2 and removed altogether in 1.3).
  • add helm option for controlling value (PR out)
  • add deprecation warnings for adapter CRDs
  • build tool to convert between the models in 1.2 timeframe

Hope that helps.

Finally had a chance to do some testing around useAdapterCRDs in my AKS test cluster today. Definitely had a massive positive impact.

Edit: a few days later, still seeing a negative impact on API calls from Dashboard. 😢

To report back.. I deployed a new cluster with useAdapterCRDs=false and its been running for ~10 days or so now without a recurrence of the watch issue slowing helm/dashboard.

Good work! It'd be great if we could get the helm option @douglas-reid mentioned so my scripts can stop hacking the subchart in 1.1 releases. Can't find the PR mentioned though.

Hopefully this PR gets merged soon.. then we can close this issue: https://github.com/istio/istio/pull/10247

The existing PR is still held up debating on what to do about the conditional CRDs in the helm chart, so I separated enabling the useAdaperCRDs in the helm chart to a separate PR

@douglas-reid I believe I addressed all your feedback from the 1st PR, so theoretically this one should be good to merge.

https://github.com/istio/istio/pull/10404 was merged so this can probably be closed

Took me some time to get back and close this. Apologies and thanks for the fix!

Was this page helpful?
0 / 5 - 0 ratings