Kops: DNS issues with autoscaler on AWS

Created on 29 Jan 2018 · 1Comment · Source: kubernetes/kops

cluster-autoscaler is not launching correctly on AWS with the pod logs giving the following error:

E0129 06:27:24.264163       1 static_autoscaler.go:135] Failed to update node registry: RequestError: send request failed
caused by: Post https://autoscaling.ap-southeast-2b.amazonaws.com/: dial tcp: lookup autoscaling.ap-southeast-2b.amazonaws.com on 172.20.0.2:53: no such host

Bug report information:

kops version: 1.8.0
Kubernetes version: 1.9.2
Cloud provider AWS

I have a running cluster with a number of instance groups. I am using the following config for the autoscaler deployment:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: cluster-autoscaler
  template:
    metadata:
      labels:
        k8s-addon: cluster-autoscaler.addons.k8s.io
        k8s-app: cluster-autoscaler
      annotations:
         # For 1.6, we keep the old tolerations in case of a downgrade to 1.5
        scheduler.alpha.kubernetes.io/tolerations: '[{"key":"dedicated", "value":"master"}]'
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
        - name: cluster-autoscaler
          image: gcr.io/google_containers/cluster-autoscaler:v1.1.0
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --nodes={{MIN_NODES_1}}:{{MAX_NODES_1}}:{{ASG_NAME_1}}
          env:
            - name: AWS_REGION
              value: {{AWS_REGION}}
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: /etc/ssl/certs/ca-certificates.crt
      dnsPolicy: Default
      nodeSelector:
        node-role.kubernetes.io/master: ""
      tolerations:
        - key: "node-role.kubernetes.io/master"
          effect: NoSchedule

I have gone through the solutions in Issue 1796 but none of these seemed to help the situation. I have verified that {{ASG_NAME_1}} is being replaced with the correct name for the AWS autoscaling group. Also, the minimum number of nodes for the ASG is 0.

Finally, I have tried using different DNS policies and base images to no avail (v0.4.0 and v0.6.0 neither of which I could get to spin up without crashing).

Any help is greatly appreciated :)

Source