External-dns: Debug "failed to sync cache: timed out waiting for the condition"

Created on 20 Sep 2019 · 14Comments · Source: kubernetes-sigs/external-dns

Can anyone suggest how I might go about debugging the error message "failed to sync cache: timed out waiting for the condition"? I'm trying to setup external-dns against a private DNS zone in GCP. I've granted the GKE cluster the correct oauth role, and my deployment (from Helm charts) is below:

---
# Source: external-dns/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: development-external-dns
  labels: 
    app.kubernetes.io/name: external-dns
    helm.sh/chart: external-dns-2.6.3
    app.kubernetes.io/instance: development
    app.kubernetes.io/managed-by: Tiller
spec:
  replicas: 1
  selector:
    matchLabels: 
      app.kubernetes.io/name: external-dns
      app.kubernetes.io/instance: development
  template:
    metadata:
      labels: 
        app.kubernetes.io/name: external-dns
        helm.sh/chart: external-dns-2.6.3
        app.kubernetes.io/instance: development
        app.kubernetes.io/managed-by: Tiller
      annotations:
    spec:      
      securityContext: 
        fsGroup: 1001
        runAsUser: 1001

      serviceAccountName: "default"
      containers:
      - name: external-dns
        image: "docker.io/bitnami/external-dns:0.5.17-debian-9-r0"
        imagePullPolicy: "IfNotPresent"
        args:
        # Generic arguments
        - --log-level=debug
        - --domain-filter=detection.int
        - --policy=upsert-only
        - --provider=google
        - --registry=txt
        - --interval=1m
        - --source=service
        - --source=ingress
        # AWS arguments
        # Azure Arguments
        # Cloudflare arguments
        # Google Arguments
        - --google-project=test-project
        # Infloblox Arguments
        # RFC 2136 arguments
        # PowerDNS arguments
        # Extra arguments
        env:
        # AWS environment variables
        # Cloudflare environment variables
        # CoreDNS environment variables
        # DigitalOcean environment variables
        # Google environment variables
        # Infloblox environment variables
        # RFC 2136 environment variables
        # PowerDNS environment variables
        # Extra environment variables
        ports:
        - name: http
          containerPort: 7979
        readinessProbe: 
          failureThreshold: 6
          httpGet:
            path: /healthz
            port: http
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5

        livenessProbe: 
          failureThreshold: 2
          httpGet:
            path: /healthz
            port: http
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5

        volumeMounts:
        # AWS mountPath(s)
        # Azure mountPath(s)
        # CoreDNS mountPath(s)
        # Google mountPath(s)
        # Designate mountPath(s)
      volumes:
      # AWS volume(s)
      # Azure volume(s)
      # CoreDNS volume(s)
      # Google volume(s)
      # Designate volume(s)

Given that the service account GKE uses has the DNS Admin role, I think permissions are ok, but the only logging I get before my container fails is as follows:

time="2019-09-20T10:37:20Z" level=info msg="config: {Master: KubeConfig: RequestTimeout:30s IstioIngressGatewayServices:[istio-system/istio-ingressgateway] ContourLoadBalancerService:heptio-contour/contour Sources:[service ingress] Namespace: AnnotationFilter: FQDNTemplate: CombineFQDNAndAnnotation:false IgnoreHostnameAnnotation:false Compatibility: PublishInternal:false PublishHostIP:false ConnectorSourceServer:localhost:8080 Provider:google GoogleProject:test-project DomainFilter:[detection.int] ExcludeDomains:[] ZoneIDFilter:[] AlibabaCloudConfigFile:/etc/kubernetes/alibaba-cloud.json AlibabaCloudZoneType: AWSZoneType: AWSZoneTagFilter:[] AWSAssumeRole: AWSBatchChangeSize:1000 AWSBatchChangeInterval:1s AWSEvaluateTargetHealth:true AWSAPIRetries:3 AWSPreferCNAME:false AzureConfigFile:/etc/kubernetes/azure.json AzureResourceGroup: CloudflareProxied:false CloudflareZonesPerPage:50 CoreDNSPrefix:/skydns/ RcodezeroTXTEncrypt:false InfobloxGridHost: InfobloxWapiPort:443 InfobloxWapiUsername:admin InfobloxWapiPassword: InfobloxWapiVersion:2.3.1 InfobloxSSLVerify:true InfobloxView: InfobloxMaxResults:0 DynCustomerName: DynUsername: DynPassword: DynMinTTLSeconds:0 OCIConfigFile:/etc/kubernetes/oci.yaml InMemoryZones:[] PDNSServer:http://localhost:8081 PDNSAPIKey: PDNSTLSEnabled:false TLSCA: TLSClientCert: TLSClientCertKey: Policy:upsert-only Registry:txt TXTOwnerID:default TXTPrefix: Interval:1m0s Once:false DryRun:false LogFormat:text MetricsAddress::7979 LogLevel:debug TXTCacheInterval:0s ExoscaleEndpoint:https://api.exoscale.ch/dns ExoscaleAPIKey: ExoscaleAPISecret: CRDSourceAPIVersion:externaldns.k8s.io/v1alpha1 CRDSourceKind:DNSEndpoint ServiceTypeFilter:[] CFAPIEndpoint: CFUsername: CFPassword: RFC2136Host: RFC2136Port:0 RFC2136Zone: RFC2136Insecure:false RFC2136TSIGKeyName: RFC2136TSIGSecret: RFC2136TSIGSecretAlg: RFC2136TAXFR:false NS1Endpoint: NS1IgnoreSSL:false TransIPAccountName: TransIPPrivateKeyFile:}"
time="2019-09-20T10:37:20Z" level=info msg="Created Kubernetes client https://10.1.48.1:443"
time="2019-09-20T10:38:20Z" level=fatal msg="failed to sync cache: timed out waiting for the condition"

Thoughts?

lifecyclrotten

Source

aodj

Most helpful comment

OK I finally solved my case: I'm not using Tiller, but rather rendering Helm charts locally. The Helm chart I'm using for external-dnsis this one: https://github.com/helm/charts/tree/master/stable/external-dns

When I rendered out the Helm chart, I was specifying a namespace with helm template --namespace external-dns ... which results in the rendered clusterrolebinding.yaml having a subject.namespace of external-dns.

Unfortunately nothing else in the chart seems to respect the namespacing; everything was being deployed to the default namespace. When I change the value to default and reapplied my configuration, the pod ran successfully.

_Edit: I realise this might be a shortcoming of rendering locally rather than relying upon Tiller ymmv_

aodj on 30 Sep 2019

👍3

All 14 comments

I'm facing the same problem with external-dns and digitalocean.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: external-dns
spec:
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: external-dns
    spec:
      containers:
      - name: external-dns
        image: bitnami/external-dns:latest
        args:
        - --source=service # ingress is also possible
        - --domain-filter=k8sdo.ml
        - --provider=digitalocean
        - --policy=upsert-only
        env:
        - name: DO_TOKEN
          valueFrom:
            secretKeyRef:
              name: digitalocean
              key: token

joaovitor on 23 Sep 2019

Same problem for me: https://github.com/kubernetes-incubator/external-dns/issues/1201

Running via the command line works fine, no idea what's going on, debug doesn't show any more details.

Update: Got it running now, please check the url above for the fix

themac13 on 23 Sep 2019

Deploying external dns with helm worked.
It uses this image: docker.io/bitnami/external-dns:0.5.17-debian-9-r0 and these args.

      - args:
        - --log-level=info
        - --domain-filter=k8sdo.ml
        - --policy=sync
        - --provider=digitalocean
        - --registry=txt
        - --interval=1m
        - --source=service
        - --source=ingress

joaovitor on 24 Sep 2019

Maybe helm takes care of the permissions setup?

themac13 on 24 Sep 2019

Hi there :wave:

Just faced the same problem.

Using the Helm chart with rbac: create: false (default) and Clouflare yielded the same error: time="2019-09-25T17:23:12Z" level=fatal msg="failed to sync cache: timed out waiting for the condition"

I fixed by enabling the rbac in Helm, everything worked just fine.

Anyone has any clue why this is happening? Should we make rbac enabled by default in the Helm chart?

jonatasbaldin on 25 Sep 2019

Is there really no path to debugging this? We're not currently using Tiller or RBAC in our cluster, since I'm just trying to test it out at first. The whole "enable RBAC and hope for the best" seems pretty sketchy to me, since it sidesteps the issue rather than identifying the actual cause of the problem.

aodj on 30 Sep 2019

It's an issue with missing RBAC permissions.

The code sets up an informer and waits one minute for it to sync. But this never happens due to missing permissions and then prints the confusing error message.

Is there a way in the Informers framework to tell if we are lacking permissions of some sort?

linki on 30 Sep 2019

Do you happen to know what the missing permission is?

aodj on 30 Sep 2019

_Edit: I realise this might be a shortcoming of rendering locally rather than relying upon Tiller ymmv_

aodj on 30 Sep 2019

👍3

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 29 Dec 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 28 Jan 2020

This sounds like we should fix the error message though. Timeout hints towards connectivity issues and not permission errors.

We would have to need to fix the error messages here and in all the other resources.

phillebaba on 12 Feb 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 13 Mar 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.