Can anyone suggest how I might go about debugging the error message "failed to sync cache: timed out waiting for the condition"? I'm trying to setup external-dns against a private DNS zone in GCP. I've granted the GKE cluster the correct oauth role, and my deployment (from Helm charts) is below:
---
# Source: external-dns/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: development-external-dns
labels:
app.kubernetes.io/name: external-dns
helm.sh/chart: external-dns-2.6.3
app.kubernetes.io/instance: development
app.kubernetes.io/managed-by: Tiller
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: external-dns
app.kubernetes.io/instance: development
template:
metadata:
labels:
app.kubernetes.io/name: external-dns
helm.sh/chart: external-dns-2.6.3
app.kubernetes.io/instance: development
app.kubernetes.io/managed-by: Tiller
annotations:
spec:
securityContext:
fsGroup: 1001
runAsUser: 1001
serviceAccountName: "default"
containers:
- name: external-dns
image: "docker.io/bitnami/external-dns:0.5.17-debian-9-r0"
imagePullPolicy: "IfNotPresent"
args:
# Generic arguments
- --log-level=debug
- --domain-filter=detection.int
- --policy=upsert-only
- --provider=google
- --registry=txt
- --interval=1m
- --source=service
- --source=ingress
# AWS arguments
# Azure Arguments
# Cloudflare arguments
# Google Arguments
- --google-project=test-project
# Infloblox Arguments
# RFC 2136 arguments
# PowerDNS arguments
# Extra arguments
env:
# AWS environment variables
# Cloudflare environment variables
# CoreDNS environment variables
# DigitalOcean environment variables
# Google environment variables
# Infloblox environment variables
# RFC 2136 environment variables
# PowerDNS environment variables
# Extra environment variables
ports:
- name: http
containerPort: 7979
readinessProbe:
failureThreshold: 6
httpGet:
path: /healthz
port: http
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
livenessProbe:
failureThreshold: 2
httpGet:
path: /healthz
port: http
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
volumeMounts:
# AWS mountPath(s)
# Azure mountPath(s)
# CoreDNS mountPath(s)
# Google mountPath(s)
# Designate mountPath(s)
volumes:
# AWS volume(s)
# Azure volume(s)
# CoreDNS volume(s)
# Google volume(s)
# Designate volume(s)
Given that the service account GKE uses has the DNS Admin role, I think permissions are ok, but the only logging I get before my container fails is as follows:
time="2019-09-20T10:37:20Z" level=info msg="config: {Master: KubeConfig: RequestTimeout:30s IstioIngressGatewayServices:[istio-system/istio-ingressgateway] ContourLoadBalancerService:heptio-contour/contour Sources:[service ingress] Namespace: AnnotationFilter: FQDNTemplate: CombineFQDNAndAnnotation:false IgnoreHostnameAnnotation:false Compatibility: PublishInternal:false PublishHostIP:false ConnectorSourceServer:localhost:8080 Provider:google GoogleProject:test-project DomainFilter:[detection.int] ExcludeDomains:[] ZoneIDFilter:[] AlibabaCloudConfigFile:/etc/kubernetes/alibaba-cloud.json AlibabaCloudZoneType: AWSZoneType: AWSZoneTagFilter:[] AWSAssumeRole: AWSBatchChangeSize:1000 AWSBatchChangeInterval:1s AWSEvaluateTargetHealth:true AWSAPIRetries:3 AWSPreferCNAME:false AzureConfigFile:/etc/kubernetes/azure.json AzureResourceGroup: CloudflareProxied:false CloudflareZonesPerPage:50 CoreDNSPrefix:/skydns/ RcodezeroTXTEncrypt:false InfobloxGridHost: InfobloxWapiPort:443 InfobloxWapiUsername:admin InfobloxWapiPassword: InfobloxWapiVersion:2.3.1 InfobloxSSLVerify:true InfobloxView: InfobloxMaxResults:0 DynCustomerName: DynUsername: DynPassword: DynMinTTLSeconds:0 OCIConfigFile:/etc/kubernetes/oci.yaml InMemoryZones:[] PDNSServer:http://localhost:8081 PDNSAPIKey: PDNSTLSEnabled:false TLSCA: TLSClientCert: TLSClientCertKey: Policy:upsert-only Registry:txt TXTOwnerID:default TXTPrefix: Interval:1m0s Once:false DryRun:false LogFormat:text MetricsAddress::7979 LogLevel:debug TXTCacheInterval:0s ExoscaleEndpoint:https://api.exoscale.ch/dns ExoscaleAPIKey: ExoscaleAPISecret: CRDSourceAPIVersion:externaldns.k8s.io/v1alpha1 CRDSourceKind:DNSEndpoint ServiceTypeFilter:[] CFAPIEndpoint: CFUsername: CFPassword: RFC2136Host: RFC2136Port:0 RFC2136Zone: RFC2136Insecure:false RFC2136TSIGKeyName: RFC2136TSIGSecret: RFC2136TSIGSecretAlg: RFC2136TAXFR:false NS1Endpoint: NS1IgnoreSSL:false TransIPAccountName: TransIPPrivateKeyFile:}"
time="2019-09-20T10:37:20Z" level=info msg="Created Kubernetes client https://10.1.48.1:443"
time="2019-09-20T10:38:20Z" level=fatal msg="failed to sync cache: timed out waiting for the condition"
Thoughts?
I'm facing the same problem with external-dns and digitalocean.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: external-dns
spec:
strategy:
type: Recreate
template:
metadata:
labels:
app: external-dns
spec:
containers:
- name: external-dns
image: bitnami/external-dns:latest
args:
- --source=service # ingress is also possible
- --domain-filter=k8sdo.ml
- --provider=digitalocean
- --policy=upsert-only
env:
- name: DO_TOKEN
valueFrom:
secretKeyRef:
name: digitalocean
key: token
Same problem for me: https://github.com/kubernetes-incubator/external-dns/issues/1201
Running via the command line works fine, no idea what's going on, debug doesn't show any more details.
Update: Got it running now, please check the url above for the fix
Deploying external dns with helm worked.
It uses this image: docker.io/bitnami/external-dns:0.5.17-debian-9-r0 and these args.
- args:
- --log-level=info
- --domain-filter=k8sdo.ml
- --policy=sync
- --provider=digitalocean
- --registry=txt
- --interval=1m
- --source=service
- --source=ingress
Maybe helm takes care of the permissions setup?
Hi there :wave:
Just faced the same problem.
Using the Helm chart with rbac: create: false (default) and Clouflare yielded the same error: time="2019-09-25T17:23:12Z" level=fatal msg="failed to sync cache: timed out waiting for the condition"
I fixed by enabling the rbac in Helm, everything worked just fine.
Anyone has any clue why this is happening? Should we make rbac enabled by default in the Helm chart?
Is there really no path to debugging this? We're not currently using Tiller or RBAC in our cluster, since I'm just trying to test it out at first. The whole "enable RBAC and hope for the best" seems pretty sketchy to me, since it sidesteps the issue rather than identifying the actual cause of the problem.
It's an issue with missing RBAC permissions.
The code sets up an informer and waits one minute for it to sync. But this never happens due to missing permissions and then prints the confusing error message.
Is there a way in the Informers framework to tell if we are lacking permissions of some sort?
Do you happen to know what the missing permission is?
OK I finally solved my case: I'm not using Tiller, but rather rendering Helm charts locally. The Helm chart I'm using for external-dnsis this one: https://github.com/helm/charts/tree/master/stable/external-dns
When I rendered out the Helm chart, I was specifying a namespace with helm template --namespace external-dns ... which results in the rendered clusterrolebinding.yaml having a subject.namespace of external-dns.
Unfortunately nothing else in the chart seems to respect the namespacing; everything was being deployed to the default namespace. When I change the value to default and reapplied my configuration, the pod ran successfully.
_Edit: I realise this might be a shortcoming of rendering locally rather than relying upon Tiller ymmv_
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
This sounds like we should fix the error message though. Timeout hints towards connectivity issues and not permission errors.
We would have to need to fix the error messages here and in all the other resources.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
OK I finally solved my case: I'm not using Tiller, but rather rendering Helm charts locally. The Helm chart I'm using for
external-dnsis this one: https://github.com/helm/charts/tree/master/stable/external-dnsWhen I rendered out the Helm chart, I was specifying a namespace with
helm template --namespace external-dns ...which results in the renderedclusterrolebinding.yamlhaving asubject.namespaceofexternal-dns.Unfortunately nothing else in the chart seems to respect the namespacing; everything was being deployed to the
defaultnamespace. When I change the value todefaultand reapplied my configuration, the pod ran successfully._Edit: I realise this might be a shortcoming of rendering locally rather than relying upon Tiller ymmv_