What happened: After update external-dns image from 0.7.4-debian-10-r67 to 0.7.5-debian-10-r0, external-dns try to delete a record and ns record of domain-filter DNS name. After we change image from 0.7.5 to 0.7.4, it's ok.
How to reproduce it (as minimally and precisely as possible): make hostedzone with filtered-domain name and use external-dns to manage that hosted zone
Anything else we need to know?:
Environment:
external-dns --version): 0.7.5logs:
time="2020-12-16T06:32:45Z" level=info msg="config: {APIServerURL: KubeConfig: RequestTimeout:30s ContourLoadBalancerService:heptio-contour/contour SkipperRouteGroupVersion:zalando.org/v1 Sources:[service ingress] Namespace: AnnotationFilter: LabelFilter: FQDNTemplate: CombineFQDNAndAnnotation:false IgnoreHostnameAnnotation:false IgnoreIngressTLSSpec:false Compatibility: PublishInternal:false PublishHostIP:false AlwaysPublishNotReadyAddresses:false ConnectorSourceServer:localhost:8080 Provider:google GoogleProject:develop-project GoogleBatchChangeSize:1000 GoogleBatchChangeInterval:1s DomainFilter:[a.bc.def.com] ExcludeDomains:[] ZoneNameFilter:[] ZoneIDFilter:[] AlibabaCloudConfigFile:/etc/kubernetes/alibaba-cloud.json AlibabaCloudZoneType: AWSZoneType: AWSZoneTagFilter:[] AWSAssumeRole: AWSBatchChangeSize:1000 AWSBatchChangeInterval:1s AWSEvaluateTargetHealth:true AWSAPIRetries:3 AWSPreferCNAME:false AWSZoneCacheDuration:0s AzureConfigFile:/etc/kubernetes/azure.json AzureResourceGroup: AzureSubscriptionID: AzureUserAssignedIdentityClientID: CloudflareProxied:false CloudflareZonesPerPage:50 CoreDNSPrefix:/skydns/ RcodezeroTXTEncrypt:false AkamaiServiceConsumerDomain: AkamaiClientToken: AkamaiClientSecret: AkamaiAccessToken: InfobloxGridHost: InfobloxWapiPort:443 InfobloxWapiUsername:admin InfobloxWapiPassword: InfobloxWapiVersion:2.3.1 InfobloxSSLVerify:true InfobloxView: InfobloxMaxResults:0 DynCustomerName: DynUsername: DynPassword: DynMinTTLSeconds:0 OCIConfigFile:/etc/kubernetes/oci.yaml InMemoryZones:[] OVHEndpoint:ovh-eu OVHApiRateLimit:20 PDNSServer:http://localhost:8081 PDNSAPIKey: PDNSTLSEnabled:false TLSCA: TLSClientCert: TLSClientCertKey: Policy:sync Registry:txt TXTOwnerID:external-dns TXTPrefix: TXTSuffix: Interval:1m0s Once:false DryRun:false UpdateEvents:false LogFormat:text MetricsAddress::7979 LogLevel:info TXTCacheInterval:0s TXTWildcardReplacement: ExoscaleEndpoint:https://api.exoscale.ch/dns ExoscaleAPIKey: ExoscaleAPISecret: CRDSourceAPIVersion:externaldns.k8s.io/v1alpha1 CRDSourceKind:DNSEndpoint ServiceTypeFilter:[] CFAPIEndpoint: CFUsername: CFPassword: RFC2136Host: RFC2136Port:0 RFC2136Zone: RFC2136Insecure:false RFC2136TSIGKeyName: RFC2136TSIGSecret: RFC2136TSIGSecretAlg: RFC2136TAXFR:false RFC2136MinTTL:0s NS1Endpoint: NS1IgnoreSSL:false NS1MinTTLSeconds:0 TransIPAccountName: TransIPPrivateKeyFile: DigitalOceanAPIPageSize:50}"
time="2020-12-16T06:32:45Z" level=info msg="Instantiating new Kubernetes client"
time="2020-12-16T06:32:45Z" level=info msg="Using inCluster-config based on serviceaccount-token"
time="2020-12-16T06:32:45Z" level=info msg="Created Kubernetes client https://10.110.0.1:443"
time="2020-12-16T06:32:56Z" level=info msg="Change zone: dev-hostedzone batch #0"
time="2020-12-16T06:32:56Z" level=info msg="Del records: a.bc.def.com. NS [ns-cloud-d100.googledomains.com ns-cloud-d12.googledomains.com ns-cloud-d43.googledomains.com ns-cloud-d22.googledomains.com] 21600"
time="2020-12-16T06:32:56Z" level=info msg="Del records: a.bc.def.com. TXT [\"heritage=external-dns,external-dns/owner=external-dns,external-dns/resource=ingress/dev/dev\"] 300"
time="2020-12-16T06:32:56Z" level=info msg="Add records: a.bc.def.com. A [11.111.11.111] 300"
time="2020-12-16T06:32:56Z" level=info msg="Add records: a.bc.def.com. TXT [\"heritage=external-dns,external-dns/owner=external-dns,external-dns/resource=ingress/dev/dev\"] 300"
time="2020-12-16T06:32:56Z" level=info msg="Add records: p66.a.bc.def.com. A [11.111.11.111] 300"
time="2020-12-16T06:32:56Z" level=info msg="Add records: p66.a.bc.def.com. TXT [\"heritage=external-dns,external-dns/owner=external-dns,external-dns/resource=ingress/dev/p66\"] 300"
time="2020-12-16T06:32:56Z" level=info msg="Add records: p67.a.bc.def.com. A [11.111.11.111] 300"
time="2020-12-16T06:32:56Z" level=info msg="Add records: p67.a.bc.def.com. TXT [\"heritage=external-dns,external-dns/owner=external-dns,external-dns/resource=ingress/dev/p67\"] 300"
time="2020-12-16T06:32:57Z" level=error msg="googleapi: Error 400: Invalid value for 'entity.change.deletions[0].rrdata[0]': 'ns-cloud-d100.googledomains.com'\nMore details:\nReason: invalid, Message: Invalid value for 'entity.change.deletions[0].rrdata[0]': 'ns-cloud-d100.googledomains.com'\nReason: invalid, Message: Invalid value for 'entity.change.deletions[0].rrdata[1]': 'ns-cloud-d12.googledomains.com'\nReason: invalid, Message: Invalid value for 'entity.change.deletions[0].rrdata[2]': 'ns-cloud-d43.googledomains.com'\nReason: invalid, Message: Invalid value for 'entity.change.deletions[0].rrdata[3]': 'ns-cloud-d22.googledomains.com'\n"
Thanks for the report, we'll look into this issue.
@Riley-sohn can you also try the official image mentioned in the release (k8s.gcr.io/external-dns/external-dns:v0.7.5) instead of the bitnami one? I don't know what kind of release process bitnami has and thus not sure of what version they were actually running, given that 0.7.5 is officially available as of today.
Can you also provide more info on when you see it and other possible information on how to fully reproduce the bug?
We are also seeing this bug on AWS - using k8s.gcr.io/external-dns/external-dns:v0.7.5 after upgrading from k8s.gcr.io/external-dns/external-dns:v0.7.4.
904d8e4 (#1813) seems to have included support for NS records being returned from the registry. In our case, we are managing a hosted zone for foo.bar.com. We also have an A record for the same domain name, pointing to an AWS Load Balancer.
It looks like External DNS is seeing the NS record being returned from the registry, not finding any resources which map to it (since NS records are usually managed outside of external DNS), and tries to delete it since it thinks it's 'managed'. There doesn't appear to be enough metadata for External DNS to distinguish whether it's supposed to manage the NS record (which is a new feature) or if it is managed externally. (as would be the case for anyone previously using 0.7.4)
@backjo I will take a deeper look at the problem tomorrow. We can think of feature-flagging it for now, but I'd be happy to hear suggestions before I analyze the problem.
@Raffo haven't thought about it too deeply, but the first thought that comes to mind is that for this use case, there should probably be metadata in the registry that tracks which specific records for a domain name are being managed by external-dns and which records are not.
Thank you for the context @backjo , that seems to be quite bad. I will try to reproduce it and fix it as soon as possible which means we'll likely rush 0.7.6 (till we have a better versioning convention, coming next year).
@backjo are you also using google as provider? I don't have a GKE cluster to use to reproduce the bug, but I tried deploying to AWS and I was not able to reproduce the error. Can you please provide the full manifest that you use to deploy ExternalDNS and the full example of what you are doing?
@Raffo unlike the original poster, we are using AWS / EKS , not GKE - it appears to be provider agnostic.
Try the following setup:
@Raffo @backjo maybe we should provide a way to skip a specific DNS record type from plan? E.g. make plan condition list configurable, somewhere around https://github.com/kubernetes-sigs/external-dns/pull/1813/files#diff-0967da8a495108f825298b0bc67c95fe9d301a1bd86baadc65a7cb3623686cc3R245
WDYT about storing information in the registry around which specific record types are being managed for a given domain - I could see an eventual case where I might want to manage some domain NS records via external DNS, but others only maybe the A record.
Ok, we took some time with @njuettner to try to reproduce the bug and we managed to reproduce it following the steps that @backjo provided. The issue is clearly created by https://github.com/kubernetes-sigs/external-dns/pull/1813 and we discussed at length how we want to fix it. ExternalDNS was not designed to handle custom records as we are doing with the DNSEndpoint resource and it never had the goal to be the one solution to manage everything DNS related in the cloud by using Kubernetes for everything. If the community is trying to achieve this with ExternalDNS, we'll have to talk more about this, as it could be a significant shift for the project.
That said, we will take another week to consider the fix for this, with the final decision to be taken on Jan the 6th.
We're now considering the following options:
0.7.6 without it. This will allow people to safely use the changes that we shipped in 0.7.5 without this bug.0.7.6 with this change. The side effects could be still surprising for users that decide to opt in for "managing NS records"type of record to the TXT record. We'll have to assess the impact of that to the rest of the functionality that we provide.Of course, a revert is to ensure our priority which is stability, not to say that features should not be added. If we ended up deciding for such a solution, we will start a discussion with the community to understand the need to manage custom DNS records with ExternalDNS.
@backjo: I have one more question for you: given that a NS record is delegating to another zone, can you please tell me what is the use case to mix A/CNAME and TXT records with the same name?
@Raffo In this case, these NS records are in the delegated zone, not the root zone - i.e. this zone is being delegated to from another, higher level hosted zone.
foo.com Zone
NS bar.foo.com ns1.org
bar.foo.com Zone
NS bar.foo.com
A bar.foo.com 192.168.1.1
Thank you for the update @backjo , very appreciated. As I said, I will give an update tomorrow with a plan to move forward.
We are surprised by this bug happening in all of our clusters in AWS Route53 as well!
This is pretty scary that it tries to delete the NS records!
time="2021-01-11T09:16:25Z" level=info msg="Desired change: DELETE zone.reducted.org NS [Id: /hostedzone/xxxxxxxxxx]"
time="2021-01-11T09:16:25Z" level=info msg="Desired change: CREATE zone.reducted.org A [Id: /hostedzone/xxxxxxxxxx]"
time="2021-01-11T09:16:25Z" level=error msg="Failure in zone zone.reducted.org. [Id: /hostedzone/xxxxxxxxxx]"
time="2021-01-11T09:16:25Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='zone.reducted.org.', type='A'] but it already exists, A Hosted Zone must contain at least one NS record for the zone itself.]\n\tstatus code: 400, request id: 7a331ff2-dee8-4f9f-a31e-xxxxx"
time="2021-01-11T09:16:25Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/xxxxxxxxxx]"
@universam1 I'm a bit late with a fix, but I'm going to be publishing an update today.
Thank you for quick response!
Guess what surprised us most that such a change came with a patch version only- we would expect that to be reflected by a feature or major version bump at least. Our review process is based on semantic versioning principle trust.
Thank you for considering that.
For sure we've been lacking a proper versioning strategy that could give better ideas of the change contained in a release. We did however take care of putting good information in the release notes and in fact the release for 0.7.5 contained information on this bug as soon as it was discovered. Again, that's a bug, not a breaking change designed on purpose to be one. I'm in the process of working around it.
This is what I came up with to work around this issue: https://github.com/kubernetes-sigs/external-dns/pull/1915 . We basically feature flag the NS records support, thus "protecting" the default functionality from ExternalDNS.
All right, the fix was merged, I will start the release process later today, we will have an official 0.7.6 image released by the end of the week.
I manually tested the change on AWS and I can confirm that the regression is fixed. We proceed with the release of v0.7.6.