External-dns: ExternalDNS deleting and then creating records. Constantly. Azure.

Created on 31 Jan 2019  Â·  47Comments  Â·  Source: kubernetes-sigs/external-dns

As you can see below, this is not ideal behaviour.

The logs from the pod just show constantly deleting/updating records. It doesn't have any information as to why it's doing it.

I've checked, my ingress addresses are not disappearing, at least not that I can see.

image

kinbug needs-clarification

Most helpful comment

Also same behavior on 0.5.9.

All 47 comments

We're seeing the same behaviour on GKE (Google).

What version of external are you're currently running?

Seems related to #879

v0.5.10 has the problem, we have reverted to v0.5.9 which does not.

Exactly the same here. v0.5.9 works fine, v0.5.10 does this constantly.

We are having the same issue, I posted an example in #543
We will try to revert to v0.5.9 for now.

I've had the same issue this morning. Thankfully you guys already reported this as I was aware of the loop but did not know the cause... I've also reverted to v0.5.9 (running AKS 1.11.3 in Azure by the way)

yep same issue here, we were saved by keeping a lock on our resource groups in azure for delete :)

We are facing the same issue starting 0.5.10. 0.5.9 works fines

Same issue, but only on 0.5.10, reverting to 0.5.9 works perfectly fine:

The following loop it's happening every minute.
Logs from external-dns (debug level):

level=debug msg="Retrieving Azure DNS zones."
level=debug msg="Found 1 Azure DNS zone(s)."
level=debug msg="Retrieving Azure DNS records for zone 'fulldomain.com'."
level=debug msg="Found A record for 'test-app.fulldomain.com' with target 'XX.XX.XX.XX'."
level=debug msg="Found TXT record for 'test-app.fulldomain.com' with target '\"heritage=external-dns,external-dns/owner=prod,external-dns/resource=ingress/test-app/test-app\"'."
level=debug msg="Endpoints generated from ingress: test-app/test-app: [test-app.fulldomain.com 300 IN A XX.XX.XX.XX [] test-app.fulldomain.com 300 IN A XX.XX.XX.XX []]"
level=debug msg="Removing duplicate endpoint test-app.fulldomain.com 300 IN A XX.XX.XX.XX []"
level=debug msg="Retrieving Azure DNS zones."
level=debug msg="Found 1 Azure DNS zone(s)."
level=info msg="Deleting A record named 'test-app' for Azure DNS zone 'fulldomain.com'."
level=info msg="Deleting TXT record named 'test-app' for Azure DNS zone 'fulldomain.com'."
level=info msg="Updating A record named 'test-app' to 'XX.XX.XX.XX' for Azure DNS zone 'fulldomain.com'."
level=info msg="Updating TXT record named 'test-app' to '\"heritage=external-dns,external-dns/owner=prod,external-dns/resource=ingress/test-app/test-app\"' for Azure DNS zone 'fulldomain.com'."

Thanks for all the other reports. I tried to downgrade to 0.5.9 and in Azure I'm now getting an API version error.

I then tried 0.5.8, same problem. Went back to 0.5.10, same problem.

I'm really confused now because up until 10 minutes ago, my External DNS was running the :latest tag and was constantly recycling DNS records.

I deleted that deployment (kubectl delete -f external-dns-manifest.yaml), and then created it. And now for some reason I'm getting API errors.

Wondering if somehow Azure is rate limiting these requests which just coincided with me trying to downgrade?

level=error msg="dns.ZonesClient#ListByResourceGroup: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code=\"InvalidApiVersionParameter\" Message=\"The api-version '2016-04-01' is invalid. The supported versions are '2018-11-01,2018-09-01,2018-08-01,2018-07-01,2018-06-01,2018-05-01,2018-02-01,2018-01-01,2017-12-01,2017-08-01,2017-06-01,2017-05-10,2017-05-01,2017-03-01,2016-09-01,2016-07-01,2016-06-01,2016-02-01,2015-11-01,2015-01-01,2014-04-01-preview,2014-04-01,2014-01-01,2013-03-01,2014-02-26,2014-04'.\""

@PirateBread

Could you try this build for Azure to see if it addresses your issue?

registry.opensource.zalan.do/teapot/external-dns:v0.5.10-16-gfe39b46

@jhohertz

Just deployed v0.5.10-16-gfe39b46 and I'm still seeing the following:

time="2019-02-08T16:05:52Z" level=info msg="Created Kubernetes client https://xxxxx-2b0c5b7a.hcp.uksouth.azmk8s.io:443" time="2019-02-08T16:05:52Z" level=info msg="Using client_id+client_secret to retrieve access token for Azure API." time="2019-02-08T16:05:52Z" level=error msg="dns.ZonesClient#time="2019-02-08T16:05:52Z" level=info msg="Created Kubernetes client https://xxxxxxx-2b0c5b7a.hcp.uksouth.azmk8s.io:443" time="2019-02-08T16:05:52Z" level=info msg="Using client_id+client_secret to retrieve access token for Azure API." time="2019-02-08T16:05:52Z" level=error msg="dns.ZonesClient#ListByResourceGroup: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code=\"InvalidApiVersionParameter\" Message=\"The api-version '2016-04-01' is invalid. The supported versions are '2018-11-01,2018-09-01,2018-08-01,2018-07-01,2018-06-01,2018-05-01,2018-02-01,2018-01-01,2017-12-01,2017-08-01,2017-06-01,2017-05-10,2017-05-01,2017-03-01,2016-09-01,2016-07-01,2016-06-01,2016-02-01,2015-11-01,2015-01-01,2014-04-01-preview,2014-04-01,2014-01-01,2013-03-01,2014-02-26,2014-04'.\"":

If I get a chance this weekend I'm going to try and reproduce this in a completely fresh environment in my own subscription to rule out some kind of configuration issue but at this point I can't see what would be wrong?

I can confirm that v0.5.10-16-gfe39b46 solves the eternal delete/update loop of doom on GKE.

Thanks for the feedback, we will work on an official release which will probably land tomorrow.

I have similar problem but on AWS with version __0.5.11__.
ExternalDNS is constantly updating same record every two minutes (--interval=2m)

time="2019-02-19T14:21:45Z" level=error msg="getting records failed: Throttling: Rate exceeded\n\tstatus code: 400, request id: af6f41c7-3451-11e9-bb90-1939f5de72e5"
time="2019-02-19T14:21:52Z" level=error msg="getting records failed: Throttling: Rate exceeded\n\tstatus code: 400, request id: b3bb1bbc-3451-11e9-92a8-118f2457694e"
time="2019-02-19T14:22:10Z" level=info msg="Desired change: UPSERT *.mydomain.com A"
time="2019-02-19T14:22:10Z" level=info msg="Desired change: UPSERT *.mydomain.com TXT"
time="2019-02-19T14:22:10Z" level=info msg="2 record(s) in zone incapsula-qa.de. were successfully updated"
time="2019-02-19T14:24:06Z" level=info msg="Desired change: UPSERT *.mydomain.com A"
time="2019-02-19T14:24:06Z" level=info msg="Desired change: UPSERT *.mydomain.com TXT"
time="2019-02-19T14:24:06Z" level=info msg="2 record(s) in zone incapsula-qa.de. were successfully updated"
time="2019-02-19T14:26:25Z" level=error msg="getting records failed: Throttling: Rate exceeded\n\tstatus code: 400, request id: 5676a7c3-3452-11e9-b59c-ddd6f4af4826"
time="2019-02-19T14:26:25Z" level=info msg="Desired change: UPSERT *.mydomain.com A"
time="2019-02-19T14:26:25Z" level=info msg="Desired change: UPSERT *.mydomain.com TXT"
time="2019-02-19T14:26:25Z" level=info msg="2 record(s) in zone incapsula-qa.de. were successfully updated"

My arguments:

      --log-level=info
      --policy=upsert-only
      --provider=aws
      --registry=txt
      --interval=2m
      --source=service

Also same behavior on 0.5.9.

I have the same issue as @omegarus.

I'm not seeing the needless updates on AWS as others are experiencing, but one difference may be that I don't have any cases of trying to publish wildcard DNS records, so I am wondering if the issue is somewhat specific to the wildcard?

@jhohertz The DNS records I'm trying to publish don't contain wildcards, they are configured for different ingresses that contain different service host names (for ex. service.internal.domain, app.internal.domain), and I'm still experiencing this issue (I've tried to downgrade as far as v0.5.7 and it still happens).

I'm sorry @FridaGo I'm not sure what you're experiencing. This issue and the ones I have recently posted about are all relating to a problem that was introduced in v0.5.10.

All I can suggest is try watching the status field of the services you are attaching the DNS records to, to see if something is causing updates you aren't expecting to that status, which external-dns might be picking up on. I've seen some ingress configurations cause things like that to occur.

Can we close this issue as v0.5.11 was released?

@jhohertz Status field is constant and not changing.

status:
  loadBalancer:
    ingress:
    - hostname: x8076o593986511e9b2dc86r8d247u18-9901230772.us-west-1.elb.amazonaws.com

dnslog.txt
I'm seeing this same behavior with infoblox after upgrading from 0.5.9 to 0.5.11. I'm going to try and downgrade to 5.9 to see if it resolves it. So much churn with the recycling bin that it blew up the Infoblox DB.
Sample logs attached.

Have the same issue on v0.5.11 on GKE

For me, on AWS, both running v0.5.9 and v0.5.11, haven't seen such a problem. Maybe it has something to do @jhohertz mentioned?

Found a solution to the problem.
If you have another externaldns who have the same txt records value, the first externaldns will delete the records of the second and vice versa
you should change the value of "txtOwnerId" for each externaldns deployment.

@medanasslim great, thanks for posting an update.

Ping to @PirateBread and @aslimacc , do you have additional info to share and/or are you still experiencing this issue?

Works for me

Experiencing the same issue with Cloudflare and both registry.opensource.zalan.do/teapot/external-dns:v0.5.9 and registry.opensource.zalan.do/teapot/external-dns:v0.5.12.

...
    spec:
      containers:
      - args:
        - --source=ingress
        - --domain-filter=my-domain.com
        - --provider=cloudflare
        - --cloudflare-proxied
        env:
        - name: CF_API_KEY
          value: 
        - name: CF_API_EMAIL
          value: 
        image: registry.opensource.zalan.do/teapot/external-dns:v0.5.9
        imagePullPolicy: Always
...

I am on Cloudflare and as I said above, you should add "txt-owner-id"

Example below:

  • args:

    • --log-level=info

    • --registry=txt

    • --interval=1m

    • --txt-owner-id=instance1

I am on Cloudflare and as I said above, you should add "txt-owner-id"

Example below:

  • args:

    • --log-level=info
    • --registry=txt
    • --interval=1m
    • --txt-owner-id=instance1

Thank you for the advice but this doesn't fix the issue.
This is useful if you have multiple clusters using the same DNS zone.

Can you share your logs, please to see the behavior of the app?

On Mon, Apr 22, 2019 at 5:21 PM Jérôme Lecorvaisier <
[email protected]> wrote:

I am on Cloudflare and as I said above, you should add "txt-owner-id"

Example below:

-

args:

  • --log-level=info

    • --registry=txt

    • --interval=1m

    • --txt-owner-id=instance1

Thank you for the advice but this doesn't fix the issue.
This is useful if you have multiple clusters using the same DNS zone.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes-incubator/external-dns/issues/883#issuecomment-485447952,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALK4NGXXQPM55KKTQWPQUJDPRXJY7ANCNFSM4GTSROLA
.

Can you share your logs, please to see the behavior of the app?
…
On Mon, Apr 22, 2019 at 5:21 PM Jérôme Lecorvaisier < @.*> wrote: I am on Cloudflare and as I said above, you should add "txt-owner-id" Example below: - args: - --log-level=info - --registry=txt - --interval=1m - --txt-owner-id=instance1 Thank you for the advice but this doesn't fix the issue. This is useful if you have multiple clusters using the same DNS zone. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#883 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ALK4NGXXQPM55KKTQWPQUJDPRXJY7ANCNFSM4GTSROLA .

Sure, you can see logs here https://github.com/kubernetes-incubator/external-dns/issues/992

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

dnslog.txt
I'm seeing this same behavior with infoblox after upgrading from 0.5.9 to 0.5.11. I'm going to try and downgrade to 5.9 to see if it resolves it. So much churn with the recycling bin that it blew up the Infoblox DB.
Sample logs attached.

I'm also seeing this with the infoblox provider running v0.5.15. Removing my TTL annotations as per a previous comment resolved this issue.

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Hi, sorry to open up this ticket again but I've faced the same issue. Once removed all other sources than istio-gateway the problem ~dissapeared~.

Edit: actually it didn't. I'm investigating it further.

Seeing this as well with Istio gateways and TransIP provider. We do have two instances of external-DNS for the same zone but with different txt-owner-id so that shouldn't be a problem.

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/remove-lifecycle rotten

/reopen

@Xnyle: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

txt-owner-id

works for me

Was this page helpful?
0 / 5 - 0 ratings