Test-infra: end DNS flakes

Created on 25 Sep 2018 · 14Comments · Source: kubernetes/test-infra

Tests flake on connecting to external services. We should minimize how much we depend on these, but we should also deflake DNS. First step was using FQDN for in-cluster services https://github.com/kubernetes/test-infra/pull/9547

TODO:

[x] investigate options with @MrHohn (and also read all of the k8s dns docs :-))
[x] use fully qualified domain names for in-cluster services in the build clusters https://github.com/kubernetes/test-infra/pull/9547
[x] upgrade prow to k8s 1.10.x so we can use dnsConfig
[x] try lowering ndots using dnsConfig https://github.com/kubernetes/test-infra/pull/9556
[ ] try rolling this out more broadly
profit? less flakes? ❄️

/area prow
/area jobs
/kind bug
/priority important-soon

/assign
/assign @MrHohn

aredeflake arejobs areprow kinbug lifecyclrotten prioritimportant-soon

Source

BenTheElder

Most helpful comment

now that we are on 1.10 for prow we can start to leverage dnsConfig on some pods :-)

BenTheElder on 25 Sep 2018

🎉2 👍1

All 14 comments

now that we are on 1.10 for prow we can start to leverage dnsConfig on some pods :-)

BenTheElder on 25 Sep 2018

🎉2 👍1

https://github.com/kubernetes/kubernetes/pull/68932 :eyes:

BenTheElder on 25 Sep 2018

DNS changes in https://github.com/kubernetes/test-infra/pull/9556 look very promising so far -- no discernible network flakes. Will roll out to critical k/k jobs tomorrow.

We may need a better mechanism to set this on all jobs, perhaps extending presets ...? @cjwagner
@krzyzacy any ideas? Basically we will want to set dnsConfig on ~all agent: kubernetes job specs.

Kubernetes has no mechanism for this currently ... cluster level DNS config is only about name servers, which we do not need or want to change. We _could_ just add this to all ~800 jobs but ideally we'd have a preset or some other defaulting mechanism for this.

BenTheElder on 25 Sep 2018

add to defaults, like https://github.com/kubernetes/test-infra/blob/master/prow/config/config.go#L1134-L1169?

krzyzacy on 25 Sep 2018

@krzyzacy it's probably not preferable to hard code this for _all_ prow users.

Side note: there has been _one_ flake since.

BenTheElder on 25 Sep 2018

or once we move to podutils, we can put this to our default decoration config

krzyzacy on 25 Sep 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 24 Dec 2018

/remove-lifecycle stale
/area deflake
@BenTheElder what remains to be done here?

spiffxp on 5 Jan 2019

it's unclear if our current attempts made a significant difference or not. we probably can't do much more ourselves besides the dnsConfig which also unfortunately makes the jobs more verbose :(

BenTheElder on 7 Jan 2019

NodeLocal DNS Cache might help with this as well :)
cc @prameshj

MrHohn on 13 Mar 2019

👀1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 14 Jun 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 14 Jul 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 13 Aug 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 13 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Cherrypick lables name not consistant

spzala · 4Comments

prow: Github's API endpoint is not configurable

fen4o · 4Comments

Ensure prow can run on a non-GKE kubernetes cluster

spiffxp · 3Comments

boskos does not have a free gce-project at the moment

xiangpengzhao · 3Comments

Deploy gubernator without webhook_secret and secrets.json

fejta · 4Comments