Test-infra: Scalability presubmits: Properly cope with termination

Created on 31 Jul 2019 · 15Comments · Source: kubernetes/test-infra

What happened:

According to @BenTheElder , the sig-scalability presubmits can not cope with getting terminated. This forces us to keep an option in Prow that prevents Presubmit jobs that are superseded by a newer push on the pull from getting terminated.

This is an option we do not want to keep, jobs should be able to properly cope with their termination. @stevekuznetsov suggested to use an appropriate terminationGracePeriodSeconds setting on the job instead.

Related discussion in Slack: https://kubernetes.slack.com/archives/C7J9RP96G/p1563978338040400

/assign @wojtek-t

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Please provide links to example occurrences, if any:

Anything else we need to know?:

kinbug siscalability

Source

alvaroaleman

All 15 comments

What is the exact problem with "not being able to cope with termination"?
How does it manifest?

I think I don't fully understand the problem, which makes it hard to answer how to solve it :)

@kubernetes/sig-scalability-bugs @shyamjvs @mm4tt

wojtek-t on 31 Jul 2019

How does it manifest?

According to Ben by scalability jobs need to do cleanup and don't have a proper grace period etc.

If that is not accurate and we can send a DELETE to the pods that contain sig-scalability presubmits without causing problems, this issue can be closed :)

alvaroaleman on 31 Jul 2019

Our presubmits use common infra (kubekins image / kubetest binary) to set-up clusters and run tests. They shouldn't be any different than any other e2e tests.
If other e2e test can handle DELETE pod properly (e.g. by terminating the test and calling kube-down) then our presubmits should be doing it as well.

If that's not the case, then could you please explain what is so special about our jobs that they "cannot cope with getting terminated"? It's not obvious to me

mm4tt on 31 Jul 2019

Hi, there's quite a lot of context missing with those partial quotes ... 😕

NONE of the presubmits on prow.k8s.io do anything special to deal with termination.
However, most of them use boskos projects (and therefore retain resources in borrowed projects with a hearbeat) or don't have external resources that need cleanup.

As I remember, the last time we started terminating jobs early when a new commit is pushed to the PR we found this disruptive to scale presubmits that use a shared project because the cluster isn't torn down and it will be much longer before the janitor can clean it up than if the job had exited normally.

Some of the testing still works this way as far as I know.

BenTheElder on 31 Jul 2019

I just saw a thread in slack the other day about terminating early and remembered the problems we had with leaked scale clusters last time: https://github.com/kubernetes/test-infra/issues/7673

One possible solution is to ensure the test script / tool handles sig TERM and starts cleaning up, and set a grace period on the job long enough for tear down.

Another is to use boskos for everything

To be clear: Nearly all Kubernetes presubmits depended on this same functionality for some time, scale testing specifically has done nothing different there. However @krzyzacy and I were able to move most of the presubmits to using projects using the boskos pool in the past. The scale testing requires more resources.

Anyone could PR this repo to do the first option ... I'm spread a bit thin at the moment. Ideally Prow would not break users especially including Kubernetes presubmits 😞

BenTheElder on 31 Jul 2019

Every job should _always_ expect to be pre-empted without any notice and handle SIGTERM appropriately.

stevekuznetsov on 31 Jul 2019

As a Prow admin that wants to provide a good experience to job executors, you want to have Pods for aborted ProwJobs deleted, anyway -- Prow has determined that those jobs will never see the light of day and it is not useful to continue having them use up quota or shared resources. Deleting those Pods when the job gets aborted improves your useful throughput.

stevekuznetsov on 31 Jul 2019

Every job should always expect to be pre-empted without any notice and handle SIGTERM appropriately.

This is true, but we can't really expect authors / users to have already handled this, it has not been documented.

In practice this does not really occur much at the moment, and those jobs specifically have previously been modified with mitigations against being terminated early (large resource requests, and possibly their own build cluster now?)

As a Prow admin that wants to provide a good experience to job executors, you want to have Pods for aborted ProwJobs deleted, anyway -- Prow has determined that those jobs will never see the light of day and it is not useful to continue having them use up quota or shared resources. Deleting those Pods when the job gets aborted improves your useful throughput.

That makes sense, and I agree... Scale testing still needs fixing, and runs on every Kubernetes PR, we should have empathy for our users... cc @cjwagner @fejta @spiffxp

BenTheElder on 31 Jul 2019

@alvaroaleman If we can gracefully handle cleanup for scalability jobs, that'll be great. IIRC back then there were couple concerns. That new commit push abruptly kills previous pod and also we cannot start new pod until the previous one cleaned up because otherwise we can have capacity issues.

Can you elaborate a bit on how this new mechanism you're proposing will work?

shyamjvs on 31 Jul 2019

Can you elaborate a bit on how this new mechanism you're proposing will work?

One of:

Use Boskos and make it take care of the account/project lifecycle management
Make sure the tests start cleaning up once they got a sig term and set a high enough gracePeriodSecond for the cleanup to complete before the pod gets killed

alvaroaleman on 1 Aug 2019

Using Boskos is the way we would like to go.
@mm4tt actually started setting up necessary projects, but he just dispappeared for vacation for 1 month.

So the plan is to use boskos for all our presubmits, though it wouldn't happen this month (but we already started working towards that direction).

wojtek-t on 1 Aug 2019

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 30 Oct 2019

/remove-lifecycle stale

@mm4tt what's the Boskos migration status? It's done, isn't it?

oxddr on 30 Oct 2019

Yes, we can close this one.
All scalability presubmits are behind boskos - https://github.com/kubernetes/perf-tests/issues/650
/assign
/close

mm4tt on 30 Oct 2019

👍1

@mm4tt: Closing this issue.

In response to this:

Yes, we can close this one.
All scalability presubmits are behind boskos - https://github.com/kubernetes/perf-tests/issues/650
/assign
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.