Test-infra: Make 'pull-kubernetes-kubemark-e2e-gce-big' part of pre-merge batch-testing

Created on 7 Sep 2017 · 38Comments · Source: kubernetes/test-infra

@timothysc I think you were suggesting this during the sig-scale meeting today. Let me know if you meant something else.

@krzyzacy - Could you help with this? To clarify - we don't want to run this as required test on PRs (due to resource constraints) but want it to run during the batch-testing stage right before merging.
This should help catch regressions without having to run it against each PR. Also it's quick (~10m for build + ~50m to run).

cc @kubernetes/sig-scalability-misc @wojtek-t @countspongebob @smarterclayton

Source

shyamjvs

Most helpful comment

Thanks.. I'll begin by making it non-optional, but non-blocking. Let's take further steps (reducing time, dealing with concurrency) as needed.

Btw the rationale why it is very important to run the job (backed with some past data) is explained here - https://github.com/kubernetes/community/blob/master/sig-scalability/processes/formal-scalability-processes.md#implementation--pre-submit-phase

shyamjvs on 27 Feb 2018

👍2

All 38 comments

/assign @rmmh
batch master Ryan, any thoughts?

krzyzacy on 7 Sep 2017

Batches currently take ~46m to run. This would slow that down a fair amount.

Do you want the failure of this to block PRs? Or do you just want more data on its health?

rmmh on 7 Sep 2017

We're already having a CI test that's running continuously for health data. It'd be preferable to block them on the test, at least for the period before the release when we want to stabilize the code, especially with scale-tests being part of release-blocking suite. To clarify - we should only block the faulty PRs, not the SQ.

Comments anyone?

shyamjvs on 7 Sep 2017

Also given the low influx of PRs now, the extra time shouldn't hurt us much.

shyamjvs on 7 Sep 2017

(I'm not sure how to configure so that it will only block batch?)

krzyzacy on 7 Sep 2017

Even if the batch fails, the one-at-a-time testing could succeed.

If you want to block PRs that cause performance regressions, you need something more precise.

rmmh on 7 Sep 2017

I think that should be a rare occurrence. And even if it does happen we have a situation sth like:

pr A doesn't cause a regression by itself
pr B doesn't cause a regression by itself
but pr A + pr B causes a regression

In this case batch-merging A & B together shouldn't be allowed and that's the right behavior we want.
And if later we retry both of them but in different batches, the first one would go in and the second one would fail which is also the right behavior imo.

shyamjvs on 7 Sep 2017

The only problem I see here is if we keep retrying the same batch over and over again. But it doesn't seem hard to prevent SQ from doing that.

shyamjvs on 7 Sep 2017

If you don't have a blocking job that will catch regressions, the current system can't block them.

Blocking the batch merge of A+B just means that A and B will merge one at a time. You need a way to block the merge of B if it introduces a regression that's incompatible with A.

rmmh on 7 Sep 2017

Blocking the batch merge of A+B just means that A and B will merge one at a time. You need a way to block the merge of B if it introduces a regression that's incompatible with A.

Not sure I understand - could you clarify? FWIU:

if they're in one batch, both of them wouldn't be merged because they fail the test
Otherwise the latter batch would be stopped from merging (because the former is already merged and the latter one on top of that fails)

shyamjvs on 7 Sep 2017

Batches are an optimization that runs in parallel to normal testing of the head of the queue. Blocking (or stopping) batch testing does not prevent merges from happening.

rmmh on 7 Sep 2017

I see. Seems then we can include the job as part of the normal testing against the queue head. The main concern was to avoid running against all open PRs (as we can't afford the resources). Running against just the head should be fine as we'd have utmost 1 instance of the job running at a given moment.

shyamjvs on 8 Sep 2017

Having the queue run more tests than standard presubmits for resource consumption reasons sounds fine, but it will take some changes to add that new concept.

rmmh on 8 Sep 2017

Cool, thanks. A rough idea of how much time this would take? Earlier from the release, the better.

shyamjvs on 8 Sep 2017

Roughly:

make mungegithub/submit-queue and splice trigger/config an additional immediately-before-submit job
make mungegithub expect that job to pass before final merge decision

Should be a couple days' work, assuming the job can run with 1 or 2 parallelism.

rmmh on 12 Sep 2017

That seems like it will halve the number of PRs we can merge, no?

spxtr on 13 Sep 2017

Ideally this job will run in 45 minutes-- is there anything that can be cut out to make that happen?

rmmh on 13 Sep 2017

This sounds like a step backward to me. We've worked pretty hard to get to a world where if a PR passes presubmits then it's okay to merge, and this neatly undoes that. I'm guessing @fejta has an opinion here.

More concretely, lets say a PR finds its way onto the queue, but the scalability pre-merge test fails. The author of the PR now has no way to rerun the test and decide if it was a flake. They simply have to push a change to their PR, get it LGTM'd, and then hope that their change fixes the issue or else they need to go through the whole ordeal again. That's not an acceptable workflow.

spxtr on 13 Sep 2017

https://github.com/kubernetes/test-infra/issues/3740 sounds related.

kargakis on 13 Sep 2017

More concretely, lets say a PR finds its way onto the queue, but the scalability pre-merge test fails. The author of the PR now has no way to rerun the test and decide if it was a flake. They simply have to push a change to their PR, get it LGTM'd, and then hope that their change fixes the issue or else they need to go through the whole ordeal again. That's not an acceptable workflow.

@spxtr This is not the case. We already have a kubemark-500 presubmit, but it doesn't run automatically on all PRs. It can be triggered by running /test pull-kubernetes-kubemark-e2e-gce-big (which is what someone for whom the pre-merge run failed will do on their PR after making fixes).

shyamjvs on 25 Oct 2017

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot on 23 Jan 2018

/remove-lifecycle stale
/lifecycle frozen

On Tue, Jan 23, 2018 at 11:11 AM fejta-bot notifications@github.com wrote:

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually
close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/4445#issuecomment-359742750,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEIhk80X6zuTvULeDjePZg9FI73EUN4rks5tNbBNgaJpZM4PQTJ9
.

shyamjvs on 23 Jan 2018

This is temporarily possible with splice (we do it with pull-kubernetes-cross) but I'm not sure how we should handle this in tide. cc @cjwagner @kargakis @spxtr, thoughts?

BenTheElder on 23 Jan 2018

I agree with @spxtr, I don't know if we should support this.

This sounds like a step backward to me. We've worked pretty hard to get to a world where if a PR passes presubmits then it's okay to merge, and this neatly undoes that.

If individual PRs don't need to pass this job to be serially merged then it isn't a true merge requirement. This sounds like it should either be a blocking presubmit or a postsubmit to me.

cjwagner on 23 Jan 2018

@cjwagner We can't afford to run it automatically against every single PR that's opened as this is a scalability job and hence resource-intensive (i.e we cannot have multiple parallel runs of it due to resource limitations and also running it for each opened PR automatically would be costly). That said, this job is available as part of the presubmit job (just not triggered automatically and needs to be called by a trigger command manually).

I addressed @spxtr 's concern in https://github.com/kubernetes/test-infra/issues/4445#issuecomment-339258137

shyamjvs on 23 Jan 2018

On another note with tide we optimize to merge PRs that have already been tested against HEAD without retest IIRC. This would break that.

BenTheElder on 23 Jan 2018

Also with tide we aren't planning to guarantee any particular batch size, just whatever is working for throughput, so this job would not just be triggered on batches of 5 or similar, and even today we merge PRs serially sometimes when batches are conflicting.

BenTheElder on 23 Jan 2018

@shyamjvs My concern is more that this introduces inconsistent merge requirements depending on how your PR is tested. From a PR author's perspective it would be confusing to have different merge requirements depending on how your PR is tested which is based on whether there are other PRs in the pool. This would effectively make the merge requirements for a PR dependent on the state of other PRs.

If this job is important enough to block PRs from merging it shouldn't just block batches, it should block serial tests too to prevent an individual PR that fails the job from merging. If we allow an individual PR that fails the job to merge, all batches after that will fail. In other words, it is dangerous to have a blocking presubmit that not all PRs need to pass because it is possible for a PR that doesn't run the job to break the job which can prevent merges until the bad PR is reverted.
If this job isn't important enough to block merges, it could be a post submit job.

cjwagner on 23 Jan 2018

Another option is to make this only run after a successful
pull-kubernetes-bazel-build so that it doesn't run on every commit. xref:
https://github.com/kubernetes/test-infra/issues/5922#issuecomment-353684407
We'd like to make jobs only run after a successful build and make all e2es
share this build again after that issue is fixed anyhow.

On Tue, Jan 23, 2018 at 1:38 PM, Cole Wagner notifications@github.com
wrote:

@shyamjvs https://github.com/shyamjvs My concern is more that this
introduces inconsistent merge requirements depending on how your PR is
tested. From a PR author's perspective it would be confusing to have
different merge requirements depending on how your PR is tested which is
based on whether there are other PRs in the pool. This would effectively
make the merge requirements for a PR dependent on the state of other PRs.

If this job is important enough to block PRs from merging it shouldn't
just block batches, it should block serial tests too to prevent an
individual PR that fails the job from merging. If we allow an individual PR
that fails the job to merge, all batches after that will fail. In other
words, it is dangerous to have a blocking presubmit that not all PRs need
to pass because it is possible for a PR that doesn't run the job to break
the job which can prevent merges until the bad PR is reverted.
If this job isn't important enough to block merges, it could be a post
submit job.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/4445#issuecomment-359939478,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq3QLAuOXMMvR9iH5Ng9i4M4iO6I9ks5tNlFjgaJpZM4PQTJ9
.

BenTheElder on 23 Jan 2018

/remove-lifecycle frozen

fejta on 25 Jan 2018

@BenTheElder @cjwagner I generally understand and agree with most of your concerns on how this can potentially affect batch merging. However, I'd like to understand a bit more about your concern here. Few points:

Wrt 'inconsistent merge requirements' and 'confusing from PR author perspective': IMO it is not very unreasonable that a PR author is hidden from this detail (that their PR is being scale-tested) as it is like an extra check that is made under the covers that likely wouldn't affect most PRs (lets say 95% - just a rough estimation from my experience). For those few mischievous ones, we'll deftly catch them and the PR author would only then need to go back and worry about scalability impact of his PR.
Wrt 'merge requirements for a PR dependent on the state of other PRs': I definitely agree with this point that we shouldn't only test batches with this job, as then we may not know which PR(s) in that batch creates the problem. I'm fine with testing the job also against the other PRs in the job, however what isn't an option is to test 'every single opened PR' with the job. If the batch size is going to be ~6 or so, I'm fine with running the job against each of them (we can probably afford it).

The only concern I have is this job should run against a PR only when it absolutely needs to. That is, after all other presubmits have passed and the PR makes it to the submit-queue and it is in the batch which is currently being pre-merge tested. Given all that, unless the batch is too large this shouldn't cause much problems.

We'd like to make jobs only run after a successful build and make all e2es share this build again after that issue is fixed anyhow.

@BenTheElder Yes, that'd be more like a last resort. As it defeats the whole purpose of trying to 'prevent' regressions from coming in instead of 'curing' them as much as possible. Because the latter costs the community much more eng time.

shyamjvs on 29 Jan 2018

Wrt 'merge requirements for a PR dependent on the state of other PRs': I definitely agree with this point that we shouldn't only test batches with this job, as then we may not know which PR(s) in that batch creates the problem. I'm fine with testing the job also against the other PRs in the job, however what isn't an option is to test 'every single opened PR' with the job.

Thats not quite what I was trying to convey. I'm mainly concerned that this pattern could stop all batches from merging.
Not all PRs will be tested as part of a batch. A PR is only tested in a batch if there happen to be other PRs to be tested, so if there are no other PRs, the PR may merge without batch testing. If we only run this job against batches that means that code that fails the job can still merge if tested serially, blocking all subsequent batches.

This job may be important enough to be worth the cost of hunting down a bad PR that might merge serially. I just want to want to be clear that this change could allow a serial PR to merge and block batch merges. If that possibility is acceptable then we can make this job blocking for batches.

cjwagner on 29 Jan 2018

Wrt 'inconsistent merge requirements' and 'confusing from PR author perspective': IMO it is not very unreasonable that a PR author is hidden from this detail (that their PR is being scale-tested) as it is like an extra check that is made under the covers that likely wouldn't affect most PRs (lets say 95% - just a rough estimation from my experience). For those few mischievous ones, we'll deftly catch them and the PR author would only then need to go back and worry about scalability impact of his PR.

How does the author of this PR know it hasn't passed and how to fix this? Right now all tests required for merge show up as statuses on the PR...
Since we aren't running this against all PRs now I don't see how we can be so confident about the failure rate as well.

Wrt 'merge requirements for a PR dependent on the state of other PRs': I definitely agree with this point that we shouldn't only test batches with this job, as then we may not know which PR(s) in that batch creates the problem. I'm fine with testing the job also against the other PRs in the job, however what isn't an option is to test 'every single opened PR' with the job. If the batch size is going to be ~6 or so, I'm fine with running the job against each of them (we can probably afford it).

The batch size is dependent on conditions of the pools with Tide. The size will depend on what we manage to pack together and test. Even with the submit-queue today we often have to serialize PRs where two PRs were conflicting in a batch because the queue doesn't know which one so it falls back to serialized testing/merging.

With Tide (replacing submit-queue soon :tm:) PRs may be tested in arbitrary batch sizes or may be serialized so if the concern is "this is expensive to run per PR" we won't be able to guarantee that it only runs on large batches.

We'd like to make jobs only run after a successful build and make all e2es share this build again after that issue is fixed anyhow.

@BenTheElder Yes, that'd be more like a last resort. As it defeats the whole purpose of trying to 'prevent' regressions from coming in instead of 'curing' them as much as possible. Because the latter costs the community much more eng time.

Er maybe we're mis-understanding. The idea is to have a common build step that must pass before other tests [including merge-required ones generally] which would let us limit how much any other test is wastefully run against PRs that don't even build.. Expensive tests could be made "cheaper" in this way by making sure other tests have passed before we run them.

This job may be important enough to be worth the cost of hunting down a bad PR that might merge serially. I just want to want to be clear that this change could allow a serial PR to merge and block batch merges. If that possibility is acceptable then we can make this job blocking for batches.

This isn't very reasonable imo. All PRs should have the same merge requirements regardless of the queue/pool state. If we do this we're essentially guaranteeing that what will happen is:

batch testing fails because the new test fails for some bad PR in a batch
the bad PR is eventually just merged serially (as are the other PRs in the batch)
the queue/tide cannot batch merge again until the bad PR is fixed because the tests continues to fail in batches
we serial merge more PRs that may also break this test because they're never even tested against it
the test is more or less permanently broken, as is batch merging

BenTheElder on 29 Jan 2018

👍1

@cjwagner @BenTheElder Your above concerns sound quite reasonable to me. So to begin with, I'm fine with adding the job in the regular way (i.e run twice on PRs before merge). I checked that most of our blocking presubmit e2e jobs (i.e node-e2e, kubemark-e2e, kops-aws-e2e) are running with a max_concurrency of 12. IMO that should be fine for now if we add the kubemark-big job also with that concurrency (it should consume at max <1k cores (with ~80 cores per run)). The CI job for it finishes in ~1 hr (I can try to reduce it further if needed) and is very healthy. Besides, once this is in place, we can get rid of the other kubemark pre-submit (i.e pull-kubernetes-kubemark-e2e-gce) which'll be redundant.

Does that SGTY?

shyamjvs on 27 Feb 2018

👍1

Yes, though I want to add that there's now in theory some process around making jobs blocking so we should start non-blocking and try to push it down under an hour as much as possible. cc @spiffxp

FWIW I'm currently #6865 working on making presubmits faster so I'll probably be unrestricting the concurrency of those jobs a bit someday soon.
I think we can keep build under 10min, unit test under 6-7min and e2es / verify to around 30m, and soon we will have https://github.com/kubernetes/kubernetes/pull/59289 instead of pull-kubernetes-cross so I'll personally be a little sad to add another hour long job but I understand the need to keep scalibility healthy. :man_shrugging:

BenTheElder on 27 Feb 2018

👍1

Thanks.. I'll begin by making it non-optional, but non-blocking. Let's take further steps (reducing time, dealing with concurrency) as needed.

shyamjvs on 27 Feb 2018

👍2

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 28 May 2018

/remove-lifecycle stale

On Mon, May 28, 2018 at 10:14 PM fejta-bot notifications@github.com wrote:

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually
close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta
https://github.com/fejta.
/lifecycle stale

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/4445#issuecomment-392596914,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEIhk-k7C7HaS3KtlTIAOuGJSE0yV9x5ks5t3FqNgaJpZM4PQTJ9
.

shyamjvs on 29 May 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

gke-ubuntustable1-k8sstable2-ingress misbehaving?

MrHohn · 4Comments

Prow issue: People in OWNERS files could not add LGTM

Aisuko · 3Comments

boskos does not have a free gce-project at the moment

xiangpengzhao · 3Comments

Deploy gubernator without webhook_secret and secrets.json

fejta · 4Comments

Ensure prow can run on a non-GKE kubernetes cluster

spiffxp · 3Comments