@timothysc I think you were suggesting this during the sig-scale meeting today. Let me know if you meant something else.
@krzyzacy - Could you help with this? To clarify - we don't want to run this as required test on PRs (due to resource constraints) but want it to run during the batch-testing stage right before merging.
This should help catch regressions without having to run it against each PR. Also it's quick (~10m for build + ~50m to run).
cc @kubernetes/sig-scalability-misc @wojtek-t @countspongebob @smarterclayton
/assign @rmmh
batch master Ryan, any thoughts?
Batches currently take ~46m to run. This would slow that down a fair amount.
Do you want the failure of this to block PRs? Or do you just want more data on its health?
We're already having a CI test that's running continuously for health data. It'd be preferable to block them on the test, at least for the period before the release when we want to stabilize the code, especially with scale-tests being part of release-blocking suite. To clarify - we should only block the faulty PRs, not the SQ.
Comments anyone?
Also given the low influx of PRs now, the extra time shouldn't hurt us much.
(I'm not sure how to configure so that it will only block batch?)
Even if the batch fails, the one-at-a-time testing could succeed.
If you want to block PRs that cause performance regressions, you need something more precise.
I think that should be a rare occurrence. And even if it does happen we have a situation sth like:
In this case batch-merging A & B together shouldn't be allowed and that's the right behavior we want.
And if later we retry both of them but in different batches, the first one would go in and the second one would fail which is also the right behavior imo.
The only problem I see here is if we keep retrying the same batch over and over again. But it doesn't seem hard to prevent SQ from doing that.
If you don't have a blocking job that will catch regressions, the current system can't block them.
Blocking the batch merge of A+B just means that A and B will merge one at a time. You need a way to block the merge of B if it introduces a regression that's incompatible with A.
Blocking the batch merge of A+B just means that A and B will merge one at a time. You need a way to block the merge of B if it introduces a regression that's incompatible with A.
Not sure I understand - could you clarify? FWIU:
Batches are an optimization that runs in parallel to normal testing of the head of the queue. Blocking (or stopping) batch testing does not prevent merges from happening.
I see. Seems then we can include the job as part of the normal testing against the queue head. The main concern was to avoid running against all open PRs (as we can't afford the resources). Running against just the head should be fine as we'd have utmost 1 instance of the job running at a given moment.
Having the queue run more tests than standard presubmits for resource consumption reasons sounds fine, but it will take some changes to add that new concept.
Cool, thanks. A rough idea of how much time this would take? Earlier from the release, the better.
Roughly:
Should be a couple days' work, assuming the job can run with 1 or 2 parallelism.
That seems like it will halve the number of PRs we can merge, no?
Ideally this job will run in 45 minutes-- is there anything that can be cut out to make that happen?
This sounds like a step backward to me. We've worked pretty hard to get to a world where if a PR passes presubmits then it's okay to merge, and this neatly undoes that. I'm guessing @fejta has an opinion here.
More concretely, lets say a PR finds its way onto the queue, but the scalability pre-merge test fails. The author of the PR now has no way to rerun the test and decide if it was a flake. They simply have to push a change to their PR, get it LGTM'd, and then hope that their change fixes the issue or else they need to go through the whole ordeal again. That's not an acceptable workflow.
https://github.com/kubernetes/test-infra/issues/3740 sounds related.
More concretely, lets say a PR finds its way onto the queue, but the scalability pre-merge test fails. The author of the PR now has no way to rerun the test and decide if it was a flake. They simply have to push a change to their PR, get it LGTM'd, and then hope that their change fixes the issue or else they need to go through the whole ordeal again. That's not an acceptable workflow.
@spxtr This is not the case. We already have a kubemark-500 presubmit, but it doesn't run automatically on all PRs. It can be triggered by running /test pull-kubernetes-kubemark-e2e-gce-big (which is what someone for whom the pre-merge run failed will do on their PR after making fixes).
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Prevent issues from auto-closing with an /lifecycle frozen comment.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale
/remove-lifecycle stale
/lifecycle frozen
On Tue, Jan 23, 2018 at 11:11 AM fejta-bot notifications@github.com wrote:
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually
close.Prevent issues from auto-closing with an /lifecycle frozen comment.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/4445#issuecomment-359742750,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEIhk80X6zuTvULeDjePZg9FI73EUN4rks5tNbBNgaJpZM4PQTJ9
.
This is temporarily possible with splice (we do it with pull-kubernetes-cross) but I'm not sure how we should handle this in tide. cc @cjwagner @kargakis @spxtr, thoughts?
I agree with @spxtr, I don't know if we should support this.
This sounds like a step backward to me. We've worked pretty hard to get to a world where if a PR passes presubmits then it's okay to merge, and this neatly undoes that.
If individual PRs don't need to pass this job to be serially merged then it isn't a true merge requirement. This sounds like it should either be a blocking presubmit or a postsubmit to me.
@cjwagner We can't afford to run it automatically against every single PR that's opened as this is a scalability job and hence resource-intensive (i.e we cannot have multiple parallel runs of it due to resource limitations and also running it for each opened PR automatically would be costly). That said, this job is available as part of the presubmit job (just not triggered automatically and needs to be called by a trigger command manually).
I addressed @spxtr 's concern in https://github.com/kubernetes/test-infra/issues/4445#issuecomment-339258137
On another note with tide we optimize to merge PRs that have already been tested against HEAD without retest IIRC. This would break that.
Also with tide we aren't planning to guarantee any particular batch size, just whatever is working for throughput, so this job would not just be triggered on batches of 5 or similar, and even today we merge PRs serially sometimes when batches are conflicting.
@shyamjvs My concern is more that this introduces inconsistent merge requirements depending on how your PR is tested. From a PR author's perspective it would be confusing to have different merge requirements depending on how your PR is tested which is based on whether there are other PRs in the pool. This would effectively make the merge requirements for a PR dependent on the state of other PRs.
If this job is important enough to block PRs from merging it shouldn't just block batches, it should block serial tests too to prevent an individual PR that fails the job from merging. If we allow an individual PR that fails the job to merge, all batches after that will fail. In other words, it is dangerous to have a blocking presubmit that not all PRs need to pass because it is possible for a PR that doesn't run the job to break the job which can prevent merges until the bad PR is reverted.
If this job isn't important enough to block merges, it could be a post submit job.
Another option is to make this only run after a successful
pull-kubernetes-bazel-build so that it doesn't run on every commit. xref:
https://github.com/kubernetes/test-infra/issues/5922#issuecomment-353684407
We'd like to make jobs only run after a successful build and make all e2es
share this build again after that issue is fixed anyhow.
On Tue, Jan 23, 2018 at 1:38 PM, Cole Wagner notifications@github.com
wrote:
@shyamjvs https://github.com/shyamjvs My concern is more that this
introduces inconsistent merge requirements depending on how your PR is
tested. From a PR author's perspective it would be confusing to have
different merge requirements depending on how your PR is tested which is
based on whether there are other PRs in the pool. This would effectively
make the merge requirements for a PR dependent on the state of other PRs.If this job is important enough to block PRs from merging it shouldn't
just block batches, it should block serial tests too to prevent an
individual PR that fails the job from merging. If we allow an individual PR
that fails the job to merge, all batches after that will fail. In other
words, it is dangerous to have a blocking presubmit that not all PRs need
to pass because it is possible for a PR that doesn't run the job to break
the job which can prevent merges until the bad PR is reverted.
If this job isn't important enough to block merges, it could be a post
submit job.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/4445#issuecomment-359939478,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq3QLAuOXMMvR9iH5Ng9i4M4iO6I9ks5tNlFjgaJpZM4PQTJ9
.
/remove-lifecycle frozen
@BenTheElder @cjwagner I generally understand and agree with most of your concerns on how this can potentially affect batch merging. However, I'd like to understand a bit more about your concern here. Few points:
Wrt 'inconsistent merge requirements' and 'confusing from PR author perspective': IMO it is not very unreasonable that a PR author is hidden from this detail (that their PR is being scale-tested) as it is like an extra check that is made under the covers that likely wouldn't affect most PRs (lets say 95% - just a rough estimation from my experience). For those few mischievous ones, we'll deftly catch them and the PR author would only then need to go back and worry about scalability impact of his PR.
Wrt 'merge requirements for a PR dependent on the state of other PRs': I definitely agree with this point that we shouldn't only test batches with this job, as then we may not know which PR(s) in that batch creates the problem. I'm fine with testing the job also against the other PRs in the job, however what isn't an option is to test 'every single opened PR' with the job. If the batch size is going to be ~6 or so, I'm fine with running the job against each of them (we can probably afford it).
The only concern I have is this job should run against a PR only when it absolutely needs to. That is, after all other presubmits have passed and the PR makes it to the submit-queue and it is in the batch which is currently being pre-merge tested. Given all that, unless the batch is too large this shouldn't cause much problems.
We'd like to make jobs only run after a successful build and make all e2es share this build again after that issue is fixed anyhow.
@BenTheElder Yes, that'd be more like a last resort. As it defeats the whole purpose of trying to 'prevent' regressions from coming in instead of 'curing' them as much as possible. Because the latter costs the community much more eng time.
Wrt 'merge requirements for a PR dependent on the state of other PRs': I definitely agree with this point that we shouldn't only test batches with this job, as then we may not know which PR(s) in that batch creates the problem. I'm fine with testing the job also against the other PRs in the job, however what isn't an option is to test 'every single opened PR' with the job.
Thats not quite what I was trying to convey. I'm mainly concerned that this pattern could stop all batches from merging.
Not all PRs will be tested as part of a batch. A PR is only tested in a batch if there happen to be other PRs to be tested, so if there are no other PRs, the PR may merge without batch testing. If we only run this job against batches that means that code that fails the job can still merge if tested serially, blocking all subsequent batches.
This job may be important enough to be worth the cost of hunting down a bad PR that might merge serially. I just want to want to be clear that this change could allow a serial PR to merge and block batch merges. If that possibility is acceptable then we can make this job blocking for batches.
Wrt 'inconsistent merge requirements' and 'confusing from PR author perspective': IMO it is not very unreasonable that a PR author is hidden from this detail (that their PR is being scale-tested) as it is like an extra check that is made under the covers that likely wouldn't affect most PRs (lets say 95% - just a rough estimation from my experience). For those few mischievous ones, we'll deftly catch them and the PR author would only then need to go back and worry about scalability impact of his PR.
How does the author of this PR know it hasn't passed and how to fix this? Right now all tests required for merge show up as statuses on the PR...
Since we aren't running this against all PRs now I don't see how we can be so confident about the failure rate as well.
Wrt 'merge requirements for a PR dependent on the state of other PRs': I definitely agree with this point that we shouldn't only test batches with this job, as then we may not know which PR(s) in that batch creates the problem. I'm fine with testing the job also against the other PRs in the job, however what isn't an option is to test 'every single opened PR' with the job. If the batch size is going to be ~6 or so, I'm fine with running the job against each of them (we can probably afford it).
The batch size is dependent on conditions of the pools with Tide. The size will depend on what we manage to pack together and test. Even with the submit-queue today we often have to serialize PRs where two PRs were conflicting in a batch because the queue doesn't know which one so it falls back to serialized testing/merging.
With Tide (replacing submit-queue soon :tm:) PRs may be tested in arbitrary batch sizes or may be serialized so if the concern is "this is expensive to run per PR" we won't be able to guarantee that it only runs on large batches.
We'd like to make jobs only run after a successful build and make all e2es share this build again after that issue is fixed anyhow.
@BenTheElder Yes, that'd be more like a last resort. As it defeats the whole purpose of trying to 'prevent' regressions from coming in instead of 'curing' them as much as possible. Because the latter costs the community much more eng time.
Er maybe we're mis-understanding. The idea is to have a common build step that must pass before other tests [including merge-required ones generally] which would let us limit how much any other test is wastefully run against PRs that don't even build.. Expensive tests could be made "cheaper" in this way by making sure other tests have passed before we run them.
This job may be important enough to be worth the cost of hunting down a bad PR that might merge serially. I just want to want to be clear that this change could allow a serial PR to merge and block batch merges. If that possibility is acceptable then we can make this job blocking for batches.
This isn't very reasonable imo. All PRs should have the same merge requirements regardless of the queue/pool state. If we do this we're essentially guaranteeing that what will happen is:
@cjwagner @BenTheElder Your above concerns sound quite reasonable to me. So to begin with, I'm fine with adding the job in the regular way (i.e run twice on PRs before merge). I checked that most of our blocking presubmit e2e jobs (i.e node-e2e, kubemark-e2e, kops-aws-e2e) are running with a max_concurrency of 12. IMO that should be fine for now if we add the kubemark-big job also with that concurrency (it should consume at max <1k cores (with ~80 cores per run)). The CI job for it finishes in ~1 hr (I can try to reduce it further if needed) and is very healthy. Besides, once this is in place, we can get rid of the other kubemark pre-submit (i.e pull-kubernetes-kubemark-e2e-gce) which'll be redundant.
Does that SGTY?
Yes, though I want to add that there's now in theory some process around making jobs blocking so we should start non-blocking and try to push it down under an hour as much as possible. cc @spiffxp
FWIW I'm currently #6865 working on making presubmits faster so I'll probably be unrestricting the concurrency of those jobs a bit someday soon.
I think we can keep build under 10min, unit test under 6-7min and e2es / verify to around 30m, and soon we will have https://github.com/kubernetes/kubernetes/pull/59289 instead of pull-kubernetes-cross so I'll personally be a little sad to add another hour long job but I understand the need to keep scalibility healthy. :man_shrugging:
Thanks.. I'll begin by making it non-optional, but non-blocking. Let's take further steps (reducing time, dealing with concurrency) as needed.
Btw the rationale why it is very important to run the job (backed with some past data) is explained here - https://github.com/kubernetes/community/blob/master/sig-scalability/processes/formal-scalability-processes.md#implementation--pre-submit-phase
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
On Mon, May 28, 2018 at 10:14 PM fejta-bot notifications@github.com wrote:
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually
close.If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta
https://github.com/fejta.
/lifecycle stale—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/4445#issuecomment-392596914,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEIhk-k7C7HaS3KtlTIAOuGJSE0yV9x5ks5t3FqNgaJpZM4PQTJ9
.
Most helpful comment
Thanks.. I'll begin by making it non-optional, but non-blocking. Let's take further steps (reducing time, dealing with concurrency) as needed.
Btw the rationale why it is very important to run the job (backed with some past data) is explained here - https://github.com/kubernetes/community/blob/master/sig-scalability/processes/formal-scalability-processes.md#implementation--pre-submit-phase