Roslyn: Auto-retry failed integration tests

Created on 30 May 2019 · 7Comments · Source: dotnet/roslyn

Some flakiness in individual integration tests is outside the current control of this repository. We should implement an auto-retry strategy consisting of the following:

Run integration tests
If the previous step fails, run integration tests again but exclude all tests that passed in the previous step
Report failures from the second step to the build job

This change means the build will only fail if one or more tests fails twice in consecutive launches of the IDE for testing.

Area-Infrastructure Integration-Test Resolution-Fixed Urgency-Soon

Source

sharwell

Most helpful comment

What value? A test leg that fails 95% of the time has no value.

Note: this was not the case. With EditorTestApp it was far more often that things were working well, issues were caught, and the occasional flakeyness issue could be investigated and stamped out. The decision to disallow editortestapp was made for the team, but then put things in a very difficult position. Effectively making it so that we became dependent on other parts of the system that aren't resiliently isolated, but which we have no carrot or stick to get changes for.

If the owners of the tests believe there is value in unreliability

The owners believe highly in reliability. but hte mechanisms to make things reliable were removed from our toolkit for political reasons :-/

is just a waste of time and money.

They still catch issues (i believe just this week it caught issues with the rename feature). I agree it's wasteful how things are done here. But it's not "just a waste". It's simply an unfortunate waste that people don't like, but which seems to be the only viable path given managment's rules on how integration test harnesses need to be built (i.e. you can't build your own isolated scaffolding, lest that give customers the means to build their own mini-VS).

Perhaps integration tests should go back to being internal, and they could use EditorTestApp again? What's more important, that we have a system that doesn't function well, but is open source? Or that we have something that works great, is extremely reliable, but is closed?

Tagging @KirillOsenkov for thoughts here as well.

CyrusNajmabadi on 31 May 2019

👍2 ❤1

All 7 comments

I don't agree with this approach. This does nothing to address reliability and just means PR validation will take even longer. I think a much more pragmatic approach is the following:

Limit the tests that run in PR to those which can run reliably. Most of the integration tests meet this category.
Run all of the tests in CI (the run of our pipelines that occurs on merged code).

That means our PRs can finally be green. Also developers can have a belief that when integration tests fail it's likely due to the change they're making and hence worth taking a look at. The approach where we keep tests we know are not reliable in CI gives developers no incentive to look into them.

jaredpar on 30 May 2019

Note: the goal of reliable integration tests has been around for more than a decade now. We were only cable to get there with 'editor test app'. It provided the appropriate isolation to keep integration tests working consistently well. We abandoned that due to political pressure, but it's put us in this positoin. While i want things to be reliable, as long as there isn't enough top-down pressure to get all of VS isolated-enough, then we likely need to accept unreliable (but still insanely valuable) integration tests here. Given the value, but also given the unreliability we don't have the clout to change, i think it would be a bad idea to limit the PR tests.

CyrusNajmabadi on 31 May 2019

❤1

i think it would be a bad idea to limit the PR tests.

I think having a PR system with a near 100% failure rate is a bad idea.

Given the value

What value? A test leg that fails 95% of the time has no value. Keeping it on means developers have to investigate flaky tests in 95% of their changes. That is a huge burden on developers who have very likely done nothing wrong here. Massive productivity loss.

If the owners of the tests believe there is value in unreliability, then they should bear that burden, not the rest of the team. The tests should be limited to the developers who are committed to investigating these tests every time they fail and ensuring the value is maintained. Running them for the entire team when can't reliably pass, and not requiring developers to investigate, is just a waste of time and money.

jaredpar on 31 May 2019

What value? A test leg that fails 95% of the time has no value.

If the owners of the tests believe there is value in unreliability

The owners believe highly in reliability. but hte mechanisms to make things reliable were removed from our toolkit for political reasons :-/

is just a waste of time and money.

Tagging @KirillOsenkov for thoughts here as well.

CyrusNajmabadi on 31 May 2019

👍2 ❤1

Note: this was not the case. With EditorTestApp it was far more often that things were working wel

Indeed. Completely agree. But at this point in time that's not a solution available to us.

They still catch issues (i believe just this week it caught issues with the rename feature).

They catch issues when there is an IDE person available to interpret your PR. That means that we can't merge a PR unless there is an IDE person available to bless every test failure. That's slow, not scalable and very wasteful.

A much more efficient system is run the tests only during merge runs. The IDE can then interpret the results on a schedule that is convenient to them. If legitimate errors are found they can file bugs to have the developer responsible fix the issue.

The burden should be on the system owners, not the individual contributors.

jaredpar on 31 May 2019

I sympathize with all parties.

Unfortunately I do agree that EditorTestApp is currently unavailable, not only for reasons that Cyrus mentions, but also that there isn't a person who would be passionate about maintaining it. So even if it were reactivated internally, there's non-trivial effort in keeping the lights on and keep up with VS host parity (to flesh out the host enough to fool Roslyn into thinking that it's running inside VS). Unfortunately I do notice that in absence of EditorTestApp some design and layering violations keep sneaking into Roslyn, which ETA would have prevented, and VSMac catches these pretty late in the cycle. But that's a separate issue from the integration tests.

A pragmatic tactical recommendation would be to split the integration tests into two categories: "flaky" and "reliable". Run the reliable tests on CI. Run flaky tests nightly or weekly and only check on them then to catch regressions. It'd be OK to run those twice as well. Have the IDE team own the flaky ones (like the QA team did with Maddog runs back in the day).

VSMac will be investing in more integration tests later this year. Hopefully if our tests are robust enough we can add them to the Roslyn CI. That would be something like EditorTestApp hopefully, but also with the added benefit of actually helping the VSMac team!