Some flakiness in individual integration tests is outside the current control of this repository. We should implement an auto-retry strategy consisting of the following:
This change means the build will only fail if one or more tests fails twice in consecutive launches of the IDE for testing.
I don't agree with this approach. This does nothing to address reliability and just means PR validation will take even longer. I think a much more pragmatic approach is the following:
That means our PRs can finally be green. Also developers can have a belief that when integration tests fail it's likely due to the change they're making and hence worth taking a look at. The approach where we keep tests we know are not reliable in CI gives developers no incentive to look into them.
Note: the goal of reliable integration tests has been around for more than a decade now. We were only cable to get there with 'editor test app'. It provided the appropriate isolation to keep integration tests working consistently well. We abandoned that due to political pressure, but it's put us in this positoin. While i want things to be reliable, as long as there isn't enough top-down pressure to get all of VS isolated-enough, then we likely need to accept unreliable (but still insanely valuable) integration tests here. Given the value, but also given the unreliability we don't have the clout to change, i think it would be a bad idea to limit the PR tests.
i think it would be a bad idea to limit the PR tests.
I think having a PR system with a near 100% failure rate is a bad idea.
Given the value
What value? A test leg that fails 95% of the time has no value. Keeping it on means developers have to investigate flaky tests in 95% of their changes. That is a huge burden on developers who have very likely done nothing wrong here. Massive productivity loss.
If the owners of the tests believe there is value in unreliability, then they should bear that burden, not the rest of the team. The tests should be limited to the developers who are committed to investigating these tests every time they fail and ensuring the value is maintained. Running them for the entire team when can't reliably pass, and not requiring developers to investigate, is just a waste of time and money.
What value? A test leg that fails 95% of the time has no value.
Note: this was not the case. With EditorTestApp it was far more often that things were working well, issues were caught, and the occasional flakeyness issue could be investigated and stamped out. The decision to disallow editortestapp was made for the team, but then put things in a very difficult position. Effectively making it so that we became dependent on other parts of the system that aren't resiliently isolated, but which we have no carrot or stick to get changes for.
If the owners of the tests believe there is value in unreliability
The owners believe highly in reliability. but hte mechanisms to make things reliable were removed from our toolkit for political reasons :-/
is just a waste of time and money.
They still catch issues (i believe just this week it caught issues with the rename feature). I agree it's wasteful how things are done here. But it's not "just a waste". It's simply an unfortunate waste that people don't like, but which seems to be the only viable path given managment's rules on how integration test harnesses need to be built (i.e. you can't build your own isolated scaffolding, lest that give customers the means to build their own mini-VS).
Perhaps integration tests should go back to being internal, and they could use EditorTestApp again? What's more important, that we have a system that doesn't function well, but is open source? Or that we have something that works great, is extremely reliable, but is closed?
Tagging @KirillOsenkov for thoughts here as well.
Note: this was not the case. With EditorTestApp it was far more often that things were working wel
Indeed. Completely agree. But at this point in time that's not a solution available to us.
They still catch issues (i believe just this week it caught issues with the rename feature).
They catch issues when there is an IDE person available to interpret your PR. That means that we can't merge a PR unless there is an IDE person available to bless every test failure. That's slow, not scalable and very wasteful.
A much more efficient system is run the tests only during merge runs. The IDE can then interpret the results on a schedule that is convenient to them. If legitimate errors are found they can file bugs to have the developer responsible fix the issue.
The burden should be on the system owners, not the individual contributors.
I sympathize with all parties.
Unfortunately I do agree that EditorTestApp is currently unavailable, not only for reasons that Cyrus mentions, but also that there isn't a person who would be passionate about maintaining it. So even if it were reactivated internally, there's non-trivial effort in keeping the lights on and keep up with VS host parity (to flesh out the host enough to fool Roslyn into thinking that it's running inside VS). Unfortunately I do notice that in absence of EditorTestApp some design and layering violations keep sneaking into Roslyn, which ETA would have prevented, and VSMac catches these pretty late in the cycle. But that's a separate issue from the integration tests.
A pragmatic tactical recommendation would be to split the integration tests into two categories: "flaky" and "reliable". Run the reliable tests on CI. Run flaky tests nightly or weekly and only check on them then to catch regressions. It'd be OK to run those twice as well. Have the IDE team own the flaky ones (like the QA team did with Maddog runs back in the day).
VSMac will be investing in more integration tests later this year. Hopefully if our tests are robust enough we can add them to the Roslyn CI. That would be something like EditorTestApp hopefully, but also with the added benefit of actually helping the VSMac team!
Run the reliable tests on CI. Run flaky tests nightly or weekly and only check on them then to catch regression
This is exactly how I feel as well.
Most helpful comment
Note: this was not the case. With EditorTestApp it was far more often that things were working well, issues were caught, and the occasional flakeyness issue could be investigated and stamped out. The decision to disallow editortestapp was made for the team, but then put things in a very difficult position. Effectively making it so that we became dependent on other parts of the system that aren't resiliently isolated, but which we have no carrot or stick to get changes for.
The owners believe highly in reliability. but hte mechanisms to make things reliable were removed from our toolkit for political reasons :-/
They still catch issues (i believe just this week it caught issues with the rename feature). I agree it's wasteful how things are done here. But it's not "just a waste". It's simply an unfortunate waste that people don't like, but which seems to be the only viable path given managment's rules on how integration test harnesses need to be built (i.e. you can't build your own isolated scaffolding, lest that give customers the means to build their own mini-VS).
Perhaps integration tests should go back to being internal, and they could use EditorTestApp again? What's more important, that we have a system that doesn't function well, but is open source? Or that we have something that works great, is extremely reliable, but is closed?
Tagging @KirillOsenkov for thoughts here as well.