when we started Roslyn, we want to catch bug as soon as possible so we made any unexpected exception to fail fast (crash the host, most of time, VS).
it was good while we were developing Roslyn, but even for RTM, we are still doing this. so something like a bug in todo list scanner will cause VS to crash. or a bug in colorizer, error list, navigation bar scanner will crash the VS. (some features that don't require explicit user action, basically it means, VS crash time to time randomly)
this seems only serve us good not customer and we get this kind of bug (Devdiv # 1192910).
I dont think having a bug in type coloring, or todo list scanning is important enough to crash user's VS with possible user data loss.
we should change our policy to report non-fatal watson rather than crashing in most of those cases. and only let internals to get crashed in those situation (like the way we do in integration tests for those small cases where we use non-fatal watson)
Does this qualify as "irony"? #119
You mentioned something about this today.
@Pilchie here "you" means me? or someone else?
@jasonmalinowski. I assigned the bug to him right as I commented.
we should change our policy to report non-fatal watson rather than crashing in most of those cases. and only let internals to get crashed in those situation
I have heard contradicting views from within the team on this. Tagging @dotnet/roslyn-ide as this _mostly_ affects the IDE scenarios.
@mavasani This bug is more or less the "figure out what the policy should be, and then make it so" bug.
@jasonmalinowski I'll sync up with Ling (Reliability tenet owner) to understand the division guidance around this. Regardless, we surely need to do a better job of looking at non fatal watsons currently being reported.
We need to identify recovery boundaries. If we can recover then reporting NFW and recovering is definitely the right thing to do. If we can't then we should rather crash then continue in possibly corrupt state.
Moving it to @tmeschter, the new reliable champ. Tom, we need to take this discussion forward with Ling and others involved in this discussion to figure out the correct strategy for reporting Non-fatal Watsons versus fatal Watsons.
There are a couple of issues with non-fatal Watsons:
I think the inevitable result is that at least some bugs we would fix today (because they crash the product) would simply never be fixed.
There are a handful of places where we definitely need to prevent unhandled exceptions from crashing the product, like 3rd-party analyzers and code fixes. We're not generally in a position to fix the underlying exception, so the best we can do is catch it, report it to the user somehow, and move on.
There are other places where we should simply be handling the exception because it isn't really exceptional--like many IO errors.
Beyond that, though, I generally think we're better off preventing the underlying exception from happening, rather than spending time trying to recover from it.
@tmeschter while I agree with your argument from the "fix bugs" point of view, crashing the IDE risks the loss of customer data. Unsaved files, incomplete version control actions, etc. are all abruptly terminated. The result can range from lost work to corrupted repositories. VS should not crash to desktop, if Roslyn wants to fast-fail then it likely needs to be out of process.
"Beyond that, though, I generally think we're better off preventing the underlying exception from happening, rather than spending time trying to recover from it."
my point is not on we need to spend time on recovering, my point is we need to spend time on not crashing VS while customer is using.
it will be completely fine for us to crash while dogfooding, but once it is out to customer, crashing is bad since like you said, customer can't fix it to stop crashing. so crashing doesn't benefit customers at all.
"We don't tend to get crash dumps for non-fatal Watsons, which makes it harder to actually fix them".
I am not sure whether that is true. it shouldn't different than fatal watson. if it means we don't look at it as much as fatal watson, than it is our internal process, so we can just make it higher priority than now.
One item I'm unclear on. What happens in the following sequence of events:
Does this trigger a second bogus Watson for developers to investigate?
@jaredpar I find it unlikely that the entire IDE would be in a bad state. VS has become something of a user experience on top a bag of state, with all the interesting bits being plug-ins. Even if Roslyn becomes unstable or unusable, allowing the IDE to survive to complete other operations is a fairly important thing.
Consider a user who realizes that they cannot compile because the compiler is emitting errors. First thing they'll often do is save their work, and potentially commit it to version control. If the compiler were to crash the IDE, then the commit to version control could be stopped mid-transaction. In theory, TFVS and Git are both resilient to this kind of thing, but depending on this resilience is asking for trouble (in my opinion).
Is the VBCSCompiler your own process? Couldn't you crash that, get a Watson dump, and not cause devenv.exe to fault?
@whoisj
I find it unlikely that the entire IDE would be in a bad state.
I find it very likely though that Roslyn will be in a bad state. That can definitely lead to further crashes. Any such crashes would be bogus and really not worth investigating. If we continue producing Watson in such a case it would be very counter productive.
@jaredpar Yes, you would get a second Watson. Probably with no obvious connection to the first.
@whoisj VBCSCompiler.exe is only used for actual builds. It doesn't communicate with devenv.exe in any way.
@tmeschter bummer. So then you really are in-process with devenv.exe then. I understand your conundrum, we have the exact same problems with Team Explorer.
Even if Roslyn becomes unstable or unusable, allowing the IDE to survive to complete other operations is a fairly important thing.
The problem is that in many cases we're highly unlikely to survive. And the later crashes will likely be much harder to track down the root cause for.
@jaredpar I understand it could be more work for us. but not sure why customer need to care about that? or what benefit customer get by reducing work for us.
@heejaechang
The value the customer gets from us is not running in a bad state. Once we've detected we're in a bad state we've lost all possible understanding of how the product will behave.
The best possible thing we can do for customers is stop, get the report and fix the bug.
I feel the first step in this work is to look at Watson telemetry/history for past cycle and identify important crashing components/code paths and then look for the points around them where we can safely assume that IDE can be recovered from a crash. For example, IDE component which runs analyzers can execute it without crashing VS, there is no reason to crash VS if analyzer driver faults - we can probably also report a diagnostic about the crash.
@jaredpar I see your point. but to make it real, we need customer agreeing with that argument. if customer actually doesn't. not sure, whether we believe it is benefit to customer or not matter.
Note I only wanted to call out the Non-fatal / Fatal pairing as a specific example of why running in a bad state isn't a good idea. There are plenty of other scenarios that you can dream up to cause customer and developer frustration here. The possibilities are virtually endless because, by definition, we have no idea what state we are in :)
@jaredpar I understand, but like @mavasani said, we can see what is common case, or like Anson said, we could put some kind of framework down to prevent we go on in unknown state. and fundamentally, crash is about customer experience so I think those should be determined by customer not us.
This sort of capability (recovering from an unexpected error) needs to be designed in on a feature-by-feature basis. It's a better fit for some than others. For example, with analyzers we have a high level of control over when and how they run and what state they can modify. We can reason about the effects of capturing all exceptions, turning them into non-fatal Watsons, and moving on. On the other hand, I would not want to catch any unexpected exceptions coming out of Code Model. There's a lot more state, and other VS components depend on it. If we avoid a crash the problem is likely to keep occurring, and we've now passed the buck to anyone building on top of it.
I don't see a "one size fits all" approach working here. As such I'm inclined to leave it to each feature to determine how fault-tolerant it wants to be (or can be), and close this issue.
@Pilchie I don't think this meets the bar for 15.1. Push it out?
Yes.
@tmeschter - this isn't happening for 15.3, right?
@Pilchie I don't think there's going to be a point when we're obviously done with this bug and can call it fixed. Rather we need to go through a continual process of identifying places where we currently crash but don't need to, and figure out how to make them more resilient. I think we should just close this bug.
Done 馃槃
Most helpful comment
Does this qualify as "irony"? #119