Roslyn: change Roslyn not to crash but do non-fatal watson on most of exception case

Created on 29 Jun 2015 · 30Comments · Source: dotnet/roslyn

when we started Roslyn, we want to catch bug as soon as possible so we made any unexpected exception to fail fast (crash the host, most of time, VS).

it was good while we were developing Roslyn, but even for RTM, we are still doing this. so something like a bug in todo list scanner will cause VS to crash. or a bug in colorizer, error list, navigation bar scanner will crash the VS. (some features that don't require explicit user action, basically it means, VS crash time to time randomly)

this seems only serve us good not customer and we get this kind of bug (Devdiv # 1192910).

I dont think having a bug in type coloring, or todo list scanning is important enough to crash user's VS with possible user data loss.

we should change our policy to report non-fatal watson rather than crashing in most of those cases. and only let internals to get crashed in those situation (like the way we do in integration tests for those small cases where we use non-fatal watson)

Area-IDE Feature Request Tenet-Reliability

Source

heejaechang

Most helpful comment

Does this qualify as "irony"? #119

HaloFour on 30 Jun 2015

😄3

All 30 comments

Does this qualify as "irony"? #119

HaloFour on 30 Jun 2015

😄3

You mentioned something about this today.

Pilchie on 31 Jul 2015

@Pilchie here "you" means me? or someone else?

heejaechang on 31 Jul 2015

@jasonmalinowski. I assigned the bug to him right as I commented.

Pilchie on 31 Jul 2015

we should change our policy to report non-fatal watson rather than crashing in most of those cases. and only let internals to get crashed in those situation

I have heard contradicting views from within the team on this. Tagging @dotnet/roslyn-ide as this _mostly_ affects the IDE scenarios.

mavasani on 11 Aug 2016

@mavasani This bug is more or less the "figure out what the policy should be, and then make it so" bug.

jasonmalinowski on 11 Aug 2016

@jasonmalinowski I'll sync up with Ling (Reliability tenet owner) to understand the division guidance around this. Regardless, we surely need to do a better job of looking at non fatal watsons currently being reported.

mavasani on 11 Aug 2016

We need to identify recovery boundaries. If we can recover then reporting NFW and recovering is definitely the right thing to do. If we can't then we should rather crash then continue in possibly corrupt state.

tmat on 11 Aug 2016

Moving it to @tmeschter, the new reliable champ. Tom, we need to take this discussion forward with Ling and others involved in this discussion to figure out the correct strategy for reporting Non-fatal Watsons versus fatal Watsons.

mavasani on 19 Sep 2016

There are a couple of issues with non-fatal Watsons:

They're non-fatal, which automatically makes them a lower priority.
We don't tend to get crash dumps for non-fatal Watsons, which makes it harder to actually fix them.

I think the inevitable result is that at least some bugs we would fix today (because they crash the product) would simply never be fixed.

There are a handful of places where we definitely need to prevent unhandled exceptions from crashing the product, like 3rd-party analyzers and code fixes. We're not generally in a position to fix the underlying exception, so the best we can do is catch it, report it to the user somehow, and move on.

There are other places where we should simply be handling the exception because it isn't really exceptional--like many IO errors.

Beyond that, though, I generally think we're better off preventing the underlying exception from happening, rather than spending time trying to recover from it.

tmeschter on 7 Feb 2017

@tmeschter while I agree with your argument from the "fix bugs" point of view, crashing the IDE risks the loss of customer data. Unsaved files, incomplete version control actions, etc. are all abruptly terminated. The result can range from lost work to corrupted repositories. VS should not crash to desktop, if Roslyn wants to fast-fail then it likely needs to be out of process.

whoisj on 7 Feb 2017

"Beyond that, though, I generally think we're better off preventing the underlying exception from happening, rather than spending time trying to recover from it."

my point is not on we need to spend time on recovering, my point is we need to spend time on not crashing VS while customer is using.

it will be completely fine for us to crash while dogfooding, but once it is out to customer, crashing is bad since like you said, customer can't fix it to stop crashing. so crashing doesn't benefit customers at all.

"We don't tend to get crash dumps for non-fatal Watsons, which makes it harder to actually fix them".
I am not sure whether that is true. it shouldn't different than fatal watson. if it means we don't look at it as much as fatal watson, than it is our internal process, so we can just make it higher priority than now.

heejaechang on 7 Feb 2017

👍1

One item I'm unclear on. What happens in the following sequence of events:

Non Fatal Watson is issued but IDE keeps going.
IDE is in a bad state (see NFW above) and predictably crashes later.

Does this trigger a second bogus Watson for developers to investigate?

jaredpar on 7 Feb 2017

@jaredpar I find it unlikely that the entire IDE would be in a bad state. VS has become something of a user experience on top a bag of state, with all the interesting bits being plug-ins. Even if Roslyn becomes unstable or unusable, allowing the IDE to survive to complete other operations is a fairly important thing.

Consider a user who realizes that they cannot compile because the compiler is emitting errors. First thing they'll often do is save their work, and potentially commit it to version control. If the compiler were to crash the IDE, then the commit to version control could be stopped mid-transaction. In theory, TFVS and Git are both resilient to this kind of thing, but depending on this resilience is asking for trouble (in my opinion).

Is the VBCSCompiler your own process? Couldn't you crash that, get a Watson dump, and not cause devenv.exe to fault?

whoisj on 7 Feb 2017

@whoisj

I find it unlikely that the entire IDE would be in a bad state.

I find it very likely though that Roslyn will be in a bad state. That can definitely lead to further crashes. Any such crashes would be bogus and really not worth investigating. If we continue producing Watson in such a case it would be very counter productive.

jaredpar on 7 Feb 2017

👍1

@jaredpar Yes, you would get a second Watson. Probably with no obvious connection to the first.

@whoisj VBCSCompiler.exe is only used for actual builds. It doesn't communicate with devenv.exe in any way.

tmeschter on 7 Feb 2017

@tmeschter bummer. So then you really are in-process with devenv.exe then. I understand your conundrum, we have the exact same problems with Team Explorer.

whoisj on 7 Feb 2017

Even if Roslyn becomes unstable or unusable, allowing the IDE to survive to complete other operations is a fairly important thing.

The problem is that in many cases we're highly unlikely to survive. And the later crashes will likely be much harder to track down the root cause for.

CyrusNajmabadi on 7 Feb 2017

@jaredpar I understand it could be more work for us. but not sure why customer need to care about that? or what benefit customer get by reducing work for us.

heejaechang on 7 Feb 2017

@heejaechang

The value the customer gets from us is not running in a bad state. Once we've detected we're in a bad state we've lost all possible understanding of how the product will behave.

Does compilation actually write the correct bytes to disk? Imagine debugging when we're not actually writing your changes to disk.
Is save actually functioning or are we swallowing an exception down the stack? How much frustration could this generate for the customer?

The best possible thing we can do for customers is stop, get the report and fix the bug.

jaredpar on 7 Feb 2017

I feel the first step in this work is to look at Watson telemetry/history for past cycle and identify important crashing components/code paths and then look for the points around them where we can safely assume that IDE can be recovered from a crash. For example, IDE component which runs analyzers can execute it without crashing VS, there is no reason to crash VS if analyzer driver faults - we can probably also report a diagnostic about the crash.

mavasani on 7 Feb 2017

@jaredpar I see your point. but to make it real, we need customer agreeing with that argument. if customer actually doesn't. not sure, whether we believe it is benefit to customer or not matter.

heejaechang on 7 Feb 2017

Note I only wanted to call out the Non-fatal / Fatal pairing as a specific example of why running in a bad state isn't a good idea. There are plenty of other scenarios that you can dream up to cause customer and developer frustration here. The possibilities are virtually endless because, by definition, we have no idea what state we are in :)

jaredpar on 7 Feb 2017

@jaredpar I understand, but like @mavasani said, we can see what is common case, or like Anson said, we could put some kind of framework down to prevent we go on in unknown state. and fundamentally, crash is about customer experience so I think those should be determined by customer not us.

heejaechang on 7 Feb 2017

This sort of capability (recovering from an unexpected error) needs to be designed in on a feature-by-feature basis. It's a better fit for some than others. For example, with analyzers we have a high level of control over when and how they run and what state they can modify. We can reason about the effects of capturing all exceptions, turning them into non-fatal Watsons, and moving on. On the other hand, I would not want to catch any unexpected exceptions coming out of Code Model. There's a lot more state, and other VS components depend on it. If we avoid a crash the problem is likely to keep occurring, and we've now passed the buck to anyone building on top of it.

I don't see a "one size fits all" approach working here. As such I'm inclined to leave it to each feature to determine how fault-tolerant it wants to be (or can be), and close this issue.

tmeschter on 7 Feb 2017

👍1

@Pilchie I don't think this meets the bar for 15.1. Push it out?

tmeschter on 20 Mar 2017

Yes.

Pilchie on 20 Mar 2017

@tmeschter - this isn't happening for 15.3, right?

Pilchie on 26 May 2017

@Pilchie I don't think there's going to be a point when we're obviously done with this bug and can call it fixed. Rather we need to go through a continual process of identifying places where we currently crash but don't need to, and figure out how to make them more resilient. I think we should just close this bug.

tmeschter on 26 May 2017

Done 😄

Pilchie on 26 May 2017

Was this page helpful?

0 / 5 - 0 ratings