Roslyn: Only format source generated file on demand

Created on 1 Dec 2020  路  17Comments  路  Source: dotnet/roslyn

This is probably more of a VS issue than a roslyn one, but it will demand some coordination between the two so I'll post it here.

I maintain the StrongInject Source Generator.

When generating the code, I generate everything on a single line, without any superfluous whitespace.

I then call CSharpSyntaxTree.ParseText(SourceText.From(file, Encoding.UTF8)).GetRoot().NormalizeWhitespace().SyntaxTree.GetText(); to format the generated code.

Well over 2/3ds of the time in my Source Generator is spent just on this, even though StrongInject is a very computation intensive SG.

However in the majority of cases this is completely unnecessary, as the user will never see the code.

Instead it would be useful if when adding a file to a compilation via the SourceGenerator I could set an unformatted flag. Then when VS displays the file to the user it will format it for me. I imagine this would lead to extremely large savings in a lot of SG projects.

Area-IDE Concept-Continuous Improvement New Feature - Source Generators

All 17 comments

This is an interesting question. Essentially how do we balance speed in source generators vs. displaying the text nicely in the IDE when customers do step through it. Think it's probably more on the IDE side though so moving there for now.

cc @jasonmalinowski , @chsienki

I do wonder what the expensive part is in that dance -- is the intermediate parse or the normalization? And if it's the normalization is there some low hanging fruit we can fix there?

@chsienki: as it stands @YairHalberstadt is having to do two text reparses: the initial parse to get a tree to pass to NormalizeWhitespace, and then there's a second parse of the tree he had to convert back to. If the "add text" API had a "please format it" option, then we could do a parse, and do NormalizeWhitespace, and presume internally that NormalizeWhitespace doesn't a valid tree into an invalid one. (If it does, that's our own bug.) Even if we can't reduce the cost of the normalization (or the need to do it entirely), at least we can reduce a parse here.

@jaredpar: the tricky bit here is debugging: if the compiled version is the unformatted code, then the spans embedded in the PDB won't be right.

@AArnott also reported this internally a few weeks ago.

Do we have a feedback ticket with a performance trace to make sure NormalizeWhitespace isn't doing more work than necessary?

It's almost all NormalizeWhitespace, although the other parts aren't cheap either.

Also doing the double parse creates a lot of garbage.

@sharwell: @AArnott did share an ETL internally. I can fully imagine NormalizeWhitespace was never carefully optimized.

If this is only formatted on demand by VS, it would also allow us to use the simplify APIs.

Currently I generate fully qualified code for everything (including global::), to remove any chance of a conflict. However users would much rather see the file with imports added and names simplified. That API is both expensive, and only available in the IDE layer, so not possible to do in a source generator.

I'd rather we just fixup NormalizeWhitespace to be fast. I don't see why it would need to be slow.

Do we have a feedback ticket with a performance trace to make sure NormalizeWhitespace isn't doing more work than necessary?

My understanding of NW (from conversations over hte years) is taht it's completely unoptimized. So every token is visited and every token is rewritten no matter what. Though, in yair's case, that just might what needs to happen as he's just emitting one long string.

If someone submits a feedback ticket with a performance trace I'll be all over it 馃槃

Jason sent me a trace. First thing I notice is #48568 will help this. I'll send a PR shortly with additional improvements.

I submitted #49758 to implement a targeted improvement, and filed #49759 to address one of the more significant sources of performance overhead later. With both of those implemented, the next issue is finding a way to avoid calling GetNextToken so many times during the normalization process.

As much as I love performance improvements to NormalizeWhitespace: shouldn't we avoid computations in SGs as much as possible to keep their performance high? I would expect it to be a tiny fraction of generated source files that will ever be seen by a human, so operations that only do cosmetic changes to the generated code seem badly placed when they're happening essentially on every keystroke rather than just when the user actually wants to see the code.

@stefanloerwald We should normalize anyway. Imagine the debugging experience and exception stack traces. When an exception is thrown and shows a line number from a generated source file, do you want it to always say it's on "line 1"? If I were the one investigating a failure, I would certainly want a meaningful line number shown there.
Beyond stack traces, the pdb will only match against your unformatted source file. So if the IDE were to format it just as the document opened, the debugger would reject the file. While that might be a solvable problem, it would involve several teams and be quite expensive. If we can make NormalizeWhitespace decently fast, it's a better use of resources and ends up with VS shipping more meaningful features.

Good points, @AArnott, thanks for elaborating. Although of course not everything has to be in just one line, even if you don't normalize whitespace ;-)

Although of course not everything has to be in just one line, even if you don't normalize whitespace ;-)

Yep. The complexity of indenting my generated code is much higher than that of adding new lines.

The overhead can be nearly eliminated by making NormalizeWhitespace operate one token "ahead" of where it does today (i.e. instead of calling GetNextToken it just reads a field storing the previous token), in combination with #49759. I'm not a fan of different content for different build scenarios because it breaks debugging and all editor features. If we were to optimize a fast-path here, it would be the creation of a skeleton document (member signatures present, but the implementations all throw null) so IntelliSense understands the available members quickly.

Was this page helpful?
0 / 5 - 0 ratings