Powershell: Use Roslyn source generators for cdxml cmdletization

Created on 22 May 2020  路  31Comments  路  Source: PowerShell/PowerShell

Summary of the new feature/enhancement

Currently, when a cdxml module is imported the engine takes the following steps:

  1. Parse an XML document
  2. Build a string representing an entire PowerShell module based on the XML
  3. The generated module essentially just builds args for a C# based CmdletAdapter implementation

If source generators were used instead, the first two steps would be moved to compile time and the last would have less overhead. They would also be less prone to some of PowerShell's module scope quirks like ErrorAction inheritance.

Issue-Enhancement WG-Engine

Most helpful comment

@daxian-dbw - IIRC, command discovery would generate PowerShell from CDXML and you don't always search CDXML modules, but when you do, it stands out in a profile.

One could modify command discovery, but that solution feels too targeted. CDXML modules can be slow to load and the generated code feels quite bloated. For example the Storage modules takes 2s to load on my machine and generates nearly 40,000 lines of PowerShell.

All 31 comments

the last would have less overhead.

What do you mean? Performance? More fast startup?

What do you mean? Performance? More fast startup?

Both (assuming by start up you mean import time, afaik no cdxml modules are imported at process init).

What steps will be?

Import-Module cdxml

  • Parse xml
  • Generate C# file
  • Compile C# to binary module
  • Load binary modue

I think what @SeeminglyScience is suggesting is that rather than shipping the XML and parsing it into a module at import time, we could create a build task to generate a binary module so that we only need to ship the module DLL. That way, all of that processing is front-loaded and users just end up importing the module DLL directly.

Yeah, Roslyn source generators attach like analyzers. Except instead of emitting diagnostic records, they add additional source files to the obj path.

I don't understand why it should be in the repo. If a developer want create a binary module the developer can use any tools.

@iSazonov Not sure what you mean.

This issue is about creating a Roslyn source generator that generates binary modules from cdxml files.

This issue is about creating a Roslyn source generator that generates binary modules from cdxml files.

You can use Roslyn source generator or another tool to create custom binary module. I ask why it should be in PowerShell repo?

I'm not sure why it wouldn't be, but sure it can go in a new repo I guess.

From my understanding you say about new tool/module to create specific binary modules.
I would expect that you describe how this should work in a full-fledged RFC.

From my understanding you say about new tool/module to create specific binary modules.

I would say that the suggestion is an alternative/addition to the current cmdletization API that would result in better performance and UX. It's like saying PowerShell is a compiler, it's not wrong but it's not the whole picture.

I would expect that you describe how this should work in a full-fledged RFC.

Yeah I think if this gets implemented it might be a good idea for the implementer to submit an RFC. If this is something you're interested in pursuing, I'd be happy to expand on any particular detail that may be unclear.

I would say that

This is too blurry said. If you have a clear idea, then share the exact specifications.

So currently when a cdxml module is imported, PowerShell reads the cdxml file and uses it to generate a PowerShell script module. My suggestion is that instead of generating a PowerShell script module, it generates a binary module. And instead of doing it at import time, it's done at compile time. A good way to do this would be via Roslyn source generators.

it's done at compile time

What will a source be in the case? cdxml? So a developer creates a cdxml, runs the generator to get a dll, then signs, packages and publish the new binary module.
This workflow raises questions:

  • Why does it have to be Roslyn source generator?
    It is one option from many. The tool could be implemented on PowerShell or others.
  • Why does it have to be in PowerShell Engine?
    Obviously it is a _tool_. Current policy is to publish new PowerShell module/tool on PowerShellGet site to get feedback before deciding to include it in Engine.
  • Why are performance gains expected?
    PowerShell uses psd1 file for discovering and we get no performance wins with binary module.
    Module loading is one time operation and we get no performance perceptible wins.
    Such modules use the same cmdlet adapter and we get no performance wins.

The vast majority of cdxml modules are owned by Microsoft. Unless we want to make it a separate tool which is just pulled in during build time for any cdxml modules we're shipping with pwsh, it makes the most sense to live in this repo since it's for building PowerShell modules imo.

Modules can go on PowerShellGet, not so much a Roslyn source generator. As far as I know, the current cdxml parser/scrip generator lives in this code base. It makes more sense for this to be here in the same way.

The vast majority of cdxml modules are owned by Microsoft.

So the new tool is needed only for MSFT? And the ask is "MSFT, please convert all cdxml modules to binary"?

it's done at compile time

What will a source be in the case? cdxml? So a developer creates a cdxml, runs the generator to get a dll, then signs, packages and publish the new binary module.
This workflow raises questions:

  • Why does it have to be Roslyn source generator?
    It is one option from many. The tool could be implemented on PowerShell or others.

I said:

A good way to do this would be via Roslyn source generators.

It seems like the obvious choice to me, but an implementer is free to investigate other tools.

  • Why does it have to be in PowerShell Engine?
    Obviously it is a _tool_. Current policy is to publish new PowerShell module/tool on PowerShellGet site to get feedback before deciding to include it in Engine.

For sure, the implementer is welcome to do that.

  • Why are performance gains expected?
    PowerShell uses psd1 file for discovering and we get no performance wins with binary module.

In discovery sure, but then you go to run a script module and the binary module is faster.

Module loading is one time operation and we get no performance perceptible wins.
Such modules use the same cmdlet adapter and we get no performance wins.

There's a lot of PowerShell code generated. There would be performance gains.

From https://github.com/PowerShell/PowerShell/issues/13236#issuecomment-662673511

I think the issues with source generators (or some other DLL generation mechanism) for cmdletisation are:

  • CDXML is checked into Windows, so generating DLLs would need to be done in that code's build steps. But APIs catered to tend to be native/COM APIs rather than .NET ones, so even writing a Source Generator we would (1) need them to be using .NET and (2) need them to be using .NET Core
  • CDXML not being DLLs is actually a huge win for PS 7, since CDXML modules were all instantly compatible with PS 7. Compiling things to DLLs would likely set up more issues
  • CDXML is checked into Windows, so generating DLLs would need to be done in that code's build steps.

Fair enough, that does diminish the most obvious value a bit. It's worth noting though that cmdletization is a public API and cdxml can be used to generate modules based on anything (assuming a custom CmdletAdapter is used). I use it in EditorServicesCommandSuite to generate functions from C# classes that can't inherit PSCmdlet.

But APIs catered to tend to be native/COM APIs rather than .NET ones, so even writing a Source Generator we would (1) need them to be using .NET and (2) need them to be using .NET Core

Depends where you put it. It could easily be a console app for instance.

  • CDXML not being DLLs is actually a huge win for PS 7, since CDXML modules were all instantly compatible with PS 7. Compiling things to DLLs would likely set up more issues

Generated code wouldn't be high risk since it would still fall back to whatever CmdletAdapter it used for actual implementation. It's mostly just hooking up arguments.

When I was doing startup performance work for Windows PowerShell 5.1, CDXML was probably the biggest issue that I did not have time to address.

Ideally CDXML is translated to real code exactly once as part of a Windows build, but one could still generate code on demand and cache the translation.

Also note the generated PowerShell seems overly verbose. I'd assumed (but never verified) that a more data driven approach was possible - and that might feel safer than caching generated code.

When I was doing startup performance work for Windows PowerShell 5.1, CDXML was probably the biggest issue that I did not have time to address.

Interesting! Perhaps one possibility is for PowerShell to generate and cache code from CDXML on loading, in a similar way to how Add-Type can generate assemblies of how PowerShell classes work

A Source Generator is a .NET Standard 2.0 assembly that is loaded by the compiler along with any analyzers.

@SeeminglyScience My understanding is that (_please correct me if I'm wrong_): a source generator is an assembly that is shipped along with you SDK nuget package (e.g. Microsoft.PowerShell.SDK), that can generate code to the user assembly at its compilation time. In case of CDXML modules, there is no dotnet code for a CDXML module and thus no dotnet compilation involved, so how could a source generator kick in? It looks to me a cmdletization tool that consumes CDXML and spit out C# code is what you want.

@lzybkr Does the current way powershell deals with CDXML affect the startup time? The parsing and script writing all happen only if you are importing a CDXML module, right? In my understanding, if we have a tool to turn CDXML modules to dotnet assemblies that interacts with CmdletAdapter directly, the benefits are:

  1. Faster loading of CDXML modules and faster execution of CDXML cmdlets
  2. Reduce the size of System.Management.Automation.dll

And IMHO, only (2) can help startup time a bit.

@SeeminglyScience My understanding is that (_please correct me if I'm wrong_): a source generator is an assembly that is shipped along with you SDK nuget package (e.g. Microsoft.PowerShell.SDK), that can generate code to the user assembly at its compilation time. In case of CDXML modules, there is no dotnet code for a CDXML module and thus no dotnet compilation involved, so how could a source generator kick in?

It looks pretty flexible in that as long as you have a csproj you can use it. I haven't used it myself so it's totally possible I'm wrong, but basically my idea was to have:

  1. a csproj referencing either the SDK or a nuget with just the source generator. Probably also referencing the cdxml file with a None node
  2. the cdxml file
  3. Run dotnet, it spits out a binary module

It looks to me a cmdletization tool that consumes CDXML and spit out C# code is what you want.

Yeah. Well truthfully a roslyn source generator would be most ideal what I personally use it for so I don't have to ship a separate assembly with just the binary cmdlets. But yeah something that consumes cdxml and spits out a ready to ship assembly also would work.

@daxian-dbw - IIRC, command discovery would generate PowerShell from CDXML and you don't always search CDXML modules, but when you do, it stands out in a profile.

One could modify command discovery, but that solution feels too targeted. CDXML modules can be slow to load and the generated code feels quite bloated. For example the Storage modules takes 2s to load on my machine and generates nearly 40,000 lines of PowerShell.

I see a simplest thing we can do is to add a cache for compiled CDXML modules. Since we already have a cache for the command discovery it does not add more security risk. (I'd work on this if MSFT team want and we could include this in 7.1.)

The next thing we could do is try to improve the CDXML module compiler.

I suggest to open new issue for these optimizations and close the issue because RSGs is more for external projects like https://github.com/PowerShell/PlasterBuild and https://github.com/PowerShell/generator-powershell.

GitHub
Provides common build tasks for PowerShell module projects - PowerShell/PlasterBuild
GitHub
Create PowerShell modules and scripts using Yeoman! - PowerShell/generator-powershell

Thanks @lzybkr, I absolutely agree there is space to improve the CDXML modules (loading and execution).

@iSazonov do you mean caching the generated powershell script for CDXML modules? It's different from the cache for command discovery. The latter only contains metadata information (which command name belongs to which module), but no code for execution. I don't recommend to do it -- it's not an urgent thing that needs to be mitigated with a workaround. If we want to invest in this area, then we should directly improve the ScriptWriter and works on the tool that produces assemblies from CDXML files.

do you mean caching the generated powershell script for CDXML modules?

No, I mean to generate CDXML script, compile it and cache the assembly. Then we could improve ScriptWriter.

No, I mean to generate CDXML script, compile it and cache the assembly.

This already sounds like the tool that @SeeminglyScience is asking for 馃槃

A tool is an external thing. I suggest to do this on the fly and check the cache folder before compile CDXML in next time.

I don't see a need to duplicate the effort there. If we're going to have such functionality it makes more sense to me that we segregate its code to a separate dll that a small console app can be built to interface with in a minimal fashion. That way, PowerShell can cache built CDXML modules until their current authors get around to shipping compiled assemblies for them, and authors wanting to write CDXML modules have the option of just shipping the compiled assembly and not requiring PowerShell to perform the caching.

Over time we could deprecate the caching method and eventually remove the CDXML caching entirely, requiring authors to precompile the modules.

If one was considering building something that might someday get integrated into the Windows build - I would start with a standalone tool that does not depend on PowerShell - probably with a simple exe wrapper around a dll that could be consumed by PowerShell.

Was this page helpful?
0 / 5 - 0 ratings