Original proposal by @jkotas (click to view)
The generation of Json serializers via reflection at runtime has non-trivial startup costs. This has been identified as a bottleneck during prototyping of fast small cloud-first micro-services:
Repro: https://gist.github.com/jkotas/b0671e154791e287c38a627ca81d7197
The Json serializer generated using reflection at runtime has startup cost ~30ms. The manually written Json serializer has startup cost ~1ms.
_Edited by @kevinwkt and @layomia :_
There are comprehensive documents detailing the needs and benefits of generating JSON serializers at compile time. Some of these benefits are improved startup time, reduction in private memory usage, faster throughput for serialization and deserialization, and being ILLinker-friendly due to avoiding reflection at run-time. There is also an opportunity to reduce the size of the trimmed System.Text.Json.dll after source generation and linker trimming due to code-paths that use reflection being potentially removed, and also unused built-in JsonConverter<T>
s such as Uri
, Ulong64
etc.
After discussing some approaches and pros/cons of some of them we decided to implement this feature using Roslyn source generators. Implementation details and code/usage examples can be seen in the design document. This document will outline the roadmap for the initial experiment and highlight actionable items.
This project requires numerous API changes and the design is being iterated on which is why we will be using the dotnet/runtimelab repository instead of dotnet/runtime. The main goal of this project is to get something up and running while changing implementation and iterating on public API without committing to dotnet/runtime master. We hope to share the project and get feedback for potential release on .NET 6.0. The project will be consumable through a prerelease package until then. Progress can be tracked through the JSON Code Gen project board in dotnet/runtimelab.
There are 3 main points in this project: type discovery, source code generation, generated source code integration (with user applications).
Type discovery can be thought of in two ways, an implicit model (where the user does not have to specify which types to generate code for) and an explicit model (user specifies through code or configuration which types to generate code for).
Various implicit approaches have been discussed such as source generating for all partial classes or scanning for calls into the serializer using Roslyn tree syntax. These models can be revisited in the future as the value/feasibility of the approach becomes clearer based on user feedback. It is important to note that some downsides to such a model include missing types to generate source for or generating source for types when not needed due to a bug or edge cases we didn鈥檛 consider.
The proposed approach for type discovery requires an explicit indication of serializable types by the user. This model supports indicating both owned and non-owned types. A new JsonSerializableAttribute
will be used to detect these types. There are two patterns for JsonSerializiableAttribute
. The first consists of applying the attribute on a type that the user owns, and the second consists of the user passing into the constructor of the attribute a non-owned serializable type.
We believe that an explicit model using attributes would be a simple first-approach to the problem. Within the Roslyn source generator, we parse the syntax tree to find usages of the JsonSerializableAttribute
. The output of this phase would be a list of input types for the generator in order to code-gen recursively for each type in all the object graphs.
The design for the generated source focuses mainly on performance gains and extensibility to existing JsonSerializer
functionality. Performance is improved in two ways. The first is during the first-time/warm-up performance for both CPU and memory by avoiding costly reflection to build up a Type metadata cache during runtime and moving it to compile time. These type metadata are then represented as JsonTypeInfo
classes that can be used for (de)serialization at runtime. The second is throughput improvement by avoiding the initial metadata-dictionary lookup on calls to the serializer by generating an instance of the type鈥檚 JsonTypeInfo
(metadata). These instances will be passed to new (de)serialize overloads.
We will use the types discovered in the type discovery phase and recurse through the type graph in order to source generate the functions mentioned above within each JsonTypeInfo
and register them inside the user-facing wrapper JsonSerializerContext
.
There are discussions regarding integration of generated metadata source code with user apps. The proposed approach consists of the generator creating a context class (JsonSerializerContext
) which takes an options instance and contains references to the generated JsonTypeInfos
for each type seen above. This relies on the creation of new overloads to the current serializer mentioned before that can be retrieved from the context. An example of the overload and usage can be seen here, while examples and details of the end to end approach can be seen in the design document.
Progress of this effort can be observed through the JSON Code Gen project board in dotnet/runtimelab.
The source generator (System.Text.Json.SourceGeneration.dll) and updated System.Text.Json.dll can be consumed via an experimental NuGet package. Issues can be logged at https://github.com/dotnet/runtimelab/issues/new?labels=area-JsonCodeGen with the area-JsonCodeGen
label.
cc @jkotas @davidfowl @stephentoub @mjsabby @terrajobst @pranavkm @ericstj @layomia @steveharter @chsienki
In theory, any startup-only reflection/delegate initialization can be done AOT. Popular scenarios including:
Please consider build some infrastructure to let the library provide AOT generation.
And also, custom converters support for serialization AOT is important.
The existing design depends on either manual storage of the JsonSerializerOptions
class (e.g. held by your own static
variable) or by using the default instance which is in a private static variable. Using the global ensures the options are not re-initialized unnecessarily.
However there is an first-time perf hit of initializing the options for each new Type encountered; this involves using reflection to lookup the properties and various attributes.
See issue https://github.com/dotnet/runtime/issues/1562 which could be used to help facilitate custom converters per POCO type and collection type which for performance will likely be generated IL (run-time or ahead-of-time) andor Roslyn generated source pending requirements\design. This wouldn't require the reflection hit.
Cool to see progress on this @kevinwkt @layomia. For folks interested please check out @kevinwkt's post above and the links to work going on.
How would this be consumed by non C# languages?
What about XML serializers at build time?
What about XML serializers at build time?
It is very interesting scenario. PowerShell uses XML serializers in remoting and it could be benefit from this greatly.
This begs the question of whether we should have the same design (and a code base) for Json and XML serializers.
How would this be consumed by non C# languages?
There's some mention of that here, but I don't think it's yet implemented. cc @chsienki
I believe there's already a tool that can generate build-time XML serializers cc @mconnew. https://docs.microsoft.com/en-us/dotnet/core/additional-tools/xml-serializer-generator
Certainly Source Generators are a better fit for this task rather an the current tool. I think that's a separate issue. Feel free to open it. Probably related to https://github.com/dotnet/runtime/issues/630.
While it seems attractive to have a common source-generator architecture for many different serialization technologies I suspect that they all have significantly different metadata and extensions points that would make full sharing impractical. It's definitely something we should be looking for. @davidfowl was mentioning to me the other day about the similarities in the task of emitting source generators for Configuration binding, MVC, and JSON. We'll likely find common code to share (EG: Reflection abstraction over Roslyn), common patterns, and potentially common architecture. We should definitely think about this in our architecture and look for those opportunities to share code or build common primitives.
@iSazonov Does powershell make use of the compiler at runtime such that this would be valuable for user types or were you thinking more about PowerShell/BCL types?
I believe there's already a tool that can generate build-time XML serializers
Yes, I know about Xml Serializer Generator (sgen.exe) but it doesn't work and half of our cases. I see Source Generator a wonderful opportunity to unify some way of generating things. I agree that XML serializers should be discussed in a separate issue.
@iSazonov Does powershell make use of the compiler at runtime such that this would be valuable for user types or were you thinking more about PowerShell/BCL types?
I hope @PaulHigin could share more info about how PowerShell uses XML serializations and maybe share thoughts about using source generators in the area.
I know PowerShell has predefined serializers for some well-know types. I believe we could enhance this list if we had XML source generators. Also we could expose the feature in PowerShell SDK so that users can benefit from the feature in their PowerShell hosting applications to get more performance remoting.
PowerShell remoting serialization is custom, and based around the PSObject type and PowerShell conversion routines. I doubt we would use .NET serializers except perhaps in special cases.
There are some design decisions being made here which experience with XmlSerializer has shown are going to be limiting. There's also some small performance implications with the current design which will affect initial load time. Generating the serializer source to be compiled into the project assembly isn't a good idea for multiple reasons.
This has already been mentioned, but re-iterating for completeness.
The pre-generated serializer is owned by and situated with the code which needs an instance of the serializer. If that owning code is a library and the type to be serialized is in a different library (e.g. via nuget) and that library is upgraded to a later version by another dependency in an application, then you might have a breaking mismatch between pre-generated serializer and implementation. There are only 2 correct owners of the pre-generated serializer. Either the parent application that's being developed, or the library where the type lives. They are the only ones which can ensure the generated serializer and the type implementation match. The current design encourages intermediary libraries to create and ship their own version which is fragile. You could mitigate the ambiguity by recording the full assembly name of the serializable type name with the implementation and hopefully the assembly version would be different in a later version of the library. If they don't match, then only use a runtime generated serializer. One problem that XmlSerializer has had and that we still don't have a solution for is detecting a version mismatch mistake at deployment time and then the application has a perf regression due to falling back to runtime generation due to mismatched versions. Having a flag on the serializer itself to ask if a pre-generated serializer is being used might be a worthwhile feature to add.
An application could depend on 2 libraries which each have their own pre-generated serializers for the same type. This causes multiple problems. It increases the memory usage as both library assemblies would have their own implementation. I didn't dig in too much how this is integrated into the Json serializer, but you will have one of two problems depending on how it's implemented. If it's an explicit usage and both libraries do need to serializer at runtime, you JIT two different copies, so you might actually use more CPU overall as well more RAM. If it's wired up implicitly, there is a problem of the serializer has two possibilities. It could find both, in which case that could be an exception. Or it might pick the first it finds, and if they are based off of two different versions of the type being implemented, you could have different outputs, and it would be non-deterministic which gets picked.
As the serializer is compiled into the same assembly as the consumer, in the case of the application generating the serializer, it increases the size on disk of the application assembly which increases startup load time. In the case of the application not owning the type, there's the question of where to apply the JsonSerializiableAttribute. A natural location would be to apply it as an assembly attribute. As the type would be referenced by a constructor parameter, I believe it will trigger jitting the type at application load (I could be wrong here, but that's my understanding) which would further increase startup time. I don't know if the code generator has the ability to modify existing code but if it does, removing the attribute would mitigate this.
Take the scenario where a library has a code path which uses the Json serializer but isn't used in all scenarios. The memory cost of having the serializer code loaded into memory has been paid even when an application using the library doesn't need that scenario. You always pay the cost of the code being loaded. If my assumption is right about assembly attributes, this could also cause extra load time to jit types which might never get used.
Having a model of the pre-generated serializer in an external separate assembly and dynamically loading that assembly in a similar way to how XmlSerializer does this would resolve all these problems.
PowerShell remoting serialization is custom, and based around the PSObject type and PowerShell conversion routines. I doubt we would use .NET serializers except perhaps in special cases.
@PaulHigin Yes, it will be _special cases_. I believe not as an exception but as a great feature for implementing convenient (and high performance) PowerShell remoting. Currently PowerShell generates "proxy" objects based on PSObect
. The use of such objects imposes many restrictions - loss of functionality and performance. If a PowerShell module defines a CustomType and that module is present on both the remote and local side, then that module could restore that CustomType as a live object and not a proxy and get all the benefits of it. I guess some applications (like Exchange Server Management Shell (EMS)) already do something like but with source generators we could enhance PowerShell Engine and SDK and simplify for developers implementing such scenarios.
Having a model of the pre-generated serializer in an external separate assembly and dynamically loading that assembly in a similar way to how XmlSerializer does this would resolve all these problems.
This sounds like a plug-in model. But do we really need a separate assembly? It seem complicate dev process. A source analyzer as plug-in could has an unique identifier (Guid) and XmlSerializer could check if local and remote side use the same plugin then utilize the plugin, otherwise fallback in runtime serializer.
@mconnew thanks for your feedback.
Firstly I'll note that we are not solely generating "serializers" here. The original proposal mentioned generating serializers, but the core of the generation (at compile time) is for type metadata (i.e. reflection independent funcs to member accessors; property/field names etc. that allow us to (de)serialize JSON). This is where the expected wins for start-up time and reduced private bytes come from (due to avoiding object-graph walkthrough and not caching Reflection.Emit
output at run-time).
When there is limited use of JsonSerializerOptions
and/or System.Text.Json.Json*Attribute
configurations, the generator will also generate serialization/deserialization logic which comes with throughput wins due to avoiding multiple layers of indirection in the regular serializer implementation. Otherwise, there'll still be a start-up win due to pre-generated metadata, but throughput should stay the same since there'll be a fallback to the regular serialization codepaths.
Re: Incorrect ownership of pre-generated serializers
I agree that various assemblies of a compilation should not maintain duplicate copies of the generated metadata. Even with linker trimming to ensure only what is being used is being kept, this redundancy aggravates other issues such as version mismatches between types and their generated metadata. The two strategies mentioned make sense to me:
Generate metadata in a shared separate assembly.
In this scenario, we'd want to avoid dynamically loading the new assembly. I believe the dynamic load would work against the goal to improve start-up/first-time serialization perf. We'd effectively avoid reflection to walk-through the type object graphs at run-time, only to introduce reflection to load new types dynamically. It could be the case that this assembly load is negligible compared to other work being done, but it would be good to avoid it nonetheless. Did you find dynamic assembly loading to be an issue with XML serializer pre-generation?
FYI the generated code being added to the application assembly is the only supported location as of right now using the C# source generators infrastructure. cc @chsienki, if generation to a new assembly is considered in C# source generators, this would be a use case where it would be great for the generated assembly not to have to be loaded dynamically at run-time.
Generate metadata only in parent application that kicked off the compilation
I need to verify locally that metadata generated in an application assembly is available to external libraries in the same compilation. If so, this is a likely direction to go (requires no additional work from Roslyn).
We do plan to record version information on the cached metadata so that we can detected if it is no longer compatible with the type it represents. This mechanism will also help prevent invoking stale versions of the metadata which are known to have bugs. In these cases, rather than falling back to the dynamic code-paths of the serializer, would it be reasonable to throw an exception so that app authors would know that they should recompile/regenerate?
Re: Multiple versions of pre-generated serializers
didn't dig in too much how this is integrated into the Json serializer
We employ an explicit model where the serialization method is passed to the serializer via new overloads on the (de)serialization methods on JsonSerializer
(see overloads that take JsonTypeInfo<T>
here).
Following from my notes above, I believe having a central/shared location for generated metadata, alleviates concerns about increased CPU and RAM usage. The serialization behavior would also be deterministic since there'd be only one metadata pool to choose from.
Code generation aside, I don't think we can avoid the problem of having multiple handwritten/hand-cached metadata classes in a compilation due to the JSON metadata infrastructure being made public
.
Re: Greater application startup time
there's the question of where to apply the JsonSerializiableAttribute. A natural location would be to apply it as an assembly attribute. As the type would be referenced by a constructor parameter, I believe it will trigger jitting the type at application load (I could be wrong here, but that's my understanding) which would further increase startup time.
I have recently made a PR for the attribute to be applied on a module
, for both owned and non-owned types. This will be changed to assembly
shortly.
If there are concerns around jitting the type at application load (cc @steveharter/@jkotas) there are perhaps a few options
AdditionalFiles
in the input GeneratorExecutionContext
to the generator). JsonSerializer.(De)Serialize*<T>
.Re: Higher memory usage even when not needed
Take the scenario where a library has a code path which uses the Json serializer but isn't used in all scenarios. The memory cost of having the serializer code loaded into memory has been paid even when an application using the library doesn't need that scenario. You always pay the cost of the code being loaded. If my assumption is right about assembly attributes, this could also cause extra load time to jit types which might never get used.
When we have static usages of the generated metadata, e.g. a caller using a method which takes the specific metadata needed to serialize a type, the it is straightforward to the ILLinker to trim generated code which is not used.
When we have dynamic usages of the generated metadata (e.g. it is not known what type is being passed to the serializer), then an overload which takes the generated context class (and has a method to return the correct type metadata based on a type) may be called. In these cases, the linker is likely to preserve all the generated type metadata since it is not clear which will be needed at run-time.
I think the linking characteristics here provide the right trade-off given static vs dynamic usage of the generated metadata.
@layomia, we only care about the root class when loading a serializer from a separate assembly. Basically here's the steps we go through. When a serializer for type Foo is created, we first check if we already have one in our serialzier cache. If not, we get the assemble name for type Foo and look for an assembly with the name FooAssemblyName.XmlSerializer.dll and load it. This assembly has all the serializers that were generated for types where the root class type was in the same assembly. There's a static method we call in this assembly which I think returns us a dictionary of serializers to populate our internal cache. We then check that cache again for an appropriate serializer and if it's not there, we dynamically generate one. So there's very little reflection that actually happens, and we take the hit of the assembly load at first use and not process startup. We have a few other checks such as versioning but that's the gist of it. The assembly load is always significantly quicker than all the reflection code that goes into generating one at runtime. We don't need to walk the entire object graph for each serializer as we only locate the pre-generated assembly by the root type and it has all the types already in code to populate the cache.
Generating in a separate assembly allows greater flexibility with your build process, and when combined with the IL Linker, you can negate the extra assembly load for a published app. There's also the non-trivial cost of the edit/compile/test loop being made a lot slower if you are doing code generation every compile for a type which is in a library you aren't currently compiling. To avoid this, you need some way of detecting that the types you are generating for haven't been modified since the last time you generated the code. You can't base it on source file timestamp as you might not have the source code for the types.
FYI the generated code being added to the application assembly is the only supported location as of right now using the C# source generators infrastructure.
How would a library such as something like the Azure SDK use pre-generated serializers? If it's only created in the application assembly, does that means a library can't use it because it needs to be explicitly used and you can't rely on the application generating them? How does this work with the scenario where an application with a Main entrypoint method can also be used as a library? It would have the generated serializer baked in, but would it be able to use it when loaded as a library?
In these cases, rather than falling back to the dynamic code-paths of the serializer, would it be reasonable to throw an exception so that app authors would know that they should recompile/regenerate?
My preference would be not to throw because you can still function, just in a degraded case. I don't think it's the right thing to do to automatically turn a performance degradation scenario into a failing scenario. Maybe add an option either as a parameter on a method/constructor or as you have a settings class, on that which specifies that you must use the pre-generated implementation. Or maybe have that as a property on your attribute which triggers the code gen which can specify that it's an exception warranting error if there's a mismatch. There's also the issue of discoverability after an app is running, e.g. on a customer computer where they don't have the source. Maybe an environment variable or an app setting to trigger exceptions on failure to use pre-generated serializer?
@mconnew again thank you for insights and experience here.
One note as @layomia also pointed out is that the current code-gen is not about generating self-contained "serializers" but about generating metadata and callbacks:
This achieves the primary goals of fast startup and minimizing private bytes both done by avoid reflection and reflection emit. A secondary goal of increased throughput occurs when the generated callbacks for serialize()\deserialize() can be used. Another secondary goal is to support the ILLinker for to reduce the size of STJ.dll.
Incorrect ownership of pre-generated serializer
Multiple versions of pre-generated serializers
The current design and constraints of Roslyn source generators mean:
Greater application startup time
Higher memory usage even when not needed
The [JsonSerializiable]
attribute is only used during ahead-of-time code generation, and not at run-time. There is no "global" assembly walk at run-time of all types that have `[JsonSerializiable].
The new "context class" programming model is a pay-to-play meaning generated code is directly called at run-time for each type, and thus only that code is JITTed. If there are 1,000 generated types, for example, only the ones accessed at run-time (by calling the appropriate member on the context class) should be JITTed (along with any dependent generated types). This lazy JIT assumption of course should be verified.
@steveharter @layomia
Can we update the issue description to make sure this item tracks both, perf improvements as well as trimming? IOW, we need make it clear that the path towards making JSON serializable types trimmable is via source generation.
Thanks @terrajobst. I've added notes about goals to facilitate more trimming (removing unused converters, reflection code-paths) and be linker friendly (due to avoiding run-time reflection) and action items (https://github.com/dotnet/runtime/issues/36782, https://github.com/dotnet/runtimelab/projects/1#card-49468644).
This issue was originally created to track multiple goals achievable with AOT source generation including:
I created https://github.com/dotnet/runtime/issues/45441 to track the user story "Developers can safely trim their apps which use System.Text.Json to reduce the size of their apps" which depends on the work in this issue.
@layomia we're trying to title all our User Stories in terms of customer benefit (WHO gets WHAT), so we focus on the result we are aiming for. Stories can depend on each other, but the actual work is in the issues parented by the stories.
So your title for #45441 is perfect, and I've retitled this story in that format as well. Feel free to adjust.
@layomia looking a little more here, I think this user story is missing the child issues that encompass the work required to achieve it. We should have issues for the various parts of the source generator work -- I assume you have an idea what those parts are, you will want to create them at some point and parent them under this story.
I suggest something like
Developers apps using JSON serialization start up and run faster #1568 (User Story)
|---------- JSON source generator (just issue)
| |---------------
| |-------------- various issues breaking up the work for the source generator
|
Developers can safely trim their apps which use System.Text.Json to reduce the size of their apps #45441 (User Story)
|----------JSON source genrator (same issue as above - it has two parents)
|----------- etc.
does that seem reasonable?
Thanks @danmosemsft that makes sense. I created https://github.com/dotnet/runtime/issues/45448 (to be further fleshed out) to track the source generation work items which should satisfy these user stories
Most helpful comment
How would this be consumed by non C# languages?