Wpf: Startup cost of XamlServices parsing is 27% slower on .NETCore than on .NET Framework

Created on 6 Dec 2018  路  38Comments  路  Source: dotnet/wpf

  • .NET Core Version: 3.0 Preview1
  • Windows version: 1803
  • Does the bug reproduce also in WPF for .NET Framework 4.8?: No

Problem description:
I expected/hoped that with the XmlReader hopefully being Span<T> based. the XAML parser would be faster in .NET Core than the .NET Framework equivalent. However that's not what I'm seeing with this benchmark:

    public class BenchmarkTests
    {
        static readonly string xamlString = @"<MyObject xmlns=""clr-namespace:XamlBenchmark;assembly=XamlBenchmark"" StringProperty=""Hello World"" Int32Property=""1234"" DoubleProperty=""123.4567890"" FloatProperty=""-0.9876"" />";

        [Benchmark]
        public object Test1()
        {
            object instance = System.Xaml.XamlServices.Parse(xamlString);
            return instance;
        }
    }
    public class MyObject
    {
        public string StringProperty { get; set; }
        public int Int32Property { get; set; }
        public double DoubleProperty { get; set; }
        public float FloatProperty { get; set; }
    }

Here are the results:


BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.1 (1809/October2018Update/Redstone5)
Intel Xeon CPU E5-1620 v3 3.50GHz, 1 CPU, 8 logical and 4 physical cores
  [Host]     : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3190.0
  Job-FQKGZY : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3190.0
  Job-YTBJBT : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT


| Method | Runtime | Toolchain | Mean | Error | StdDev | Ratio | RatioSD |
|------- |-------- |-------------- |---------:|---------:|---------:|------:|--------:|
| Test1 | Clr | net472 | 363.3 us | 7.139 us | 7.639 us | 1.00 | 0.00 |
| Test1 | Core | netcoreapp3.0 | 461.6 us | 9.078 us | 8.048 us | 1.27 | 0.03 |

issue-type-bug rank20 regression tenet-performance

Most helpful comment

I love seeing this kind of issue. Great to see the focus on fundamentals.

All 38 comments

Thanks @dotMorten. This is likely to be expected, as Core is not currently ngened. We'll address before Core3 ships.

cc @brianrob @billwert

Baiting @benaadams @AndreyAkinshin if they have interest to look for low-hanging fruit here 馃槈

Also looking at allocations by running this parse method many many times, I see this hit over 40 times per parsing:
image

And here's some of the CPU analysis:
image

Loading the assemblies' custom attributes takes a lot of time. The more assemblies that are loaded, the slower it will take. If you create a simple hello world app that tests this, .NET Framework loads 5 assemblies but .NET Core 3.0 preview loads 23 assemblies.

There's a XamlSchemaContext that could perhaps be reused since it's the class that loads all assemblies.

EDIT: If a XamlSchemaContext is used, .NET Core 3.0 preview is faster than .NET Framework 4.7.2 (tested 64-bit)

I love seeing this kind of issue. Great to see the focus on fundamentals.

Re GetCustomAttributes, it was improved by @nickcraver in https://github.com/dotnet/coreclr/pull/20779 but that change would already be in preview 1.

@danmosemsft It's also 24 loaded assemblies (.NET Core 3.0 latest daily build) vs 5 loaded assemblies (.NET Framework 4.7.2).

If you patch System.Xaml to only check the first 5 returned assemblies (.NET Framework only loads 5 assemblies), then .NET Core 3.0 code executes faster.

So it seems the 5x number of assemblies is causing the slowdown. It has to execute more code.

Original .NET Core 3.0 daily build:

Name | Exc % | Exc | Inc % | Inc
-- | -- | -- | -- | --
module system.xaml <> | 24.0 | 1,073 | 91.6 | 4,090
module system.private.corelib <> | 14.8 | 662 | 36.3 | 1,619
module coreclr <> | 10.3 | 462 | 10.6 | 472
module clrjit <> | 5.4 | 239 | 5.4 | 243
module coreclr <> | 5.4 | 239 | 5.7 | 253
module coreclr <> | 5.2 | 234 | 5.2 | 234
module coreclr <> | 5.2 | 231 | 5.6 | 248
module system.private.corelib <> | 4.9 | 219 | 22.9 | 1,023
module coreclr <> | 4.1 | 183 | 5.4 | 241
module coreclr <> | 3.4 | 152 | 4.0 | 178
module ntdll <> | 2.6 | 116 | 2.6 | 116
module coreclr <> | 2.6 | 116 | 2.8 | 127

Finally had a chance to look into this. Really great to see folks engaged here.

I was able to reproduce the difference and captured traces of 10K iterations of the parse operation to compare.

As was mentioned above, the biggest issue that I see is significantly higher cost in and caused
by Attribute.GetCustomAttributes. The higher cost shows up in both CPU usage in Attribute.GetCustomAttributes and in allocation helpers that are zeroing memory to support increased allocations in this path.

The biggest CPU usage differences are:

  • Attribute.GetCustomAttributes: 560ms on Core vs. 265ms on Desktop = +111%
  • Allocation helpers (JIT_New and JIT_NewArr1): 907ms on Core vs. 597ms on Desktop = +52%

Breaking down the allocations that contribute to the additional allocation helper CPU (as well as memory cost), these are the top 3 allocators:

  • System.Reflection.CustomAttributeRecord[]
  • System.Xaml.MS.Impl.XmlNsInfo
  • Node[System.Reflection.Assembly.System.Xaml.MS.Impl.XmlNsInfo]

The first step here should be for this to be investigated from the reflection angle. @joshfree, can you get eyes on this from a reflection expert? I can share traces offline and help ramp someone up on the data.

Having taken some steps here already, it definitely seems like we can eliminate CustomAttributeRecord[] in some way. These are allocated only to be iterated in the caller in all cases, but the way they're populated is non-trivial to change. I have a branch locally that tried to eliminate these but it creates some very odd null refs in ways I didn't readily understand (but someone else may quickly see why I'm an idiot there).

If you're curious, this is where they're generated for a great many attribute code paths.

@brianrob

Attribute.GetCustomAttributes: 560ms on Core vs. 265ms on Desktop = +111%

This is probably because .NET Core loads 24 assemblies but .NET Framework only loads 5 assemblies. The code gets custom attributes of all loaded assemblies.

@0xd4d, you're right - the iteration over all assemblies happens here: https://github.com/dotnet/wpf/blob/08d1133388dac9f033e16bb7309d0d2d409845a2/src/Microsoft.DotNet.Wpf/src/System.Xaml/System/Xaml/XamlSchemaContext.cs#L1125

As @NickCraver points out, there is possibly some optimization that can happen in the reflection path, but it is also true that there may need to be some work done in Xaml parsing to either limit the number of assemblies that get processed, or cache the results. Without a clear understanding of the WPF technology, it's not clear to me what should be done here. @fabiant3, are you the right person to look at this?

cc @steveharter for Reflection

18k calls from Parse -> XamlXmlReader.Initalize -> NodeStreamSorter.ctor -> StartNewNodeStreamWithSettingsPreamble does become 14M calls to FilterCustomAttributeRecord and that's 44% of the time for Parsing.

image

image

or cache the results.

XamlSchemaContext has 10 ConcurrentDictionarys (instance rather than static) so I wonder if it had that in mind?

The main cost for Parsing (64%) happens under System.Xaml.XamlSchemaContext.TryGetCompatibleXamlNamespace(String, out String) where its still the same number of calls as Parse (18k in this case) and then begins to fan out.

Would it be valid to make those ConcurrentDictionaries static?

Sounds like the context is quite flexible on usage:

/// SchemaContexts can potentially be shared between multiple callers in an AppDomain, including
/// both full and partial trust callers. To be safe for sharing, the default implementation should
/// be idempotent and order-independent--i.e. functionally immutable.

Though I don't know if lookup should be unique in some way (e.g. different Xaml files having explicitly different contexts)

There are multiple things we can yet do in reflection and I'd like to get to them, but I second @benaadams's approach here on an actual solution that would make the differences really desired here. The problem with reflection performance optimizations is they hit a hard wall with contracts: they must return a new attribute set on each call - we cannot return the same instances (this would break all sorts of existing code).

The only way we can cache instances (after deciding it's valid to do so) is well above the layer those allocations happen at, e.g. in a lookup dictionary such as those in play here.

I just wanted to add here: in doing some benchmark down we discovered that while my optimizations went into 3.0 Preview 1, some steps backwards also landed in there. Details are in https://github.com/dotnet/coreclr/issues/21456 and a PR to fix that regression is in https://github.com/dotnet/coreclr/pull/21462. That should help out here, but only marginally compared to not actually calling this path more than once as proposed above.

@steveharter can you assist in the investigation from the coreclr/Reflection side of the house?

Sure I'll take a look asap at the reflection side

Moving this to 3.1. The discussion has targeted the initialization of XamlSchemaContext within XamlServices.Parse (et al.), which examines more assemblies than it did in NETFx. I don't think it's critical to fix this for 3.0, because

  1. it only affects parse/load of loose XAML
  2. it's an initialization expense, whose relative effect diminishes as the string being parsed grows longer or more complex
  3. it's already possible for an app to mitigate this

In short, the effect will be felt in practice only by apps that load lots of small XAML snippets. This isn't the mainstream case, and such apps have a workaround.

Details supporting these claims.

All apps load BAML, which uses a built-in SchemaContext. But relatively few apps load loose XAML via explicit calls to XamlServices.Parse et al. At least that's my impression from looking at apps that have come my way (largely through bug reports) - we don't have telemetry for this.

The 27% figure comes from parsing a very simple string (one XAML tag); a real app typically parses longer strings, and will see less of a difference. I tried strings with a few hundred tags, and the difference was not worth noting.

Most of the XamlServices methods call new XamlSchemaContext(), but an app can avoid the initialization expense by using the one method that doesn't. Replace XamlServices.Parse(mystring) with

            StringReader stringReader = new StringReader(mystring);
            using (XmlReader xmlReader = XmlReader.Create(stringReader))
            {
                XamlXmlReader xamlReader = new XamlXmlReader(xmlReader, CachedSchemaContext);
                return XamlServices.Load(xamlReader);
            }

where CachedSchemaContext is a XamlSchemaContext that the app creates once and re-uses every time it calls Parse. (Similarly for XamlServices.Load(stream) et al.)

An app that does this is responsible for releasing the cached context when it's no longer needed, if that's relevant to the overall behavior. This is probably not an issue, except perhaps for apps with complex interaction with AppDomain, or similar advanced architecture.

I've tried building this idea into XamlServices itself. The early results are good, but it needs more investigation. For example, XamlServices can't know when to release the cached context, and I don't fully understand the consequences of keeping it alive "forever".

and I don't fully understand the consequences of keeping it alive "forever"

If thats the chosen path the cache might need weak references (making it even more expensive) to not block unloadable assemblies "forever", otherwise it would be a memory leak with apps having reloadable plugin architectures or are emitting generated code in collectible assemblies at runtime. We are doing the latter on desktop, I haven't used AssemblyLoadContext in Core yet so I don't know entirely the implications, but its something to keep in mind.

Changed the title to reflect the real issue.

executing HTTP requests in WPF command is much slower on .NET Core idk what causes this but it works fine in .NET Framework. (I'm using refit btw)

not sure if it's WPF problem or .NET Core problem or refit problem... https://github.com/dotnet/corefx/issues/23401

EDIT:

SOLVED. See this comment https://github.com/dotnet/corefx/issues/23401#issuecomment-561671803

For reference: @bugproof comment is unrelated to this issue or to WPF. (.NET Core implements HTTPRequest differently from .NET Framework.)

Just FYI it is issues like this that have caused us to stop our port to netcore. I had hoped performance would be better. I hope these issues continue to see activity and improvement.

@rolandh - We determined that this issue only affects apps that load lots of small XAML snippets using XamlServices.Parse. Is that your case? (We believe it's not all that common, but we don't have any hard evidence to either support or refute that belief.)

Please continue to open issues, or upvote existing issues, about specific perf concerns. Comments about "issues like this" are not really actionable, as no one knows what you're referring to specifically.

I haven't experienced any issues as we haven't ported to net core yet. I was just saying that seeing multiple issues like this has meant that we won't be considering porting ove from net 4.7 in the short term. I've seen quite a few benchmarks showing slow downs, I haven't seen any showing performance increases in WPF yet, when I do I'll look at porting. I thought that might be good information for MS on how to prioritise issues and development.

I know its not an actionable comment for you, but it might be for someone higher up who wants to see a higher percentage off people port their applications.

I was just saying that seeing multiple issues like this has meant that we won't be considering porting ove from net 4.7 in the short term.
@rolandh can you articulate the multiple issues that you have been and concerned you moving to .NET Core besides this one? That will help us to priorities perf investigation.

@Lxiamail

Pretty much every major library's first iteration to netcore has had slowdowns. Sometimes dramatic, eg 10x slower with entity frameworks 2.1 release.

I see almost nothing showing performance benefits, only contrived small applications seem to benefit. Everything that is a large in production library seems to go backwards and introduce bugs.

There is no incentive for us to upgrade after seeing issues like this repeatedly occur after porting to netcore.

Here is a list of relevant cases I've seen when researching whether netcore is stable enough to upgrade yet.

https://github.com/dotnet/wpf/issues/1549
https://github.com/dotnet/wpf/issues/1748

https://supportcenter.devexpress.com/ticket/details/T824984/wpf-core-3-0-performance

This was nopcommerce after upgrading to netcore
https://www.nopcommerce.com/en/boards/topic/54986/nopcommerce-410-high-memory-usage-and-how-weve-handled-it

EF Core large slow downs vs the older entity framework
https://github.com/nopSolutions/nopCommerce/issues/3419

Just my 2 cents: I also ported a large WPF+EntityFramework app to core 3.0 and found that load times increased from <4 seconds for .Net Framework to 5+ seconds with .Net Core 3.0. This is with creating native images for both in release and debug builds. Runtime performance seemed largely unchanged so I abandoned this porting effort and will try this again with .Net 5 when it is released. There are numerous reports of slow-downs like in the post above and also this one: https://github.com/dotnet/wpf/issues/94

My impression is that some work is required to address all these already reported WPF + .Net Core performance issues in order to make it worthwhile to perform this migration for more complex applications with lots of assemblies and XAML resources.

Cheers
Philip

@ajcvickers re EF feedback above.

I seemed to encounter the performance problem too; The same wpf code target .net core 3.1 and another .net 4.8, the latter memory usage is about 20MB while the former is 32MB+ almost 60MB at the startup of program;
image
Read a same file as hex and show on a TextBox, the memory usage is above, still, net core cost more
memory than net48 ; I felt confused
image

@laggage your .NET 48 is running as 32-bit and .NET Core is running as 64-bit

Does adding <RuntimeIdentifier>win-x86</RuntimeIdentifier> to your .csproj help? (moving it to 32-bit)

@benaadams I tried your suggestion, but still, core wpf just cost more memory than net4.x wpf;

image

@SamBent This issue was milestoned to core 5.0 but I don't see any updates even though 5.0 is being released. Is there reason to believe these performance regressions were fixed and the people on this forum should evaluate this again in their applications.

The original issue - startup cost of XamlServices.Parse - is not yet addressed in 5.0. It only comes up if the app explicitly asks for it, so it's not a factor in load times of most apps.

A number of other issues have been mentioned here, most of them outside WPF's scope. You can follow the links to check on the status with the appropriate owners.

@SamBent In my case the problems may be DataTemplate instantiation via XamlReader.Parse which I assume internally calls XamlServices.Parse.

Will this issue be assigned to a new milestone or will WPF apps which make using of XAML parsing see permanent load time degradation if ported to .Net Core?

Was this page helpful?
0 / 5 - 0 ratings