Runtime: ReadyToRun images are not as efficient for .NET Core 3 as NGen for .NET Framework

Created on 29 Aug 2019  路  35Comments  路  Source: dotnet/runtime

@AlexChuev commented on Thu Aug 29 2019

  • .NET Core Version: 3.0 Preview 8
  • Windows version: 1809
  • Does the bug reproduce also in WPF for .NET Framework 4.8?: No
  • Is this bug related specifically to tooling in Visual Studio (e.g. XAML Designer, Code editing, etc...)? No

    Problem description:
    Since Ngen.exe and Native Image Task are not available in .NET Core 3, ReadyToRun images are the only way to reduce the application startup time and avoid delays caused by JIT compilation. However, our tests show that for the same WPF application that initializes three times faster after using Ngen.exe, ReadyToRun images provide the difference of only 25%.
    This is a huge hit for our users, since with many libraries and theme resources, JIT compilation is one of the main factors affecting load times. For some real-life applications, the use of Ngen.exe allowed our users to shave up to 6 seconds (half) off the initial start time.

    Minimal repro:
    You can find samples for .NET Core 3 and .NET Framework here: https://github.com/AlexChuev/ReadyToRunPerformanceTest
    These samples use DevExpress WPF assemblies to demonstrate how Ngen.exe and ReadyToRun images affect projects with many classes and theme resources. If you need published apps or other samples for your tests, please let me know.

P.S. I'm posting this in the WPF repo because the difference between Ngen.exe and ReadyToRun images highly affects WPF applications. Normally, the JIT compiler processes classes and methods only when they are needed for the program execution. However, when WPF loads theme resources for a control, this causes all classes referenced in these resources (including classes referenced in currently unused or invisible parts) to be processed by the JIT compiler. In addition, static constructors of many WPF classes contain the DependencyProperty registration code that may cause even more classes and methods to be JITted.


@rladuca commented on Thu Aug 29 2019

@fadimounir Where should we file a companion bug on .NET?


@fadimounir commented on Thu Aug 29 2019

@rladuca Sure you can file a companion bug on dotnet/coreclr. We'd be interested in diagnosing this more, although one thing to keep in mind is that R2R by design principle will always be a bit slower than fragile native images, because they are version resilient (they do not have the same fragility as old ngen images).

If you need published apps or other samples for your tests, please let me know.

@AlexChuev That would be very helpful

cc @jkotas FYI

area-ReadyToRun-coreclr tenet-performance

Most helpful comment

Last fall, I built up 3/4 of the logic necessary to see a view of the type system load events in a useful way in PerfView. I'll see if I can resurrect that work, and build up an understanding of what types we're actually working with here, and if there are significant differences from 3.1.

All 35 comments

Copied from WPF issue #1478

COMPlus_TC_QuickJitForLoops=1 slightly improved startup time for my hand made UI app (even with full R2R).

@fadimounir You can find published samples here:
https://devexpress-my.sharepoint.com/:u:/p/alexander_chuyev/EfQCpVoz181Nj61vjW-N1Q4BjgXkVMHMOXnzIym4bXQvYg?e=4iMy56
The source code for these samples is here:
https://github.com/AlexChuev/ReadyToRunPerformanceTest

Results that I get on my test machine:

| Application | Time to load assemblies |
| ------------- | ------------- |
| .NET Core (self-contained, no r2r) | 680 ms |
| .NET Core (self-contained, with r2r) | 450 ms |
| .NET Framework (no native images) | 1000 ms |
| .NET Framework (with native images) | 250 ms |

Do we have any progress or some workaround to this?

@AlexChuev is it possible for you to run another test using the lastest .NET 5 Preview? There is a very recent blog post about performance as well (https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-5/). Although it doesn't mention any improvements to ReadyToRun other than size reduction, there is a lot of changes in JIT though.

@elachlan here are my results for the same test app converted to .NET 5 and targeting x64:

| Application | Time to load assemblies |
| ------------- | ------------- |
.NET Framework (no native images) | 1448
.NET Framework (with native images) | 277
.NET Core 3 (self-contained, no r2r) | 957
.NET Core 3 (self-contained, with r2r) | 586
.NET 5 (self-contained, no r2r) | 1104
.NET 5 (self-contained, with r2r) | 708

.NET 5 is slower than .NET Core 3 in my tests, to which I have no explanation so far. Note that I had to use a different machine than previously, so total numbers are a bit different.

@AlexChuev thanks for the tests. Not so great that we have gone backwards again.

@stephentoub Your article was a great run down of the performance improvements and was much appreciated. There was a comment on it that Preview 7 was going to have more improvements for AOT/R2R. Is that the intention?

Based on Alex's tests, .NET 5 looks like its 155% worse than .NET Framework and 20% worse than .NET Core 3.

Thanks!

Your article was a great run down of the performance improvements and was much appreciated.

Thanks.

There was a comment on it that Preview 7 was going to have more improvements for AOT/R2R.

Which comment?

There was a comment on it that Preview 7 was going to have more improvements for AOT/R2R.

Which comment?

Sorry, someone made a comment, in the comments. Not you specifically.
https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-5/#comment-6772

I imagine there will be some sort of effort before release, just wondering if that was the aim.

I see, thanks. @jeffschwMSFT may be able to provide more information.

Adding @mangod9 and @jkotas. We are aware that there are performance gaps between r2r and ngen. One of the larger positive factors of r2r is the less fragile nature of its images, which comes with a performance trade mark. The way that we like to approach these problems are with real world examples that we can take a look at and tune.

We are aware that there are performance gaps between r2r and ngen. One of the larger positive factors of r2r is the less fragile nature of its images, which comes with a performance trade mark. The way that we like to approach these problems are with real world examples that we can take a look at and tune.

Compiling a small program with the F# compiler is a good example. A significant amount of time is spent in JIT with R2R.

@AlexChuev might be able to provide a couple of example apps. Devexpress has a myriad of demo applications which would provide great examples of functional apps.

@jeffschwMSFT is there a plan to revisit r2r performance before .NET 5 release? or are you after specific test scenarios?

assume these perf numbers are for certain UI specific scenarios?

@mangod9 Yes, AlexChuev's example is at https://github.com/AlexChuev/ReadyToRunPerformanceTest

@AlexChuev is it possible for you to update your repo to include the changes for .NET 5?

It might be worthwhile to try crossgen2 composite functionality available in 5 to check if it improves this scenario. I tried to get a composite built for the repro, but looks like we have an issue we need to fix for that. Will create a separate issue for that.

@elachlan sure, updated

@jeffschwMSFT @mangod9 I'll take a quick look at this.

cc: @dotnet/crossgen-contrib

@jkotas @davidwrighton
Profiles show more time here:

Name | Inc % | Inc
-- | -- | --
coreclr!ClassLoader::LoadTypeHandleForTypeKey | 73.4 | 83.991
+ coreclr!ClassLoader::LoadTypeDefThrowing | 33.2 | 37.991
+ coreclr!SigPointer::GetTypeHandleThrowing | 30.6 | 35
+ coreclr!ClassLoader::EnsureLoaded | 11.4 | 13
+ coreclr!ClassLoader::LoadTypeHandleForTypeKey_Body | -6.1 | -7
+ coreclr!ClassLoader::LoadConstructedTypeThrowing | 4.4 | 5

That's the bulk of the regression.

I validated that the number of assemblies loaded is the same (though obviously different versions). I assume the number of types is roughly the same also given it is the same scenario.

Is this enough of a breadcrumb? I can share the profiles to you if you'd like to take a look.

@billwert Thanks for collecting the profiles. Could you please share them?

@MichalStrehovsky pointed out since WPF implies mc++ cg2 wouldnt support it currently. We will continue to investigate the 3.1->5.0 regression.

Last fall, I built up 3/4 of the logic necessary to see a view of the type system load events in a useful way in PerfView. I'll see if I can resurrect that work, and build up an understanding of what types we're actually working with here, and if there are significant differences from 3.1.

I see non-trivial amount of time spent in the new covariant return checks:

Name | Inc % | Inc
-- | -- | --
coreclr!ClassLoader::ValidateMethodsWithCovariantReturnTypes | 6.6 | 40

after resolving the crossgen2 issue, the test app does compile in composite mode. Would be good to add that case to the perf numbers once we look through the covariant return issue.

@davidwrighton do you have an issue/pr tracking that work on the perfview github?

Yes. microsoft/perfview#1232 . I need to address the feedback from @brianrob, and I've identified that inclusive time processing for the non-tree view isn't actually correct, it may take a week or two for me to find the time to finish up that work.

Also, https://github.com/davidwrighton/coreclr/tree/runtime_startup_events_3_1_5 holds an implementation of these events on .NET Core 3.1 for comparative investigations.

It looks like all the related issues are sorted. There are a couple more optimisations that could be made in https://github.com/dotnet/wpf/pull/3278, but they seem to be pushed out to another Issue/PR.

@AlexChuev did crossgen2 make a difference?

@davidwrighton is the idea for perfview to get a release out for .NET 5 release?

@jkotas / @billwert do the profiles look better now?

@elachlan, PerfView ships independently of .NET and effectively has a 1-2month cadence for releases. Once I finish the PR the updated logic will be available for general users of PerfView.

I was unable to make crossgen2 work on my machine so far. When I set PublishReadyToRunUseCrossgen2 to true and try to publish my test app targeting win-x64, the following error pops up:

error NETSDK1095: Optimizing assemblies for performance is not supported for the selected target platform or architecture. Please verify you are using a supported runtime identifier, or set the PublishReadyToRun property to false.

SDK support for cg2+composite will work in preview 8. I have been able to manually run it for now.

The changes to perfview have been merged. We should now be able to profile the startups a lot easier. Thanks @davidwrighton !

I tested it with crossgen2 in .NET 5 RC2 on my machine but got exception in runtime:

Unhandled exception. System.TypeInitializationException: The type initializer for 'System.Windows.Media.FontFamily' threw an exception.
 ---> System.TypeInitializationException: The type initializer for 'MS.Internal.FontCache.DWriteFactory' threw an exception.
 ---> System.InvalidCastException: Specified cast is not valid.
   at MS.Internal.Text.TextInterface.Native.Util.ConvertHresultToException(Int32 hr)
   at MS.Internal.Text.TextInterface.Factory.Initialize(FactoryType factoryType)
   at MS.Internal.Text.TextInterface.Factory..ctor(FactoryType factoryType, IFontSourceCollectionFactory fontSourceCollectionFactory, IFontSourceFactory fontSourceFactory)
   at MS.Internal.FontCache.DWriteFactory..cctor()
   --- End of inner exception stack trace ---
   at MS.Internal.FontCache.DWriteFactory.get_SystemFontCollection()
   at System.Windows.Media.FontFamily..cctor()
   --- End of inner exception stack trace ---
   at MS.Internal.Text.DynamicPropertyReader.GetTypeface(DependencyObject element)
   at MS.Internal.Text.TextProperties.InitCommon(DependencyObject target)
   at MS.Internal.Text.TextProperties..ctor(FrameworkElement target, Boolean isTypographyDefaultValue)
   at System.Windows.Controls.TextBlock.GetLineProperties()
   at System.Windows.Controls.TextBlock.EnsureTextBlockCache()
   at System.Windows.Controls.TextBlock.MeasureOverride(Size constraint)
   at System.Windows.FrameworkElement.MeasureCore(Size availableSize)
   at System.Windows.UIElement.Measure(Size availableSize)
   at MS.Internal.Helper.MeasureElementWithSingleChild(UIElement element, Size constraint)
   at System.Windows.Controls.ContentPresenter.MeasureOverride(Size constraint)
   at System.Windows.FrameworkElement.MeasureCore(Size availableSize)
   at System.Windows.UIElement.Measure(Size availableSize)
   at System.Windows.Controls.Border.MeasureOverride(Size constraint)
   at System.Windows.FrameworkElement.MeasureCore(Size availableSize)
   at System.Windows.UIElement.Measure(Size availableSize)
   at System.Windows.Controls.Control.MeasureOverride(Size constraint)
   at System.Windows.FrameworkElement.MeasureCore(Size availableSize)
   at System.Windows.UIElement.Measure(Size availableSize)
   at System.Windows.Controls.Grid.MeasureOverride(Size constraint)
   at System.Windows.FrameworkElement.MeasureCore(Size availableSize)
   at System.Windows.UIElement.Measure(Size availableSize)
   at MS.Internal.Helper.MeasureElementWithSingleChild(UIElement element, Size constraint)
   at System.Windows.Controls.ContentPresenter.MeasureOverride(Size constraint)
   at System.Windows.FrameworkElement.MeasureCore(Size availableSize)
   at System.Windows.UIElement.Measure(Size availableSize)
   at System.Windows.Controls.Decorator.MeasureOverride(Size constraint)
   at System.Windows.Documents.AdornerDecorator.MeasureOverride(Size constraint)
   at System.Windows.FrameworkElement.MeasureCore(Size availableSize)
   at System.Windows.UIElement.Measure(Size availableSize)
   at System.Windows.Controls.Border.MeasureOverride(Size constraint)
   at System.Windows.FrameworkElement.MeasureCore(Size availableSize)
   at System.Windows.UIElement.Measure(Size availableSize)
   at System.Windows.Window.MeasureOverrideHelper(Size constraint)
   at System.Windows.Window.MeasureOverride(Size availableSize)
   at System.Windows.FrameworkElement.MeasureCore(Size availableSize)
   at System.Windows.UIElement.Measure(Size availableSize)
   at System.Windows.Interop.HwndSource.SetLayoutSize()
   at System.Windows.Interop.HwndSource.set_RootVisualInternal(Visual value)
   at System.Windows.Interop.HwndSource.set_RootVisual(Visual value)
   at System.Windows.Window.SetRootVisual()
   at System.Windows.Window.SetRootVisualAndUpdateSTC()
   at System.Windows.Window.SetupInitialState(Double requestedTop, Double requestedLeft, Double requestedWidth, Double requestedHeight)
   at System.Windows.Window.CreateSourceWindow(Boolean duringShow)
   at System.Windows.Window.CreateSourceWindowDuringShow()
   at System.Windows.Window.SafeCreateWindowDuringShow()
   at System.Windows.Window.ShowHelper(Object booleanBox)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at System.Windows.Threading.ExceptionWrapper.TryCatchWhen(Object source, Delegate callback, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.DispatcherOperation.InvokeImpl()
   at System.Windows.Threading.DispatcherOperation.InvokeInSecurityContext(Object state)
   at MS.Internal.CulturePreservingExecutionContext.CallbackWrapper(Object obj)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at MS.Internal.CulturePreservingExecutionContext.Run(CulturePreservingExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Windows.Threading.DispatcherOperation.Invoke()
   at System.Windows.Threading.Dispatcher.ProcessQueue()
   at System.Windows.Threading.Dispatcher.WndProcHook(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndWrapper.WndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(Object o)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at System.Windows.Threading.ExceptionWrapper.TryCatchWhen(Object source, Delegate callback, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at System.Windows.Threading.Dispatcher.PushFrameImpl(DispatcherFrame frame)
   at System.Windows.Threading.Dispatcher.PushFrame(DispatcherFrame frame)
   at System.Windows.Application.RunDispatcher(Object ignore)
   at System.Windows.Application.RunInternal(Window window)
   at ReadyToRunPerformanceTest.App.Main()

Is there a setting to let R2R build slow tier1 instead of just quick tier0? I don't want tiered compilation (TC) to neither use additional CPU nor additional memory because the server app is hosted some hundret times on the same server (strict tenant isolation by IIS app pools).

Was this page helpful?
0 / 5 - 0 ratings