Runtime: [ Question ] Reduce memory consumption of CoreCLR

Created on 22 Mar 2017 · 17Comments · Source: dotnet/runtime

Hello.

I am wondering about possible ways to reduce memory consumption of CoreCLR.
Do you have any ideas about how it is possible to reduce the working set size?
Please, share any related ideas, and also general opinions about this direction of development.

By the way, is there any defined set of rules for choosing between higher performance and lower memory consumption?
Is it an accepted practice to add compile-time or runtime switches, which allow to switch between the two options?

Design Discussion

Source

ruben-ayrapetyan

Most helpful comment

Don't allocate 😄

davidfowl on 22 Mar 2017

👍5

All 17 comments

Don't allocate 😄

davidfowl on 22 Mar 2017

👍5

Hi Ruben, do you have any profiling results?

seanshpark on 22 Mar 2017

Hi SaeHie,

Yes, we have profiled several Xamarin GUI applications on Tizen Mobile.

Typical profile of CoreCLR's memory on the GUI applications is the following:

Mapped assembly images - 4.2 megabytes (50%)
JIT-compiler's memory - 1.7 megabytes (20%)
Execution engine - about 1 megabyte (11%)
Code heap - about 1 megabyte (11%)
Type information - about 0.5 megabyte (6%)
Objects heap - about 0.2 megabyte (2%)

ruben-ayrapetyan on 22 Mar 2017

JIT-compiler memory - 1.7 megabytes (20%)

Compiler itself or generated code?

egavrin on 22 Mar 2017

JIT-compiler memory - 1.7 megabytes (20%)

Compiler itself of generated code?

Yes, the memory for compilation itself, without size of JIT-compiled code (the code's size is accounted in "Code heap").

ruben-ayrapetyan on 22 Mar 2017

Yes, the memory for compilation itself

This memory should be transient. It is not needed once the JIT is done JITing. The JIT keeps some of it around to avoid asking OS for it again and again. Is the 1.7MB number the high watermark, or do you see it kept around permanently?

The JIT should need less than 100kB to JIT most methods. You may take a look at which (large?) methods take the large amount of memory to JIT, and do something about them.

jkotas on 22 Mar 2017

Don't allocate :-)

This is not necessarily the right answer to optimize the fixed footprint that this issue is about. The techniques to avoid allocations (generics, etc.) often make the fixed footprint worse than just writing a simple code that allocates a bit of temporary garbage.

Typical profile of CoreCLR's memory on the GUI applications

Excellent! It is always good to start performance investigation with a measurement.

Higher performance and lower memory consumption? Is it an accepted practice to add compile-time or runtime switches, which allow to switch between the two options?

We do have a prior art here: The server GC vs. workstation GC setting is exactly that. The server GC is higher performance, but it has higher memory consumption as well. We can discuss other similar switches like this.

Mapped assembly images - 4.2 megabytes (50%)
JIT-compiler's memory - 1.7 megabytes (20%)

These two are obviously the buckets to focus on. For optimizing the footprint of mapped assembly images, you may take a look at using the https://github.com/mono/linker - @russellhadley and @erozenfeld are looking into using the mono linker for .NET Core.

jkotas on 22 Mar 2017

👍4

Yes, we have profiled several Xamarin GUI applications on Tizen Mobile.

Thanks for sharing the results!

seanshpark on 22 Mar 2017

@jkotas,

Thank you very much for your comments.

We clarified the measurements.

Also, need to add some comments about them:

the measurements were performed with assemblies precompiled in ReadyToRun format, which currently isn't default in Tizen. When Fragile format of assemblies is used, distribution of memory consumptions looks quite differently.
- the measurements show "Private" memory usage of process, i.e. only the part, which is not shared with other processes. "Shared" part is not accounted at all in this measurements. Most part of "Mapped assembly images" in measurements above is "Private_Clean" (unmodified) memory, which automatically becomes "shared" just when same assembly is mapped to another processes. So, actual per-application consumption of the "mapped assembly images" is much less in ReadyToRun mode. Please, see the new measurements below.

@seanshpark , @jkotas , please, see the clarified measurements below.
The following measurements are for Puzzle sample application (https://developer.tizen.org/sites/default/files/documentation/puzzle2.zip), which is started along with another .NET application (so, mapped files are mostly shared).

ReadyToRun mode means the Tizen-default set of precompiled assemblies is in ReadyToRun format.
Fragile mode - the Tizen-default set of precompiled assemblies is in Fragile format (currently, the format is used in Tizen).
The values in cells represent "Private" (per-application) memory consumption of CoreCLR.

Component | ReadyToRun mode | Fragile mode
------------ | ------------ | -------------
Mapped assembly images | 1921 kilobytes (37%) | 5130 kilobytes (76%)
Execution engine | 1309 kilobytes (25.2%) | 795 kilobytes (11.8%)
Objects heap | 690 kilobytes (13.3%) | 506 kilobytes (7.5%)
Code heap | 549 kilobytes (10.5%) | 119 kilobytes (1.7%)
Type information | 654 kilobytes (12.6%) | 106 kilobytes (1.5%)
JIT-compiler's memory | 64 kilobytes (1.2%) | 64 kilobytes (0.9%)
Total | 5187 kilobytes (100%) | 6720 kilobytes (100%)

Do we understand correctly that the differences in memory distribution between ReadyToRun and Fragile mode are caused by storing preinitialised data in the Fragile format? Could you, please, point us to some documentation or places in code base that could explain the difference?

ruben-ayrapetyan on 3 Apr 2017

👍1

differences in memory distribution between ReadyToRun and Fragile mode are caused by storing preinitialised data in the Fragile format?

I think so.

documentation or places in code base that could explain the difference?

The pre-initialized datastructures in the Fragile format have a lot of pointers that need to be updated. It is called "restoring" in the code, e.g. look for MethodTable::Restore. Updating the pointers produces the private memory pages.

Creating the datastructures at runtime on demand gives you a dense packing for free. The private pages contain just the datastructures needed. The preinitialized datastructures in the fragile images do not have this property (e.g. the program may only need 100 byte datastructure from a given page, but the whole 4k page is private memory).

jkotas on 3 Apr 2017

@jkotas, thank you for the information!

ruben-ayrapetyan on 3 Apr 2017

@ruben-ayrapetyan as i read it this is answered now; please reopen if not.

danmosemsft on 13 Apr 2017

@jkotas,

We have performed initial comparison of CoreCLR and CoreRT from viewpoint of memory consumption on benchmarks from http://benchmarksgame.alioth.debian.org.

The initial measurements show that CoreCLR consumes approximately 41% more memory on average than CoreRT and is approximately 4% slower (x64 release build).

Particularly, binary-trees benchmark (http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=csharpcore&id=5) shows the following:
Peak Rss on CoreCLR is about 1.5 gigabytes
Peak Rss on CoreRT is about 1 gigabyte
Running time on CoreCLR is about 46.7 seconds
Running time on CoreRT is about 29.6 seconds

As far as we currently see, the difference in memory consumption is mostly related to differences in GC heuristics.
Particularly, we could reduce memory consumption of CoreCLR on binary-trees by about 2 times through invoking GC more frequently.

Do we see correctly that the main cause of the difference is related to GC?
Could you, please, clarify what are the differences in GC between CoreRT and CoreCLR?

cc @lemmaa @egavrin @Dmitri-Botcharnikov @sergign60 @BredPet @gbalykov @kvochko

ruben-ayrapetyan on 5 Jul 2017

As far as we currently see, the difference in memory consumption is mostly related to differences in GC heuristics.

Unfortunately, it does not explain why we see performance improvements on memory intensive benchmarks like binary-trees or spectral-norm.

Launch time is better on CoreRT, obviously. ~45% faster with CoreRT.

egavrin on 5 Jul 2017

GC PAL is incomplete in CoreRT - the performance related parts are missing:

The concurrent/background GC is not enabled in CoreRT yet (it is the default in CoreCLR). You can try rerunning the CoreCLR with concurrent GC disabled to see whether it is causing the difference.
The L1/L2 cache size detection is missing https://github.com/dotnet/corert/blob/master/src/Native/gc/unix/gcenv.unix.cpp#L389. You can try hardcode the number that CoreCLR uses on your machine to see whether it is causing the difference.

jkotas on 5 Jul 2017

@jkotas, Thank you very much for the advice.

We checked the CoreCLR with concurrent GC turned off.

In this configuration, CoreCLR consumes 2 times less RSS at peak, and is about 30% faster than CoreRT on the binary-trees benchmark.

ruben-ayrapetyan on 7 Jul 2017

You may be running into https://github.com/dotnet/corert/issues/3784.

These kind of differences between CoreCLR and CoreRT are point-in-time problem. The GC perf characteristics should be within noise between CoreCLR and CoreRT by the time we are done.

jkotas on 7 Jul 2017

Was this page helpful?

0 / 5 - 0 ratings