Runtime: Performance regression: 6x slower array allocation on Alpine

Created on 26 Aug 2020  路  36Comments  路  Source: dotnet/runtime

From the data that I got from @danmosemsft which was collected by running dotnet/performance microbenchmarks on alpine 3.11 via WSL2, it looks like allocating arrays of both value and reference types became 6 times slower compared to 3.1.

Initially, I thought that it was just an outlier, but I can see the same pattern for other collections that internally use arrays (queue, list, stack etc). The regression is specific to alpine. Ubuntu 18.04 (with and without WSL2) is fine.

@jkotas @janvorli who would be the best person to investigate that?

Repro

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --filter 'System.Collections.CtorGivenSize<Int32>.Array'

System.Collections.CtorGivenSize.Array(Size: 512)

| Conclusion | Base | Diff | Base/Diff | Modality | Operating System | Bit | Processor Name | Base Runtime | Diff Runtime |
| ---------- | ------:| ------:| ---------:| -------- | ----------------------- | ----- | --------------------------------------------- | --------------- | ----------------------- |
| Same | 181.79 | 183.96 | 0.99 | | Windows 10.0.18363.1016 | Arm | Microsoft SQ1 3.0 GHz | .NET Core 3.1.6 | 5.0.100-rc.1.20413.9 |
| Same | 92.89 | 94.47 | 0.98 | | Windows 10.0.18363.959 | X64 | Intel Xeon CPU E5-1650 v4 3.60GHz | .NET Core 3.1.6 | 5.0.100-rc.1.20404.3 |
| Same | 96.05 | 94.36 | 1.02 | | Windows 10.0.18363.959 | X64 | Intel Xeon CPU E5-1650 v4 3.60GHz | .NET Core 3.1.6 | 5.0.100-rc.1.20418.3 |
| Same | 114.74 | 111.94 | 1.03 | | Windows 10.0.19041.450 | X64 | Intel Core i7-5557U CPU 3.10GHz (Broadwell) | .NET Core 3.1.6 | 5.0.100-rc.1.20413.9 |
| Same | 80.49 | 79.98 | 1.01 | | Windows 10.0.19041.450 | X64 | Intel Core i7-6700 CPU 3.40GHz (Skylake) | .NET Core 3.1.6 | 5.0.100-rc.1.20419.9 |
| Same | 67.30 | 67.66 | 0.99 | bimodal | Windows 10.0.19042 | X64 | Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) | .NET Core 3.1.6 | 5.0.100-rc.1.20418.3 |
| Same | 86.10 | 79.17 | 1.09 | bimodal | Windows 10.0.19041.450 | X64 | Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) | .NET Core 3.1.6 | 5.0.100-rc.1.20419.14 |
| Same | 97.50 | 98.77 | 0.99 | | Windows 10.0.18363.959 | X86 | Intel Xeon CPU E5-1650 v4 3.60GHz | .NET Core 3.1.6 | 5.0.100-rc.1.20420.14 |
| Slower | 127.02 | 150.46 | 0.84 | bimodal | Windows 10.0.19041.450 | X86 | Intel Core i7-5557U CPU 3.10GHz (Broadwell) | .NET Core 3.1.6 | 5.0.100-rc.1.20419.5 |
| Slower | 193.61 | 287.83 | 0.67 | bimodal | ubuntu 18.04 | Arm64 | Unknown processor | .NET Core 3.1.6 | 6.0.100-alpha.1.20421.6|
| Same | 99.85 | 103.42 | 0.97 | | ubuntu 18.04 | X64 | Intel Xeon CPU E5-1650 v4 3.60GHz | .NET Core 3.1.6 | 5.0.100-rc.1.20403.23 |
| Slower | 138.73 | 151.37 | 0.92 | | macOS Mojave 10.14.5 | X64 | Intel Core i7-5557U CPU 3.10GHz (Broadwell) | .NET Core 3.1.6 | 5.0.100-rc.1.20404.2 |
| Slower | 72.85 | 515.56 | 0.14 | | alpine 3.11 | X64 | Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) | .NET Core 3.1.6 | 6.0.100-alpha.1.20421.6|
| Slower | 78.85 | 90.76 | 0.87 | | ubuntu 18.04 | X64 | Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) | .NET Core 3.1.6 | 5.0.100-rc.1.20418.3 |

area-GC-coreclr os-alpine tenet-performance

Most helpful comment

The container was missing libgdiplus on Alpine, but was able to resolve it by adding it from a http://dl-3.alpinelinux.org/alpine/edge/testing/.

I was able to validate that the fix gets the perf to be comparable with 3.1:

| Alpine | 3.1 | 5.0 | 6.0 |
|------------------|-------------:|----------------:|----------------:|
| CtorGivenSize.Array| 80.30 ns | 737.6 ns | 87.62 ns|

All 36 comments

@billwert

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

These tests are measuring raw GC allocation speed. This is GC performance regression.

@mangod9 Could you please find somebody to investigate this?

Ok will have someone take a look. @adamsitnik , Is the regression specifically with Alpine on WSL2, or Alpine in general?

Is the regression specifically with Alpine on WSL2, or Alpine in general?

mangod9 The benchmarks were executed with Alpine on WSL2, I don't have "Alpine only" results to be able to answer. FWIW the Ubuntu on WSL2 results are fine so it might be a general Alpine problem

It repros without WSL2 too.

The problem is that the GC is running like 100x more often than it should. It is likely problem in the budget computation. One of the places to check is PAL_GetLogicalProcessorCacheSizeFromOS.

ok thanks for the context. We will look if there is any Alpine specific behavior which is leading to this.

Here is a simple repro:

using System.Diagnostics;

namespace System.Collections
{
    class Program
    {
        static void Main(string[] args)
        {
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < 1000000; i++)
            {
                var foo = new int[512];
            }
            sw.Stop();
            Console.WriteLine(System.GC.CollectionCount(0));
            Console.WriteLine(sw.ElapsedTicks);
        }
    }
}

| | 3.1 (Alpine) | 5.0 p8 (Alpine) |
|------------------|-------------:|----------------:|
| Gen0 Collections | 330 | 12653 |
| ElaspedTicks | 68689200 | 368227200 |

yeah this is clearly related to calculating the minimal gen0 budget because this test, which allocates only temporary objects, will only trigger gen0 GCs that will always have a min gen0 budget. The min gen0 budget is calculated in get_gen0_min_size so something in that method changed.

I investigated this more and it appears none of the _SC_LEVEL1_DCACHE_SIZE are defined for Alpine (musl), so PAL_GetLogicalProcessorCacheSizeFromOS will always return 0.

Wonder if https://github.com/dotnet/runtime/pull/34488 caused the regression, since I notice this case is missing when compared to 3.1

I think you are right. It means that we had a performance bug for Alpine here even before. The code that I have deleted in #34488 produced provably wrong results for AMD and the logic was questionable for Intel too.

Would you like me to work on getting this fixed?

I notice that you had an aborted PR: https://github.com/dotnet/runtime/pull/34484 to fix GetCacheSizeFromCpuId. Does that need to be resurrected? If so I think I can get that code back..

34484 fixed just one of the multiple different bugs that this code had. I think it needs to be written more or less from scratch using current editions of Intel and AMD manuals.

We can also gather this information from the /sys/devices/system/cpu without doing Intel / AMD specific kung-fu. There is for example /sys/devices/system/cpu/cpu0/cache/index0/size for level 1 cache of cpu0, /sys/devices/system/cpu/cpu0/cache/index1/size for level 2 and /sys/devices/system/cpu/cpu0/cache/index2/size for level 3.

We have this code for ARM64, so we can just remove the ifdef around it and use it unconditonally.

yeah just removing this seems like the best option for 5. Will create a PR, do we need to rethink for 6?

Agree. We have a duplicate of this code in gcenv.unix.cpp for standalone GC, so please remove it there as well.

I think it is fine as a permanent fix ... until we discover the next problem with this.

imo, the better fix would be:

- #if defined(HOST_ARM64)
+ #if defined(TARGET_LINUX)

to keep the support for non-Linux Unix-like operating systems intact (macOS, FreeBSD, SunOS and since so forth).

Reopening, so it can ported to .net 5

I have validated that the simplified repro shows low gen0 collection counts. I am unable to get the benchmarks working on WSL2 (will investigate), @adamsitnik if you could please validate that the benchmarks are showing comparable perf to 3.1 that would be great (guess we need to wait for a daily build anyways). Thx.

I am unable to get the benchmarks working on WSL2 (will investigate)

@mangod9 what error are you getting? You should be able to run the python script and get 3.1 results:

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 --filter 'System.Collections.CtorGivenSize<Int32>.Array'

And then run the benchmarks as 5.0 using your local build of dotnet/runtime by specifying the path to corerun:

python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Collections.CtorGivenSize<Int32>.Array' \
 --corerun './runtime/artifacts/bin/testhost/net5.0-Alpine-Release-x64/shared/Microsoft.NETCore.App/5.0.0/corerun'

You can read more about benchmarking local builds of dotnet/runtime here

if you could please validate that the benchmarks are showing comparable perf to 3.1 that would be great

I currently don't have access to Alpine machine, but as soon as #41547 get propagated to SDK I can spawn a new Azure VM and run the benchmarks

The container was missing libgdiplus on Alpine, but was able to resolve it by adding it from a http://dl-3.alpinelinux.org/alpine/edge/testing/.

I was able to validate that the fix gets the perf to be comparable with 3.1:

| Alpine | 3.1 | 5.0 | 6.0 |
|------------------|-------------:|----------------:|----------------:|
| CtorGivenSize.Array| 80.30 ns | 737.6 ns | 87.62 ns|

Thanks @adamsitnik and @mangod9 for resolving, I'm very happy that my data turned out to be rather useful!

@danmosemsft @mangod9 I wonder how we could prevent from a similar problem in the future. Perhaps we should add some asserts to the method to ensure that it never returns 0 again?

@danmosemsft Yeah it was very helpful in pinpointing where the issue might be. @adamsitnik yeah I will create a separate issue to track how we can add a test/asserts for this.

@Lxiamail @billwert @DrewScoggins this is evidence I think that regular Alpine runs in the lab are important. This would have justified servicing, I think. I'm not sure where we left that conversation @billwert ?

@Lxiamail @billwert @DrewScoggins this is evidence I think that regular Alpine runs in the lab are important. This would have justified servicing, I think. I'm not sure where we left that conversation @billwert ?

@danmosemsft Yes, we have an email discussion about adding additional OS coverage in perf lab. We will looks into @adamsitnik's finalized report of exercises' data, and I'm trying to get .NET Core OS usage telemetry data. Hope we can identify the commonly used OSes and add them to perf lab.

@Lxiamail sounds good, thanks! Telemetry aside, I think if we can only do 2 Linux flavors, Alpine probably needs to be one of those because of its unique characteristics.

Telemetry aside, I think if we can only do 2 Linux flavors, Alpine probably needs to be one of those because of its unique characteristics.
@danmosemsft I'm not familar with the unique characteristics on different Linux OSes. It would be great if someone can articulate what's unique Alpine characteristics that we refer to? From PR: #41547, I didn't see any Alpine specific runtime code change.

Just as side note, https://stackshare.io/stackups/alpine-linux-vs-ubuntu shows Alpine has a lot less users than Ubuntu. If only using OS telemetry to pick the Linux OSes for perf lab, Alpine may not pop to the top rank.

Alpine is different since it uses the musl libc instead of GNU, thus its different from other distros. For this particular fix constants such as _SC_LEVEL1_DCACHE_SIZE are not defined for musl. Hence the fallback to using a difference method to retrieve the cache size was mostly alpine specific (there are several such subtle differences on Alpine).

Right, anything related to libc can have different characteristics. Beyond musl, Alpine is the most "size optimized" distro we currently support, so it may have unique characteristics elsewhere because of those optimizations (eg possibly missing bits of /proc - I don't know, just an example). Plus, folks who deploy to Alpine are presumably particularly sensitive to perf (or at least size) so we should particularly care about the perf on Alpine.

cc @richlander in case he wants to add anything.

@mangod9 @danmosemsft thanks for the info! Looks like we definitly should add Alpine to perf lab. Is there any other OSes have unique charactenstics, which should be considered to monitor in perf lab?

Others are better positioned to answer that one, off the top of my head I cannot remember Linux regressions that wouldn't show up in either Ubuntu or Alpine.

Adding Alpine is a good start, we can always re-evaluate if we find any other distro specific issues.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jamesqo picture jamesqo  路  3Comments

iCodeWebApps picture iCodeWebApps  路  3Comments

matty-hall picture matty-hall  路  3Comments

Timovzl picture Timovzl  路  3Comments

nalywa picture nalywa  路  3Comments