Core: Horrible performance on Linux compared to Windows

Created on 14 Jan 2019  路  17Comments  路  Source: dotnet/core

I've written a program that decodes BUFR files that are used by met offices to describe weather observances. A big part of this format is using a bit stream rather than a byte stream to encode the data. Now this code works reasonably fast on my Windows PC. The available files of 2 days are processed in around 15 (HDD) to 20 (SSD) seconds. On my Linux web server (HDD but with sufficient RAM to cache all files), the same process takes 180 seconds. (Not counting the time to download any files.)

Also, the .NET Core process I run on Linux starts over a dozen more processes with the same command line. What are they doing?

.NET Core 2.1.5 on Windows 10 x64 and Ubuntu 16.04 x64.

The source code of the project is attached. It should run as-is. The tables directory must be where the executable or the project is, or you will get tons of error messages. It will download a few files (3000 to 4000, total around 120 MB) and then parse them. I've counted the time from when the filter counter appears until it's completed.

needs-more-info

All 17 comments

@ygoe do you have a small isolated repro that demonstrates what is slower?
If you don't know which part of your app is slower, you can either experiment, or do perf analysis and compare what take where how much time.
Please make sure the perf difference shows on more than 1 environment / machine to avoid weird machine setup problems. Thanks!

@karelz Unfortunately the BUFR format is very complex and I cannot leave any part of it away or it will fail completely. The WMO has done a great job at finding complicated solutions back in the 80s. So I cannot make it any smaller than it is.

The source code of the project is attached.

Isn't attached anywhere?

Is it now?
bufr.zip

@ygoe What is WMO?
Did you reproduce the perf difference in 2 different environments as I suggested above?
Did you get chance to do some basic perf analysis and comparison which part of your app is slower? That will increase chance we will prioritize the investigation sooner.

Wikipedia says, WMO is the World Meteorological Organization. It's their standard format for international data exchange I'm parsing here.

Let's see when I find the time to track this down.

Alright, let us know what you find - it may help you find a smaller repro for the problem you're facing.

Hardly a scientific test but a few data points -

MacBook Pro (i7 dual core/16gb RAM) - 85.5 seconds
Ubuntu 18.04 vm on the above machine - 416.5
Win10 vm with same specs - 375.9

@adamsitnik @brianrob @billwert can you please help here?

Yes, I'm doing a bit of profiling and will reply back with what I find.

I did a quick comparison between Windows and Linux and I found that Linux was much more CPU bound and progressed much slower. I took a couple of 5 second traces to see what the differences were, and on Linux, the big difference is that there is pretty heavy contention on a spin lock in the allocator:

image

This contention doesn't show up at all on Windows. It's also worth noting the total number of threads on each platform:

Linux: 176
Windows: 123

This could certainly have some impact - though I wouldn't have expected things to go from no contention to this much contention solely based on this. The Parallel.ForEach which causes these threads to be injected is based on the files that were downloaded, so I think a good next step is for you to isolate this variable. Specifically, what kind of difference do you see if you run this workload against the same set of static files on Linux and Windows? Do you see an increase in the number of threads from Windows to Linux with the same static file set? Essentially, make the app as deterministic as possible and then let's see how things behave.

Any update @ygoe?

@karelz Sorry, I didn't get to it yet. I don't understand the details you found out anyway so I could just provide other test code but no analysis on that level. I modified that program on my side already so I need to create something that works on a fixed (maybe smaller) set of files.

I can't provide another code than I did. I tried to make a test case but whatever I tried has the same performance on Windows and Linux. I tried with a small set of large input files, where each file was processed several times. I tried with more large files, each processed once. I tried with 500 tiny files. Nothing could reproduce the observed behaviour of the original program. So I'm sorry, I can't help with this. If you're going to find the bug in the CLR, the already provided code will have to do.

@ygoe one thing that you might try is to run the app and confirm the behavior. Then, make the app just use the set of files that were downloaded over and over, rather than downloading a new set of files. This should allow you to not materially change that amount of work that's happening and still repro the problem. We're going to have a lot of difficulty tracking this down otherwise, as each time we run the app, the behavior is different and seems to key off the number of files downloaded.

Agreed, we need to be comparing same amount of work on both systems. Otherwise, it is not a bug, but likely by design difference in workload.

No response, closing. Feel free to reopen when there's more actionable info as asked above. Thanks!

Was this page helpful?
0 / 5 - 0 ratings