Runtime: [Hostfxr] [Windows] High CPU usage on background thread after invoking `hostfxr_get_runtime_delegate` on non-main thread.

Created on 13 Jul 2020  路  7Comments  路  Source: dotnet/runtime

Description

In .NET 5, invoking hostfxr_get_runtime_delegate from a non-main thread appears to leave a background task running, consuming a full CPU thread.

Specifically, the issue can be reproduced in the following fashion on Windows:

int main(int argc, char* argv[])
{
    // load_runtime sets up the runtime, calling `hostfxr_get_runtime_delegate` down the road
    auto handle = CreateThread(nullptr, 0, &load_runtime, 0, 0, nullptr);
    Sleep(INFINITE); // Stall the program to observe the effect.
}

This does not occur if the runtime is initialised on the main thread.

Reproduction

Minimal reproduction is available at Reloaded.Core.Bootstrap/runtimebug.
Simply execute Reloaded.Core.Bootstrap.Example and observe CPU utilisation once the program begins to sleep.

The offending function call that triggers the issue can be found at Reloaded.Core.Bootstrap/CoreCLR.cpp#L71.

(Requires Visual Studio 16.7 preview or newer, with Desktop development with C++ workload as well as the latest .NET 5 SDK).

Reproduction Debugging

To trigger the correct, intended behaviour (no loop/bug) simply load the runtime on main thread. i.e. replace

initializeThreadHandle = CreateThread(nullptr, 0, &load_bootstrapper_async, 0, 0, nullptr);
WaitForSingleObject(initializeThreadHandle, INFINITE);

with

load_bootstrapper_async(0);

in the example's entry point.
I have set it up this way to make control flow comparisons easier between intended and unintended behaviour should that be helpful.

Configuration

Runtime: 5.0.0-preview.6
OS: Win 10 2004 (Build 19041.329)
Architecture: x86, x64

Regression?

This is a regression from .NET Core 3.X.
(i.e. Code runs as expected all the way from 3.0 Preview 6 when it became first available to the latest 3.1 release).

I have not yet identified the first .NET 5 Preview/build with the issue intact.

Images

.NET 5:
HLFar4wrQt

netcoreapp3.1:
HxklikQaVN

netcoreapp3.0:
WdN7XsRATY

area-Diagnostics-coreclr tenet-performance

Most helpful comment

I put this under a debugger this afternoon. The server is being put into an error state where the overlapped IO is continually returning ERROR_OPERATION_ABORTED. The description for that error states that

The I/O operation has been aborted because of either a thread exit or an application request.

In other words, the fact that you loaded the runtime on one thread but ended that thread broke the asynchronous IO. The abort is being recognized here at line 327:

https://github.com/dotnet/runtime/blob/50f82db26eaa743074fea538063abb4fec0adb02/src/coreclr/src/vm/ipcstreamfactory.cpp#L324-L330

which unfortunately doesn't set fSawError if the GetConnectedStream callback returns nullptr. In that case we never reset the asynchronous call to ConnectNamedPipe, so the pipe always comes back marked as signaled, and then fails in the same way.

This should be fixable by special casing the abort case and/or resetting the connection if the callback returns nullptr.

All 7 comments

Tagging subscribers to this area: @vitek-karas, @swaroop-sridhar, @agocke
Notify danmosemsft if you want to be subscribed.

Thanks a lot for the repro @Sewer56 . I debugged into it a little bit and it seems to be related to the diagnostic server thread. It loops in this callstack:

>   coreclr.dll!IpcStreamFactory::GetNextAvailableStream(void(*)(const char *, unsigned int) callback) Line 166 C++
    coreclr.dll!DiagnosticServer::DiagnosticsServerThread(void * __formal) Line 50  C++
    kernel32.dll!BaseThreadInitThunk()  Unknown
    ntdll.dll!RtlUserThreadStart()  Unknown

It calls the IpcStream::DiagnosticsIpc::Poll which "successfully" waits (seems to be no wait time actually) on the handle, then gets a stream which is always NULL - that causes the caller to loop and try again... and so on.

I don't know what's different running it from managed exe versus hosted via the hosting APIs - obviously a simple console app doesn't repro this.

Tagging subscribers to this area: @tommcdon
Notify danmosemsft if you want to be subscribed.

I don't know what's different running it from managed exe versus hosted via the hosting APIs - obviously a simple console app doesn't repro this.

Indeed, would be interesting to know, even if at a high level.
I'd be interested digging into the runtime internals myself someday.
Thanks for reaching out so quickly.

On an unrelated note.
Seems there was something I forgot to mention in the repro; this should be fairly obvious and implicit from the issue description but I should mention just in case.

To trigger the correct, intended behaviour (no loop) simply load the runtime on main thread. i.e. replace

initializeThreadHandle = CreateThread(nullptr, 0, &load_bootstrapper_async, 0, 0, nullptr);
WaitForSingleObject(initializeThreadHandle, INFINITE);

with

load_bootstrapper_async(0);

in the example's entry point. I edited the opening post to reflect this.

Explicitly, I set it up this way in the repro so it's easier to test & compare control flow.

@dotnet/dotnet-diag @josalem

Interesting! I haven't had a chance to try the repro since I'm on my Mac right now.

@vitek-karas when you run this under the debugger, does a NamedPipe with a name like dotnet-diagnostic-<pid> show up for the application? IpcStreamFactory::GetNextAvailableStream returning nullptr indicates a failure state of the underlying NamedPipe (on Windows). Could you give some more details about what's happening inside the call to GetNextAvailableStream and Poll? Alternatively, could you turn on StressLog with the diagnostics log facility (0x00001000) and post the result? I'll try to repro this tomorrow.

I put this under a debugger this afternoon. The server is being put into an error state where the overlapped IO is continually returning ERROR_OPERATION_ABORTED. The description for that error states that

The I/O operation has been aborted because of either a thread exit or an application request.

In other words, the fact that you loaded the runtime on one thread but ended that thread broke the asynchronous IO. The abort is being recognized here at line 327:

https://github.com/dotnet/runtime/blob/50f82db26eaa743074fea538063abb4fec0adb02/src/coreclr/src/vm/ipcstreamfactory.cpp#L324-L330

which unfortunately doesn't set fSawError if the GetConnectedStream callback returns nullptr. In that case we never reset the asynchronous call to ConnectNamedPipe, so the pipe always comes back marked as signaled, and then fails in the same way.

This should be fixable by special casing the abort case and/or resetting the connection if the callback returns nullptr.

Was this page helpful?
0 / 5 - 0 ratings