Runtime: Socket Dispose during synchronous operation hangs on RedHat 7

Created on 24 Sep 2020  路  49Comments  路  Source: dotnet/runtime

Disposing a Socket that has a pending synchronous Connect/Accept operation intermittently hangs on RedHat7 and CentOS7 x64 in CI.

```c#
using var client = new Socket(SocketType.Stream, ProtocolType.Tcp);

Exception ex = await Assert.ThrowsAnyAsync(async () =>
{
Task connectTask = Task.Run(() => client.Connect(endPoint));
await WaitForEventAsync(events, "ConnectStart"); // Wait until the operation starts
Task disposeTask = Task.Run(() => client.Dispose());
await new[] { connectTask, disposeTask }.WhenAllOrAnyFailed(); // Hang
});
```

Existing Connect and Accept tests are potentially masking this failure with the use of a timeout/retry combo.

Example CI Console log

area-System.Net.Sockets bug

Most helpful comment

The bugzilla issue is now public: https://bugzilla.redhat.com/show_bug.cgi?id=1886305.

All 49 comments

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

cc: @antonfirsov @tmds

@tmds you might have a good idea what's going on here -- any thoughts?
We discovered it as collateral failure during Telemetry changes for 5.0 -- if this is a real product bug (specific to RedHat7 & CentOS7), we might still want to fix it in 5.0 - maybe as servicing.

Existing Connect and Accept tests are potentially masking this failure with the use of a timeout/retry combo.

I think the failure can't be masked by the retry: both the operation Task as the disposing Task are awaited.

you might have a good idea what's going on here -- any thoughts?

There was an issue on selinux systems (like CentOS, RHEL) which was fixed here: https://github.com/torvalds/linux/commit/05174c95b83f8aca0c47b87115abb7a6387aafa5.
Are these CI machines running an up to date version of RHEL 7?
You can also try disabling selinux and see if that affects the test.

They're not awaited if the timeout task finished first

Ah, yes. When this times out, the test should fail immediately instead of being retried.

Are these CI machines running an up to date version of RHEL 7?
You can also try disabling selinux and see if that affects the test.

Do you know if the RHEL7 machines are up to date?

I was able to produce several dumps from the hangs. It sits in:

00007F01FF7C70F8 00007f020781dde2 [HelperMethodFrame: 00007f01ff7c70f8] System.Threading.Thread.SleepInternal(Int32)
00007F01FF7C7240 00007F018C74376F System.Threading.SpinWait.SpinOnceCore(Int32)
00007F01FF7C72D0 00007F018C741D67 System.Net.Sockets.SafeSocketHandle.CloseAsIs(Boolean)
00007F01FF7C73C0 00007F018C74135E System.Net.Sockets.Socket.Dispose(Boolean)
00007F01FF7C7600 00007F018C740E27 System.Net.Sockets.Socket.Dispose()

and in:

00007F0200B11178 00007f0206b20d6f [InlinedCallFrame: 00007f0200b11178] Interop+Sys.Accept(System.Runtime.InteropServices.SafeHandle, Byte*, Int32*, IntPtr*)
00007F0200B11178 00007f018c73176a [InlinedCallFrame: 00007f0200b11178] Interop+Sys.Accept(System.Runtime.InteropServices.SafeHandle, Byte*, Int32*, IntPtr*)
00007F0200B11160 00007F018C73176A ILStubClass.IL_STUB_PInvoke(System.Runtime.InteropServices.SafeHandle, Byte*, Int32*, IntPtr*)
00007F0200B11270 00007F018C73148E System.Net.Sockets.SocketPal.TryCompleteAccept(System.Net.Sockets.SafeSocketHandle, Byte[], Int32 ByRef, IntPtr ByRef, System.Net.Sockets.SocketError ByRef)
00007F0200B11350 00007F018C730C10 System.Net.Sockets.SocketAsyncContext.Accept(Byte[], Int32 ByRef, IntPtr ByRef)
00007F0200B11400 00007F018C632F23 System.Net.Sockets.SocketPal.Accept(System.Net.Sockets.SafeSocketHandle, Byte[], Int32 ByRef, System.Net.Sockets.SafeSocketHandle ByRef)
00007F0200B11480 00007F018C624B1D System.Net.Sockets.Socket.Accept()

I'm not well versed in sockets to be able to draw any firm conclusions from them. So here're the dumps:

Kusto query

|JobName |Uri |FileName |
|------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
|4eab1239-6bc1-4177-bd43-5b7af0c2725b|https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-42626-merge-4eab12396bc14177bd/System.Net.Sockets.Tests/core.1000.14431?sv=2019-07-07&se=2020-10-14T18%3A32%3A16Z&sr=c&sp=rl&sig=sbWH2zFVmPe9IwC4ohdePEIdTmdhmyYGQmP23tGsQUY%3D |core.1000.14431|
|4eab1239-6bc1-4177-bd43-5b7af0c2725b|https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-42626-merge-4eab12396bc14177bd/System.Net.Sockets.Tests/core.1000.14499?sv=2019-07-07&se=2020-10-14T18%3A32%3A16Z&sr=c&sp=rl&sig=sbWH2zFVmPe9IwC4ohdePEIdTmdhmyYGQmP23tGsQUY%3D |core.1000.14499|
|4eab1239-6bc1-4177-bd43-5b7af0c2725b|https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-42626-merge-4eab12396bc14177bd/System.Net.Sockets.Tests/core.1000.14512?sv=2019-07-07&se=2020-10-14T18%3A32%3A16Z&sr=c&sp=rl&sig=sbWH2zFVmPe9IwC4ohdePEIdTmdhmyYGQmP23tGsQUY%3D |core.1000.14512|
|4eab1239-6bc1-4177-bd43-5b7af0c2725b|https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-42626-merge-4eab12396bc14177bd/System.Net.Sockets.Tests/core.1000.14526?sv=2019-07-07&se=2020-10-14T18%3A32%3A16Z&sr=c&sp=rl&sig=sbWH2zFVmPe9IwC4ohdePEIdTmdhmyYGQmP23tGsQUY%3D |core.1000.14526|
|4eab1239-6bc1-4177-bd43-5b7af0c2725b|https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-42626-merge-4eab12396bc14177bd/System.Net.Sockets.Tests/core.1000.14539?sv=2019-07-07&se=2020-10-14T18%3A32%3A16Z&sr=c&sp=rl&sig=sbWH2zFVmPe9IwC4ohdePEIdTmdhmyYGQmP23tGsQUY%3D |core.1000.14539|
|3ac5ecc6-b0b1-4a35-b77e-bf8f30f80be0|https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-42626-merge-3ac5ecc6b0b14a35b7/System.Net.Sockets.Tests/core.1000.15625?sv=2019-07-07&se=2020-10-14T18%3A32%3A17Z&sr=c&sp=rl&sig=AaWvOYz%2BNoC32l%2BvKRJ02OEWTlUIIM4d92MH7kYzZsM%3D|core.1000.15625|
|3ac5ecc6-b0b1-4a35-b77e-bf8f30f80be0|https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-42626-merge-3ac5ecc6b0b14a35b7/System.Net.Sockets.Tests/core.1000.15749?sv=2019-07-07&se=2020-10-14T18%3A32%3A17Z&sr=c&sp=rl&sig=AaWvOYz%2BNoC32l%2BvKRJ02OEWTlUIIM4d92MH7kYzZsM%3D|core.1000.15749|
|3ac5ecc6-b0b1-4a35-b77e-bf8f30f80be0|https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-42626-merge-3ac5ecc6b0b14a35b7/System.Net.Sockets.Tests/core.1000.15762?sv=2019-07-07&se=2020-10-14T18%3A32%3A17Z&sr=c&sp=rl&sig=AaWvOYz%2BNoC32l%2BvKRJ02OEWTlUIIM4d92MH7kYzZsM%3D|core.1000.15762|
|3ac5ecc6-b0b1-4a35-b77e-bf8f30f80be0|https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-42626-merge-3ac5ecc6b0b14a35b7/System.Net.Sockets.Tests/core.1000.15776?sv=2019-07-07&se=2020-10-14T18%3A32%3A17Z&sr=c&sp=rl&sig=AaWvOYz%2BNoC32l%2BvKRJ02OEWTlUIIM4d92MH7kYzZsM%3D|core.1000.15776|
|3ac5ecc6-b0b1-4a35-b77e-bf8f30f80be0|https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-42626-merge-3ac5ecc6b0b14a35b7/System.Net.Sockets.Tests/core.1000.15790?sv=2019-07-07&se=2020-10-14T18%3A32%3A17Z&sr=c&sp=rl&sig=AaWvOYz%2BNoC32l%2BvKRJ02OEWTlUIIM4d92MH7kYzZsM%3D|core.1000.15790|

I can provide assistance with analyzing the dumps (getting correct runtime, setting up dotnet dump etc.).

@ManickaP I mentioned a bug that was in the Linux kernel and that affects RHEL and CentOS. We should check the CI machines are using a recent version of RHEL.

@antonfirsov mentioned these machines are running the 3.10 kernel

It's expected to have an older kernel. It will include many fixes that are backported to it.
What version of RHEL is it running? The latest available is RHEL 7.8.

@tmds I will dig into the logs / ping the CI team to dig out this information. Kernel is 3.10.0.862

(Close was by accident, pressing wrong button on phone. )

According to https://helix.dot.net/

CentOS 7.5
RHEL 7.5

Ok @MihaZupan was faster 馃槃

RHEL 7.5

The commit I mentioned is from May 21, 2019 , and RHEL 7.5 is from April 10, 2018.
Can we update to a later version of RHEL?

@tmds is connect(AF_UNSPEC) supposed to unblock pending accept and connect calls? Isn't it just a way for disconnecting an "already connected TCP socket" (sic. - PR description).

Can we update to a later version of RHEL?

We can ask the engineering team if they can update (will be longer process I suppose), but I think first we need to prove for 100% that the hang is caused by this specific kernel bug. (By local testing probably?)

5.0 supports RHEL 7+ (whatever is in support by RedHat) so before updating machines we need to make sure we maintain coverage. I do not know what the full matrix is : the repo infra team can help. But another possibility is to disable this test unless it is on new enough version?
https://github.com/dotnet/core/blob/master/release-notes/5.0/5.0-supported-os.md

connect(AF_UNSPEC) supposed to unblock pending accept and connect calls? Isn't it just a way for disconnecting an "already connected TCP socket" (sic. - PR description).

It causes operations to abort too.
If it didn't the xxxCanceledByDispose tests would fail.

but I think first we need to prove for 100% that the hang is caused by this specific kernel bug.

If you run this code, the bug should cause it to print errno EAFNOSUPPORT (97) as mentioned in the fix.

using System;
using System.Net;
using System.Net.Sockets;
using System.Runtime.InteropServices;

namespace console
{
    unsafe class Program
    {
        static void Main(string[] args)
        {
            Socket s = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
            s.Connect("www.microsoft.com", 443);

            // Connect to AF_UNSPEC
            const int AddressLength = 16;
            byte* address = stackalloc byte[AddressLength]; // note: AF_UNSPEC is zero.
            int rv = connect((int)s.Handle, address, AddressLength);
            if (rv == -1)
            {
                int errno = Marshal.GetLastWin32Error();
                Console.WriteLine($"fail, errno is {errno}");
            }
            else
            {
                Console.WriteLine("success");
            }
        }

        [DllImport("libc", SetLastError = true)]
        public static extern int connect(int socket, byte* address, uint address_len);
    }
}

5.0 supports RHEL 7+ (whatever is in support by RedHat)

The latest minor is the supported version.

@tmds thanks, I will try to compare behavior before/after 7.5.

5.0 supports RHEL 7+ (whatever is in support by RedHat)

The latest minor is the supported version.

@danmosemsft does this mean that if we can prove this is a kernel bug, we can close? Or shall we at least document a known issue here?

Something doesn't add up. Even with the timeout/retry, the CanceledByDispose tests would have to pass at least one iteration successfully to report success.
Though if it is caused by the Linux issue it should never succeed.

Let's first see if the CI machines have this Linux bug.

The hangs were intermittent on these systems.

that if we can prove this is a kernel bug, we can close? Or shall we at least document a known issue here?

That is for your team to decide. Wrt docs I don't think we do that unless it's commonly hit by customers and it appears to be a bug in.NET?

@tmds I pushed a test case based on your repro code to #42741, but no tests failed in CI:
https://dev.azure.com/dnceng/public/_build/results?buildId=831247

Will be OOF 28-02 happy to help investigating when I'm back.

I've installed RHEL 7.5, and my kernel is 3.10.0-862.el7. Unfortunately the kernel doesn't have the bug I mentioned.

I wrote a small app that should reproduce the issue:

using System;
using System.Threading;
using System.Threading.Tasks;
using System.Net;
using System.Net.Sockets;

namespace console
{
    class Program
    {
        static void Main(string[] args)
        {
            while (true)
            {
                var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);

                Task connectTask = client.ConnectAsync(new IPEndPoint(IPAddress.Parse("1.1.1.1"), 23));

                // Wait a little so the operation is started.
                Thread.Sleep(100);
                Task disposeTask = Task.Run(() => client.Dispose());

                try
                {
                    Task.WaitAll(connectTask, disposeTask);
                }
                catch
                { }

                Console.Write(".");
            }
        }
    }
}

So far, it is running without getting stuck.

I can provide assistance with analyzing the dumps (getting correct runtime, setting up dotnet dump etc.).

@ManickaP what are the proper steps to load these coredumps?

I'm using dotnet dump to analyze the dumps: https://github.com/dotnet/diagnostics/blob/master/documentation/dotnet-dump-instructions.md

When you download the coredump, you'll also need the runtime bits with which it failed:

  1. Download JobList.json https://helix.dot.net/api/jobs/<job-id>/details?api-version=2019-06-17. Replace job-id with yours, e.g.: https://helix.dot.net/api/jobs/4eab1239-6bc1-4177-bd43-5b7af0c2725b/details?api-version=2019-06-17, find jobList: and download it.
  2. Find your test bits in the job list: "WorkItemId": "System.Net.Sockets.Tests". There's CorrelationPayloadUrisWithDestinations containing runtime bits (.../test-runtime-net5.0-Linux-Debug-x64.zip?...) and "PayloadUri": containing test dll bits. Download them both.

Extract both somewhere in their own directories. The extracted files all have 0 permissions, you need at least give them read (chmod +r *). We're interested in contents of test-runtime-net5.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0 and System.Net.Sockets.Tests.

  1. Load the dump: dotnet dump analyze core.1000.14431
  2. Load runtime (setclrpath <your-full-path-runtime-bits>): setclrpath /home/manicka/Downloads/test-runtime-net5.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0
  3. Test that runtime is loaded: clrmodules --> should print list of dlls
  4. Load PDBs: setsymbolserver -directory /home/manicka/Downloads/test-runtime-net5.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0
    setsymbolserver -directory /home/manicka/Downloads/System.Net.Sockets.Tests
    --> this should give you file names and line numbers in clrstack.
  5. Use clrthreads, clrstack, dumpasync etc. to analyze the problem

Some more docs: https://github.com/dotnet/diagnostics/blob/master/documentation/debugging-coredump.md

This is intentionally written in greater detail for wider audience.

So far, it is running without getting stuck.

After 50k+ iterations, it still isn't blocked.
Does the issue happen often in CI?

I will take a look at the coredumps. Thank you @ManickaP for these clear instructions!

The dump show what is reported.

00007F70B2503F10 00007f70b9213a8d [InlinedCallFrame: 00007f70b2503f10] Interop+Sys.Connect(System.Runtime.InteropServices.SafeHandle, Byte*, Int32)
00007F70B2503F10 00007f703e1073f3 [InlinedCallFrame: 00007f70b2503f10] Interop+Sys.Connect(System.Runtime.InteropServices.SafeHandle, Byte*, Int32)
00007F70B2503F00 00007F703E1073F3 ILStubClass.IL_STUB_PInvoke(System.Runtime.InteropServices.SafeHandle, Byte*, Int32)
00007F70B2503FF0 00007F703E1071FF System.Net.Sockets.SocketPal.TryStartConnect(System.Net.Sockets.SafeSocketHandle, Byte[], Int32, System.Net.Sockets.SocketError ByRef) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs @ 601]
00007F70B2504080 00007F703E106DF8 System.Net.Sockets.SocketAsyncContext.Connect(Byte[], Int32) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs @ 1425]
00007F70B25040F0 00007F703E1065C5 System.Net.Sockets.SocketPal.Connect(System.Net.Sockets.SafeSocketHandle, Byte[], Int32) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs @ 1037]
00007F70B2504150 00007F703DFED371 System.Net.Sockets.Socket.DoConnect(System.Net.EndPoint, System.Net.Internals.SocketAddress) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs @ 4297]
00007F70B2504280 00007F703DFE3A21 System.Net.Sockets.Socket.Connect(System.Net.EndPoint) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs @ 905]
00007F70B2504380 00007F703DFE9979 System.Net.Sockets.Socket.Connect(System.Net.IPAddress[], Int32) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs @ 1005]
00007F70B25044C0 00007F703DFE4A01 System.Net.Sockets.Socket.Connect(System.String, Int32) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs @ 966]
00007F70B2504550 00007F703DFE39A8 System.Net.Sockets.Socket.Connect(System.Net.EndPoint) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs @ 891]
00007F70B2504650 00007F703DFE34B9 System.Net.Sockets.Tests.SocketHelperArraySync+<>c__DisplayClass3_0.<ConnectAsync>b__0() [/_/src/libraries/System.Net.Sockets/tests/FunctionalTests/SocketTestHelper.cs @ 53]
00007F70B2504670 00007F703DFE332B System.Threading.Tasks.Task.InnerInvoke() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2370]
00007F70B25046B0 00007F703DFE3211 System.Threading.Tasks.Task+<>c.<.cctor>b__277_0(System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2358]
00007F70B25046E0 00007F703DFE3056 System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 274]
00007F70B2504750 00007F703DFE2A87 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2320]
00007F70B2504820 00007F703DFE2600 System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2252]
00007F70B2504850 00007F703DFE21CF System.Threading.Tasks.Task.ExecuteFromThreadPool(System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2243]
00007F70B2504870 00007F703DFDECC1 System.Threading.ThreadPoolWorkQueue.Dispatch() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs @ 641]
00007F70B25048E0 00007F703DFDEB49 System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() [/_/src/coreclr/src/System.Private.CoreLib/src/System/Threading/ThreadPool.CoreCLR.cs @ 29]
00007F70B2504C60 00007f70b7bc9fe7 [DebuggerU2MCatchHandlerFrame: 00007f70b2504c60] 
OS Thread Id: 0x3db8
        Child SP               IP Call Site
00007F70B11C10F8 00007f70b9210cf2 [HelperMethodFrame: 00007f70b11c10f8] System.Threading.Thread.SleepInternal(Int32)
00007F70B11C1240 00007F703E11696F System.Threading.SpinWait.SpinOnceCore(Int32) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/SpinWait.cs @ 242]
00007F70B11C12D0 00007F703E1124F7 System.Net.Sockets.SafeSocketHandle.CloseAsIs(Boolean) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SafeSocketHandle.cs @ 113]
00007F70B11C13C0 00007F703E111AEE System.Net.Sockets.Socket.Dispose(Boolean) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs @ 4398]
00007F70B11C1600 00007F703E110EA7 System.Net.Sockets.Socket.Dispose() [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs @ 4467]
00007F70B11C1650 00007F703E110D78 System.Net.Sockets.Tests.TelemetryTest+<>c__DisplayClass10_1.<EventSource_SocketConnectFailure_LogsConnectFailed>b__3() [/_/src/libraries/System.Net.Sockets/tests/FunctionalTests/TelemetryTest.cs @ 190]
00007F70B11C1670 00007F703DFE332B System.Threading.Tasks.Task.InnerInvoke() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2370]
00007F70B11C16B0 00007F703DFE3211 System.Threading.Tasks.Task+<>c.<.cctor>b__277_0(System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2358]
00007F70B11C16E0 00007F703DFE3056 System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 274]
00007F70B11C1750 00007F703DFE2A87 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2320]
00007F70B11C1820 00007F703DFE2600 System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2252]
00007F70B11C1850 00007F703DFE21CF System.Threading.Tasks.Task.ExecuteFromThreadPool(System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2243]
00007F70B11C1870 00007F703DFDECC1 System.Threading.ThreadPoolWorkQueue.Dispatch() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs @ 641]
00007F70B11C18E0 00007F703DFDEB49 System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() [/_/src/coreclr/src/System.Private.CoreLib/src/System/Threading/ThreadPool.CoreCLR.cs @ 29]
00007F70B11C1C60 00007f70b7bc9fe7 [DebuggerU2MCatchHandlerFrame: 00007f70b11c1c60] 

There is an ongoing operation (Interop+Sys.Connect in this dump), and another thread is trying to unblock that operation in CloseAsIs loop:

https://github.com/dotnet/runtime/blob/98eee0587dfd0752c030fe49b3fd9b45f4b2e5c5/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SafeSocketHandle.cs#L104-L114

I think the dump won't tell us more. If we could reproduce this, it would be interesting to see what happens when TryUnblockSocket gets called. My attempts to reproduce it using the simple application I shared earlier were unsuccessful.

Because it is reported only on RHEL7 (and derived CentOS 7), I think it makes sense to upgrade to the latest version and postpone further investigation until the issue occurs again. I'd like to avoid spending too much time on when the root cause may be fixed already.

@tmds let me know if I'm missing something, but I think your repro code in https://github.com/dotnet/runtime/issues/42686#issuecomment-701964046 is not testing the right thing. We are facing this issue only with synchronous connect, async is fine.

Any chance you can retest the same logic with synchronous Socket.Connect on your 7.5 environment? Your kernel version seems to be identical to the CI one.

@tmds let me know if I'm missing something, but I think your repro code in #42686 (comment) is not testing the right thing. We are facing this issue only with synchronous connect, async is fine.

oops, thanks for catching this. I hope it will reproduce now.

I'm replacing

Task connectTask = client.ConnectAsync(new IPEndPoint(IPAddress.Parse("1.1.1.1"), 23));

by

Task connectTask = Task.Run(() => client.Connect(new IPEndPoint(IPAddress.Parse("1.1.1.1"), 23)));

I hope it will reproduce now.

It hasn't after 100k+ iterations.
What test hung in CI? I can try running that in a loop.

This is the test (disabled now for RH-family systems):
https://github.com/dotnet/runtime/blob/e17d818e25567d7fa9525ae5d32e2b8d2483ae96/src/libraries/System.Net.Sockets/tests/FunctionalTests/TelemetryTest.cs#L169-L211

I can try running that in a loop.

The failure is sporadic, but happens in more than half of the cases actually.

I have a simpler unit test repro in #42741. This thing fails quite consistently in CI:
https://github.com/dotnet/runtime/blob/829c484300a7fde0e3f3448d1b285a5f51cc5f07/src/libraries/System.Net.Sockets/tests/FunctionalTests/Connect.cs#L214-L252

@tmds looks like a slightly modified version of your original code will repro the hang consistently:
```C#
using System;
using System.Threading;
using System.Threading.Tasks;
using System.Net;
using System.Net.Sockets;

namespace HangSockets
{
class Program
{
static void Main(string[] args)
{
while (true) // EDIT: no need for the loop actually
{
// Equivalent of:
// var client = new Socket(AddressFamily.InterNetworkV6, SocketType.Stream, ProtocolType.Tcp) {DualMode = true};
var client = new Socket(SocketType.Stream, ProtocolType.Tcp);

            Task connectTask = Task.Run(() => client.Connect(new IPEndPoint(IPAddress.Parse("1.1.1.1"), 23)));

            // Wait a little so the operation is started.
            Thread.Sleep(100);
            Task disposeTask = Task.Run(() => client.Dispose());

            try
            {
                Task.WaitAll(connectTask, disposeTask);
            }
            catch { }

            Console.Write(".");
        }
    }
}

}
```
The difference is that the socket has to be DualMode IPV6. The hang also occurs on RHEL 7.9. Would be nice to verify what is the case with RHEL 8.

According to results in #42741 this hangs across various other distros with older kernels, not only Red Hat:
https://dev.azure.com/dnceng/public/_build/results?buildId=842777&view=ms.vss-test-web.build-test-results-tab

@karelz @geoffkizer this seems to be a real product issue occurring on systems we do officially support.

@tmds looks like a slightly modified version of your original code will repro the hang consistently:

Awesome! I've reported a bug in Red Hat Bugzilla for it: https://bugzilla.redhat.com/show_bug.cgi?id=1886305.

Now that we can reproduce the issue, I see that the connect to AF_UNSPEC is returning an error (EINVAL). We can use that to stop spinning in Dispose.
@antonfirsov do you want to make this change? Or shall I make a PR for it?

We can escape the loop in Dispose, but that still leaves the connect hanging. That will require a fix in the kernel.

@tmds thanks for investigating! I'm happy to work on the PR.

FYI on CI, this fails with the following distros / kernels:
| distro | uname -r |
|---|---|
|Ubuntu 16.04 | 4.15.0-1095-azure|
|sles-15-sp1 gen1 | 4.12.14-8.22-azure|
|RHEL 7.5 | 3.10.0-862.43.1.el7.x86_64|
|CentOS 7.5 | 3.10.0-862.11.6.el7.x86_64|

Can you check if the bug is also present on RHEL 8? I think there is a chance, since kernel is 4.1* AFAIK.
Does it make sense to forward this info to the bugzilla ticket? (I can't access it.)

I'm happy to work on the PR.

I think the Dispose may still hang when we call close We'll find out.

Can you check if the bug is also present on RHEL 8? I think there is a chance, since kernel is 4.1* AFAIK.
Does it make sense to forward this info to the bugzilla ticket? (I can't access it.)

I've reported it against RHEL7. If it is needed, it will also be fixed for RHEL8. We don't need to check.

Thanks for the analysis @antonfirsov @tmds!
Should we fix it only in 6.0 and wait for customers to report it against 5.0 before porting it?
Or do we expect customers to run into the problem more often? (and therefore we should service it proactively in 5.0)

@karelz this isn't a regression. .NET 5 improved how Socket.Dispose works on non-Windows in https://github.com/dotnet/corefx/pull/38804. This improvement isn't working on RHEL7 due to the kernel bug. The tests we see failing in CI were not there for previous versions of .NET Core.

I think we can't fully work around the kernel issue. We'll see what is possible when Anton looks into it further.

The bugzilla issue is now public: https://bugzilla.redhat.com/show_bug.cgi?id=1886305.

@tmds looks like shutdown will abort the connection on RHEL7, in case connect(AF_UNSPEC) fails. I will raise a PR on Monday, want full understanding of the tests in #42725 first.

@karelz the issue also repros with 3.1, so this is definitely not a regression.

I don't understand why we aren't calling shutdown here anyway. That should abort any pending operations and fail any subsequent ones. @tmds do you know?

@antonfirsov Is that what you're proposing -- call shutdown here?

I don't understand why we aren't calling shutdown here anyway. That should abort any pending operations and fail any subsequent ones. @tmds do you know?

connect AF_UNSPEC causes a TCP RST close, while shutdown causes a TCP FIN close. The difference is observable by the peer. We prefer the first because it matches Windows behavior.

@antonfirsov Is that what you're proposing -- call shutdown here?

Anton and I discussed this, and the idea is to call shutdown as a fallback when connect AF_UNSPEC fails.

For RHEL7 (and derivates like CentOS7) the connect AF_UNSPEC issue will be fixed in an update. For most systems connect AF_UNSPEC works fine already.

It seems weird to me that aborting pending operations causes a RST. If the user wants to send a RST, they should call Close(0). If they don't, we shouldn't be forcing a RST here.

We prefer the first because it matches Windows behavior.

"Matches Windows behavior" in what sense? It looks to me like we are forcing the RST ourselves on Windows as well. See here: https://github.com/dotnet/corefx/commit/9b2f7a834691a3bbad2949181eec707dd4b5a13b#diff-290750fd8fd72a2c104c587fc11c3f2eR182

It seems weird to me that aborting pending operations causes a RST. If the user wants to send a RST, they should call Close(0). If they don't, we shouldn't be forcing a RST here.

If a send gets aborted the peer observes this instead of maybe wrongly assuming it received everything.

The user can call Shutdown explicitly if he wants to do a FIN close.

"Matches Windows behavior" in what sense?

afaik this also behaves like this on .NET Framework.

I'm also unclear why we don't just call shutdown instead of doing the spin wait thing. Shutdown will stop new operations from starting. So why not just call that instead of spin-waiting?

If a send gets aborted the peer observes this instead of maybe wrongly assuming it received everything.

Maybe. But there's an inherent race here. If the close happens just before the send, then you won't see a RST. It's not clear to me why it's important to send a RST in this case but not in that case.

Regardless, I'm not sure I understand your argument here. Are you saying that sending RST is the right behavior, or are you saying we do this for historical reasons, to match some previous behavior?

In fact, isn't the issue below explicitly about not sending RST in this exact case?

https://github.com/dotnet/runtime/issues/29980

@wfurt do you have insight here?

@geoffkizer the goal of the changes was to match as close as possible the behavior on Windows.

If you want it to behave different, I don't have a strong opinion about it (as long as Windows and Linux behave as close as possible).

@geoffkizer even if we want to change the behavior, that's a breaking change, and definitely 6.0 only. Wouldn't it be wise to separate this discussion from the scope of this particular issue, and make a decision with more people involved?

In the meanwhile, we can provide a relatively small fix for this hang, that can be still backported to 5.0.

even if we want to change the behavior, that's a breaking change, and definitely 6.0 only. Wouldn't it be wise to separate this discussion from the scope of this particular issue, and make a decision with more people involved?

Yes, agreed. I filed a new issue for this: https://github.com/dotnet/runtime/issues/43385

Thoughts welcome.

@karelz @antonfirsov @geoffkizer the RHEL7 kernel issue will be fixed in a patch of 7.9. It will probably be available in November (or January latest). I don't consider this a major issue because things are not working worse than before.

Note that POSIX does not specify a way to abort on-going synchronous calls. The implementation is relying on OS specifics which work well on Linux, and less well on some other OSes. For best cross-platform behaviour applications should avoid using Dispose to abort on-going operations.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

matty-hall picture matty-hall  路  3Comments

yahorsi picture yahorsi  路  3Comments

omariom picture omariom  路  3Comments

jzabroski picture jzabroski  路  3Comments

EgorBo picture EgorBo  路  3Comments