Runtime: Http Test hang/failure: MaxConnectionsPerServer_WaitingConnectionsAreCancelable

Created on 28 Aug 2018  路  22Comments  路  Source: dotnet/runtime

https://mc.dot.net/#/user/filipnavara/pr~2Fjenkins~2Fdotnet~2Fcorefx~2Fmaster~2F/test~2Ffunctional~2Fcli~2F/68ed63990b09ad9ceea911374da90acb1b042a89/workItem/System.Net.Http.Functional.Tests/wilogs currently show timeouts being hit on

  • RedHat.74.Amd64.Open-x64:Release
  • Centos.74.Amd64.Open-x64:Release

The failed runs all indicate that the xunit process made progress, then stopped printing to the console, and 20 minutes later the executor terminated the process. The logs show the automated reruns happened on different machines, and reproduced.

Snippets from one of the runs console.log:

2018-08-28 09:37:46,047: INFO: proc(54): run_and_log_output: Output: xUnit.net Console Runner v2.4.1-pre.build.4059 (64-bit .NET Core 4.6.26827.04)
2018-08-28 09:37:46,142: INFO: proc(54): run_and_log_output: Output:   Discovering: System.Net.Http.Functional.Tests
2018-08-28 09:37:46,402: INFO: proc(54): run_and_log_output: Output:   Discovered:  System.Net.Http.Functional.Tests
2018-08-28 09:37:46,403: INFO: proc(54): run_and_log_output: Output:   Starting:    System.Net.Http.Functional.Tests
2018-08-28 09:37:47,133: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClient_SelectedSites_Test.RetrieveSite_Succeeds [SKIP]
2018-08-28 09:37:47,133: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsSelectedSitesTestEnabled"
2018-08-28 09:37:47,133: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClient_SelectedSites_Test.RetrieveSite_Debug_Helper [SKIP]
2018-08-28 09:37:47,134: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsSelectedSitesTestEnabled"
2018-08-28 09:37:47,134: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClient_SelectedSites_Test.RetrieveSite_Succeeds [SKIP]
2018-08-28 09:37:47,134: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsSelectedSitesTestEnabled"
2018-08-28 09:37:47,134: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClient_SelectedSites_Test.RetrieveSite_Debug_Helper [SKIP]
2018-08-28 09:37:47,134: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsSelectedSitesTestEnabled"
2018-08-28 09:37:47,493: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClientMiniStress.MakeAndFaultManyRequests [SKIP]
2018-08-28 09:37:47,493: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:47,494: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClientMiniStress.UnreadResponseMessage_Collectible [SKIP]
2018-08-28 09:37:47,494: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:47,494: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClientMiniStress.SingleClient_ManyGets_Async [SKIP]
2018-08-28 09:37:47,494: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:47,495: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClientMiniStress.CreateAndDestroyManyClients [SKIP]
2018-08-28 09:37:47,495: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:47,495: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClientMiniStress.ManyClients_ManyPosts_Async [SKIP]
2018-08-28 09:37:47,495: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:47,495: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClientMiniStress.ManyClients_ManyGets [SKIP]
2018-08-28 09:37:47,496: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:47,496: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClientMiniStress.SingleClient_ManyGets_Sync [SKIP]
2018-08-28 09:37:47,496: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:48,960: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClientEKUTest.HttpClient_ClientEKUServerAuth_Fails [SKIP]
2018-08-28 09:37:48,960: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "CanTestCertificates"
2018-08-28 09:37:48,960: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClientEKUTest.HttpClient_NoEKUServerAuth_Ok [SKIP]
2018-08-28 09:37:48,961: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "CanTestCertificates"
2018-08-28 09:37:48,961: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClientEKUTest.HttpClient_NoEKUClientAuth_Ok [SKIP]
2018-08-28 09:37:48,961: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "CanTestClientCertificates"
2018-08-28 09:37:48,961: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClientEKUTest.HttpClient_ServerEKUClientAuth_Fails [SKIP]
2018-08-28 09:37:48,961: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "CanTestClientCertificates"
2018-08-28 09:37:48,962: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClientEKUTest.HttpClient_ServerEKUClientAuth_Fails [SKIP]
2018-08-28 09:37:48,962: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "CanTestClientCertificates"
2018-08-28 09:37:48,962: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClientEKUTest.HttpClient_NoEKUClientAuth_Ok [SKIP]
2018-08-28 09:37:48,962: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "CanTestClientCertificates"
2018-08-28 09:37:48,962: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClientEKUTest.HttpClient_NoEKUServerAuth_Ok [SKIP]
2018-08-28 09:37:48,963: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "CanTestCertificates"
2018-08-28 09:37:48,963: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClientEKUTest.HttpClient_ClientEKUServerAuth_Fails [SKIP]
2018-08-28 09:37:48,963: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "CanTestCertificates"
2018-08-28 09:37:49,923: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClientMiniStress.MakeAndFaultManyRequests [SKIP]
2018-08-28 09:37:49,924: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:49,924: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClientMiniStress.ManyClients_ManyPosts_Async [SKIP]
2018-08-28 09:37:49,924: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:49,924: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClientMiniStress.SingleClient_ManyGets_Sync [SKIP]
2018-08-28 09:37:49,924: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:49,924: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClientMiniStress.UnreadResponseMessage_Collectible [SKIP]
2018-08-28 09:37:49,925: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:49,925: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClientMiniStress.ManyClients_ManyGets [SKIP]
2018-08-28 09:37:49,925: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:49,925: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClientMiniStress.SingleClient_ManyGets_Async [SKIP]
2018-08-28 09:37:49,926: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:49,926: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClientMiniStress.CreateAndDestroyManyClients [SKIP]
2018-08-28 09:37:49,926: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsStressModeEnabled"
2018-08-28 09:37:53,811: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClientHandlerTest.Ctor_ExpectedDefaultPropertyValues_UapPlatform [SKIP]
2018-08-28 09:37:53,811: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsUap"
2018-08-28 09:37:53,925: INFO: proc(54): run_and_log_output: Output:     System.Net.Http.Functional.Tests.PlatformHandler_HttpClientHandlerTest.Ctor_ExpectedDefaultPropertyValues_UapPlatform [SKIP]
2018-08-28 09:37:53,925: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: "IsUap"

(log terminates)

And run_client.py:

2018-08-28 09:37:44,283: INFO: executor(112): run: Successfully downloaded work item payload
2018-08-28 09:37:44,283: INFO: executor(463): _execute_command: running chmod +x $HELIX_WORKITEM_PAYLOAD/*.sh && $HELIX_PYTHONPATH $HELIX_CORRELATION_PAYLOAD/RunnerScripts/scriptrunner/scriptrunner.py --script RunTests.sh $HELIX_CORRELATION_PAYLOAD in /home/helixbot/dotnetbuild/work/a01a5d92-9983-471e-a012-334328feca8f/Work/95b27b09-8f38-4fae-b2a7-a3e5f646d4b6/Exec max 1200 seconds
2018-08-28 09:38:28,804: INFO: servicebusrepository(86): renew_workitem_lock: Entering renew_workitem_lock for https://nethelix.servicebus.windows.net/centos.74.amd64.open/messages/89187500/6b012b36-ef7b-43c3-b6c1-1951676536ab
2018-08-28 09:38:28,914: INFO: saferequests(90): request_with_retry: Response complete with status code '200'

<snip />

2018-08-28 09:57:16,851: INFO: servicebusrepository(86): renew_workitem_lock: Entering renew_workitem_lock for https://nethelix.servicebus.windows.net/centos.74.amd64.open/messages/89187500/6b012b36-ef7b-43c3-b6c1-1951676536ab
2018-08-28 09:57:16,990: INFO: saferequests(90): request_with_retry: Response complete with status code '200'
2018-08-28 09:57:16,990: INFO: servicebusrepository(99): renew_workitem_lock: renew_workitem_lock Status Code: 200
2018-08-28 09:57:44,286: ERROR: job(44): kill: Job running for too long. Killing...
2018-08-28 09:57:44,301: ERROR: executor(474): _execute_command: Executor timed out after 1200 seconds and was killed.
area-System.Net.Http disabled-test test bug test-run-core

All 22 comments

We have seen these type of failures in networking nightly runs for quite some time. Mostly on Win7, but occasionally on OSX (more) and Linux (rarely).
I expect there is some failure condition leading to deadlock during the test runs.

@rmkerr @wfurt this seems to hit our CI heavily in last few days - @filipnavara mentioned 4 occurrences on 1 PR and at least 1x on each other of his PRs in last few days.
Do we have history in Kusto or on Max's PowerBI dashboard? I want to confirm how widespread it is, on which OS, and when it started. We may need to prioritize it ...

We analyzed failure frequency for last month and something indeed changed:

  • Starting 8/22 we started getting more of these failures on Centos.74, RedHat.74 and OSX.1013 (25-65 per day). Other OSs stayed the same (OSX.1012, Windows.10, Windows.81, Ubuntu.1604 - just 1 hit, other OSs had no hits at all).
  • Starting 8/28 we started getting slightly more failures on Debian.87, OpenSuse.423, Ubuntu.1404 and Ubuntu.1604 (2 failures of each on 8/28 and 1 failure of each on 8/29 - it may be related to specific PR?). [UPDATE] They are by-design, specific to PR dotnet/corefx#32006

    • Note: These new failures are in different test categories: System.Net.Requests.Tests and System.Net.Security.Tests

Thanks @rmkerr and @wfurt for the data.
@rmkerr can you please look at the 1st regression on Centos.74/RedHat.74/OSX.1013? That looks like something we should go after ...
We might want to check if there was any update to the OS images around 8/22.

That aligns with what I observed on my PRs. All the failures were on Centos.74, RedHat.74 and OSX.1013.

It may or may not be related, but it started at around the same time (+/- time zone differences) when the new Xunit 2.4 test runner was merged. I was able to reproduce the hangs locally on OS X 10.14, but so far only when parallel runs are enabled.

The hanging test on my machine is System.Net.Http.Functional.Tests.PlatformHandler_HttpClientHandler_Cancellation_Test.MaxConnectionsPerServer_WaitingConnectionsAreCancelable.

Thanks for the additional info! I'll take a look at that test.

Thanks @filipnavara for your tip about the updated xunit test runner. That might be very much it - different timing, exposing our long-existing problem.
Were you able to identify the hanging test more than once by a chance?

We should try to repro locally too. Then maybe run the same test massively in parallel to get easier repro ...

Yes, can easily reproduce it locally now. Hangs roughly 20% of the time, more if the tests run in parallel. Managed to track it to this specific line with some reasonable confidence (stuck on that line 10 out of 10 hangs):

https://github.com/dotnet/corefx/blob/master/src/System.Net.Http/tests/FunctionalTests/HttpClientHandlerTest.Cancellation.cs#L340

@filipnavara @rmkerr can you please try to disable the offending test to see if it helps?
If it helps noticeably to reduce the frequency, we should check it in to unblock CI runs. Thanks!

8/22-8/27 all failures were in System.Net.Http.Functional.Tests category

Starting with 8/28, some failures were also in other test categories: [UPDATE] They are by-design, specific to PR dotnet/corefx#32006

  • System.Net.Requests.Tests - 10 (2x RedHat.74, 2x Centos.74, 2x Ubuntu.1404, 2x OpenSuse.423, 1x Ubuntu.1604, 1x Debian.87)
  • System.Net.Security.Tests - 8 (2x Ubuntu.1604, 2x Debian.87, 1x Centos.74, 1x RedHat.74, 1x Ubuntu.1404, 1x OpenSuse.423)

@karelz I'll disable that test now.

@karelz Are all of the System.Net.Requests.Tests and System.Net.Security.Tests failures from my OpenSSL hybridization PR (#32006?). I wasn't seeing those hangs locally until this morning, when I rebased on master (notably, I just jumped across the same xunit runner upgrade); but it's entirely possible that my change is exacerbating something.

@bartonjs you're right, we can ignore those.
Kusto doesn't have easy mapping to PR (I will follow up to improve that). I checked manually and all the listed System.Requests.Tests and System.Net.Security.Tests hangs are from your PR dotnet/corefx#32006 - these 3 commits:

I updated the posts above to strike-through info about these unrelated hangs to avoid further confusion.

@bartonjs if you can repro the new hangs locally, can you please check which tests are affected? It seems that the new xunit runner made the tests more suspectible to the hangs. Maybe even bug in xunit? ... We will see.
@rmkerr we will need to get to the bottom of this as high-pri - it seems more tests are affected than just the one we are disabling and it will block more and more devs over time.

Did the tests run at all before the xunit upgrade? One of the reasons why the upgrade was made is that a lot of tests were incorrectly ignored and not run at all.

Another thing to watch out for is if the affected tests are only "async" ones or if it's a mix.

I just checked in on the data this morning, and it looks like disabling the test worked. This weekend we saw less than ten test runs time out. That's in contrast to almost a hundred last weekend.

I'm having a bit of trouble setting up a machine to repro the issue, but once I get that together I'll try to fix the test itself.

If it turns out to be a problem to reproduce it on your machine I can offer remote access to one of my machines. I have two of them that hang in the test 60% of the time. Just drop me an email.

@wfurt can you please load balance the root-causing? It may be related to dotnet/corefx#32262.

OK. I'll try to get repro locally.

It seems like it is still problem. form CI Logs:
[Long Running Test] 'System.Net.Http.Functional.Tests.PlatformHandler_HttpClientHandler_Cancellation_Test.MaxConnectionsPerServer_WaitingConnectionsAreCancelable', Elapsed: 00:17:56

Was this page helpful?
0 / 5 - 0 ratings