Powershell: macOS is hanging during webcmdlets tests

Created on 15 Nov 2017  路  30Comments  路  Source: PowerShell/PowerShell

Consistent on my macBook and also affecting Travis-CI runs.

I believe @JamesWTruher was investigating if there is a race condition between the test listener being ready and the test case not receiving a response.

Replace HttpListener tests:

  • [x] response (#5540)
  • [x] echo (#5840)
  • [x] redirect (#5872)
  • [x] linkheader (#5806)
  • [x] Disabled HttpListener from starting
Resolution-Fixed WG-Quality-Test

Most helpful comment

My test showed that port 8079 works. I put in #5520 to switch the port. That will resolve the AppVeyor fails, but not necessarily the macOS fails.

All 30 comments

cc @markekraus don't know if you have a mac to repro this

@SteveL-MSFT is this an example of what you are seeing? or are there actual it block fails?
https://travis-ci.org/PowerShell/PowerShell/jobs/301748042

Description: Invoke-WebRequest tests/

Name:        Error occurred in Describe block

message:

WebListener did not start before the timeout was reached.

stack-trace:

At /Users/travis/build/PowerShell/PowerShell/test/tools/Modules/WebListener/WebListener.psm1:91 char:13

If so, we can try increasing the time out on the start up timeout here:

https://github.com/PowerShell/PowerShell/blob/2f8e691e055fc16411dc254f492aa07e9a638b57/test/tools/Modules/WebListener/WebListener.psm1#L58

It's possible the addition of the 2 new TLS listener ports takes more time on macOS.

If you are getting this consistently on your system. run

Publish-PSTestTools
Set-Location ./test/tools/WebListener/bin/
dotnet WebListener.dll ../../Modules/WebListener/ServerCert.pfx password 8083 8084 8085 8086

See how long it takes to start up.

I will try to repro this myself.

ok.. I see something completely different in my macOS system. 1 in 3 launches of the web cmdlet tests I'm getting this:

[-] Error occurred in Describe block 10.66s
   HttpListener failed to respond
   At /Users/admin/git/PowerShell/test/tools/Modules/HttpListener/HttpListener.psm1:362 char:17
 [-] Error occurred in test script './test/powershell/Modules/Microsoft.PowerShell.Utility/WebCmdlets.Tests.ps1' 257ms
   Cannot index into a null array.
   at Describe, /Users/admin/git/PowerShell/src/powershell-unix/bin/Linux/netcoreapp2.0/osx.10.12-x64/publish/Modules/Pester/Functions/Describe.ps1: line 117

which is the HttpListener and not WebListener.

Then 1 in 2 runs that get past that error I get hung up on the Validate Invoke-WebRequest returns empty RelationLink property if there is no Link Header test and have to kill the shell. That test is making a call to the HttpListener.

I tried increasing the time out on HttpListener to 25, and it reduce the initialization fail to 1 in 5.. 鈽癸笍

I added a -TimeoutSec 5 to the Validate Invoke-WebRequest returns empty RelationLink property if there is no Link Header` and now that test never completely hangs.

But none of this matches the fails I'm seeing in Travis CI.

@JamesWTruher is currently investigating

Looks like the hang is related to the old httplistener, I'm going to convert tests to use the new weblistener and get rid of httplistener.

I'm confused as to why that started failing suddenly. The actual code for HttpListener hasn't been touched for months and I don't see how changing the copyright string would have affected anything. I would blame the WebListener changes being too many threads for CI, but, I see the same thing every now and then in my mac, and HttpListener loads and fails before WebListener even gets the chance to clash.

@markekraus it's puzzling why it started failing as the code for both HttpListener and the corresponding tests haven't been touched. The other suspect is that we picked up dotnetcore 2.0.4 servicing preview, but would need a simpler repro if it's an issue in corefx.

I could try building a private with dotnetcore 2.0.0 to at least eliminate whether it's corefx

update: still hangs on 2.0.0, so I guess that's good news, but then something changed on our side

I reset to beta.8 and rebuilt PSCore6, then ran the RC webcmdlet tests and still hung. At this point, the only thing that changed is macOS.

On CI Appveyor failed tests is "alias" tests - I think we should remove its (we have separate test set for aliases).

@iSazonov can you create a separate issue for that? The alias test isn't really the problem, it's the HttpListener failing to start.

@markekraus actually, the HttpListener is starting. When the test is hung, you can still send requests to the HttpListener and it will respond. However, I think the test case is stuck where it sent a request and waiting on a response. Need to continue investigating this. At this time, since this code is hitting an issue, it doesn't make sense to change it to use WebListener en masse. It might make sense to convert most of it for consistency and keep a test case around for covering this scenario.

@SteveL-MSFT I think we have several HttpListener related issues here. One is the macOS tests hanging. Another is macOS tests not starting. Those two may be related. The third is AppVeyor tests not starting.

On my mac, you are right, when it hangs I can still access the HttpListener. However, we have instances in Travis CI and on my mac where HttpListener is not even starting at all. in that instance, HttpListener is not accessible at all.

The timing of when the AppVeyor fails started seems to coincide with the changes they made in AppVeyor: https://www.appveyor.com/updates/2017/11/18/ . I'm not sure the AppVeyor fails are related to the macOS fails. The ones on AppVeyor are now consistent and every build since the 18th has failed for the same reason that @iSazonov reported.

For the AppVeyor fails, it looks like 8082 can't be used on AppVeyor. I'm not sure why because I can't find anything running on the port. https://ci.appveyor.com/project/markekraus/powershell/build/6.0.0-rc-6.

Do we want to roll back to the previous appveyor image while we are investigating?

well... for the AppVeyor issue we could roll back to the previous image, or we can change the port in this describe block... or we can change the tests to simply ensure the alias iwr and irm aliases are being exported for the commands. @iSazonov says we are testing for that somwhere else, but I'm not sure we are. if we are testing for that alias export, then this entire describe block can just be removed.

My test showed that port 8079 works. I put in #5520 to switch the port. That will resolve the AppVeyor fails, but not necessarily the macOS fails.

@markekraus you're awesome. it seems changing the port also fixed the macOS issue (on Travis-CI). not sure how they are related.

@SteveL-MSFT I'm not so sure that actually did anything on macOS. I'm still seeing the same rate of fails and hangs on my macOS with current master. *shrugs

@markekraus maybe I'm just getting lucky on Travis-CI with my PRs? perhaps they have different systems with different versions of macOS?

Maybe ask CIs teams what is best practice to use TCP ports? It would be nice to have a list of reserved ports for customers.

@iSazonov we could ask, but they may not even know. The tools and products they bake into their images don't even properly document this kind of thing 馃檨 . I can't blame the CIs completely for that. The port 8082 issue is even more odd because I could not even find evidence that it was being bound to by anything. The only thing I know for sue is that I could not bind to the port.

@SteveL-MSFT I think it is just luck. The fails are not 100%, but something like 1 in 10 and the hangs are 1 in 2 of of runs that don't fail. I'm pretty sure the fails are a CoreFX bug. I need to make a simple repro. Basically, when binding to a port with HttpListener it just doesn't always work.

The hang issue looks like it only occurs in the tests. It is probably a combination of calls being made to the HttpListener that is causing it. I can make 10,000 calls the the URI in a simple pester test and nothing goes wrong, yet in the Web Cmdlets test it hangs 1 in 2 runs. I added -verbose to the line in questions and IWR is waiting for a response from the server. If I put -TimeoutSec on the test it always times out. While it is hung on that I can do an IWR to the same URI from another console and it responds properly.

@markekraus would be great if you can work out a simple repro and open the issue in corefx repo

Maybe again crossgen?

Just tried both crossgen and non-crossgen on my macBook running 10.13.1 and hangs in both cases

Make sense to add some code hooks to log a status of threads?

Ok, I give up on this. I cannot reproduce the hangs at all outside of these tests. It must be a combination of things that cause the hang, but it is definitely hanging inside HttpListener. It gets the request, starts processing it and then never sends the response. It claims that it has sent the response, but tcpdump begs to differ. This could be something to do with running a child process, then pester, and then a runspace. If I start HttpListener in the foreground in another console, this never hangs and the tests complete successfully every time.

I think we just need to move all of the tests to WebListener. I've spent way too much time on this already, when it was always my plan to eventually move these tests over. As much as I would love to know the root cause to this, finding it is out of my depth at this point.

If we want to keep these from false failing or hanging in CI, I suggest these 2 tests be set to skip on macOS:

https://github.com/PowerShell/PowerShell/blob/71e588d223c393cc5d2c918bdef91edef236bf9c/test/powershell/Modules/Microsoft.PowerShell.Utility/WebCmdlets.Tests.ps1#L789

https://github.com/PowerShell/PowerShell/blob/71e588d223c393cc5d2c918bdef91edef236bf9c/test/powershell/Modules/Microsoft.PowerShell.Utility/WebCmdlets.Tests.ps1#L797

They are the only 2 that hang with any kind of consistency.

On the issue of it failing to start. I saw some weird behavior on mac where the dotnet binary would occasionally use local ports including 8080 when it was building TestExe and WebListener. I have no clue why it is doing that, I assume it has some kind of local language engine it is launching or something. in any case, occasionally when it used port 8080, it was not giving up the local port after it finished compiling. I cannot find any reference to dotnet using port 8080, so this is a bit of a mystery to me. It doesn't use 8080 every time and I don't see the same behavior on linux or windows. The failed start issue might easily be averted by using port 8078 instead of 8080.

@markekraus for now, I would be fine if you wanted to convert those tests to use the weblistener, but leave the httplistener code so we can investigate root cause later

5540 for replacing response tests, since these are the ones I see hanging in macOS the most. I also updated your initial comment to track the progress of replacing the the tests. When HttpListener is no longer starting in the tests, we can consider this issue closed.

5540 is merged to RC.2

Was this page helpful?
0 / 5 - 0 ratings