Intermittently I'll get a packer crash with the following crash log below.
Error happens intermittently, overall I have a Jenkins job executing parallel steps that are baking a two ami's.
1.4.2
Running environment is RHEL 7 using packer 1.4.2. We are specifically using the packer ansible community plugin for 2.4.x. I do not believe we'll be able to upgrade off this connector plugin anytime soon. Python 2.7.5 is in use. We are on ansible 2.4.0.0
==> amazon-ebs: Waiting for auto-generated password for instance...
amazon-ebs: It is normal for this process to take up to 15 minutes,
amazon-ebs: but it usually takes around 5. Please wait.
amazon-ebs:
amazon-ebs: Password retrieved!
==> amazon-ebs: Using winrm communicator to connect: 52.61.247.118
==> amazon-ebs: Waiting for WinRM to become available...
Build 'amazon-ebs' errored: unexpected EOF
==> Some builds didn't complete successfully and had errors:
--> amazon-ebs: unexpected EOF
==> Builds finished but no artifacts were created.
panic: runtime error: invalid memory address or nil pointer dereference
2019/10/14 20:49:48 packer: [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x6ab196]
2019/10/14 20:49:48 packer:
2019/10/14 20:49:48 packer: goroutine 974 [running]:
2019/10/14 20:49:48 packer: net/http.redirectBehavior(0x36aab2d, 0x4, 0x0, 0xc000100e00, 0x0, 0x0, 0x0)
2019/10/14 20:49:48 packer: /usr/local/go/src/net/http/client.go:422 +0x26
2019/10/14 20:49:48 packer: net/http.(*Client).do(0xc00060fad0, 0xc000100e00, 0x0, 0x0, 0x0)
2019/10/14 20:49:48 packer: /usr/local/go/src/net/http/client.go:655 +0x2f1
2019/10/14 20:49:48 packer: net/http.(*Client).Do(...)
2019/10/14 20:49:48 packer: /usr/local/go/src/net/http/client.go:509
2019/10/14 20:49:48 packer: github.com/masterzen/winrm.clientRequest.Post(0x3bfa020, 0xc0004980e0, 0xc0000f8080, 0xc0002154c0, 0x0, 0x0, 0x0, 0x0)
2019/10/14 20:49:48 packer: /Users/mmarsh/go/pkg/mod/github.com/masterzen/[email protected]/http.go:82 +0x3e8
2019/10/14 20:49:48 packer: github.com/masterzen/winrm.ClientNTLM.Post(...)
2019/10/14 20:49:48 packer: /Users/mmarsh/go/pkg/mod/github.com/masterzen/[email protected]/ntlm.go:22
2019/10/14 20:49:48 packer: github.com/masterzen/winrm.(*Client).sendRequest(...)
2019/10/14 20:49:48 packer: /Users/mmarsh/go/pkg/mod/github.com/masterzen/[email protected]/client.go:102
2019/10/14 20:49:48 packer: github.com/masterzen/winrm.(*Client).CreateShell(0xc0000f8080, 0x0, 0x0, 0x0)
2019/10/14 20:49:48 packer: /Users/mmarsh/go/pkg/mod/github.com/masterzen/[email protected]/client.go:81 +0xa8
2019/10/14 20:49:48 packer: github.com/hashicorp/packer/communicator/winrm.New(0xc00015fb30, 0xc00015fb30, 0x1, 0xd)
2019/10/14 20:49:48 packer: /Users/mmarsh/Projects/packer/communicator/winrm/communicator.go:59 +0x180
2019/10/14 20:49:48 packer: github.com/hashicorp/packer/helper/communicator.(*StepConnectWinRM).waitForWinRM(0xc0002148c0, 0x3c41da0, 0xc0005b4780, 0xc000237c80, 0xc000780e40, 0x0, 0xc000802720, 0xc000197f60)
2019/10/14 20:49:48 packer: /Users/mmarsh/Projects/packer/helper/communicator/step_connect_winrm.go:133 +0x332
2019/10/14 20:49:48 packer: github.com/hashicorp/packer/helper/communicator.(*StepConnectWinRM).Run.func1(0x3c732a0, 0xc0005a2ba0, 0xc0002148c0, 0x3c41da0, 0xc0005b4780, 0xc000237c80, 0xc00047c370, 0xc00047c380, 0xc0000513b0)
2019/10/14 20:49:48 packer: /Users/mmarsh/Projects/packer/helper/communicator/step_connect_winrm.go:46 +0x76
2019/10/14 20:49:48 packer: created by github.com/hashicorp/packer/helper/communicator.(*StepConnectWinRM).Run
2019/10/14 20:49:48 packer: /Users/mmarsh/Projects/packer/helper/communicator/step_connect_winrm.go:44 +0x1a1
2019/10/14 20:49:48 [INFO] (telemetry) ending amazon-ebs
2019/10/14 20:49:48 ui error: Build 'amazon-ebs' errored: unexpected EOF
2019/10/14 20:49:48 machine readable: error-count []string{"1"}
2019/10/14 20:49:48 ui error:
==> Some builds didn't complete successfully and had errors:
2019/10/14 20:49:48 machine readable: amazon-ebs,error []string{"unexpected EOF"}
2019/10/14 20:49:48 ui error: --> amazon-ebs: unexpected EOF
2019/10/14 20:49:48 ui:
==> Builds finished but no artifacts were created.
2019/10/14 20:49:48 [INFO] (telemetry) Finalizing.
2019/10/14 20:49:48 /usr/local/bin/packer: plugin process exited
2019/10/14 20:49:49 waiting for all plugin processes to complete...
2019/10/14 20:49:49 /usr/local/bin/packer: plugin process exited
2019/10/14 20:49:49 /usr/local/bin/packer: plugin process exited
I'm experiencing the same issue. See the details below.
Packer intermittently crashes with the following error when building Windows Server 2012 R2 AMIs:
Build 'amazon-ebs' errored: unexpected EOF
It's intermittent. I have this running in a number of accounts and it occurs around 60% of the time.
1.4.3
Packer is running on a Windows Server 2016 instance that resides in the same VPC as the templates being deployed. I currently have 11 templates that are deployed simultaneously via PowerShell runspace jobs. This appears to only affect Windows Server 2012 R2 builds.
Packer output:
2019-10-15T11:00:54-04:00: amazon-ebs: It is normal for this process to take up to 15 minutes,
amazon-ebs: but it usually takes around 5. Please wait.
2019-10-15T11:05:34-04:00: amazon-ebs:
amazon-ebs: Password retrieved!
2019-10-15T11:05:34-04:00: ==> amazon-ebs: Using winrm communicator to connect: 10.94.130.137
2019-10-15T11:05:34-04:00: ==> amazon-ebs: Waiting for WinRM to become available...
2019-10-15T11:05:38-04:00: Build 'amazon-ebs' errored: unexpected EOF
==> Some builds didn't complete successfully and had errors:
--> amazon-ebs: unexpected EOF
Crash.log:
2019/10/15 11:05:38 packer.exe: 2019/10/15 11:05:38 [INFO] Attempting WinRM connection...
2019/10/15 11:05:38 packer.exe: 2019/10/15 11:05:38 [DEBUG] connecting to remote shell using WinRM
2019/10/15 11:05:38 packer.exe: panic: runtime error: invalid memory address or nil pointer dereference
2019/10/15 11:05:38 packer.exe: [signal 0xc0000005 code=0x0 addr=0x10 pc=0x6a852d]
2019/10/15 11:05:38 packer.exe:
2019/10/15 11:05:38 packer.exe: goroutine 630 [running]:
2019/10/15 11:05:38 packer.exe: net/http.redirectBehavior(0x3a3205b, 0x4, 0x0, 0xc00041c900, 0x0, 0x0, 0x0)
2019/10/15 11:05:38 packer.exe: /usr/local/go/src/net/http/client.go:422 +0x2d
2019/10/15 11:05:38 packer.exe: net/http.(Client).do(0xc0003bbad0, 0xc00041c900, 0x0, 0x0, 0x0)
2019/10/15 11:05:38 packer.exe: /usr/local/go/src/net/http/client.go:655 +0x2f8
2019/10/15 11:05:38 packer.exe: net/http.(Client).Do(...)
2019/10/15 11:05:38 packer.exe: /usr/local/go/src/net/http/client.go:509
2019/10/15 11:05:38 packer.exe: github.com/masterzen/winrm.clientRequest.Post(0x3fdfec0, 0xc000297f50, 0xc00011b000, 0xc00039f480, 0x0, 0x0, 0x0, 0x0)
2019/10/15 11:05:38 packer.exe: /Users/mmarsh/go/pkg/mod/github.com/masterzen/[email protected]/http.go:82 +0x3ef
2019/10/15 11:05:38 packer.exe: github.com/masterzen/winrm.ClientNTLM.Post(...)
2019/10/15 11:05:38 packer.exe: /Users/mmarsh/go/pkg/mod/github.com/masterzen/[email protected]/ntlm.go:22
2019/10/15 11:05:38 packer.exe: github.com/masterzen/winrm.(Client).sendRequest(...)
2019/10/15 11:05:38 packer.exe: /Users/mmarsh/go/pkg/mod/github.com/masterzen/[email protected]/client.go:102
2019/10/15 11:05:38 packer.exe: github.com/masterzen/winrm.(Client).CreateShell(0xc00011b000, 0x0, 0x0, 0x0)
2019/10/15 11:05:38 packer.exe: /Users/mmarsh/go/pkg/mod/github.com/masterzen/[email protected]/client.go:81 +0xaf
2019/10/15 11:05:38 packer.exe: github.com/hashicorp/packer/communicator/winrm.New(0xc0006cc320, 0xc0006cc320, 0x1, 0xd)
2019/10/15 11:05:38 packer.exe: /Users/mmarsh/Projects/packer/communicator/winrm/communicator.go:59 +0x187
2019/10/15 11:05:38 packer.exe: github.com/hashicorp/packer/helper/communicator.(StepConnectWinRM).waitForWinRM(0xc000610a40, 0x402d7c0, 0xc00048bbc0, 0x4047b40, 0xc0006ac600, 0x16, 0x0, 0x459301, 0xc00037cc40)
2019/10/15 11:05:38 packer.exe: /Users/mmarsh/Projects/packer/helper/communicator/step_connect_winrm.go:134 +0x291
2019/10/15 11:05:38 packer.exe: github.com/hashicorp/packer/helper/communicator.(StepConnectWinRM).Run.func1(0x4063520, 0xc00039e8a0, 0xc000610a40, 0x402d7c0, 0xc00048bbc0, 0x4047b40, 0xc0006ac600, 0xc00059f6d0, 0xc00059f6e0, 0xc00059f6f0, ...)
2019/10/15 11:05:38 packer.exe: /Users/mmarsh/Projects/packer/helper/communicator/step_connect_winrm.go:46 +0x8d
2019/10/15 11:05:38 packer.exe: created by github.com/hashicorp/packer/helper/communicator.(*StepConnectWinRM).Run
2019/10/15 11:05:38 packer.exe: /Users/mmarsh/Projects/packer/helper/communicator/step_connect_winrm.go:44 +0x1de
2019/10/15 11:05:38 [INFO] (telemetry) ending amazon-ebs
2019/10/15 11:05:38 ui error: 2019-10-15T11:05:38-04:00: Build 'amazon-ebs' errored: unexpected EOF
2019/10/15 11:05:38 machine readable: error-count []string{"1"}
2019/10/15 11:05:38 ui error:
==> Some builds didn't complete successfully and had errors:
2019/10/15 11:05:38 machine readable: amazon-ebs,error []string{"unexpected EOF"}
2019/10/15 11:05:38 ui error: --> amazon-ebs: unexpected EOF
Are you also running multiple Packer builds from the same machine?
Are you also running multiple Packer builds from the same machine?
Yes I am. When this job runs, I spawn 11 concurrent Packer processes.
okay, awesome. I wonder if there's some kind of WinRM conflict that only occurs once you get to a certain concurrency. I'll try to reproduce when I get a chance.
@SwampDragons I had the same thought, but I've had this running for 6+ months and had no issues until around a month ago, and it oddly only occurs when deploying 2012 R2 AMIs. Either way, if you would like to test the concurrent WinRM theory I'll gladly provide the code I use to spawn the jobs.
I'm happy to provide more info if you need as well, just let me know and I can dig. Currently I'm still doing iterations automating our things and the intermittent failures aren't near 50%, they're somewhat low I guess but I do see a handful per day. We're deploying to Windows Server 2016 AMI's. The way we spawn jobs is just from sh in Jenkins pipeline scripts. Our packer bakery was fired up just in the past month. I've seen it fail with just 2 concurrent packer processes.
I just saw mine fail back to back with the same error having just 1 AMI being baked, so no concurrent packer invocations that I can see from sh this morning (no parallel Jenkins steps) with the project I'm working on this morning.
I've simply been testing/debugging some ansible scripts that are being ran from packer, sending it through our build process. I got the panic: runtime error: invalid memory address or nil pointer dereference error on the 8th and 9th attempt. 10th attempt proceeded through. I'm running these one by one.
@SwampDragons - Have you had any luck with troubleshooting? Is there anything I can do to assist? I just tested with the latest version of Packer (1.4.5) and I still have intermittent failures when building 2012 R2 AMIs.
I've still never been able to reproduce, which makes this really hard to investigate. But I haven't given up.
Are either of you executing from within the Packer docker container? https://hub.docker.com/r/hashicorp/packer
I'm not able to use containers, so no docker in play here, I'm running packer in an EC2 instance that is running RHEL 7.x. I execute packer via Jenkins sh.
Packer binary for version 1.4.2 is placed under /usr/local/bin/ on this build machine.
@SwampDragons - Not yet. I have running Packer on a Windows Server 2016 instance and executed via PowerShell. I'd be more than happy to prepare a sample template and script to spawn the jobs to reproduce this issue for you. Would that help?
That would be helpful. As simple as you can get it where it still generates the issue would be best :) . Thanks.
Also, anything relevant about your environment -- using proxies, etc.
Interestingly, that nil pointer dereference is happening inside the golang http library because it's hitting a nil pointer to a response, and I'm not sure how that even happens. I'm reading golang source right now to try to figure out a repro case.
No proxy in play on my end, just an EC2 instance inside a VPC in AWS, both the machine running packer and the target EC2 instance being baked by packer reside in the same VPC in private subnets, both sides are currently using t2.mediums for my group.
On the target ec2 instance being provisioned with packer and in addition to windows firewall rules for winrm we also have an aws security group that allows just the winrm traffic and this security group is assigned to the instance by packer. I don't know exactly how AWS implements security groups behind the scenes or if that would really count as a proxy.
I don't know enough golang, but is it possible that AWS network is flaky or dropping out intermittently or that winrm is dropping suddenly while sending back the response?
Knowing you're running Packer from within AWS is useful; I've been trying to reproduce on my local computer -- I'll see if I can set up a similar VPC situation with the same instance types and security group. You've been able to reproduce with a single concurrent run; is it the same amount of flakiness for a single run or is it happening more often when the instance running Packer is performing more than one parallel build?
I don't know enough golang, but is it possible that AWS network is flaky or dropping out intermittently or that winrm is dropping suddenly while sending back the response?
I've been trying all kinds of wacky things locally to constrain my resources, including bandwidth and network availability, and haven't been able to reproduce yet. But doing it from inside Amazon makes sense -- the t2.medium instances are labelled as having "low to moderate" network performance, whatever that means. Maybe there's a timeout or dropped data being caused there.
You've been able to reproduce with a single concurrent run; is it the same amount of flakiness for a single run or is it happening more often when the instance running Packer is performing more than one parallel build?
I think it's about the same, I've seen it fail back to back single runs like that as well. Recently I was baking maybe 10 at a time in parallel and maybe suffered 1 crash like reported.
I was actually thinking of maybe trying to test some network destabilization things on my end to see if I could repro with using a tool to cause packet loss, delays etc. like netem or an iptables trick to drop %1 of packets and see if this specific crashing increases.
I'll see if I can find out some more info on our aws vpc and the environment. Another thing is that these machines being baked are not fresh AMI's from AWS, they originated from AWS but our base ami's we launch with packer get patched each month and they've also went through a STIG security compliance process initially. Let me know if there's anything specific in the target windows machine I can give info wise.
Trying to read what I can to help, I came across this ansible git issue 23320 about winrm being sensitive to network issues. I guess overall without knowing golang and why the http client response is nil, the only thought I have is that the connection is just suddenly closing or dropping. Big assumption on my part if that's what is going on or not.
it's not ideal, but I could also try to build a version of Packer with tons of extra logging in our vendored winrm modules, send it to you, and see if we can track the bug that way. If I can't get a repro case today I'll probably try that route.
Sounds good, I don't mind switching the packer binary for some testing and get you some logs if you go that route. Our ci/cd isn't in production mode yet. Keep in mind our setup might not be ideal where we're stuck using ansible community plugin for 2.4.x, python 2.7.5 is in use. and are on ansible 2.4.0.0. If that's not a problem I'll try it.
I don't think the ansible version is to blame here, so that should be fine. I'll give myself another day of tinkering to get a repro and then will generate a binary for you to test if that doesn't work out.
Apologies for the delay in response.
I'm not using a proxy either. Like Eric, I have Packer running in AWS and in the same subnet as the builders. I'm building a repro case for you and almost have it wrapped up. I also have not tried to reproduce this with single build as that would likely take more time, but I can if needed.
My repro case builds 30 AMIs using 6 different versions of the latest Server 2012 R2 AMIs (created by AWS) in an effort to increase possibility of encountering a crash. I know it sound crazy, but I don't like waiting :P
The last few tests have had 1-3 crashes, so you should see the same result within a few attempts. I should have it uploaded within the hour.
@SwampDragons - Attached is my repro case. Let me know if I left anything out or if you have any questions.
@colereynolds what kind of instance (like, t2.medium, etc) is Packer running on?
@colereynolds what kind of instance (like, t2.medium, etc) is Packer running on?
I have it running on a c5.xlarge.
👍
@colereynolds I've run your repro 3 times, no crashes. What region are you in? Are you using a custom security group? What version of Packer are you on?
OMG IT JUST REPROD. 4th time's the charm
Using Cole's setup, I managed to reproduce once in 5 hours of nearly constant reruns, I haven't been able to trigger the bug since I added some logging to the ntlm library to try to figure out what path this code was taking. FWIW here's the patched binary I'm using: https://circleci.com/gh/hashicorp/packer/21251#artifacts/containers/0 in case either of you have the ability to run and acquire some logs for me (the debug logging will only appear if you have set the env var PACKER_LOG=1)
Thanks, I'll place this binary in the morning and see if I can get you some logs from it.
Thanks, I appreciate it. I have a pretty strong guess about generally _what_ is happening, but I'm not clear on _where_. Hopefully the logging will answer that part.
@SwampDragons I just got the first bomb and a log with the new binary, here's the output from the Jenkins step. Hopefully this helps clear things :) Let me know if you need anything else.
Yes! Thanks! tl;dr is that your logs are precisely what I needed to confirm where the bug occurs and I think this build shouldn't crash anymore: https://circleci.com/gh/hashicorp/packer/21394#artifacts/containers/0
In case you're interested in the why:
From your logs:
[2019-11-15T16:48:08.856Z] 2019/11/15 16:48:08 packer: 2019/11/15 16:48:08 res is: (*http.Response)(nil) and err is:
That's what I suspected was happening. This logging is coming from inside of a special RoundTripper that we use only when someone is using ntlm. This RoundTripper is returning both a nil response and a nil error, which is an antipattern that Golang tells you not to do but doesn't really enforce using code. As a result, the http client that calls the RoundTripper isn't handling this situation properly. It's assuming that since err is nil, resp is populated, and is then choking on the nil pointer that it didn't expect.
And I'm 99% sure I've found the culprit within the RoundTripper; an "err" got redeclared in the scope of an if statement, then returned outside the scope of that statement, so while err probably isn't actually nil, it's not being returned. I've pushed another change to that PR; binaries of the new and improved and _probably_ crash-free build can be retrieved from our CI at the link above.
I'll make a PR against the upstream library that contains the RoundTripper if you can confirm that the patched binary above doesn't crash anymore.
Awesome! Thanks for the explanation. I'll give the latest binary a try next chance I get and keep an eye on it from my end. Will let you know any findings after I run our ami bakery "hard" for a few days.
hah! I went to open the PR only to find someone's already opened one for the same issue a couple of days ago (linked above). 🤞 it gets merged soon. @paulmey I've seen that you have some commits on that library; any chance you can bump the broken test which seems unrelated to me?
Thank you, @SwampDragons! I'll do the same and let you know if I encounter any crashes.
Update: They merged the PR upstream, so I've changed my PR from an investigation PR to a module upgrade PR.
@SwampDragons - Is this scheduled to be part of 1.5.0 release?
@nbshetty yes; it is also in the nightly build of packer.
I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Most helpful comment
OMG IT JUST REPROD. 4th time's the charm