Hi! I ran into an interesting scenario where useful error output from a shell provisioner was truncated:
Here is a provisioner snippet, note the set -e
:
{
"type": "shell",
"inline": [
...<truncated>...
"sudo -u service_user bash <<'DEPLOY_BASH'",
"set -e",
...<truncated>...
"ln -nfs ${RELEASE_DIR} ${CURRENT_DIR}",
"cd ${CURRENT_DIR}",
"npm install",
...<truncated>...
"curl https://s3-us-west-2.amazonaws.com/static_assets/dist/dist.${SHA}.tar | tar -xv",
...<truncated>...
"DEPLOY_BASH"
]
}
This would yield a failure that was opaque. The output of the npm install
was truncated, leading us to believe the problem was with our NPM libraries. However the problem was later with the tar
command on a malformed tar file.
Here is the output with set -e
.
https://gist.github.com/benmullin333/fd84c941f1a81f0e3048c955e2f1585a
Here is the output with the same provisioner, _without_ the set -e
:
https://gist.github.com/benmullin333/bbe195dce4002cde47a7ac245206b1b5
I'm not sure if this falls under "user error" or not. However, it would have been useful if someone had shared this information to save time.
Please note that the behavior was identical using packer version 0.12.3
This was on aws linux.
Have same issue when trying to see the content of the directory on the AMI being built.
This will lead to a truncation at the end of the provisioners section
{
"type": "shell",
"inline": [
"set -x",
"sudo ls -la /path/to/a/directory/with/a/lot/of/entries"
]
}
Adding a sleep 1
at the end of the provisioner will output the full list
{
"type": "shell",
"inline": [
"set -x",
"sudo ls -la /path/to/a/directory/with/a/lot/of/entries",
"sleep 1"
]
}
Adding a sleep
at the end of the provisioner will output the full list
hmm, I suspect the program exits and causes packer to start shutting everything down while the output is still buffered and being sent to the client. A sleep at the end would definitely solve this for now.
We might be able to fix by blocking remotecmd from returning until we see an EOF in the output
I'm currently facing this issue and is becoming a big pain to debug some faulty bakes. I can't use the sleep 1
solution because we are provisioning via a script file which is failing somewhere - because I'm not getting the full logs it gets hard to debug.
I've compile the packer code with a small hack that somehow solved my issue. It's really an hack, which basically puts the sleep inside the packer code. It can be found here: https://github.com/PerGon/packer/commit/cda29b8cebadb23c880e3c25a7bb4f3d3c454e5c
I've already spent quite some time trying to figure out how to deal with the Go channels, the buffer and what not, but my limited knowledge of Go is holding me back (first time I compiled Go code :D ).
@mwhooker do you know if this bug will be in roadmap any time soon? Or, if you could give me some more direction on how to fix it, I can give it a try.
(I tried to find a way to check for the EOF on the output of remotecmd, but no luck :/ )
Similar symptoms in https://github.com/hashicorp/packer/issues/5050
@apogrebnyak thanks for the comment -- it allowed me to build a really reliable repro case so I could resolve this. @PerGon thanks for sharing your hack -- it helped me narrow down where the issue was. Can you verify for me whether https://github.com/hashicorp/packer/pull/5089 solves this for you?
Hi @SwampDragons - this is great news :heart_eyes_cat:
I'll test you patch right away and will be back with the results. Thanks
Hi @SwampDragons ,
I'm sad to inform that I'm pretty much sure that this doesn't fix our issues :( it took me a while to verify because it's not deterministic - sometimes I get all the logs, sometimes, the exact same bake doesn't get me all the logs.
Today I'm off, but tomorrow I'll try to get a simple enough use case that hopefully can be replicated on your side.
My Go foo is low, but I wonder if the SSH communicator inherits from the RPC communicator. Both have similar methods like Start
, Upload
, Download
, etc. Since I'm using the SSH communicator and I see the issue is not totally solved, I wonder if the fix in the RPC communicator shouldn't also be implemented in the SSH communicator.
Thanks!
@PerGon Yeah, the rpc communicator kind of wraps the ssh communicator, and the race I'm addressing in this patch is in the rpc communicator; I couldn't see any in the ssh communicator proper, and this patch does address the test case I generated that uses the ssh communicator.
Now I wonder if I placed the wait too late in the code and your case is activating this path here: https://github.com/hashicorp/packer/blob/9ee97aaa2a26e1401a59829b5104012aeac65bbe/packer/rpc/communicator.go#L105-L110
What happens if we move the wait from here: https://github.com/hashicorp/packer/blob/9ee97aaa2a26e1401a59829b5104012aeac65bbe/packer/rpc/communicator.go#L116 to between lines 104 and 105?
@PerGon I pushed the change described above... any improvement?
Hi @SwampDragons,
I think we got something :) I've run a bunch of bakes and got logs for all of them. So, on my side I believe I can call this fixed
Thanks a lot for your efforts 💪
I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Most helpful comment
Hi @SwampDragons,
I think we got something :) I've run a bunch of bakes and got logs for all of them. So, on my side I believe I can call this fixed
Thanks a lot for your efforts 💪