Server: 0.6.4, Ubuntu 14.04, 64-bit
When consul lock gets a lock, it launches /bin/sh -c <child-process>, which then launches the actual child process. When it loses the lock, it kills /bin/sh but leaves <child-process> running.
Steps to reproduce:
consulconsul lock, e.g. consul lock locks/replicate consul-replicate -config /etc/consul-replicate.dconsul to release the lockps aux | grep consul and see that the child process is still running.Relevant log fragments:
ps aux | grep consul:consul 2750 0.1 3.3 25220 16776 ? Ssl 20:52 0:04 /opt/consul/0.6.4/consul agent -config-file=/etc/consul/consul.json -config-dir=/etc/consul/conf.d
consul-+ 2868 0.0 2.6 19200 13028 ? Ssl 20:53 0:00 /opt/consul/0.6.4/consul lock -verbose locks/replicate consul-replicate -config /etc/consul-replicate.d
consul-+ 2875 0.0 0.1 4448 800 ? S 20:53 0:00 /bin/sh -c consul-replicate -config /etc/consul-replicate.d
consul-+ 2876 0.0 0.9 8296 4656 ? Sl 20:53 0:00 consul-replicate -config /etc/consul-replicate.d
Setting up lock at path: locks/replicate/.lock
Attempting lock acquisition
Starting handler 'consul-replicate -config /etc/consul-replicate.d'
Lock lost, killing child
Terminating child pid 2875
Error running handler: signal: terminated
signal: terminated
Child terminated
Lock release failed: failed to release lock: Put http://127.0.0.1:8500/v1/kv/locks/replicate/.lock?flags=3304740253564472344&release=c97cc39e-b5a7-0bd8-e8aa-baf9226a4ddb: dial tcp 127.0.0.1:8500: getsockopt: connection refused
Note: pid 2875 is the /bin/sh process
ps aux | grep consul afterwards:consul-+ 2876 0.0 0.9 8296 4656 ? Sl 20:53 0:00 consul-replicate -config /etc/consul-replicate.d
consul 2972 0.4 2.8 23108 14236 ? Ssl 21:36 0:00 /opt/consul/0.6.4/consul agent -config-file=/etc/consul/consul.json -config-dir=/etc/consul/conf.d
The child process, pid 2876, is still running
Pid 2972 is a new consul agent process restarted by upstart.
If you are seeing this on ubuntu, it may be caused by "sh == dash". Switching from dash to bash as the sh interpreter fixed this behavior for me.
That looks like it did the trick.
It might be worthwhile to allow the shell to be overridden so that we don't have to change the default shell globally. https://wiki.ubuntu.com/DashAsBinSh claims that there are many speed improvements from using dash which are often incorrectly attributed to upstart.
Alternatively, you can ask the kernel to kill your process when its parent (shell) dies with prctl(PR_SET_PDEATHSIG, SIGKILL).
However, consul should not launch processes that it controls in a shell.
We experienced corruption today using consul-replicate due to this issue (multiple consul-replicates running due to leaks over time). We run consul-replicate as an Upstart job under Ubuntu with consul lock. I suspect that this is a common pattern.
Not explicitly setting SHELL in the Upstart job leads to this leak. The behavior is both obscure and dangerous. I think it warrants a special callout in the docs or something. I think that this is the real solution though: https://github.com/hashicorp/consul/issues/1692
I'll send a PR to update the docs if it will be accepted
@evan2645 sorry about that - I'd definitely take a PR to update the docs and push that out while we work on the fix.
@slackpad no apology needed :) opened https://github.com/hashicorp/consul/issues/2090
Most helpful comment
If you are seeing this on ubuntu, it may be caused by "sh == dash". Switching from dash to bash as the sh interpreter fixed this behavior for me.