Consul: Locks not being released when member disconnected from cluster

Created on 17 Mar 2016 · 9Comments · Source: hashicorp/consul

Hi - I'm running a three node cluster (v 0.6.3) trying to acquire a single lock.

One of our failure test cases is that the that cluster is in a good state with leader elected and one member holding the lock (member A) and other two members are continually checking to see if the lock is available (members B & C). The network interface that consul uses on member A is then taken down so that member A can no longer talk to B & C. This causes leader election; B & C form a new quorum, elect a leader, and after lock-delay is reached one of them takes the lock. For A, it goes into state of having no cluster leader as expected but the child process is taking an indeterminate amount of time to be terminated; the shortest seems to be 30 seconds but there have been times when 5 minutes have gone by and its still running. I tried comparing starting member A as the Leader vs the Follower and did not see a difference. The end result of this is two child processes running on two different hosts which collide and wreck havoc for the overall system.

Is this expected behavior? And any suggestions on how to get around it?

themcli typbug

Source

bfloyd89

👍2

Most helpful comment

Hi add the "serfHealth" check to your session in addition to your custom
ones and the session will expire if the node dies as determined by the
Consul cluster. That's a built in check provided by every node by Consul.

On Fri, Feb 22, 2019, 4:53 PM David Nault notifications@github.com wrote:

I'm working on an project that uses Consul for leader election. App nodes
watch the leader key for changes, and race to acquire the lock on it when
it disappears. On startup each app node registers itself with Consul and
creates a session. The app periodically extends the session by telling the
agent it passed a health check.

If the Consul agent used by the leader app fails, the health check remains
in the "passing" state. This is apparently expected, according to #1790
https://github.com/hashicorp/consul/issues/1790. In this case we're
also seeing the dead app's session doesn't expire, and the lock on the
leader election key is not released :-(

Is this a bug? Any ideas for a workaround, short of having each app node
monitor the service health and forcefully delete the leader election key
when the leader disappears from the list of healthy nodes?

Encountered this in integration testing, where I'm trying to run a 3-node
Consul cluster in Docker, with each of three app nodes talking to a
different Consul container. Issue is triggered by stopping the Docker
container running the agent used by the app holding the lock holder.

The app nodes talk to Consul using the HTTP API.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/hashicorp/consul/issues/1843#issuecomment-466597681,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AApG5U-OQ292tDR0AgREFj6OWmWk3LVfks5vQJD5gaJpZM4HzLrJ
.

slackpad on 23 Feb 2019

❤1 👍1

All 9 comments

Hi @bfloyd89 are you using the consul lock command, the Consul Go API client, or some custom code to do the locking?

slackpad on 17 Mar 2016

Hi @slackpad, i'm using the consul lock command with no flags

bfloyd89 on 17 Mar 2016

Hmm - this is definitely not the expected behavior. My guess is that the lock monitor isn't seeing anything because it has a long-polling query over a TCP connection that's not getting actively notified when the network interface goes down. We might need a more active monitor with a timeout to catch cases like this.

Are you running consul lock on the Consul server nodes, or on other nodes just running Consul agent? Since you mentioned a three node cluster it sounds like you might be running on the servers, which might behave differently I/O-wise.

slackpad on 17 Mar 2016

Yeah, we are running all three as server nodes

bfloyd89 on 17 Mar 2016

Looks like we might need some feedback from the session goroutine's inability to renew the session which causes the lock to be given up.

slackpad on 22 Nov 2016

I'm working on an project that uses Consul for leader election. App nodes watch the leader key for changes, and race to acquire the lock on it when it disappears. On startup each app node registers itself with Consul and creates a session. The app periodically extends the session by telling the agent it passed a health check.

If the Consul agent used by the leader app fails, the health check remains in the "passing" state. This is apparently expected, according to #1790. In this case we're also seeing the dead app's session doesn't expire, and the lock on the leader election key is not released :-(

Is this a bug? Any ideas for a workaround, short of having each app node monitor the service health and forcefully delete the leader election key when the leader disappears from the list of healthy nodes?

Encountered this in integration testing, where I'm trying to run a 3-node Consul cluster in Docker, with each of three app nodes talking to a different Consul container. Issue is triggered by stopping the Docker container running the agent used by the app holding the lock.

The app nodes talk to Consul using the HTTP API.

UPDATE: Adding the "serfHealth" check to the session fixed this problem. Thank you, @slackpad !

dnault on 23 Feb 2019

Oops, how did I unassign @slackpad ? That was certainly unintentional.

dnault on 23 Feb 2019

On Fri, Feb 22, 2019, 4:53 PM David Nault notifications@github.com wrote:

I'm working on an project that uses Consul for leader election. App nodes
watch the leader key for changes, and race to acquire the lock on it when
it disappears. On startup each app node registers itself with Consul and
creates a session. The app periodically extends the session by telling the
agent it passed a health check.

If the Consul agent used by the leader app fails, the health check remains
in the "passing" state. This is apparently expected, according to #1790
https://github.com/hashicorp/consul/issues/1790. In this case we're
also seeing the dead app's session doesn't expire, and the lock on the
leader election key is not released :-(

Is this a bug? Any ideas for a workaround, short of having each app node
monitor the service health and forcefully delete the leader election key
when the leader disappears from the list of healthy nodes?

Encountered this in integration testing, where I'm trying to run a 3-node
Consul cluster in Docker, with each of three app nodes talking to a
different Consul container. Issue is triggered by stopping the Docker
container running the agent used by the app holding the lock holder.

The app nodes talk to Consul using the HTTP API.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/hashicorp/consul/issues/1843#issuecomment-466597681,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AApG5U-OQ292tDR0AgREFj6OWmWk3LVfks5vQJD5gaJpZM4HzLrJ
.