Consul: Reloading Consul re-runs all watch commands every time.

Created on 3 Jan 2015 · 54Comments · Source: hashicorp/consul

I'm reloading Consul to load in new config files as the containers are built - but it appears as if it's re-running each defined watch command every time Consul is reloaded.

Note all the newly created docker containers that are being stopped and started (from the watch command):

http://shared.froese.org/2014/Screen_Shot_2015-01-02_at_4.34.32_PM.png

Example watch command is here:

https://gist.github.com/darron/481604459ccfde4d401a

Is this expected behavior? I had thought that if the config had changed, it should obviously reload and probably re-run, but this is surprising to me.

I have a few watch commands:

https://gist.github.com/darron/38af49ad1352a913d360

Looking through the api, I don't see another way to register a watch command.

In the end, I was trying to launch approximately 50 containers, it was still going 4 hours later - launching and re-launching 2300+ containers and counting:

http://shared.froese.org/2014/v87sd-17-50.jpg

Am I "doing it wrong"?

themoperator-usability typenhancement

Source

darron

👍18 👀2

Most helpful comment

Any update on by when this will be fixed?

aashishmodak on 17 May 2017

👍6 👎1

All 54 comments

A watch command seems to be run:

When a config file is loaded.
When Consul is HUP'ed and it rechecks them all.

I don't think those commands should be run unless the watch is actually "triggered" - but that's what appears to be the behavior.

darron on 4 Jan 2015

Just found issue #342 that describes the fire on create behavior.

darron on 4 Jan 2015

👀1

The fire-on-create one is a weird semantic thing. To me it seemed that most use cases would need the initial data plus any deltas (e.g. initial HAProxy config, plus updates), so we always fire on the first run. I guess there are cases you may not care. I want to make that a flag to the watchers.

The re-fire is just caused by us doing the dumbest reload possible. e.g. drop everything and rebuild everything, instead of complex change detection logic. (Was the watcher added/removed/modified)

armon on 5 Jan 2015

Assuming that the executable called by the watch is capable of making an idempotent change seems reasonable (or at the very worst, an identically generated config file + 1x SIGHUP should not be problematic in the common case). FWIW, we're designing the tools called by consul-template around the assumption that watches will fire haphazardly for many different and unknown reasons.

sean- on 6 Jan 2015

All of the commands I have been using are idempotent, the problem is that as you add watches it gets run and re-run and re-run over and over and over.

If I have 50 containers with 2 watches per container (service and KV) then if ANY container is added or removed, all 50 stop and start each time there's a change to any Consul config file.

I have looked at the shell environment and the JSON that's passed to the handler - and there's no obvious difference that I can see - so there's no real way to know it's different outside of Consul.

I was attempting to start an AWS box that very comfortably runs 50+ small web containers - I was never able to finish and killed the box after 2300 container restarts. It couldn't really ever catch up after one container had been loaded and the next one finished.

darron on 6 Jan 2015

@darron oh are you saying that the watchers compound over time? e.g. a new set of 50 watchers is added on each reload?

armon on 6 Jan 2015

Here's how it is working with the current state of Consul:

There's already 49 containers with a kv and service watch each.
Add a new site to Docker.
Add a KV watch. (Hup Consul - as a result, all 98 watches fire again - which start to reload 49 containers and rebuild 49 nginx config files).
Add a service watch. (Hup Consul - as a result, all 99 watches fire again - which start to reload 50 containers and rebuild 49 nginx config files).

So - approximately 200 watch events fire because Consul is HUPed - not because they change or anything.

I could cut the reloads by 50% only reloading Consul once - but that didn't really help much - still containers that didn't need to be restarted were being restarted over and over.

I ended up disabling the KV container restart watch - the first one described here:

http://blog.froese.org/2014/12/30/how-octohost-uses-consul-watches/

It's just not lightweight enough to justify including the way Consul fires them each time.

darron on 6 Jan 2015

Hey guys. I also ran into this issue as well, and there are some potentially nasty behaviors if any of the handlers invoke consul reloadeither directly on indirectly. Essentially the watch will continue to refire until all file descriptors are exhausted:

2015/05/27 18:55:16 [ERR] agent: Failed to invoke watch handler 'consul reload': fork/exec /bin/sh: too many open files

here is the given watch config:

{ "watches": [ { "handler": "consul reload", "type": "event", "name": "break-consul" } ] }

This was with consul 0.5.0

dgshep on 27 May 2015

I think a possible solution is to use the LTime field in the events and store that in a file each time your handler runs. That way, if you receive the same event with an LTime which is less-than-or-equal to the one you have stored in the file you can ignore it.

Can anyone confirm that LTime is reliable for this?

jsok on 2 Jul 2015

@jsok event watches are a bit of a snowflake, as they are powered by the gossip layer. LTime should be safe to use in the way you describe for event watches since it is a monotonic value, but it will not apply to other watch types and will not be part of the payload returned for those watches.

ryanuber on 13 Jul 2015

Ok, so I find this all very weird...

Triggering an event that is being watched gives me the following output from the watch (a single event):

[{"ID":"afe6d0d3-c8f7-bea6-0311-2b3672cdb3fd","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":9}]

But reloading using consul reload also triggers this event (as this issue confirms), however, the output from the watch is a list of all events past:

[{"ID":"95a7f085-ac1d-d916-6ba4-749e8d102a5e","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":5},{"ID":"10681ffe-b7b6-a0a0-d3a3-fa802d997258","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":7},{"ID":"5b373801-54f2-2bfa-bcb0-a4bcd3aacacf","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":8},{"ID":"afe6d0d3-c8f7-bea6-0311-2b3672cdb3fd","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":9}]

Why is that, as it makes very little sense to me.

joelmoss on 30 Jul 2015

@ryanuber what is Ltime?

joelmoss on 30 Jul 2015

LTime is serfs implementation of a lamport clock. The docs go into more depth: https://www.serfdom.io/docs/internals/gossip.html

I believe the reason you get all the events on reload is because the other nodes do not know what the last gossiped event that your local agent received. So to synchronise your local agent the other nodes will re-send all previous events.

jsok on 30 Jul 2015

👍1

ok, but right now I have only one node for testing.

joelmoss on 30 Jul 2015

Can I work around this by checking and remembering the CONSUL_INDEX env var in my handler (if I have a watch of type key or keyprefix)?

BjRo on 10 Aug 2015

@armon Do you still plan to add the flag to the watches (to turn off fire-on-create)? If so, any plans when this is going to be available?

BjRo on 10 Aug 2015

@BjRo We are tracking it here, but no concrete plans. Lots of more pressing things.

armon on 11 Aug 2015

I think I ran into this issue as well. This is stopping us from adopting Consul since it would fire a deploy if we have to restart Consul.

ssenaria on 26 Sep 2015

Any updates on this?

ssenaria on 26 Sep 2015

This is critical for us as part of our one to many plans. I can't have my events all firing anytime there is an edit or reload. I'd consider this a pressing issue.

pl1ght on 29 Sep 2015

I am also planning to use Consul Watch - Event triggers, whether it is a deployment or HTTP call fired by monitoring system to fix a check (e.g. restart ntpd if ntp peer check failed) with bunch of other events handler. This is still a blocker for me, i am not happy with the workaround as i had to put all the commands in a script handler instead. As a work around i am verifying the payload and restart the service if only payload matches.

https://gist.github.com/vkhatri/1c3d9b287338ed0288c0

Would be great to have a configuration parameter to prevent event trigger on service start/reload.

vkhatri on 2 Jan 2016

👀1

Because I was tired of launching processes with scripts that kept duplicating functionality, I built a small Go based tool to help with this limitation:

https://github.com/darron/sifter

It handles event and key watches - and doesn't allow the watch to fire if:

Event: It's already seen that LTime value.
Key: the hash of the payload is the same as before.

We've been running it in production to protect event watches for a few weeks now - key watches are less tested but have worked in my local testing.

darron on 2 Jan 2016

👍4

+1
We too need disabling "fire-on-create".

sc0rp10 on 2 Mar 2016

We have the same issues with event watches. We want to use events to manually initialize container or schedule some tasks (like deploy). But the watches get executed in the beginning, which ist not our desired behaviour. This is also a serious blocker for us using consul. We are thinking about using consul exec instead, but that has serious security issues allowing arbitrary commands to be executed. +1

kaelumania on 4 Mar 2016

Any ETA on when a fix or enhancement would be scheduled for?

ssenaria on 7 Apr 2016

Also wondering if there is any update on this, running into this exact same issue.

bfgoodrich on 26 Apr 2016

I also am running into this problem and would appreciate a solution.

zpeterg on 24 May 2016

👍

jippi on 25 May 2016

I know this isn't a fix but we've been using Sifter and it's been great.

ssenaria on 25 May 2016

👍1

@ssenaria can you expand on how you use Sifter?

davidvasandani on 25 May 2016

https://github.com/darron/sifter looks pretty straightforward, it just wraps your commands and does some pre-checking.

highlyunavailable on 25 May 2016

👍3

Has this been looked at recently? This is still a huge pain and I would rather not have to keep a history of watches that have already been ran.

bfgoodrich on 25 Aug 2016

this is sooo needed, we are saving modifyindex inside consul itself and checking against that and if we see a change we actually run the handler.

alisade on 29 Aug 2016

👍 to getting this looked at...

taharah on 27 Dec 2016

👍4

Consul 0.7.2 and still waiting...

gamefundas on 7 Jan 2017

👍3

I'm currently learning Consul and while thinking about the question, I believe this is the job of the event handler to remember which event has been processed for real. Actually I find it even rather safe than all the recent events are fired again, it allows to be sure the handler doesn't forget an event.

ccomb on 1 Feb 2017

Any update on by when this will be fixed?

aashishmodak on 17 May 2017

👍6 👎1

Same issue here. Will try sifter but does not seem to be tested with key/keyprefix watches..

Any updates?

Cheers!

aroca on 1 Feb 2018

A way to remove the event from consul might be useful.

daledude on 24 May 2018

👍2

got stuck at the same issue. seems no update until now

jgj1018 on 23 Aug 2018

👍1

Want to use "event watch" to start/stop services, but get stuck because of the defect (fire on create). When can this issue be fixed? Thanks.

yuqiangh on 4 Sep 2018

I am still confused how watches are even all that useful when this behavior is present. If you are requesting a very specific event but a watch fires also during other times (like a reload), what good is the watch?

adawalli on 9 Feb 2019

👍4

Any update on fixing this? It has been a huge pain in manually keeping track of old watch event.

DImuthuUpe on 13 Dec 2019

👍1

I'm trying to replicate this issue and work on a fix.
I created a watch on an event and fired it multiple times. Everything worked as expected.
I tried to do a consul reload but this didn't trigger the event, contrary to the behavior described in a previous comment.
Has this problem been fixed? Is there a PR I can look into?

Sudhar287 on 25 Mar 2020

👍1

I've been working on this issue for a while and I think I have a solution. I want to ask some clarifying questions before I send the PR.

I identified that its the watch plan handler function that's being called whenever a new watch is created and thus the fire-on-create behavior is observed. Am I correct here? To fix this I've simply skipped handler function from being called the very first time.
Would you like to give the users the power to either enable or disable this fire-on-create behavior? If this is required, then I thought of having a global parameter to set/unset it.

Sudhar287 on 3 Apr 2020

👍1

@Sudhar287 I believe the consul reload problem still exists. I was able to reproduce this via:

$ consul agent -dev -hcl 'watches {type = "key", key = "/test", handler_type = "script", args = ["sh", "-c", "echo test >> /tmp/consulwatch"] }'

and then running consul reload in another shell, causing test to be printed out a second time:

$ cat /tmp/consulwatch
test
test

The solution of adding a new parameter onto the watch plan is sensible, but I think there are some subtle edge cases someone might run into with that behavior that we should think through.

The obvious edge case I can see is the following scenario:

I have a watch on the K/V path /foo
I initiate a reload of consul
The consul agent cancels the current watch of /foo
An update to the /foo key is committed to the Raft log
The consul agent starts a new watch of /foo, not firing on create

In this scenario the watch misses an update, which is undesirable.

For event watches specifically, this might be fine because consul events are already a best-effort mechanism and not designed for guaranteed delivery, but for other types of watches this behavior could be a problem. Perhaps a solution can be designed such that a fire-on-create option only applies to event watches.

crhino on 9 Apr 2020

👀1

Yes @crhino, you are absolutely right in everything you wrote! Thank you for the detailed comment.
Even I thought about the edge case but I figured it would be okay to skip the handler being invoked for events that occurred in the past. Now I know that its undesirable...

Please let me know the final proposal and I'll work on it. :)

Sudhar287 on 9 Apr 2020

@Sudhar287 I roughly reviewed the code you proposed, and the idea of operator being able to explicitly state in the watch config, should it be immediately fired-on-create or not, save a lot of complex tracking caused by states etc. However I have question. Will the watch with "fireOnCreate":"no" setting set, be executed on initial agent startup? If it will be executed on startup but won't be executed on reload I suggest re-labeling it to "fireOnReload": "no".

@crhino Great comment on this one and the edge cases. I think that the current behaviour with fire-any-time yes should remain and will be the default in future. However I consider this relative easy and clean implementation that @Sudhar287 proposed to provide operators with great fine granted control over watches. Experienced ones might want to fine tune their watches and decide which should stick to the default behaviour and run anytime, and which ones they can "sacrifise" and tweak with "fireOnReload":"no" while fully understanding the downsides of their decision. This is somehow similar to the consul kv get and consul kv get -stale options. I am curious to hear what do you think about that?

vaLski on 10 Apr 2020

👀1

Thank you for your insights @vaLski. I have some findings/suggestions to present too:

i. Starting the consul agent triggers the watch reload fucntion and
ii. The watch reload function eventually fires them. Hence, I think fire-on-reload option is essential a fire-on-create option.
The consul reload command reloads all the watches. This has to be skipped if we don't want the fire-on-reload/fire-on-create behavior.
The fucntion call stack are different between watching using consul watch and the consul agent watch. Eg: consul agent -dev -hcl 'watches {type = "key", key = "/test", handler_type = "script", args = ["sh", "-c", "echo test >> /tmp/consulwatch"] }' and consul watch -type=key -key=/test echo test >> /tmp/consulwatch are different that I though it would be. If we want to modify the behaviour of the application, we have to make it consistent for both the commands.
consul reload doesn't seem to have any effect when watching with the consul watch command. Which is the reason for my comment earlier.

Please let me know if I've made a mistake, I'm a beginner wrt consul. Please share your opinion and thoughts too. :)

Sudhar287 on 12 Apr 2020

The fucntion call stack are different between watching using consul watch and the consul agent watch. Eg: consul agent -dev -hcl 'watches {type = "key", key = "/test", handler_type = "script", args = ["sh", "-c", "echo test >> /tmp/consulwatch"] }' and consul watch -type=key -key=/test echo test >> /tmp/consulwatch are different that I though it would be. If we want to modify the behaviour of the application, we have to make it consistent for both the commands.

The difference is essentially whether or not the watch is declared in the Consul agent's configuration or not. A consul reload will only reload the agent's configuration files, which means that any watches run via consul watch are not effected by a reload at all. The Watches docs page explains this as follows:

Watches can be configured as part of the agent's configuration, causing them to run once the agent is initialized. Reloading the agent configuration allows for adding or removing watches dynamically.

Alternatively, the watch command enables a watch to be started outside of the agent. This can be used by an operator to inspect data in Consul or to easily pipe data into processes without being tied to the agent lifecycle.

Hope that makes sense, let me know if there are any questions!

crhino on 13 Apr 2020

👀1 👍1

Thank you for the response @crhino!
FYI, I've updated the PR. Some tests are failing in circle CI but the pass on my local machine. Any tips on how to proceed?

Sudhar287 on 17 Apr 2020

@Sudhar287 force push several times until it works. Note that rebasing on Head might also help as many tests are a bit more stable in latest commits

pierresouchay on 23 Apr 2020

🚀1

Thanks for the response @pierresouchay. Your suggestions were helpful and I've passed the CI. Can one of the official maintainers please have a look at the PR #7616 ?

Sudhar287 on 4 May 2020

👍1

What's the status of reviewing and merging this one? It's blocking some of our work that will implement new consul watches.