Consul: Service registration idempotence regarding HealthChecks

Created on 6 Nov 2018  路  17Comments  路  Source: hashicorp/consul

Issue with Service registration and checks

When registering several times a service on a local Agent using /agent/service/register, if checks are removed, those checks are not removed from the service.

So, if at t0, we register service redis with checks:[ check0, check1] and then, at t1 we register checks:[check2], the service will have the following services:

redis: [check0, check1, check2]

Ideally, in a RESTfull way, it should be redis: [check2]

It means the PUT registration is not idempotent. This is a bit problematic sometimes.

However, it probably means that some people are relying on this behavior, so changing the behaviour might break clients.

We thus propose to add an optional parameter for /agent/service/register?idempotent=true to enable this idempotent behavior.

@Aestek will soon provide a patch.

Issue with checks registration

Another bug can be found in /agent/check/register: when registering several times (in a idempotent way) a check, the state will override the state of check. Consider this:

Register check1 on Service2

{
  "Name": "check1",
  "ServiceID": "service2",
  "HTTP": "localhost:5050",
  "Interval": "30s",
}

At first run, the check will be initialized => Ok
At second registration time, the check (which was passing) will be reset to Critical (default value from registration) and will flap.

@Aestek will also propose another patch to address this: if a check already exists and is not modified, do not change the current status of the check.

Fixing those 2 issues will ease deployment of services using idempotent systems such as chef.

@pearkes Would you consider this for inclusion in upstream?

Kind Regards

needs-discussion

All 17 comments

Seems reasonable, I've labelled it so we can discuss with team at next available opportunity.

I am not the smartest so I am an advocate for simplicity, when I read ?idempotent=true my mind is already saying 'what?' I believe one scenario is a bug the other the correct way to handle the checks. Please don't fragment behavior like this, just discus which one is idiomatic and if it breaks stuf version bump and move on, thx

@gertcuykens agree, but I do know that some people rely on this behavior (by inserting some checks outside service definition and applying service definition on each chef run). So not doing that way will break some users.

We could however recommend using this query parameter in the documentation

God built Earth in 7 days between there were no legacy ;)

Can we settle on a version bump and the reverse ?idempotent=false instead. Including a deprecated warning for does who rely on this non idempotent bug :P

and if it breaks stuf version bump and move on, thx

Sadly version bumps on our API are way to drastic a thing to get around minor breaking changes like this. A version bump that changes the URL basically breaks every Consul client in the world and causes massive upgrade pain.

We appreciate how ugly it is trying to maintain "legacy" behaviour - there is tons of it in Consul as something that's been around a while, but any change that breaks existing clients effectively prevents many of our largest users from being able to upgrade at all and leaves them stranded on older versions so we have to be extremely pragmatic with our approach.

If you mean version bump Consul that's even worse - even if we make breaking changes we still have to give loads of warning of the change and ensure that the upgrade is smooth so old clients can still use new servers etc without any semantics being confused etc. How would one roll out an upgrade of all clients if they have a breaking API change? You can't upgrade the clients first since then they will do the wrong thing against old API and you can't upgrade the consul clients first for same reason etc.

That said, we all agreed that this was unpleasant and that just "fixing" it might be an option - expecting Check append behaviour is pretty odd and not something we ever documented.

BUT there is a major use-case that is relied upon all the time by existing tooling that would be broken if we change the default behaviour here. In many cases people register services with all their checks etc either via Config or API, and then use this endpoint from some totally separate automation to do things like update the service tags or metadata. When they do that, they rely on the fact that when they PUT a change to the service with it's new tags, they don't need to perfectly redefine all it's Checks too. They just leave those blank and it doesn't update them. If we change the default then all such integrations would remove all their checks on next tag update!

We _could_ special case empty checks but then it still isn't really idempotent for Chef as there would be no way to remove all checks.

I do agree @gertcuykens that ?idempotent=true is not a great API choice, but I also agree with Pierre that we do need to make this opt-in somehow and it's not easy to come up with a better name than that! Versioning is hard!

Then I suggest we leave current api v1 behavior, document that part better and use github v2 tags to mark all legacy code for when it's time to clean things up on a api level so we can revisit this later.

I strongly believe we should follow docker and kubernetes version / config filosophie here as in /v2/agent/service/register It's the only acceptable way to change it.

As far as i can tell you will not find any ?... workaround in docker and kubernetes api right?

https://docs.docker.com/registry/spec/api/
https://kubernetes.io/docs/concepts/overview/kubernetes-api/

Also who said a tool can't mix v1 and v2 api? Kubernetes does it all the time. Maybe we can introduce a api level in our consul config files also where the default is v1 or do we have that already?

@gertcuykens thanks for the input, I agree those are potential options.

In general HashiCorp tools don't always make the same choices as others like Docker and Kubernetes when it comes to versioning. Partly because there is more than one way to organise, partly because some of them pre-dated other popular projects.

It's a hard problem in general because it is much more than a _technical_ decision -- it's super important that we evolve our software in ways that our current users _expect_ and that cause a minimum of disruption and extra work for them to do. Having HashiCorp tools be somewhat consistent with themselves historically in how they handle backwards incompatibility and consistent with each other is probably more important that doing the same thing as Kubernetes for example.

We could add a /v2/ but every time we "bump" version like that it requires a massive amount of work to rewrite all clients that are using it. For a minor "bug" like this that isn't affecting anyone much for the last 4 years it seems, it's hard to justify asking people to change stuff. Sure both can coexist a while and would have to but that increases maintenance burden. Imagine ending up in a situation where we have v1,2,3,4 because each one only changed one minor bug with backwards compatibility ramifications in one endpoint. That would be a nightmare to maintain and document as well as horrible for clients to keep up with.

In general URL versioning is a pretty heated topic (Google API versioning if you're interested in some viewpoints!) but most people agree it's pretty horrible to change the URL of every resource in your API to introduce a single breaking change somewhere.

That said, adding an query param might not be the best way either - there is a spectrum of possible solutions between that and making a whole new /v2/ and changing every single API endpoint and client that exists!

We'll consider the simplest way that will work least badly for users 馃槃.

We just had a chance to discuss this and aren't decided but I wanted to take some notes:

  • If we were to take the query param idea, replace_checks=1 would feel the best / most explicit, but we'd need to validate the assumption that tags/service meta wouldn't have similar problem, if you want to look into that it'd be welcome.
  • We talked about a new endpoint that didn't duplicate much other than modifying this behavior but aren't sold on that idea.
  • The 2nd PR here (#4904) we want to prioritize getting in and have triaged it as such. The query param PR needs more thought as mentioned above.

Thanks!

@pearkes thx for discussing this :)

Tags and meta are replaced when re-registering a service, only checks are left behind.

As for the param, another option would be to introduce a new config file param like replace_checks_on_service_register that would default to false. This new param could be removed in an upcoming non-backward compatible release. This would keep this kind of "bugfix" param centralized. What do you think ?

@Aestek I am not a fan of the agent config, or in that case, it should be possible to put: replace_checks=0 if default behavior is set to 1.

We currently have some agents (I mean, now, only a single case known to me) were one our beloved users did mix behaviors (service created by chef, additional healthcheck added later by a script)

I have a question.
Say the check is changed from "timeout 5s docker ps" to "timeout 10s docker ps" and then re-registered.
There is a slight change in the content of the health check here. Will it fail momentarily ?

If so, why not "fail" only if the re-registered health check actually fails ? ( retain the state of the old content until the result of the new content is available )

@glakshmeepathi no it wont fail momentarily, the state (passing, warning...) will stay the same until the check is re-ran with the new parameters.

Having read the discussion it seems to me that we want to support idempotent service registrations. I think a parameter is the best way to do it for now, since we don't want to start api v2 and we hesitate to create a new endpoint just for that. Which doesn't mean that we won't do that in the future. I can see how v2 of this endpoint will default to the idempotent behaviour.

I think the parameter name idempotent leaves lots of room for interpretation though, what do you think about replace_existing_checks which defaults to false?

Other than that I think https://github.com/hashicorp/consul/pull/4905 is good to go.

Hey there,
We wanted to check in on this request since it has been inactive for at least 60 days.
If you think this is still an important issue in the latest version of Consul
or its documentation please reply with a comment here which will cause it to stay open for investigation.
If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well!
Thank you!

Hey there, This issue has been automatically closed because there hasn't been any activity for at least 90 days. If you are still experiencing problems, or still have questions, feel free to open a new one :+1:

Stale bot closed this but AFAICT it was actually fixed by #4905 and release in Consul 1.6.1.

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least _30_ days.

If you are still experiencing problems, or still have questions, feel free to open a new one :+1:.

Was this page helpful?
0 / 5 - 0 ratings