Consul: Support server selection priorities / weights

Created on 18 Jul 2014 · 19Comments · Source: hashicorp/consul

If we support priories, then you can in support cases like a server in a remote region (for backup / quorum purposes). The priorities would disable client routing through them unless necessary (higher priority servers have failed / are unreachable).

themservice-metadata typenhancement

Source

armon

👍15

Most helpful comment

@sean-
Would love to look everything up as a query but it doesn't quite have the functionality at the moment to meet our use cases. For example:
We have a 3 node service with a robust health check. When a node becomes degraded our healthcheck puts it into warning state (we could also do this with healthchecks based on load). We want to prefer nodes that are green but if there are none available then we want to return the yellow (warning) nodes. We don't want to failover datacenters, we still want to return the local nodes.
Something like what is suggested in https://groups.google.com/forum/#!topic/consul-tool/Rm4P7dSTsY0 would work for that use case.

There is also the use case mentioned above about introducing a new server in a canary style fashion, i.e. sending x% of random requests to the new server (or again in the case of load dropping the number of clients being sent to a node but still serving it). Conul wouldn't need to track connections to a node in this case, just offer a priority field that can be set for a service on a node, the randomized order from a standard service lookup could then be weighted by that priority.

dbason on 6 Sep 2017

👍4

All 19 comments

It'd also be useful for gradually scaling in new versions of a service.

morgante on 28 Aug 2014

I could see a use-case for weights as well. We have some different hardware profiles, and I would love to be able to weight nodes accordingly since we know performance for a particular piece of software is a certain percentage worse on certain hardware.

carlivar on 28 Aug 2014

This could also allow failover for services where round robin isn't preferable/possible e.g failover a load balancer for a service that has stickiness

kcd83 on 7 Sep 2014

488 is basically this as well, but just inside a service rather than cross-datacenter. +1 for this.

highlyunavailable on 25 Mar 2015

Would these priorities be static or dynamic?

pepov on 18 Apr 2015

:+1: Being able to send <1% of traffic to a standby as a way of exercising DR paths would be great. Also a crude mechanisms to handle different capabilities between hardware.

sean- on 8 Jun 2015

Can this be done with the new "network coordinates" or the "prepared queries". I was hoping that we could add something like this to accomplish https://github.com/hashicorp/consul/issues/488

camerondavison on 13 Jan 2016

@a86c6f7964 prepared queries can help across datacenters for sure (using pre-configured fallbacks or network coordinates, or both). Within a datacenter, many HTTP endpoints now support the ?near= argument that lets you find the closest service, but this issue still stands for a more general weighting feature.

slackpad on 13 Jan 2016

I guess I meant, the code that was added recently would help support easily adding this feature. Sorry for the mis-communication.

camerondavison on 13 Jan 2016

While not exactly server priorities, many on this issue have referenced automatic DC failover as a reason for wanting server priorities. As @slackpad mentioned earlier, with Prepared Queries this is possible, and has been made easier with Prepared Query Templates. We held a webinar and covered this at around minute 31.

https://www.youtube.com/watch?v=FGbzS6ripXA&feature=youtu.be&t=1690

sean- on 27 Mar 2016

:+1:
I arrived here looking for the solution to #1229. Prepared Queries are really cool and would solve both #1229 and #488 if we had one special tag for "local".

With tag local we could define remote as a prepared query template:

{
  "Name": "remote-",
  "Template": {
    "Type": "name_prefix_match"
  },
  "Service": {
    "Service": "${name.suffix}",
    "Tags": ["!local"]
  }
}

This is actually already possible if we merge the local implementation in PR #1231.

Then if we add failover-to-query, we can do priorities:

{
  "Name": "local-first-",
  "Template": {
    "Type": "name_prefix_match"
  },
  "Service": {
    "Service": "${name.suffix}",
    "Tags": ["local"],
    "Failover": {
      "Query": "remote-${name.suffix}",
  }
}

With failover-to-query it could even finally failover to another DC if local and remote aren't available.
(Although with anything recursive someone is bound to shoot themselves in the foot with it, we may want to guard against that.)

Weights are a bit tougher, because then Consul has to keep some kind of state about how many queries for a certain lookup. I'm using ebay/Fabio (it does weighted routing with Consul tags and services) and there are a lot of other solutions out there for weighted routing.

However, I think adding a single "local" tag and fallback-to-query, together with existing tag and query functionality, would require the least code changes yet still allow flexibility in composing complex queries. Tag groups could fallback to any other tag groups and nodes can be prioritized in any order.

EDIT: this fallback functionality is also specifically requested in #1159

doublerebel on 27 Mar 2016

Why not just look up everything as a query?

Sean Chittenden

sean- on 27 Mar 2016

Hey, this got me wondering, when an agent establishes a connection with a server, does it pick the nearest? I couldn't find that info in the docs.

cirocosta on 29 Aug 2017

@cirocosta agents send the query to a server, which may forward it to a leader depending on the consistency mode of the query.
"nearness" as per the coordinate subsystem is only used in the catalog/health endpoints if you specify a query param. See https://www.consul.io/docs/internals/coordinates.html for more info.

preetapan on 29 Aug 2017

👍1

Adding to what @preetapan said - agents pick a random server and use that for a while, and then periodically choose a new one at a frequency that's dependent on the size of the cluster. This gives users of stale queries the best chance of having their load spread across the cluster. Since many kinds of requests have to be forwarded to the leader internally by the servers, it doesn't give much of an advantage to choose the nearest one.

slackpad on 29 Aug 2017

👍1

Weighting would also be useful for load based routing to a service. Say we have a service that has 3 nodes but one of those nodes gets under medium load. We'd like to prefer the other 2, but still leave that node in the service in case the others go down.

dbason on 5 Sep 2017

dbason on 6 Sep 2017

👍4

We are also need this feature. There are any plans?

tmanninger on 6 Jun 2019

Thank you for reporting and helping with this issue/feature request!
This is something that we considered doing in 2014, but we didn't actually go down that route. This feature is supported by Consul these days: https://www.consul.io/docs/connect/l7-traffic-management.html which is why I am closing this issue.