I know this place is only for real issues, but I don't have access to google servers in my country (thus not able to use IRC and mail list), i have to leave my questions here. thanks in advance for the reply.
Hi @hehailong5 please see below:
Regarding service registration, as the consul client is the only contact point, I understand more than one node should be configured to run as consul client to avoid the single point of failure situation.
in this case (having multiple consul clients), how do I configure my consul registration tool? usually it requires to specify only one endpoint. what if this endpoint is down? will it automatically forward the registration request to an alive consul client in the cluster?
Normally you'll run a separate set of Consul _servers_ (usually 3 or 5) and then run the Consul client agent on every other machine in your infrastructure. Applications always talk to the local Consul client agent, which will automatically forward requests to and keep track of the Consul servers. The Consul servers provide stable storage for the catalog, key/value store, and provide coordination for things like locks and semaphores. The Consul client agents on each other machine manage registering local services on that machine, and provide interfaces to Consul for applications on that machine (HTTP or DNS). There's usually no need to have redundancy at the Consul agent level on each machine, as that's part of the same failure domain as the machine itself.
If I register a HTTP health check in the consul, is it the consul server or the consul client that will do the actual checking?
The Consul client agent will run the health check locally and update the Consul servers with the results.
is there a watchdog implemented in consul cluster? say initially I have 3 consul servers running in the cluster, if one of them goes down and then goes online, will it be automatically re-joined in the cluster?
Consul has a built in node health check called serfHealth that acts as a watchdog to make sure a node is alive and responding to network probes. This will automatically mark the node failed if it goes down, and will help the server rejoin the cluster if it comes back online.
Hope that helps! I'll close this out but feel free to re-open if you need any more clarifications.
Hi,
Regarding #1, do you mean if the Consul client crash or the network breaks down between, the application cannot use the consul anymore util the Consul client comes back online? I thought there is a mechanism for Consul clients to prevent single point failure.. in my case, the application connects to the Consul client remotely via HTTP 8500.
There's no mechanism on the client side to prevent a single point of failure because the machine it's running on is also a single point of failure. Consul is highly available on the server side with >= 3 servers, so if one of them dies or has its network disrupted then another server will take up leadership and the cluster can continue to operate. It's not recommended to connect to the consul client remotely as then you'll need to manage a load balancer and the list of healthy clients to connect to. If you run the Consul agent on each machine then it manages all this for you.
Thanks for the clarification! that makes clear. thank you!
Hi,
one more question about the watchdog in consul:
if any node fails in the cluster, will serfHealth help push it online?
The serfHealth check is added to every node and is used to reflect the cluster's low-level health status for that node (basically whether it responded to network probes). This is logically AND-ed with the service checks on that node as well, so if serfHealth fails, instances of the services on that node won't be returned over DNS, for example. Once the node recovers and starts responding to probes, the serfHealth which will become passing on its own. Hope that helps!
Thanks for the answers. Very clarifying
Hi @slackpad,
If serfHealth fails for node(N1) for 1min and there are 1000 services registered on node(N1) that means I wouldn't be getting any service for 1min, right? If yes, is it not a bad situation as many of my requests would be failing.
If I use catalog end point to get the services and if few services out of 1000 are down then I may end-up sending requests to down services?
Can you please suggest, how can I get services which are healthy even if their node is down/failing?
Can we not replicate health checks to all server nodes, so if a node is down we would still be able to get healthy services?
Can we not have a http-api/endpoint to get healthy services? Current API "/health/service/:service" returns healthy nodes which are having that service.
Most helpful comment
Hi @hehailong5 please see below:
Normally you'll run a separate set of Consul _servers_ (usually 3 or 5) and then run the Consul client agent on every other machine in your infrastructure. Applications always talk to the local Consul client agent, which will automatically forward requests to and keep track of the Consul servers. The Consul servers provide stable storage for the catalog, key/value store, and provide coordination for things like locks and semaphores. The Consul client agents on each other machine manage registering local services on that machine, and provide interfaces to Consul for applications on that machine (HTTP or DNS). There's usually no need to have redundancy at the Consul agent level on each machine, as that's part of the same failure domain as the machine itself.
The Consul client agent will run the health check locally and update the Consul servers with the results.
Consul has a built in node health check called
serfHealththat acts as a watchdog to make sure a node is alive and responding to network probes. This will automatically mark the node failed if it goes down, and will help the server rejoin the cluster if it comes back online.Hope that helps! I'll close this out but feel free to re-open if you need any more clarifications.