Etcd: `cluster is healthy` returned by `etcdctl cluster-health` is sometimes confusing

Created on 11 Nov 2015 · 11Comments · Source: etcd-io/etcd

etcdctl cluster-health
member 2eacd324a7820934 is healthy: got healthy result from http://172.17.8.102:2379
failed to check the health of member 65de4ea85ff20848 on http://172.17.8.103:2379: Get http://172.17.8.103:2379/health: dial tcp 172.17.8.103:2379: i/o timeout
member 65de4ea85ff20848 is unreachable: [http://172.17.8.103:2379] are all unreachable
member ce2a822cea30bfca is healthy: got healthy result from http://172.17.8.101:2379
cluster is healthy

an i/o error connecting to a down host returns "cluster is healthy" sounds weird. Maybe we could say cluster is degraded at this time.

/cc @philips @vcaputo @ecnahc515

arequestion

Source

yichengq

👎1

Most helpful comment

etcdctl member-health

On Wed, Nov 11, 2015 at 3:28 PM, Brandon Philips [email protected]
wrote:

We can write a new tool but I think it is reasonable to have a tool that
returns non-zero if members are unhealthy or unreachable. Perhaps etcdctl
cluster-ping or something?

etcdctl cluster-ping --fail-one
etcdctl cluster-ping --fail-quorum

I dunno

—
Reply to this email directly or view it on GitHub
https://github.com/coreos/etcd/issues/3852#issuecomment-155942632.

jonboulle on 13 Nov 2015

❤1 🎉1

All 11 comments

And I do not really think cluster-health can always give you an accurate view by any mean.

How do we know the second member is actually down and the cluster is degraded only by seeing an io time out? There might be a client side connection issue the the second server. I feel adding more state will just complicate the situation.

The only thing we can reliably tell the user is the cluster is healthy or unhealthy at the point it receives the request. And cluster is healthy means the majority of the cluster is OK. Other than that we can not tell anything reliably.

xiang90 on 11 Nov 2015

I agree with @xiang90 on this - I think it's overloading this command for it to error out when the client can't reach one of the nodes. Can we just log warnings like "could not reach node xxx" , but still return that the cluster is healthy - and then we can clearly define in the docs what "the cluster is healthy" actually means

jonboulle on 11 Nov 2015

👎1

My thought is that cluster is healthy sounds really strong, and it could be used when all members are healthy. And cluster is workable is printed out when some member is healthy. cluster is unhealthy when the cluster cannot make progress.

I agree that whatever the result is, it is always the view that is seen by etcdctl cluster-health. Moreover, I think it needs to explain what it sees in details.

yichengq on 11 Nov 2015

My thought is that cluster is healthy sounds really strong, and it could
be used when all members are healthy.

But this could still be true even if the etcdctl client just happens to
have a network blip reaching one of the members...

On Wed, Nov 11, 2015 at 11:40 AM, Yicheng Qin [email protected]
wrote:

My thought is that cluster is healthy sounds really strong, and it could
be used when all members are healthy. And cluster is workable is printed
out when some member is healthy. cluster is unhealthy when the cluster
cannot make progress.

I agree that whatever the result is, it is always the view that is seen by etcdctl
cluster-health. Moreover, I think it needs to explain what it sees in
details.

—
Reply to this email directly or view it on GitHub
https://github.com/coreos/etcd/issues/3852#issuecomment-155889205.

jonboulle on 11 Nov 2015

👎1

@jonboulle One one side, i think we have different definition on healthy word. One another side, we need to admit that the result is just the view of etcdctl. If the cluster is healthy, and etcdctl cannot reach any of them, it will say the cluster is unhealthy.

From admin/user side, they care about these three status. If the cluster is fully healthy, they don't need to do anything. if the cluster is working but not that good, they will inspect the member. if the cluster doesn't work, they will dig into the problem immediately.

yichengq on 11 Nov 2015

On Wed, Nov 11, 2015 at 11:51 AM, Yicheng Qin [email protected]
wrote:

If the cluster is healthy, and etcdctl cannot reach any of them, it will
say the cluster is unhealthy.

See, I would disagree with this behaviour. I think it should rather say
"cannot reach any nodes in cluster". It doesn't have enough information to
judge the health of the cluster (as I'm defining it).

jonboulle on 11 Nov 2015

If the cluster is healthy, and etcdctl cannot reach any of them, it will say the cluster is unhealthy.

@yichengq etcdctl should not.

See, I would disagree with this behaviour. I think it should rather say
"cannot reach any nodes in cluster". It doesn't have enough information to
judge the health of the cluster (as I'm defining it).

@jonboulle I agree with you. etcd should simply say unreachable and cannot determine the status of the cluster.

xiang90 on 11 Nov 2015

See, I would disagree with this behaviour. I think it should rather say
"cannot reach any nodes in cluster". It doesn't have enough information to
judge the health of the cluster (as I'm defining it).

That would be better, and helps to distinguish the status clearly. It is really nice to explicitly differentiate the case that etcdctl cannot connect to the member and the member is unhealthy.

yichengq on 11 Nov 2015

We can write a new tool but I think it is reasonable to have a tool that returns non-zero if members are unhealthy or unreachable. Perhaps etcdctl cluster-ping or something?

etcdctl cluster-ping --fail-one
etcdctl cluster-ping --fail-quorum

I dunno

philips on 12 Nov 2015

😕1

etcdctl member-health

On Wed, Nov 11, 2015 at 3:28 PM, Brandon Philips [email protected]
wrote:

We can write a new tool but I think it is reasonable to have a tool that
returns non-zero if members are unhealthy or unreachable. Perhaps etcdctl
cluster-ping or something?

etcdctl cluster-ping --fail-one
etcdctl cluster-ping --fail-quorum

I dunno

—
Reply to this email directly or view it on GitHub
https://github.com/coreos/etcd/issues/3852#issuecomment-155942632.