Typically Elasticsearch doesn't work well in cross-datacentre architectures, but how can you define that? So long as there is reliable and ample network connection between 2 sites, why not? If Elasticsearch had insight into the reliability of it's relationship to other nodes in the cluster, this could serve as a vital cluster health metrics.
To know that, it would be great if each ES node could track shard transfer rate, ping time, packet loss, relative uptime, etc, metrics against any/all other known nodes. Also tracking minimum_masters stable time from each node's perspective would be useful too.
The results of the metrics could be used in diagnosing or indicating stability problems due to network issues. The availability metrics would be skewed by node restarts, etc, but it would still be highly useful. The transfer rate data would always be consistent.
Discussed in FixItFriday. Agreed that at least some of these metrics would be good to have, but it would be a time-consuming and tedious job to add these stats. Nice to have, but maybe not worth the effort?
I'll mark it as adoptme and high hanging fruit
A simpler way to get started here might be to log warnings on the node (similar to slow logs). If pinging takes longer than a (user-definable) threshold, we could for example log a warning. Same for slow shard transfer rates etc.
+1 @ywelsch
+1 @ywelsch
This issue has been open for a while, but not a lot has happened with it. I will close this issue for now, because it is a high hanging fruit and there are currently no plans to work on this improvement, also another approach that @ywelsch suggested is easier to get started. Please feel free to leave feedback on the proposal (including +1s).
Most helpful comment
A simpler way to get started here might be to log warnings on the node (similar to slow logs). If pinging takes longer than a (user-definable) threshold, we could for example log a warning. Same for slow shard transfer rates etc.