Elasticsearch: Improve Elasticsearch network monitoring

Created on 30 Nov 2018 · 17Comments · Source: elastic/elasticsearch

Network Monitoring

`TcpChannel`

The TcpChannel represents tcp connection.

[ ] Counter - Number of bytes sent
[ ] Counter - Number of bytes received

`Transport.Connection`

The Transport.Connection represents a collection of channels to another node.

[ ] Counter - Number of requests sent
[ ] Counter - Number of request timeouts

`ConnectionManager`

The ConnectionManager represents a manager for outgoing cluster connections. We have one for the local cluster and one for each remote cluster.

[ ] Gauge - Number of open Transport.Connection
[ ] Gauge - Number of open outbound TcpChannel
[ ] Counter - Number of failed requests due to flush failure
[ ] Counter - Number of request timeouts
[ ] Counter - Number of requests sent
[ ] Counter - Transport.Connections closed
[ ] Counter - TcpChannels closed

`TcpTransport`

The TcpTransport holds the state of all the incoming tcp connections.

[ ] Gauge - Number of open inbound TcpChannel
[ ] Counter - Number of requests received
[ ] Counter - Number of responses sent
[ ] Counter - Number of failed responses due to flush failure
[ ] Counter - Number of inbound TcpChannels closed
[ ] Counter - Keepalives sent
[ ] Counter - Keepalives received
[ ] Gauge - Recycled bytes consumed from PageCacheRecycler
[ ] Gauge - Un-recycled bytes consumed from PageCacheRecycler
[ ] Gauge - Byte size of netty’s heap buffer pool
[ ] Gauge - Byte size of netty’s direct buffer pool

:DistributeNetwork >enhancement Meta Distributed

Source

tbrooks8

🎉1

All 17 comments

Pinging @elastic/es-distributed

elasticmachine on 30 Nov 2018

@dliappis - @ywelsch told me that you might have some thoughts on this meta issue?

tbrooks8 on 30 Nov 2018

Thanks for the ping @tbrooks8 . Thinking of this at a higher level, and aiming at using these metrics to reduce the time spent to identify real network issues, first of all I have a few general observations.

In most cases the source/dest IP:port pairs are important to help with diagnosis. With the suggested Metric Types Gauge/Meter/EWMA, I am not sure how such things would work, e.g. would the metric name be something like ":-:`? If yes, this would need to also be periodically reaped from the list over time to reflect the lifecycle of connections.

At any rate, assuming we have solved the naming issue with sockets, I've added some stats that I think can be valuable. (Possibly not all may be implementable and some may adversely impact performance)

Meter: throughput (bytes/s) per interface used and generated by Elasticsearch.

ConnectionManager:

List of Metric with top N connections based on total network volume. e.g.:
["<SRC_IP1:SRC_PORT1-DST_IP1:PORT1>": "100MB", "<SRC_IP2:SRC_PORT2-DST_IP2:PORT2>": "93MB", ...] over period of time say 10s.
To help identify SSL misconfigurations: Gauge - # of requests where SSL negotiation failed? (this would be even more useful again if if sockets can be identified)

TcpTransport:

Gauge or Metric- number of open (established) connections per source port
Gauge or Metric- number of open connections per IP-port pair
Would it be possible to have a periodic TCP based RTT check among nodes and store the result a a list of EWMA (like uptime) over a period?
Same as for ConnectionManager for SSL: Gauge # of requests where SSL negotiation failed
Meter - size of netty send queue per time period
Meter - size of netty recv queue per time period

dliappis on 6 Dec 2018

We typically avoid rates (metric and EWMA) within Elasticsearch. Instead, we expose counters and expect that our monitoring products will expose the rates based on counters collected over a time period. How many connections were opened in the last time period? Do a derivative on the number of connections opened counter.

jasontedor on 6 Dec 2018

I'll make some adjustments based the the comments.

We typically avoid rates (metric and EWMA) within Elasticsearch

I noticed that. I'll make everything counters and I guess our strategy here is that this monitoring has to be enabled for these metrics to provide for the most value.

I also think that this removes most of the purpose of histogram/latency values. But I will think about it.

tbrooks8 on 6 Dec 2018

Would it be possible to meter response and request latency per all connections between 2 specific nodes (so the user can find out which network connection has lower throughput)

yaronp68 on 7 Dec 2018

Also relates to #19335

ywelsch on 7 Dec 2018

I think it might also be interesting to get some of these stats on a per-connection-type level (e.g. BULK vs RECOVERY vs REG vs STATE)

ywelsch on 10 Dec 2018

👍1

I noticed that. I'll make everything counters and I guess our strategy here is that this monitoring has to be enabled for these metrics to provide for the most value.

Thanks @tbrooks8.

I want to add, since it wasn't clear from my initial comment: I am fully in favor of the ideas here to expose more metrics about networking within Elasticsearch. My comment was only an iteration on what is presented here to be consistent with our we expose metrics in general and think about them in the context of exposing them through monitoring higher up in the stack. Thanks for taking this on.

jasontedor on 11 Dec 2018

I updated this ticket with some updated targets to monitor based on feedback.

tbrooks8 on 3 Apr 2019

👍1

I think it's a good thing to be able to drill down when viewing different metrics.
For example, consider the number of requests sent. You want to know what is the total number of requests sent (already reported), number of requests sent for a particular connection, number of requests sent for particular connection for particular channel type, number of requests sent through a particular channel. And there are multiple metrics that are useful on different levels of the hierarchy: requests sent, responses received, bytes sent, bytes received, request latency.
The main issue here is that the cluster is changing, new nodes are added to the cluster and old nodes are removed, channels are even more transient - they might be re-established even if cluster membership is not changed.
So what do we do with those entities (think TCP channels) that no longer exists?

We keep them in memory. So our response to node stats will contain information about all currently opened and already closed connections. However, we can not keep them forever and we need some mechanism for removing stale entries.
We don't keep destroyed entities in memory. So our response to node stats will contain only currently established connections. And we rely on monitoring software to periodically collect node stats. However, it should be noted, that you can not deduce some metrics if you're following the second approach. For example, it's not possible to know the total number of opened connection from the startup. Because if the first monitoring poll was at X and the second monitoring poll was at X+1 and some connection was opened and closed between X and X+1, there is no way to reveal this information. If we follow this approach, we probably need to have more high level stats, for example, total number of opened channels, total number of closed channels, etc.

@tbrooks8 @dliappis @jasontedor would be nice to get your opinion on this.

andrershov on 14 Aug 2019

I think what you're describing in 2 makes sense. High level metrics (total channels, bytes, etc). And then specific channel level stats. It is entirely possible that a channel is opened and closed within the poll interval. And that might be missed. But the high level stats will still reflect that there was a channel opened and closed.

tbrooks8 on 23 Aug 2019

cc: @elastic/stack-monitoring for awareness

cachedout on 12 Sep 2019

Would it also be possible to have a counter tracking each node's time connected to the cluster? It would be useful to compare this to the JVM uptime.

PhaedrusTheGreek on 24 Sep 2019

I think one valuable thing to have would be a histogram of time spent on the IO thread per requests.

This would allow identifying whether or not a cluster/node is hit by some expensive operation on the IO thread or if things like search throughput are bound by the IO threads (say beause there's slow audit logging or so happening on the IO thread).

We could implement that by just having counters for the various bins e.g. < 10ms, < 100ms, < 1s, > 1s.

original-brownbear on 25 Sep 2019

👍1

This is becoming more critical as enterprise (ECK/ECE) deployments are being designed across same-region datacentres with 5-10ms latency levels (AWS, Elastic Cloud, e.g.).

These architectures are also being deployed in cross-zone private datacentres with increased probability of network issues.

Latency and/or throughput issues directly affect shard operation response times, and today we have no visibility into network overhead.

Visibility and eventually self diagnosis will help mitigate this new risk

PhaedrusTheGreek on 29 Jan 2020

Today we don't currently expose much information about connectivity failures. Disconnections that don't involve the master node are often completely silent, and those that _do_ involve the master will result in a node-left ... reason: disconnected log message without any further details. This doesn't obviously look like an unexpected connectivity issue to end-users: if we exposed the underlying Connection reset or Connection timed out (and the fact that it was unexpected) then this would IMO result in less confusion.

Perhaps the answer here is simply more/better logging, but perhaps it's also something for which statistics would be useful. Disconnections tend to be sporadic, so one must do some analysis to determine whether there's a pattern over time - maybe they're affecting a single faulty node, or maybe it's the connections between certain pairs of nodes (e.g. between AZs).