Elasticsearch: Improve Elasticsearch network monitoring

Created on 30 Nov 2018  路  17Comments  路  Source: elastic/elasticsearch

Network Monitoring

TcpChannel

The TcpChannel represents tcp connection.

  • [ ] Counter - Number of bytes sent
  • [ ] Counter - Number of bytes received

Transport.Connection

The Transport.Connection represents a collection of channels to another node.

  • [ ] Counter - Number of requests sent
  • [ ] Counter - Number of request timeouts

ConnectionManager

The ConnectionManager represents a manager for outgoing cluster connections. We have one for the local cluster and one for each remote cluster.

  • [ ] Gauge - Number of open Transport.Connection
  • [ ] Gauge - Number of open outbound TcpChannel
  • [ ] Counter - Number of failed requests due to flush failure
  • [ ] Counter - Number of request timeouts
  • [ ] Counter - Number of requests sent
  • [ ] Counter - Transport.Connections closed
  • [ ] Counter - TcpChannels closed

TcpTransport

The TcpTransport holds the state of all the incoming tcp connections.

  • [ ] Gauge - Number of open inbound TcpChannel
  • [ ] Counter - Number of requests received
  • [ ] Counter - Number of responses sent
  • [ ] Counter - Number of failed responses due to flush failure
  • [ ] Counter - Number of inbound TcpChannels closed
  • [ ] Counter - Keepalives sent
  • [ ] Counter - Keepalives received
  • [ ] Gauge - Recycled bytes consumed from PageCacheRecycler
  • [ ] Gauge - Un-recycled bytes consumed from PageCacheRecycler
  • [ ] Gauge - Byte size of netty鈥檚 heap buffer pool
  • [ ] Gauge - Byte size of netty鈥檚 direct buffer pool
:DistributeNetwork >enhancement Meta Distributed

All 17 comments

Pinging @elastic/es-distributed

@dliappis - @ywelsch told me that you might have some thoughts on this meta issue?

Thanks for the ping @tbrooks8 . Thinking of this at a higher level, and aiming at using these metrics to reduce the time spent to identify real network issues, first of all I have a few general observations.

In most cases the source/dest IP:port pairs are important to help with diagnosis. With the suggested Metric Types Gauge/Meter/EWMA, I am not sure how such things would work, e.g. would the metric name be something like ":-:`? If yes, this would need to also be periodically reaped from the list over time to reflect the lifecycle of connections.

At any rate, assuming we have solved the naming issue with sockets, I've added some stats that I think can be valuable. (Possibly not all may be implementable and some may adversely impact performance)

  • Meter: throughput (bytes/s) per interface used and generated by Elasticsearch.

ConnectionManager:

  • List of Metric with top N connections based on total network volume. e.g.:
    ["<SRC_IP1:SRC_PORT1-DST_IP1:PORT1>": "100MB", "<SRC_IP2:SRC_PORT2-DST_IP2:PORT2>": "93MB", ...] over period of time say 10s.
  • To help identify SSL misconfigurations: Gauge - # of requests where SSL negotiation failed? (this would be even more useful again if if sockets can be identified)

TcpTransport:

  • Gauge or Metric- number of open (established) connections per source port
  • Gauge or Metric- number of open connections per IP-port pair
  • Would it be possible to have a periodic TCP based RTT check among nodes and store the result a a list of EWMA (like uptime) over a period?
  • Same as for ConnectionManager for SSL: Gauge # of requests where SSL negotiation failed
  • Meter - size of netty send queue per time period
  • Meter - size of netty recv queue per time period

We typically avoid rates (metric and EWMA) within Elasticsearch. Instead, we expose counters and expect that our monitoring products will expose the rates based on counters collected over a time period. How many connections were opened in the last time period? Do a derivative on the number of connections opened counter.

I'll make some adjustments based the the comments.

We typically avoid rates (metric and EWMA) within Elasticsearch

I noticed that. I'll make everything counters and I guess our strategy here is that this monitoring has to be enabled for these metrics to provide for the most value.

I also think that this removes most of the purpose of histogram/latency values. But I will think about it.

Would it be possible to meter response and request latency per all connections between 2 specific nodes (so the user can find out which network connection has lower throughput)

Also relates to #19335

I think it might also be interesting to get some of these stats on a per-connection-type level (e.g. BULK vs RECOVERY vs REG vs STATE)

I noticed that. I'll make everything counters and I guess our strategy here is that this monitoring has to be enabled for these metrics to provide for the most value.

Thanks @tbrooks8.

I want to add, since it wasn't clear from my initial comment: I am fully in favor of the ideas here to expose more metrics about networking within Elasticsearch. My comment was only an iteration on what is presented here to be consistent with our we expose metrics in general and think about them in the context of exposing them through monitoring higher up in the stack. Thanks for taking this on.

I updated this ticket with some updated targets to monitor based on feedback.

I think it's a good thing to be able to drill down when viewing different metrics.
For example, consider the number of requests sent. You want to know what is the total number of requests sent (already reported), number of requests sent for a particular connection, number of requests sent for particular connection for particular channel type, number of requests sent through a particular channel. And there are multiple metrics that are useful on different levels of the hierarchy: requests sent, responses received, bytes sent, bytes received, request latency.
The main issue here is that the cluster is changing, new nodes are added to the cluster and old nodes are removed, channels are even more transient - they might be re-established even if cluster membership is not changed.
So what do we do with those entities (think TCP channels) that no longer exists?

  1. We keep them in memory. So our response to node stats will contain information about all currently opened and already closed connections. However, we can not keep them forever and we need some mechanism for removing stale entries.
  2. We don't keep destroyed entities in memory. So our response to node stats will contain only currently established connections. And we rely on monitoring software to periodically collect node stats. However, it should be noted, that you can not deduce some metrics if you're following the second approach. For example, it's not possible to know the total number of opened connection from the startup. Because if the first monitoring poll was at X and the second monitoring poll was at X+1 and some connection was opened and closed between X and X+1, there is no way to reveal this information. If we follow this approach, we probably need to have more high level stats, for example, total number of opened channels, total number of closed channels, etc.

@tbrooks8 @dliappis @jasontedor would be nice to get your opinion on this.

I think what you're describing in 2 makes sense. High level metrics (total channels, bytes, etc). And then specific channel level stats. It is entirely possible that a channel is opened and closed within the poll interval. And that might be missed. But the high level stats will still reflect that there was a channel opened and closed.

cc: @elastic/stack-monitoring for awareness

Would it also be possible to have a counter tracking each node's time connected to the cluster? It would be useful to compare this to the JVM uptime.

I think one valuable thing to have would be a histogram of time spent on the IO thread per requests.

This would allow identifying whether or not a cluster/node is hit by some expensive operation on the IO thread or if things like search throughput are bound by the IO threads (say beause there's slow audit logging or so happening on the IO thread).

We could implement that by just having counters for the various bins e.g. < 10ms, < 100ms, < 1s, > 1s.

This is becoming more critical as enterprise (ECK/ECE) deployments are being designed across same-region datacentres with 5-10ms latency levels (AWS, Elastic Cloud, e.g.).

These architectures are also being deployed in cross-zone private datacentres with increased probability of network issues.

Latency and/or throughput issues directly affect shard operation response times, and today we have no visibility into network overhead.

Visibility and eventually self diagnosis will help mitigate this new risk

Today we don't currently expose much information about connectivity failures. Disconnections that don't involve the master node are often completely silent, and those that _do_ involve the master will result in a node-left ... reason: disconnected log message without any further details. This doesn't obviously look like an unexpected connectivity issue to end-users: if we exposed the underlying Connection reset or Connection timed out (and the fact that it was unexpected) then this would IMO result in less confusion.

Perhaps the answer here is simply more/better logging, but perhaps it's also something for which statistics would be useful. Disconnections tend to be sporadic, so one must do some analysis to determine whether there's a pattern over time - maybe they're affecting a single faulty node, or maybe it's the connections between certain pairs of nodes (e.g. between AZs).

Was this page helpful?
0 / 5 - 0 ratings