TcpChannelThe TcpChannel represents tcp connection.
Transport.ConnectionThe Transport.Connection represents a collection of channels to another node.
ConnectionManagerThe ConnectionManager represents a manager for outgoing cluster connections. We have one for the local cluster and one for each remote cluster.
Transport.ConnectionTcpChannelTransport.Connections closedTcpChannels closedTcpTransportThe TcpTransport holds the state of all the incoming tcp connections.
TcpChannelTcpChannels closedPageCacheRecyclerPageCacheRecyclerPinging @elastic/es-distributed
@dliappis - @ywelsch told me that you might have some thoughts on this meta issue?
Thanks for the ping @tbrooks8 . Thinking of this at a higher level, and aiming at using these metrics to reduce the time spent to identify real network issues, first of all I have a few general observations.
In most cases the source/dest IP:port pairs are important to help with diagnosis. With the suggested Metric Types Gauge/Meter/EWMA, I am not sure how such things would work, e.g. would the metric name be something like "
At any rate, assuming we have solved the naming issue with sockets, I've added some stats that I think can be valuable. (Possibly not all may be implementable and some may adversely impact performance)
Meter: throughput (bytes/s) per interface used and generated by Elasticsearch.ConnectionManager:
Metric with top N connections based on total network volume. e.g.:["<SRC_IP1:SRC_PORT1-DST_IP1:PORT1>": "100MB", "<SRC_IP2:SRC_PORT2-DST_IP2:PORT2>": "93MB", ...] over period of time say 10s.Gauge - # of requests where SSL negotiation failed? (this would be even more useful again if if sockets can be identified)TcpTransport:
Gauge or Metric- number of open (established) connections per source portGauge or Metric- number of open connections per IP-port pairEWMA (like uptime) over a period?ConnectionManager for SSL: Gauge # of requests where SSL negotiation failedMeter - size of netty send queue per time periodMeter - size of netty recv queue per time periodWe typically avoid rates (metric and EWMA) within Elasticsearch. Instead, we expose counters and expect that our monitoring products will expose the rates based on counters collected over a time period. How many connections were opened in the last time period? Do a derivative on the number of connections opened counter.
I'll make some adjustments based the the comments.
We typically avoid rates (metric and EWMA) within Elasticsearch
I noticed that. I'll make everything counters and I guess our strategy here is that this monitoring has to be enabled for these metrics to provide for the most value.
I also think that this removes most of the purpose of histogram/latency values. But I will think about it.
Would it be possible to meter response and request latency per all connections between 2 specific nodes (so the user can find out which network connection has lower throughput)
Also relates to #19335
I think it might also be interesting to get some of these stats on a per-connection-type level (e.g. BULK vs RECOVERY vs REG vs STATE)
I noticed that. I'll make everything counters and I guess our strategy here is that this monitoring has to be enabled for these metrics to provide for the most value.
Thanks @tbrooks8.
I want to add, since it wasn't clear from my initial comment: I am fully in favor of the ideas here to expose more metrics about networking within Elasticsearch. My comment was only an iteration on what is presented here to be consistent with our we expose metrics in general and think about them in the context of exposing them through monitoring higher up in the stack. Thanks for taking this on.
I updated this ticket with some updated targets to monitor based on feedback.
I think it's a good thing to be able to drill down when viewing different metrics.
For example, consider the number of requests sent. You want to know what is the total number of requests sent (already reported), number of requests sent for a particular connection, number of requests sent for particular connection for particular channel type, number of requests sent through a particular channel. And there are multiple metrics that are useful on different levels of the hierarchy: requests sent, responses received, bytes sent, bytes received, request latency.
The main issue here is that the cluster is changing, new nodes are added to the cluster and old nodes are removed, channels are even more transient - they might be re-established even if cluster membership is not changed.
So what do we do with those entities (think TCP channels) that no longer exists?
@tbrooks8 @dliappis @jasontedor would be nice to get your opinion on this.
I think what you're describing in 2 makes sense. High level metrics (total channels, bytes, etc). And then specific channel level stats. It is entirely possible that a channel is opened and closed within the poll interval. And that might be missed. But the high level stats will still reflect that there was a channel opened and closed.
cc: @elastic/stack-monitoring for awareness
Would it also be possible to have a counter tracking each node's time connected to the cluster? It would be useful to compare this to the JVM uptime.
I think one valuable thing to have would be a histogram of time spent on the IO thread per requests.
This would allow identifying whether or not a cluster/node is hit by some expensive operation on the IO thread or if things like search throughput are bound by the IO threads (say beause there's slow audit logging or so happening on the IO thread).
We could implement that by just having counters for the various bins e.g. < 10ms, < 100ms, < 1s, > 1s.
This is becoming more critical as enterprise (ECK/ECE) deployments are being designed across same-region datacentres with 5-10ms latency levels (AWS, Elastic Cloud, e.g.).
These architectures are also being deployed in cross-zone private datacentres with increased probability of network issues.
Latency and/or throughput issues directly affect shard operation response times, and today we have no visibility into network overhead.
Visibility and eventually self diagnosis will help mitigate this new risk
Today we don't currently expose much information about connectivity failures. Disconnections that don't involve the master node are often completely silent, and those that _do_ involve the master will result in a node-left ... reason: disconnected log message without any further details. This doesn't obviously look like an unexpected connectivity issue to end-users: if we exposed the underlying Connection reset or Connection timed out (and the fact that it was unexpected) then this would IMO result in less confusion.
Perhaps the answer here is simply more/better logging, but perhaps it's also something for which statistics would be useful. Disconnections tend to be sporadic, so one must do some analysis to determine whether there's a pattern over time - maybe they're affecting a single faulty node, or maybe it's the connections between certain pairs of nodes (e.g. between AZs).