For binary protocols or services using HTTPS, it can be confusing to see zero stats in the CLI and GUI. There are TCP stats, they're just only accessible in the grafana dashboards right now.
Introduce a CLI and GUI panel for TCP metrics that shows whether the protocol has been detected or not and the general metrics. See #2223 for the other side of this.
$ linkerd tcp
Maybe as a flag to stat? The main reason for not going that route is that the headers will be totally different.
What is the list of TCP metrics that we have?
@grampelberg Here's what I see for TCP metrics in a pod running MySQL:
Full metrics output
# HELP tcp_open_total Total count of opened connections
# TYPE tcp_open_total counter
tcp_open_total{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http"} 24
tcp_open_total{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http"} 24
# HELP tcp_open_connections Number of currently-open connections
# TYPE tcp_open_connections gauge
tcp_open_connections{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http"} 21
tcp_open_connections{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http"} 21
# HELP tcp_read_bytes_total Total count of bytes read from peers
# TYPE tcp_read_bytes_total counter
tcp_read_bytes_total{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http"} 281477
tcp_read_bytes_total{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http"} 1198068
# HELP tcp_write_bytes_total Total count of bytes written to peers
# TYPE tcp_write_bytes_total counter
tcp_write_bytes_total{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http"} 1198068
tcp_write_bytes_total{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http"} 281477
# HELP tcp_close_total Total count of closed connections
# TYPE tcp_close_total counter
tcp_close_total{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno=""} 3
tcp_close_total{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno=""} 3
# HELP tcp_connection_duration_ms Connection lifetimes
# TYPE tcp_connection_duration_ms histogram
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="1"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="2"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="3"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="4"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="5"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="10"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="20"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="30"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="40"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="50"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="100"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="200"} 2
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="300"} 2
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="400"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="500"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="1000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="2000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="3000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="4000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="5000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="10000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="20000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="30000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="40000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="50000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno="",le="+Inf"} 3
tcp_connection_duration_ms_count{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno=""} 3
tcp_connection_duration_ms_sum{direction="inbound",peer="src",tls="no_identity",no_tls_reason="not_http",errno=""} 663
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="1"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="2"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="3"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="4"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="5"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="10"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="20"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="30"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="40"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="50"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="100"} 0
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="200"} 2
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="300"} 2
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="400"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="500"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="1000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="2000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="3000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="4000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="5000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="10000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="20000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="30000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="40000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="50000"} 3
tcp_connection_duration_ms_bucket{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno="",le="+Inf"} 3
tcp_connection_duration_ms_count{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno=""} 3
tcp_connection_duration_ms_sum{direction="inbound",peer="dst",tls="no_identity",no_tls_reason="not_http",errno=""} 662
# HELP control_request_total Total count of HTTP requests.
# TYPE control_request_total counter
control_request_total{addr="linkerd-proxy-api.linkerd.svc.cluster.local:8086",tls="disabled"} 0
# HELP control_response_latency_ms Elapsed times between a request's headers being received and its response stream completing
# TYPE control_response_latency_ms histogram
# HELP control_response_total Total count of HTTP responses.
# TYPE control_response_total counter
# HELP control_retry_skipped_total Total count of retryable HTTP responses that were not retried.
# TYPE control_retry_skipped_total counter
# HELP process_start_time_seconds Time that the process started (in seconds since the UNIX epoch)
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1549650682
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 2
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 60
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1048576
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 96190464
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 3321856
So, maybe a bytes/second, connection duration and current connections for the columns? Anything else good we could get out of that?
Yep, those are the 3 that seem relevant to me. I could TIOLI on connection duration, but it could be useful in some scenarios.
Connection duration, as it's currently measured, is basically useless. See https://github.com/linkerd/linkerd2/issues/2207
I'm trying to work on a proto addition to the public api for these. Thoughts on the following, based on the comments above?
message TcpStatSummaryResponse {
oneof response {
Ok ok = 1;
ResourceError error = 2;
}
message Ok {
repeated TcpStatTable stat_tables = 1;
}
}
message TcpStatTable {
oneof table {
PodGroup pod_group = 1;
}
message PodGroup {
repeated Row rows = 1;
message Row {
Resource resource = 1;
string time_window = 2;
// number of currently open connections
uint64 tcp_open_connections = 3;
// total count of bytes read from peers
uint64 tcp_read_bytes_total = 4;
// total count of bytes written to peers
uint64 tcp_write_bytes_total = 5;
// connection lifetime
uint64 tcp_connection_duration_ms = 6;
// Stores a set of errors for each pod name. If a pod has no errors, it may be omitted.
map<string, PodErrors> errors_by_pod = 7;
}
}
}
rpc TcpStatSummary(StatSummaryRequest) returns (TcpStatSummaryResponse) {}
Instead of providing a new endpoint, how about modifying the existing StatSummary endpoint to also provide TCP stats if requested? I'm thinking something along the lines of:
diff --git a/proto/public.proto b/proto/public.proto
index 329206cd..e8cba256 100644
--- a/proto/public.proto
+++ b/proto/public.proto
@@ -315,6 +315,7 @@ message StatSummaryRequest {
}
bool skip_stats = 6; // true if we want to skip stats from Prometheus
+ bool tcp_stats = 7;
}
message StatSummaryResponse {
@@ -339,6 +340,15 @@ message BasicStats {
uint64 actual_failure_count = 8;
}
+message TcpStats {
+ // number of currently open connections
+ uint64 open_connections = 1;
+ // total count of bytes read from peers
+ uint64 read_bytes_total = 2;
+ // total count of bytes written to peers
+ uint64 write_bytes_total = 3;
+}
+
message StatTable {
oneof table {
PodGroup pod_group = 1;
@@ -359,6 +369,7 @@ message StatTable {
uint64 failed_pod_count = 6;
BasicStats stats = 5;
+ TcpStats tcp_stats = 8;
// Stores a set of errors for each pod name. If a pod has no errors, it may be omitted.
map<string, PodErrors> errors_by_pod = 7;
Per @olix0r's comment, I think we should leave connection duration out of the TcpStats struct, but we can always add it later.
Totally! I went back and forth on whether to add this to the existing stat summary or make a new endpoint. Since the returned stats were completely different, I wasn't sure whether we were overcrowding an already complicated API. But since we'll use the same StatSummaryRequest and a lot of the same code for figuring out inbound/outbound prometheus queries, this is probably a good choice!
Going to close this issue now that tcp stats have been added to the web UI and CLI.
Opened https://github.com/linkerd/linkerd2/issues/2460 to track adding tcp_connection_duration.