I have been struggling with this alert and i came to the conclusion that the problem lies in the threshold chosen, at least in my environment. I would like to understand if my conclusions are correct or be told to dig deeper in my issue.
My Setup is
3 Nodes etcd 3.2.25 running in google cloud across 3 zones, nodes are Custom 2 CPU 4GB of ram and the average icmp ping between nodes is about 0.5ms.
Everything works fine , i have no errors or issues that i can see other than this Alert firing of 0.2 s ( threshold is 0.15 )
When drawing a graph of it i can see it jump between 0.1s and 0.2 almost in a straight line.
This is my conclusion and why i think the problem lies in the threshold
The metric used for this etcd_network_peer_round_trip_time_seconds_bucket has buckets of
as described on the prometheus documentation the quantile estimation is error prone.
To my understanding the estimation error is greater when the number of data points in the interval are small.
In my case, and this might be my issue, i scrape the etcd metrics endpoint every 30s, in each of these 30s interval the etcd_network_peer_round_trip_time_seconds_count goes up by 2.
so only a few data point for the histogram_quantile linear interpolation to correctly estimate the value, reason why using a value that is in the middle of the 2 buckets ( 0.15 seconds ) is extremeley prone to flap and alert in my case.
Is it normal to only have 1 data point every 15s on average for etcd_network_peer_round_trip_time_seconds_count and the related bucket ? Or is something wrong on my side ?
I currently raised my threshold to 0.2048 so that i will mostly monitor for events in the upper buckets but i am struggling to understand if i am just missing some obvious issue on my side .
thanks
I have been struggling with this alert and i came to the conclusion that the problem lies in the threshold chosen, at least in my environment. I would like to understand if my conclusions are correct or be told to dig deeper in my issue.
@primeroz these metric definitions are a guideline and a place to start with defaults and can't cover every environment as you noted.
Is it normal to only have 1 data point every 15s on average for etcd_network_peer_round_trip_time_seconds_count and the related bucket ? Or is something wrong on my side ?
Previous to this PR we only took data points for this metric from snapshots. So your information appears correct but we are going to improve that moving forward.
ref: https://github.com/etcd-io/etcd/pull/10155
The last point I would make is although things appear to be fine review the tuning docs tuning:time-parameters section. Watch out for excessive leader elections and act accordingly.
@primeroz did this answer your question if not do you have anything to add?
Yeah this covers what I needed.
Thanks
On Fri, 7 Dec 2018, 15:30 Sam Batschelet, notifications@github.com wrote:
@primeroz https://github.com/primeroz did this answer your question if
not do you have anything to add?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/etcd-io/etcd/issues/10292#issuecomment-445267843, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB_TjCn0-fyFUYHsKQQU0a9Egj3COY0wks5u2omigaJpZM4Y77tO
.