FSync is the lifeblood of etcd performance and performs variably in complex cloud deployments. The current log message simply gives a single readout: 'Fsync time was X, slower then Y'.
Case for higher-resolution metrics: When you see this problem - it means something catastrophic may be happening in your cluster, and downtime could be on the way. Most people don't monitor all disk performance metrics, and even if they do, they may not specifically be monitoring the drives that etcd is looking it, hence its easier for users to do fsync forensics metrics directly from etcd if they are available.
Counterpoint: ETCD's doesn't need to create metrics for warnings - nor provide fine grained information on disk performance, so this really should be a warning. Warnings shouldn't be put into metrics.
Summary of existing metrics/ideas related to this:
(rate(etcd_disk_wal_fsync_duration_seconds_bucket{le="1"}[2m]))
/
rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m])
Which will measure the rate of change so you have an instantaneous measurement of problems that might be creeping up, much more useful then a histogram of total fsync buckets over time since the birth of christ :).
# HELP etcd_disk_wal_fsync_duration_current The current WAL fsync time, if its rising, disk slowness may decrease availability
# TYPE etcd_disk_wal_fsync_duration_current gauge
etcd_disk_wal_fsync_duration_current 0.010661352
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 3
etcd_disk_wal_fsync_duration_seconds_sum 0.11810550900000001
etcd_disk_wal_fsync_duration_seconds_count 3
"average sync duration of %v over the past 10 minutes extremely slow, faster disk i/o is strongly suggested for highly available clusters." Or something like: "For the past X minutes, fsync durations have breached the maximum suggested time period by an average of X seconds. faster disk i/o is strongly suggested for highly available clusters."plog.Warningf("sync duration of %v, expected less than %v", duration, warnSyncDuration)/cc @fabxc any opinions? thanks in advance!
+1 to this. In my experience, this is the single most important thing to monitor - slow/variable i/o will cause leader losses for sure. If I could give every etcd NVMe or SSD backed storage I would, but cloud environments can have unpredictable storage performance. A gauge providing the latest fsync duration that can be scraped with a time series metrics collector would be valuable for real-time monitoring and health readouts.
there is already a pretty fine grained bucketed fsync duration metrics. i do not understand why you want a metrics for just the "latest" one. what additional information would that give you? what additional actions can you react on the additional information you got?
i operator etcd on both fast/slow disk. i never feel the bucketed metrics is not enough.
Generally I wouldn't agree with warning conditions not being captured by metrics. If it is sufficiently relevant to the application health, as it obviously is here, it makes sense.
Can you elaborate why you consider the histogram, which has well-chosen bucket sizes in general, useless?
It generally seems to capture it pretty well. When querying the histogram metric in your TSDB, you can pick shorter or longer window depending on how fast you want to detect increases in latency.
Nothing speaking against logs of course. But the one you showed could also be expressed with the histogram metric you already have. A PromQL query here (but any capable query engine against a TSDB should be able to express this, really):
instance:etcd_disk_wal_duration_seconds:avg10m =
increase(etcd_disk_wal_fsync_duration_seconds_sum[10m]) /
increase(etcd_disk_wal_fsync_duration_seconds_count[10m])
Obviously this puts a lot more power into the users hand than making this a static computation compiled into etcd itself. Choosing to calculate the average here versus a quantile practically won't make too much of a difference. Quantiles being better to detect slow outliers.
To think this further through, you could use above result for your alerting like
predict_linear(instance:etcd_disk_wal_duration_seconds:avg10m[10m], 30 * 60) > 0.5
to get notified if the avg fsync latency is ramping up continously over the last 10m so that it will exceed 500ms in 30 minutes.
This seems rather powerful to me. Certainly more powerful than a single log line or a metric only representing the most recent latency observation. The latter is obviously prone to trigger alerts due to a single outliers.
I might not have fully understood the use case. If so, can you describe an example of what you want to detect and how the histogram (plus a capable monitoring system obviously) does not suffice?
The difference users need to see, and alert against, may be finer grained changes from .1 to 5 seconds.
Histogramming the ranges from 2,4,8,16s and beyond doesn't help with this.
Although its up to the user to decide, after all i think pretty much everyone agrees that fsync > 1second means your in BIG trouble for any real cluster.
This made it sound like we want to alert against really anything >1 second. So why is a high resolution needed beyond?
I'm just concerned that this is a theoretical problem. The histogram is pretty high-res as it is and should give pretty accurate estimates.
To be honest, if fsync usually takes more than 1s, etcd is almost unusable (we doc this in hardware recommendation). I do not think fsync bucket > 1second with high resolution is ever needed from etcd point of view. if you experience fsync > 1s, you need to continue debug at system level, not at etcd level.
in short, etcd metrics is really for monitoring application performance and reporting issues when resources are not enough for etcd. do not use etcd metrics to debug os issues.
I do not think fsync bucket > 1second with high resolution is ever needed from etcd point of view.
Please note that there are many solutions that will be acceptable and useful to us - we're not coupled to anything other then improving the information content here, in some form that is more granular then what we have now.
do not use etcd metrics to debug os issues.
agree, there is a fine line between "monitoring fsync changes" and "monitoring filesystem performance", and admittedly a very high correlation... but the 'wal fsync' is, and always will be, an operation which is measured and executed by etcd, not by the OS. and as important as it is, more explicit metrics, logs here are really very important to us.
@heyitsanthony @xiang90 @fabxc We'd like to schedule a google hangout to discuss this and see if maybe we are shooting past each other. There might be an idea that we can all agree on here . How does tomorrow at noon EST sound ? (thats 9AM PST), maybe too early? Not sure where you guys are located .
@jayunit100 is the metrics output you're suggesting this:
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.1"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.2"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.4"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.8"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.6"} 3
instead of what's there now:
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 1
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 3
?
@heyitsanthony We're open to anything :) , there are lots of ways to provide more/better fsync info.
.1,.2,.4,.8,1.6 is more useful then .1,.2..5,... in that it provides one more data point. Its super minimal but a step forward. My thinking there , above was based on the 100 millisecond generic guidelines....... the difference between a .001 and .004 fsync is not useful to anyone, i assume.I think our ideal might be a gauge, since it is easy to monitor and doesnt reduce resolution of the information, but I think you guys have better intuition over what would be the best way forward. I think there are subtelties around whats better/worse, but i think we can all agree that there might be something better then the current situation . I hesitate to pick any of these as "what we want" because, all we really want is more information and taking a second glance at how to best deal with making fsync super-easy to tune, monitor, and flag in complex situations.
i am going to close this one. based on our debugging experience, what we have today is enough for slow fsync alerting. and people feel comfortable to debug slow io with os level tools( for example tools built atop ebpf)
Most helpful comment
Please note that there are many solutions that will be acceptable and useful to us - we're not coupled to anything other then improving the information content here, in some form that is more granular then what we have now.
agree, there is a fine line between "monitoring fsync changes" and "monitoring filesystem performance", and admittedly a very high correlation... but the 'wal fsync' is, and always will be, an operation which is measured and executed by etcd, not by the OS. and as important as it is, more explicit metrics, logs here are really very important to us.