Etcd: Brainstorm ideas for better information for slow fsyncs

Created on 10 Apr 2017  路  13Comments  路  Source: etcd-io/etcd

FSync is the lifeblood of etcd performance and performs variably in complex cloud deployments. The current log message simply gives a single readout: 'Fsync time was X, slower then Y'.

  • Case for higher-resolution metrics: When you see this problem - it means something catastrophic may be happening in your cluster, and downtime could be on the way. Most people don't monitor all disk performance metrics, and even if they do, they may not specifically be monitoring the drives that etcd is looking it, hence its easier for users to do fsync forensics metrics directly from etcd if they are available.

  • Counterpoint: ETCD's doesn't need to create metrics for warnings - nor provide fine grained information on disk performance, so this really should be a warning. Warnings shouldn't be put into metrics.

7671 Introduced some ideas for better fsync debugging...

  1. linear metrics: Advantage, fine grained disk perf. Disadvantage: Too many new metrics added when we linearly go from 1->8k.
  2. gauges: Advantage: Only one metric and gives continuous instant feedback for fsync perforamnce.
  3. have a rolling average in the logs, report the 'last 10' fsyncs or the 'average fsync'. Of course, this is really just reproducing metrics inside the logs, but maybe it gets around reviewer concerns.

Summary of existing metrics/ideas related to this:

  • One possibility from discussing w/ @heyitsanthony is using rates. I noodled some on it and came up with this:
(rate(etcd_disk_wal_fsync_duration_seconds_bucket{le="1"}[2m]))
/
  rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m])

Which will measure the rate of change so you have an instantaneous measurement of problems that might be creeping up, much more useful then a histogram of total fsync buckets over time since the birth of christ :).

  • One original option was to simply have a separate metric for all warnings - that way users could directly count metrics that were interested rather then customizing their filters. Although its up to the user to decide, after all i think pretty much everyone agrees that fsync > 1second means your in BIG trouble for any real cluster.
  • If we want to avoid metrics for warnings, then I was hoping to have:
# HELP etcd_disk_wal_fsync_duration_current The current WAL fsync time, if its rising, disk slowness may decrease availability
# TYPE etcd_disk_wal_fsync_duration_current gauge
etcd_disk_wal_fsync_duration_current 0.010661352
  • Alas, this is what we currently have: Not really useful at all since, once it gets above a certain speed, your cluster is pretty much dead, so the buckets in a time-series of 2 seconds really amount to just drinking from a firehose.
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 3
etcd_disk_wal_fsync_duration_seconds_sum 0.11810550900000001
etcd_disk_wal_fsync_duration_seconds_count 3
  • Likewise, a more useful log would be something like "average sync duration of %v over the past 10 minutes extremely slow, faster disk i/o is strongly suggested for highly available clusters." Or something like: "For the past X minutes, fsync durations have breached the maximum suggested time period by an average of X seconds. faster disk i/o is strongly suggested for highly available clusters."
  • The current log is: plog.Warningf("sync duration of %v, expected less than %v", duration, warnSyncDuration)

Most helpful comment

I do not think fsync bucket > 1second with high resolution is ever needed from etcd point of view.

  • Absolutely, the resolution we care about here is from 0-(2 to 5)s. Some stats on massive fsyncs might be interesting but not really necessary. More bins might great, we'd love to have them, but if we really want to minimize buckets, and avoid gauges, lets pick smaller ones.
  • .1 , .2, .4, .8, 1.6, 2.0, 4.0, 16.0, 32.0 would be more useful and be a minimal change. Even ok if you really want to prune > 1.6.

Please note that there are many solutions that will be acceptable and useful to us - we're not coupled to anything other then improving the information content here, in some form that is more granular then what we have now.

do not use etcd metrics to debug os issues.

agree, there is a fine line between "monitoring fsync changes" and "monitoring filesystem performance", and admittedly a very high correlation... but the 'wal fsync' is, and always will be, an operation which is measured and executed by etcd, not by the OS. and as important as it is, more explicit metrics, logs here are really very important to us.

All 13 comments

/cc @fabxc any opinions? thanks in advance!

+1 to this. In my experience, this is the single most important thing to monitor - slow/variable i/o will cause leader losses for sure. If I could give every etcd NVMe or SSD backed storage I would, but cloud environments can have unpredictable storage performance. A gauge providing the latest fsync duration that can be scraped with a time series metrics collector would be valuable for real-time monitoring and health readouts.

there is already a pretty fine grained bucketed fsync duration metrics. i do not understand why you want a metrics for just the "latest" one. what additional information would that give you? what additional actions can you react on the additional information you got?

i operator etcd on both fast/slow disk. i never feel the bucketed metrics is not enough.

Generally I wouldn't agree with warning conditions not being captured by metrics. If it is sufficiently relevant to the application health, as it obviously is here, it makes sense.

Can you elaborate why you consider the histogram, which has well-chosen bucket sizes in general, useless?
It generally seems to capture it pretty well. When querying the histogram metric in your TSDB, you can pick shorter or longer window depending on how fast you want to detect increases in latency.

Nothing speaking against logs of course. But the one you showed could also be expressed with the histogram metric you already have. A PromQL query here (but any capable query engine against a TSDB should be able to express this, really):

instance:etcd_disk_wal_duration_seconds:avg10m =
    increase(etcd_disk_wal_fsync_duration_seconds_sum[10m]) /
    increase(etcd_disk_wal_fsync_duration_seconds_count[10m])

Obviously this puts a lot more power into the users hand than making this a static computation compiled into etcd itself. Choosing to calculate the average here versus a quantile practically won't make too much of a difference. Quantiles being better to detect slow outliers.

To think this further through, you could use above result for your alerting like

predict_linear(instance:etcd_disk_wal_duration_seconds:avg10m[10m], 30 * 60) > 0.5

to get notified if the avg fsync latency is ramping up continously over the last 10m so that it will exceed 500ms in 30 minutes.
This seems rather powerful to me. Certainly more powerful than a single log line or a metric only representing the most recent latency observation. The latter is obviously prone to trigger alerts due to a single outliers.

I might not have fully understood the use case. If so, can you describe an example of what you want to detect and how the histogram (plus a capable monitoring system obviously) does not suffice?

The difference users need to see, and alert against, may be finer grained changes from .1 to 5 seconds.

Histogramming the ranges from 2,4,8,16s and beyond doesn't help with this.

Although its up to the user to decide, after all i think pretty much everyone agrees that fsync > 1second means your in BIG trouble for any real cluster.

This made it sound like we want to alert against really anything >1 second. So why is a high resolution needed beyond?
I'm just concerned that this is a theoretical problem. The histogram is pretty high-res as it is and should give pretty accurate estimates.

  • low resolution fsync alerting is valuable
  • high resolution fsync diagnostics is more valuable
  • having both is priceless, and the second bullet gives you the first bullet 'for free'.

To be honest, if fsync usually takes more than 1s, etcd is almost unusable (we doc this in hardware recommendation). I do not think fsync bucket > 1second with high resolution is ever needed from etcd point of view. if you experience fsync > 1s, you need to continue debug at system level, not at etcd level.

in short, etcd metrics is really for monitoring application performance and reporting issues when resources are not enough for etcd. do not use etcd metrics to debug os issues.

I do not think fsync bucket > 1second with high resolution is ever needed from etcd point of view.

  • Absolutely, the resolution we care about here is from 0-(2 to 5)s. Some stats on massive fsyncs might be interesting but not really necessary. More bins might great, we'd love to have them, but if we really want to minimize buckets, and avoid gauges, lets pick smaller ones.
  • .1 , .2, .4, .8, 1.6, 2.0, 4.0, 16.0, 32.0 would be more useful and be a minimal change. Even ok if you really want to prune > 1.6.

Please note that there are many solutions that will be acceptable and useful to us - we're not coupled to anything other then improving the information content here, in some form that is more granular then what we have now.

do not use etcd metrics to debug os issues.

agree, there is a fine line between "monitoring fsync changes" and "monitoring filesystem performance", and admittedly a very high correlation... but the 'wal fsync' is, and always will be, an operation which is measured and executed by etcd, not by the OS. and as important as it is, more explicit metrics, logs here are really very important to us.

@heyitsanthony @xiang90 @fabxc We'd like to schedule a google hangout to discuss this and see if maybe we are shooting past each other. There might be an idea that we can all agree on here . How does tomorrow at noon EST sound ? (thats 9AM PST), maybe too early? Not sure where you guys are located .

@jayunit100 is the metrics output you're suggesting this:

etcd_disk_wal_fsync_duration_seconds_bucket{le="0.1"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.2"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.4"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.8"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.6"} 3

instead of what's there now:

etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 1
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 2
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 3
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 3

?

@heyitsanthony We're open to anything :) , there are lots of ways to provide more/better fsync info.

  • As you mention above, .1,.2,.4,.8,1.6 is more useful then .1,.2..5,... in that it provides one more data point. Its super minimal but a step forward. My thinking there , above was based on the 100 millisecond generic guidelines....... the difference between a .001 and .004 fsync is not useful to anyone, i assume.
  • lower window, but same or slightly more buckets with higher precision for the critical ranges (.1,.2,.3,.4,...,1.6).
  • same window, and more buckets (.1,.2,.3,....,4.0) - not really the best idea but it would be useful.
  • a gauge (probably ideal for us)
  • a log that reports average fsyncs in windows with a more informative error msg.

I think our ideal might be a gauge, since it is easy to monitor and doesnt reduce resolution of the information, but I think you guys have better intuition over what would be the best way forward. I think there are subtelties around whats better/worse, but i think we can all agree that there might be something better then the current situation . I hesitate to pick any of these as "what we want" because, all we really want is more information and taking a second glance at how to best deal with making fsync super-easy to tune, monitor, and flag in complex situations.

i am going to close this one. based on our debugging experience, what we have today is enough for slow fsync alerting. and people feel comfortable to debug slow io with os level tools( for example tools built atop ebpf)

Was this page helpful?
0 / 5 - 0 ratings