Please read https://github.com/etcd-io/etcd/blob/master/Documentation/reporting_bugs.md.
My etcd version is 3.3.10
backend_commit_duration_seconds like is
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))
{endpoint="metrics",instance="172.22.66.116:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.11571199999999937
{endpoint="metrics",instance="172.22.66.117:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.08104000000000033
{endpoint="metrics",instance="172.22.66.118:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.10784000000000088
wal_fsync_duration_seconds like is
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
{endpoint="metrics",instance="172.22.66.116:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.0766933333333334
{endpoint="metrics",instance="172.22.66.117:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.04398222222222232
{endpoint="metrics",instance="172.22.66.118:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.04272000000000035
is the disk too slow cause this warning ?
warning log like is
2019-06-25 08:47:33.191089 W | etcdserver: read-only range request "key:\"/registry/leases/kube-node-lease/m725-c115-26-k8s-pf1-elk01\" " with result "range_response_count:1 size:368" took too long (321.997639ms) to execute
2019-06-25 08:47:36.123405 W | etcdserver: read-only range request "key:\"/registry/priorityclasses\" range_end:\"/registry/priorityclasset\" count_only:true " with result "range_response_count:0 size:9" took too long (130.640623ms) to execute
2019-06-25 08:47:36.123845 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:457" took too long (114.454452ms) to execute
2019-06-25 08:47:41.924197 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:457" took too long (366.110214ms) to execute
2019-06-25 08:47:44.752364 W | etcdserver: read-only range request "key:\"/registry/configmaps/ingress-nginx/ingress-controller-leader-nginx\" " with result "range_response_count:1 size:463" took too long (130.214951ms) to execute
@jingyih
@boxuan666 Thanks for reporting!
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))
{endpoint="metrics",instance="172.22.66.116:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.11571199999999937
Does this mean the P99 duration is 115.7ms? If so, it is too slow.
I am not super familair with histogram_quantile expression, maybe we can double check the P99 duration by searching keyword "backend_commit_duration_seconds" in metrics endpoint output. This way you will find the raw data, something like this:
etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 123
etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 124
...
@jingyih
Hi bro
The information as follow
etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 4
etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 484209
etcd_disk_backend_commit_duration_seconds_bucket{le="0.004"} 1.0856757e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.008"} 1.1494004e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 1.1693697e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.032"} 1.1750741e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.064"} 1.179604e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.128"} 1.1844565e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.256"} 1.1864375e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.512"} 1.1867404e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="1.024"} 1.1867526e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="2.048"} 1.186753e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="4.096"} 1.186753e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="8.192"} 1.186753e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="+Inf"} 1.186753e+07
The "took too long to execute" warning is triggered whenever applying a raft entry to backend store took more than 100 ms. Backend commit duration is usually the most dominant part in the duration of apply.
From the metrics you just posted, it looks like the P99 is in the order of 32 ms. But anything in 128 ms bucket and above will definitely trigger a warning, which in your case there are 20k or so. So it is expected that you will see a lot of the warnings in the server log. I think you should use a faster disk.
Install etcd 3.3.13 or later to solve the problem, I have verified it.
No relationship with disk or network
Install etcd 3.3.13 or later to solve the problem, I have verified it.
No relationship with disk or network
i want to know why.
The "took too long to execute" warning is triggered whenever applying a raft entry to backend store took more than 100 ms
How can I adjust this 100ms "took too long" warning threshold? I have an etcd cluster running on underpowered arm32 machines. I am quite happy with the speed, but now my logs are full of these "warnings", hiding real errors. I have already increased --heartbeat-interval and --election-timeout .
(Note: I'm running 3.3.10 - perhaps this is already improved in later versions?)
The "took too long to execute" warning is triggered whenever applying a raft entry to backend store took more than 100 ms
How can I adjust this 100ms "took too long" warning threshold? I have an etcd cluster running on underpowered arm32 machines. I am quite happy with the speed, but now my logs are _full_ of these "warnings", hiding real errors. I have already increased
--heartbeat-intervaland--election-timeout.(Note: I'm running 3.3.10 - perhaps this is already improved in later versions?)
Starting from 3.4, log level is configurable, which means you can choose to log only those levels higher than warning.
https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/configuration.md#--log-level
Starting from 3.4, log level is configurable
Thanks for the workaround. To be clear: I think you're saying I need to disable all warnings because the 100ms warning threshold is hard-coded. The various examples above indicate that 100ms is not an outlier for some deployments, and generally it seems impossible to mandate a fixed response SLA on behalf of all users.
I suggest we turn this issue into a feature request: "too-long warning threshold should be configurable" (or perhaps even just remove this warning and rely on metrics?)
I suggest we turn this issue into a feature request: "too-long warning threshold should be configurable" (or perhaps even just remove this warning and rely on metrics?)
Did a PR to introduce the threshold limit.
I am having exactly the same issue with rancher 2.4.3.
I am running prometheus with lots of alert rules, which creates a large configmap.
Every time I try to delete some prometheus alert rules, etcd will get stuck.
Tried with ebs optimised instance and porvissioned IOPS, not helping
Most helpful comment
Thanks for the workaround. To be clear: I think you're saying I need to disable all warnings because the 100ms warning threshold is hard-coded. The various examples above indicate that 100ms is not an outlier for some deployments, and generally it seems impossible to mandate a fixed response SLA on behalf of all users.
I suggest we turn this issue into a feature request: "too-long warning threshold should be configurable" (or perhaps even just remove this warning and rely on metrics?)