Etcd: 磁盘读写速度慢导致 etcdserver: publish error: etcdserver: request timed out

Created on 4 Sep 2018  ·  7Comments  ·  Source: etcd-io/etcd

Please read https://github.com/etcd-io/etcd/blob/master/Documentation/reporting_bugs.md.
The data of etcd is saved on a HDD (mechanical disk drive) instead of SSD. When the data size exceeds 1G, it seems that the request timeout will happen due to slow data synchronization.
View code:

etcdserver/server.go
req := pb.Request{
        Method: "PUT",
        Path:   membership.MemberAttributesStorePath(s.id),
        Val:    string(b),
    }
ctx, cancel := context.WithTimeout(s.ctx, timeout)
        _, err := s.Do(ctx, req)

There is a mechanism to trigger client request timeout,
the timeout threshold is given by:

// ReqTimeout returns timeout for request to finish.
func (c *ServerConfig) ReqTimeout() time.Duration {
    // 5s for queue waiting, computation and disk IO delay
    // + 2 * election timeout for possible leader election
    return 5*time.Second + 2*time.Duration(c.ElectionTicks*int(c.TickMs))*time.Millisecond
}

Is this duration too short for operating etcd with HDD? Can we make it configurable?

arequestion

Most helpful comment

@Tutotu I also made some editing to your post, in hope of making it clearer to other contributors in the community. Hope you don't mind:)

All 7 comments

@Tutotu I translated your issue to English in hopes of resolving your problem quickly. Also so that it will help others who may find this issue in the future, I hope you do not take offense.

To answer your question per FAQ SSD is highly recommend and is most likely the root of your issue. Adjusting timeouts will only masq this problem. https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#system-requirements

To confirm please post output of backend_commit_duration and wal_fsync_duration metrics.

curl -s http://127.0.0.1:2379/metrics | \
    grep -E backend_commit_duration\|wal_fsync_duration

I am sorry ,my english is not good.thank you for your help.
this is my backend_commit_duration and wal_fsync_duration mtrics

```#curl -s http://127.0.0.1:4012/metrics | grep -E backend_commit_duration\|wal_fsync_duration

HELP etcd_disk_backend_commit_duration_seconds The latency distributions of commit called by backend.

TYPE etcd_disk_backend_commit_duration_seconds histogram

etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 150
etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 10489
etcd_disk_backend_commit_duration_seconds_bucket{le="0.004"} 180031
etcd_disk_backend_commit_duration_seconds_bucket{le="0.008"} 296334
etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 297034
etcd_disk_backend_commit_duration_seconds_bucket{le="0.032"} 297073
etcd_disk_backend_commit_duration_seconds_bucket{le="0.064"} 297129
etcd_disk_backend_commit_duration_seconds_bucket{le="0.128"} 297238
etcd_disk_backend_commit_duration_seconds_bucket{le="0.256"} 297284
etcd_disk_backend_commit_duration_seconds_bucket{le="0.512"} 297284
etcd_disk_backend_commit_duration_seconds_bucket{le="1.024"} 297284
etcd_disk_backend_commit_duration_seconds_bucket{le="2.048"} 297284
etcd_disk_backend_commit_duration_seconds_bucket{le="4.096"} 297284
etcd_disk_backend_commit_duration_seconds_bucket{le="8.192"} 297284
etcd_disk_backend_commit_duration_seconds_bucket{le="+Inf"} 297284
etcd_disk_backend_commit_duration_seconds_sum 1138.3594216329889
etcd_disk_backend_commit_duration_seconds_count 297284

HELP etcd_disk_wal_fsync_duration_seconds The latency distributions of fsync called by WAL.

TYPE etcd_disk_wal_fsync_duration_seconds histogram

etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 840388
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 1.180019e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 1.38423e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 1.387787e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 1.387899e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 1.387951e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 1.38803e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 1.388078e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 1.388081e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 1.388081e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 1.388081e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 1.388081e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 1.388081e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 1.388081e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 1.388081e+06
etcd_disk_wal_fsync_duration_seconds_sum 1366.5636633669471
etcd_disk_wal_fsync_duration_seconds_count 1.388081e+06
```

@jingyih

@Tutotu I also made some editing to your post, in hope of making it clearer to other contributors in the community. Hope you don't mind:)

@jingyih thanks that actually helped me a lot, my google conversion was probably quite sloppy.

Is this duration too short for operating etcd with HDD? Can we make it configurable?

The metrics you show don't look that bad but etcd is very I/O intensive and HDD will eventually be a bottleneck. In general, I feel it would undermine the stability of the cluster to allow for the timeouts to be adjusted to accommodate slower disks. As your data grew and I/O demands increased the problems would become unmanageable leading to eventual election failures and loss of quorum. Please update to SSD.

@jingyih thank you very much

@hexfusion
By increasing the timeout, I won't see the "etcdserver: publish error: etcdserver: request timed out" error again. Basically, it must be the slow disk read and write. If this timeout time can not be configured, I only replace SSD.

Was this page helpful?
0 / 5 - 0 ratings