Etcd: ETCD data gets corrupted with error "read wal error (walpb: crc mismatch)

Created on 16 Aug 2016 · 10Comments · Source: etcd-io/etcd

We have a CoreOS VM running on ESXi with etcd running in stand-alone mode. The etcd is used locally on the node for communication between the docker containers. On more than one occasion we have noticed that etcd data gets corrupted with error "read wal error (walpb: crc mismatch) and cannot be repaired". After this, the only recourse is to discard the current etcd data dir and start afresh.

Aug 11 07:26:55 hybridmedia2.ucdemolab.com etcd2[1203]: etcd Version: 2.3.2
Aug 11 07:26:55 hybridmedia2.ucdemolab.com etcd2[1203]: Git SHA: ce63f10
Aug 11 07:26:55 hybridmedia2.ucdemolab.com etcd2[1203]: Go Version: go1.5.4
Aug 11 07:26:55 hybridmedia2.ucdemolab.com etcd2[1203]: Go OS/Arch: linux/amd64
Aug 11 07:26:55 hybridmedia2.ucdemolab.com etcd2[1203]: setting maximum number of CPUs to 12, total number of available CPUs is 12
Aug 11 07:26:55 hybridmedia2.ucdemolab.com etcd2[1203]: the server is already initialized as member before, starting as etcd member...
Aug 11 07:26:55 hybridmedia2.ucdemolab.com etcd2[1203]: listening for peers on http://localhost:2380
Aug 11 07:26:55 hybridmedia2.ucdemolab.com etcd2[1203]: listening for peers on http://localhost:7001
Aug 11 07:26:55 hybridmedia2.ucdemolab.com etcd2[1203]: listening for client requests on http://172.17.42.1:2379
Aug 11 07:26:55 hybridmedia2.ucdemolab.com etcd2[1203]: listening for client requests on http://172.17.42.1:4001
Aug 11 07:26:55 hybridmedia2.ucdemolab.com etcd2[1203]: listening for client requests on http://localhost:2379
Aug 11 07:26:55 hybridmedia2.ucdemolab.com etcd2[1203]: listening for client requests on http://localhost:4001
Aug 11 07:26:56 hybridmedia2.ucdemolab.com etcd2[1203]: recovered store from snapshot at index 120012 
Aug 11 07:26:56 hybridmedia2.ucdemolab.com etcd2[1203]: name = 7b8e1b72fe74459097fe70b2a3209aee
Aug 11 07:26:56 hybridmedia2.ucdemolab.com etcd2[1203]: data dir = /var/lib/etcd2
Aug 11 07:26:56 hybridmedia2.ucdemolab.com etcd2[1203]: member dir = /var/lib/etcd2/member
Aug 11 07:26:56 hybridmedia2.ucdemolab.com etcd2[1203]: heartbeat = 100ms
Aug 11 07:26:56 hybridmedia2.ucdemolab.com etcd2[1203]: election = 1000ms 
Aug 11 07:26:56 hybridmedia2.ucdemolab.com etcd2[1203]: snapshot count = 10000
Aug 11 07:26:56 hybridmedia2.ucdemolab.com etcd2[1203]: advertise client URLs = http://localhost:2379
Aug 11 07:26:57 hybridmedia2.ucdemolab.com etcd2[1203]: read wal error (walpb: crc mismatch) and cannot be repaired

mfusion@hybridmedia2 ~ $ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1068.8.0
VERSION_ID=1068.8.0
BUILD_ID=2016-07-18-0616
PRETTY_NAME="CoreOS 1068.8.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

mfusion@hybridmedia2 ~ $ sudo systemctl cat etcd2
# /usr/lib64/systemd/system/etcd2.service
[Unit]
Description=etcd2
Conflicts=etcd.service

[Service]
User=etcd
Type=notify
Environment=ETCD_DATA_DIR=/var/lib/etcd2
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd2
Restart=always
RestartSec=10s
LimitNOFILE=40000
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

# /run/systemd/system/etcd2.service.d/20-cloudinit.conf
[Service]
Environment="ETCD_ADVERTISE_CLIENT_URLS=http://localhost:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=http://localhost:2380,http://localhost:7001"
Environment="ETCD_LISTEN_CLIENT_URLS=http://localhost:2379,http://localhost:4001,http://172.17.42.1:2379,http://172.17.42.1:4001"
Environment="ETCD_LISTEN_PEER_URLS=http://localhost:2380,http://localhost:7001"

Source

utyagi24

Most helpful comment

Thanks everyone for looking into this. A few comments - I erroneously said the installation was bare-metal. Actually, the installations where we encountered this issue are VM running on ESXi 5.5. Also on further examination of journal log, I can verify that problem surfaced following a power down without shutdown.
Regarding etcd 3.0, we would definitely consider the upgrade once it is available on CoreOS stable release.

utyagi24 on 17 Aug 2016

❤2

All 10 comments

@utyagi24 So you suspect that the problem is not caused by corrupted data? Do you mind sending the wal files to xiang.[email protected]? I can take a look.

xiang90 on 16 Aug 2016

Sure @xiang90 let me email you the wal file.
Also, I would like to know under what circumstances the wal file gets corrupted ? What should we do to avoid such error in the future ?

utyagi24 on 16 Aug 2016

@utyagi24 CRC mismatch can happen if the there is a bit rot on disk or a file system corruption. It should not happen frequently. Or there might be a bug in etcd WAL package.

xiang90 on 16 Aug 2016

I do not know how to avoid it unless I find the root cause of your issue. We run failure injection cluster 7 * 24 (including hard kill which can cause some wal repairable corruptions). We do not have this problem.

xiang90 on 16 Aug 2016

@utyagi24 I analyzed your WAL files. The tail of the last file has a corrupted 37765 byte record. The record starts at byte offset 0x1988068 and should terminate at 0x19913ec. Running it through od gives:

1990ff0    [non-zero bytes]
1991000    00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
*
199afc0    00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
199afcf

The trailing 0's indicates there was a torn write. On 2.3 the WAL uses O_APPEND and will increase the file length on every write. Here, the data didn't hit disk but the file length was updated; that should never happen. The filesystem should commit the file length only after the data is persisted. I suspect this corruption is due to a file system bug related to the fallocate path. Is this on ext4? Did these CRC errors show up following a hard reboot?

On a related note, etcd 3.0 doesn't rely on the file length and instead can reason about torn writes, so it isn't affected (3.0's etcd-dump-logs can process the wal but not 2.3).