I am running a three node etcd cluster. When I insert a new key value pair into the store, I see the following sequence of system calls on the server.
1 creat("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal.tmp")
2 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal.tmp")
3 fdatasync("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal.tmp")
4 rename("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal.tmp", "/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
5 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
6 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
7 fdatasync("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
8 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
9 fdatasync("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
==========Data is durable here -- Ack the user====================
For the 4th operation (rename of wal.tmp to wal) to be persisted to disk, the parent directory has to be fsync'd. If not, a crash just after acknowledging the user can result in a data loss. Specifically, the rename can be reordered on some file systems, thus not issued immediately by the fs. In such a case, on recovery, the server would see the file wal.tmp but not wal. On seeing this, I believe etcd just unlinks the tmp file and therefore can lose the user data. If this happens on two nodes on a three node cluster, then a global data loss is possible. We have reproduced this particular data loss issue using our testing framework. As a fix, it would be safe to fsync the parent directory on creat or rename of files.
@heyitsanthony This seems like a bug. I marked this as P2 since this is unlikely to happen in practice. But we should fix this soon. @ramanala Can you say more about your testing framework? We are VERY interested in auto testing to ensure etcd works reliably.
@xiang90 -- we have a testing framework that can test distributed storage systems for problems like data loss, unavailability etc in the presence of correlated crashes (i.e., all servers crashing together and in some servers, there are problems wrt FS like the above mentioned.). We have couple more issues in etcd where the cluster can become unavailable. I will file those issues in sometime.
Thanks for your interest! This is a research tool and the related paper will be publicly available in OSDI this year. We will also make the tool publicly available. I will update you with more information about the tool in a few days.
Thank you @heyitsanthony and @xiang90 !
Hi @ramanala i am interested in knowing more about your testing framework. Is it available publicly now?
Most helpful comment
@xiang90 -- we have a testing framework that can test distributed storage systems for problems like data loss, unavailability etc in the presence of correlated crashes (i.e., all servers crashing together and in some servers, there are problems wrt FS like the above mentioned.). We have couple more issues in etcd where the cluster can become unavailable. I will file those issues in sometime.
Thanks for your interest! This is a research tool and the related paper will be publicly available in OSDI this year. We will also make the tool publicly available. I will update you with more information about the tool in a few days.