Etcd: "has already been bootstrapped" when re-provisioning one of the machines

Created on 20 May 2016 · 8Comments · Source: etcd-io/etcd

# /usr/lib64/systemd/system/etcd2.service
[Unit]
Description=etcd2
Conflicts=etcd.service

[Service]
User=etcd
Type=notify
Environment=ETCD_DATA_DIR=/var/lib/etcd2
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd2
Restart=always
RestartSec=10s
LimitNOFILE=40000
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/etcd2.service.d/40-etcd-cluster.conf
[Service]
Environment="ETCD_NAME=node1"
Environment="ETCD_ADVERTISE_CLIENT_URLS=http://123.12.12.12:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=http://123.12.12.12:2380"
Environment="ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379"
Environment="ETCD_LISTEN_PEER_URLS=http://123.12.12.12:2380"
Environment="ETCD_INITIAL_CLUSTER=node1=http://123.12.12.12:2380,node2=http://172.15.0.22:2380,
Environment="ETCD_STRICT_RECONFIG_CHECK=true"

May 20 17:57:39 localhost etcd2[761]: listening for peers on http://123.12.12.12:2380
May 20 17:57:39 localhost etcd2[761]: listening for client requests on http://0.0.0.0:2379
May 20 17:57:39 localhost etcd2[761]: stopping listening for client requests on http://0.0.0.0
May 20 17:57:39 localhost etcd2[761]: stopping listening for peers on http://123.12.12.12:2380
May 20 17:57:39 localhost etcd2[761]: member 51eecfc041171d8f has already been bootstrapped
May 20 17:57:39 localhost systemd[1]: etcd2.service: Main process exited, code=exited, status=
May 20 17:57:39 localhost systemd[1]: Failed to start etcd2.

Possible duplicate with https://github.com/coreos/etcd/issues/2780.

Source

gyuho

Most helpful comment

I run in a similar issue but found an elegant solution

etcd version 3.3.0 deployed with 3 nodes in a cluster
all the certificated are present on each of the nodes
data-dir folder is stored on an external volume, one per each node
by mistake two of the three volumes have been deleted
event after we created the volumes back, etcd did not start on the 2 nodes with the message
member 9dd0db80yyyyxxxx has already been bootstrapped
to fixed the issue, we stopped the etcd service on all three nodes
we deleted the volumes and re-created them
we started etcd service only on the two affected nodes
the service started normally on the 2 nodes
2020-04-10 13:57:53.792272 I | embed: listening for client requests on 127.0.0.1:4001
we started only after the etcd on the node with the intact volume containing the data-dir
cluster is running now fine

ioanc on 10 Apr 2020

👍3

All 8 comments

According to @dghubble

People often re-provision one of the nodes in bare-metal with exactly the same configuration (same IP, same ports), and when the new machine reboots etcd fails to restart member 51eecfc041171d8f has already been bootstrapped.

@xiang90 explained:

You cannot restart a member that was in the cluster without the previous data-dir

But this looks like a frequent use that we might be able to support?

gyuho on 20 May 2016

But this looks like a frequent use that we might be able to support?

No. We cannot. This is unsafe. Lost data-dir == lost member forever.

xiang90 on 20 May 2016

I can reproduce this in coreos-baremetal with the etcd cluster example by provisioning a 3 node cluster (working), then re-provisioning a single node (can't join). The re-provisioning can be thought of as swapping in a fresh machine with a new disk and configuring it exactly the same as before.

The previous advice (manually removing the member via the dynamic configuration API) is not very practical for large cluster deployments (often where etcd is simply a subcomponent). I'm interested to know why the cluster token cannot be used to address this? If there really is no way to do this, then we'll have to write scripts to check cluster health and automate the dynamic configuration removal and re-addition.

dghubble on 20 May 2016

After some discussion, this boils down to determining if an etcd node is being brought up for the first time or not, even if its configuration is identical. For now, etcd nodes must be provisioned together or manual reconfiguration (remove and re-add) will be needed if the operator deems the state safe. Other solutions require assumptions which etcd cannot safely make wrt consistency in the general case.

dghubble on 20 May 2016

@dghubble

I'm interested to know why the cluster token cannot be used to address this?

Cluster-token is for avoiding users from accidentally having one node that "belongs" to two clusters at the same time. See https://github.com/coreos/etcd/blob/master/Documentation/op-guide/clustering.md#static.

A machine participates etcd cluster A. Then user removes the etcd data of that machine, and creates a new etcd cluster B on the same machine. Since user does not remove that machine from cluster A, now the machine belongs to both cluster A and B. Members in cluster A can still send messages to the machine which actually belongs to cluster B. This confuses raft and a lot of other stuff. Thus, we provide the cluster-token.

So the general rule is: for each NEW cluster, user SHOULD assign a unique cluster-token. etcd then can use cluster-token to reject messages from other clusters.

In the case you described, it actually breaks this rule. A new etcd machine (no previous data since you re-previsioned it) with -initial flag set (since you still use the old configuration) creates a new cluster with the same cluster-token (since you still use the old configuration). To avoid that member from creating a new cluster and "belonging" to two clusters, etcd now successfully detects this issue and shutdowns this machine. Thus you see has already been bootstrapped.

xiang90 on 21 May 2016

The re-provisioning can be thought of as swapping in a fresh machine with a new disk and configuring it exactly the same as before.

I think provisioning by ignition should only happen once when users want to setup the initial environment, and should be a one shot thing. After that, users should use other tools like ansible, puppet to manage their application life cycle. So updating any application configuration should not involve swapping disk. Users need to re-provisioning when the machine itself dies or the cluster environment changes. That should happen infrequently and should have human involved.

In the use case you described to me, it seems that users want to use ignition to update their applications configuration, and manage their life cycle. It might be OK for managing stateless applications. But it is a disaster for stateful applications like etcd, ceph, Postgres. For ceph, if you lose data of a monitor node, then you have to require the right configuration from quorum. The new node wont simply start up without any data. That is also true for etcd. Some additional operations will be required if users wipe all data.

I think we should not encourage people to re-provision their machine for managing application (like k8s, etcd or ceph) purpose. If re-provision is needed, then the machine should be viewed as a complete new machine, and human intervention should be involved.

/cc @crawford

xiang90 on 21 May 2016

I think I have the answer to my original question, about why this isn't addressed by an existing mechanism. Thanks.

Going into your response, whether an etcd node should join an existing cluster or consider itself a new cluster is a question of information availability. Either the cluster has sufficient state (for an operator, early initialization script, or Ansible-type tool) to correctly determine whether a node should be a new cluster or join an existing cluster or else it does not have sufficient state. If an operator is expected to make the right choice manually, we can design tools to do this automatically via the same logic flow. If not, we should stop expecting operators to do this, instead, etcd nodes should be provisioned/re-provisioned _together_ without exception, if a re-provision is needed.

The discussed recommendation that an external, centralized service be used to determine whether to serve a new etcd node a configuration to become a new cluster node or join an existing cluster node seems insufficient as well. The right choice depends on the _current_ cluster state. For example, if one node is re-provisioned it should be told to re-join, but if the peers are re-provisioned shortly after, the right choice might be to have the nodes establish a new cluster. Races would occur. The source of truth here is etcd itself. (Provisioning also takes time so one would need to predict the state of the cluster at the time that etcd comes up). A general purpose provisioning service would need special knowledge about etcd and is out of scope. Perhaps an intermediate component could embed such logic to help make these decisions. Again, iff sufficient state is available, otherwise this is all moot.

Concerning how users manage their machines during their lifecycle and when they decide to re-provision, that's up to them, their preferences, and their needs.

I'm fine with just saying that etcd nodes should be provisioned together as a unit or not at all (i.e. configure live instances manually or with some convenience tool). Thanks for the details.

dghubble on 22 May 2016

I run in a similar issue but found an elegant solution

etcd version 3.3.0 deployed with 3 nodes in a cluster
all the certificated are present on each of the nodes
data-dir folder is stored on an external volume, one per each node
by mistake two of the three volumes have been deleted
event after we created the volumes back, etcd did not start on the 2 nodes with the message
member 9dd0db80yyyyxxxx has already been bootstrapped
to fixed the issue, we stopped the etcd service on all three nodes
we deleted the volumes and re-created them
we started etcd service only on the two affected nodes
the service started normally on the 2 nodes
2020-04-10 13:57:53.792272 I | embed: listening for client requests on 127.0.0.1:4001
we started only after the etcd on the node with the intact volume containing the data-dir
cluster is running now fine