Etcd: "has already been bootstrapped" when re-provisioning one of the machines

Created on 20 May 2016  Â·  8Comments  Â·  Source: etcd-io/etcd

# /usr/lib64/systemd/system/etcd2.service
[Unit]
Description=etcd2
Conflicts=etcd.service
​
[Service]
User=etcd
Type=notify
Environment=ETCD_DATA_DIR=/var/lib/etcd2
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd2
Restart=always
RestartSec=10s
LimitNOFILE=40000
TimeoutStartSec=0
​
[Install]
WantedBy=multi-user.target
​
# /etc/systemd/system/etcd2.service.d/40-etcd-cluster.conf
[Service]
Environment="ETCD_NAME=node1"
Environment="ETCD_ADVERTISE_CLIENT_URLS=http://123.12.12.12:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=http://123.12.12.12:2380"
Environment="ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379"
Environment="ETCD_LISTEN_PEER_URLS=http://123.12.12.12:2380"
Environment="ETCD_INITIAL_CLUSTER=node1=http://123.12.12.12:2380,node2=http://172.15.0.22:2380,
Environment="ETCD_STRICT_RECONFIG_CHECK=true"
May 20 17:57:39 localhost etcd2[761]: listening for peers on http://123.12.12.12:2380
May 20 17:57:39 localhost etcd2[761]: listening for client requests on http://0.0.0.0:2379
May 20 17:57:39 localhost etcd2[761]: stopping listening for client requests on http://0.0.0.0
May 20 17:57:39 localhost etcd2[761]: stopping listening for peers on http://123.12.12.12:2380
May 20 17:57:39 localhost etcd2[761]: member 51eecfc041171d8f has already been bootstrapped
May 20 17:57:39 localhost systemd[1]: etcd2.service: Main process exited, code=exited, status=
May 20 17:57:39 localhost systemd[1]: Failed to start etcd2.

Possible duplicate with https://github.com/coreos/etcd/issues/2780.

Most helpful comment

I run in a similar issue but found an elegant solution

  • etcd version 3.3.0 deployed with 3 nodes in a cluster
  • all the certificated are present on each of the nodes
  • data-dir folder is stored on an external volume, one per each node
  • by mistake two of the three volumes have been deleted
  • event after we created the volumes back, etcd did not start on the 2 nodes with the message
    member 9dd0db80yyyyxxxx has already been bootstrapped
  • to fixed the issue, we stopped the etcd service on all three nodes
  • we deleted the volumes and re-created them
  • we started etcd service only on the two affected nodes
  • the service started normally on the 2 nodes
    2020-04-10 13:57:53.792272 I | embed: listening for client requests on 127.0.0.1:4001
  • we started only after the etcd on the node with the intact volume containing the data-dir
  • cluster is running now fine

All 8 comments

According to @dghubble

People often re-provision one of the nodes in bare-metal with exactly the same configuration (same IP, same ports), and when the new machine reboots etcd fails to restart member 51eecfc041171d8f has already been bootstrapped.

@xiang90 explained:

You cannot restart a member that was in the cluster without the previous data-dir

But this looks like a frequent use that we might be able to support?

But this looks like a frequent use that we might be able to support?

No. We cannot. This is unsafe. Lost data-dir == lost member forever.

I can reproduce this in coreos-baremetal with the etcd cluster example by provisioning a 3 node cluster (working), then re-provisioning a single node (can't join). The re-provisioning can be thought of as swapping in a fresh machine with a new disk and configuring it exactly the same as before.

The previous advice (manually removing the member via the dynamic configuration API) is not very practical for large cluster deployments (often where etcd is simply a subcomponent). I'm interested to know why the cluster token cannot be used to address this? If there really is no way to do this, then we'll have to write scripts to check cluster health and automate the dynamic configuration removal and re-addition.

After some discussion, this boils down to determining if an etcd node is being brought up for the first time or not, even if its configuration is identical. For now, etcd nodes must be provisioned together or manual reconfiguration (remove and re-add) will be needed if the operator deems the state safe. Other solutions require assumptions which etcd cannot safely make wrt consistency in the general case.

@dghubble

I'm interested to know why the cluster token cannot be used to address this?

Cluster-token is for avoiding users from accidentally having one node that "belongs" to two clusters at the same time. See https://github.com/coreos/etcd/blob/master/Documentation/op-guide/clustering.md#static.

A machine participates etcd cluster A. Then user removes the etcd data of that machine, and creates a new etcd cluster B on the same machine. Since user does not remove that machine from cluster A, now the machine belongs to both cluster A and B. Members in cluster A can still send messages to the machine which actually belongs to cluster B. This confuses raft and a lot of other stuff. Thus, we provide the cluster-token.

So the general rule is: for each NEW cluster, user SHOULD assign a unique cluster-token. etcd then can use cluster-token to reject messages from other clusters.

In the case you described, it actually breaks this rule. A new etcd machine (no previous data since you re-previsioned it) with -initial flag set (since you still use the old configuration) creates a new cluster with the same cluster-token (since you still use the old configuration). To avoid that member from creating a new cluster and "belonging" to two clusters, etcd now successfully detects this issue and shutdowns this machine. Thus you see has already been bootstrapped.

The re-provisioning can be thought of as swapping in a fresh machine with a new disk and configuring it exactly the same as before.

I think provisioning by ignition should only happen once when users want to setup the initial environment, and should be a one shot thing. After that, users should use other tools like ansible, puppet to manage their application life cycle. So updating any application configuration should not involve swapping disk. Users need to re-provisioning when the machine itself dies or the cluster environment changes. That should happen infrequently and should have human involved.

In the use case you described to me, it seems that users want to use ignition to update their applications configuration, and manage their life cycle. It might be OK for managing stateless applications. But it is a disaster for stateful applications like etcd, ceph, Postgres. For ceph, if you lose data of a monitor node, then you have to require the right configuration from quorum. The new node wont simply start up without any data. That is also true for etcd. Some additional operations will be required if users wipe all data.

I think we should not encourage people to re-provision their machine for managing application (like k8s, etcd or ceph) purpose. If re-provision is needed, then the machine should be viewed as a complete new machine, and human intervention should be involved.

/cc @crawford

I think I have the answer to my original question, about why this isn't addressed by an existing mechanism. Thanks.

Going into your response, whether an etcd node should join an existing cluster or consider itself a new cluster is a question of information availability. Either the cluster has sufficient state (for an operator, early initialization script, or Ansible-type tool) to correctly determine whether a node should be a new cluster or join an existing cluster or else it does not have sufficient state. If an operator is expected to make the right choice manually, we can design tools to do this automatically via the same logic flow. If not, we should stop expecting operators to do this, instead, etcd nodes should be provisioned/re-provisioned _together_ without exception, if a re-provision is needed.

The discussed recommendation that an external, centralized service be used to determine whether to serve a new etcd node a configuration to become a new cluster node or join an existing cluster node seems insufficient as well. The right choice depends on the _current_ cluster state. For example, if one node is re-provisioned it should be told to re-join, but if the peers are re-provisioned shortly after, the right choice might be to have the nodes establish a new cluster. Races would occur. The source of truth here is etcd itself. (Provisioning also takes time so one would need to predict the state of the cluster at the time that etcd comes up). A general purpose provisioning service would need special knowledge about etcd and is out of scope. Perhaps an intermediate component could embed such logic to help make these decisions. Again, iff sufficient state is available, otherwise this is all moot.

Concerning how users manage their machines during their lifecycle and when they decide to re-provision, that's up to them, their preferences, and their needs.

I'm fine with just saying that etcd nodes should be provisioned together as a unit or not at all (i.e. configure live instances manually or with some convenience tool). Thanks for the details.

I run in a similar issue but found an elegant solution

  • etcd version 3.3.0 deployed with 3 nodes in a cluster
  • all the certificated are present on each of the nodes
  • data-dir folder is stored on an external volume, one per each node
  • by mistake two of the three volumes have been deleted
  • event after we created the volumes back, etcd did not start on the 2 nodes with the message
    member 9dd0db80yyyyxxxx has already been bootstrapped
  • to fixed the issue, we stopped the etcd service on all three nodes
  • we deleted the volumes and re-created them
  • we started etcd service only on the two affected nodes
  • the service started normally on the 2 nodes
    2020-04-10 13:57:53.792272 I | embed: listening for client requests on 127.0.0.1:4001
  • we started only after the etcd on the node with the intact volume containing the data-dir
  • cluster is running now fine
Was this page helpful?
0 / 5 - 0 ratings