Charts: [stable/percona-xtradb-cluster] re-creation of a cluster with existing persistent volumes fails

Created on 26 Mar 2018 · 7Comments · Source: helm/charts

Is this a request for help?: No

Is this a BUG REPORT or FEATURE REQUEST? (choose one): FEATURE REQUEST

Version of Helm and Kubernetes:

Kubernetes v1.9.2
Helm v2.8.2

Which chart:

stable/percona-xtradb-cluster

What happened:

Install the chart, using persistent storage.
Write some data to the database.
Delete the StatefulSet.
Recreate the StatefulSet.
The first pod enters a CrashLoop due to safe-to-bootstrap protection
Manually edit the grastate.dat file within the volume mounted by the first node, and set safe_to_bootstrap to 1.
The StatefulSet properly gets created.

What you expected to happen:
Given that Kubernetes StatefulSets provide guarantees around the ordering of deletion of pods, I would expect that on a re-create, the SS should be able to properly start and bootstrap from the first pod.

How to reproduce it (as minimally and precisely as possible):
See above.

Anything else we need to know:

lifecyclstale

Source

skriss

Most helpful comment

I'm using this job to edit grastate.dat. PXC cluster should be stopped first. Use proper claim name inside.

apiVersion: batch/v1
kind: Job
metadata:
  name: safe-to-bootstrap
spec:
  template:
    spec:
      volumes:
      - name: mysql-data
        persistentVolumeClaim:
          claimName: mysql-data-pxc-0
      containers:
      - name: safe-to-bootstrap
        image: busybox
        imagePullPolicy: IfNotPresent
        command:
          - sed
          - -i 
          - "s|safe_to_bootstrap.*:.*|safe_to_bootstrap:1|1"
          - /var/lib/mysql/grastate.dat
        volumeMounts:
        - mountPath: /var/lib/mysql
          name: mysql-data
      restartPolicy: OnFailure

dmitryzykov on 5 Nov 2018

👍4 ❤3 🎉3

All 7 comments

cc @stephenlawrence

skriss on 26 Mar 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 24 Jun 2018

This issue is being automatically closed due to inactivity.

stale[bot] on 8 Aug 2018

Hi @skriss I'm facing the exact same issue.

I've created Percona cluster
wrote some data in it.
deleted cluster
recreated cluster using same settings.
first pod fails at: percona-cluster-pxc-0 2/3 CrashLoopBackOff step

I didn't get your solution to this. Where is grastate.dat file exists and how can I edit it?

shinebayar-g on 25 Oct 2018

I'm using this job to edit grastate.dat. PXC cluster should be stopped first. Use proper claim name inside.

apiVersion: batch/v1
kind: Job
metadata:
  name: safe-to-bootstrap
spec:
  template:
    spec:
      volumes:
      - name: mysql-data
        persistentVolumeClaim:
          claimName: mysql-data-pxc-0
      containers:
      - name: safe-to-bootstrap
        image: busybox
        imagePullPolicy: IfNotPresent
        command:
          - sed
          - -i 
          - "s|safe_to_bootstrap.*:.*|safe_to_bootstrap:1|1"
          - /var/lib/mysql/grastate.dat
        volumeMounts:
        - mountPath: /var/lib/mysql
          name: mysql-data
      restartPolicy: OnFailure

dmitryzykov on 5 Nov 2018

👍4 ❤3 🎉3

I think this is still a issue. What if the first node's grastate.dat contains seqno: -1 and is thus not the best node to bootstrap from because it doesn't contain the last version of the database. What if it is the third node in the statefulset that contains the last version of the data. Maybe you can bootstrap the cluster with the third node if you change pod management policy to parallel. I don't think the issue is solved by just executing the job above, because there is a risk of data loss.