After rebooting a node (power cycle), the OSD pod failed to activate
Expected behavior:
OSD pods should recover and rejoin the Ceph cluster
How to reproduce it (minimal and precise):
Brought up a 4 node cluster, each node has one 100GB virtual disk for Rook/Ceph
Brought up Rook, everything came up as expected all 4 disks were joined to the Ceph cluster.
Power cycled one node
When the node came back up, the OSD pod failed to activate the osd. The osd pod logs are:
kubectl logs -n rook-ceph rook-ceph-osd-1-6b777b487c-896pv
2019-06-12 13:32:20.233762 I | rookcmd: starting Rook v1.0.2 with arguments '/rook/rook ceph osd start -- --foreground --id 1 --osd-uuid c61860d2-8879-4de9-ae4b-3cd8ebdbc9c4 --conf /var/lib/rook/osd1/rook-ceph.config --cluster ceph --default-log-to-file false'
2019-06-12 13:32:20.233896 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=INFO, --osd-id=1, --osd-store-type=bluestore, --osd-uuid=c61860d2-8879-4de9-ae4b-3cd8ebdbc9c4
2019-06-12 13:32:20.233901 I | op-mon: parsing mon endpoints:
2019-06-12 13:32:20.233905 W | op-mon: ignoring invalid monitor
2019-06-12 13:32:20.234090 I | exec: Running command: stdbuf -oL ceph-volume lvm activate --no-systemd --bluestore 1 c61860d2-8879-4de9-ae4b-3cd8ebdbc9c4
2019-06-12 13:32:20.965452 I | Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1
2019-06-12 13:32:21.541785 I | Running command: /usr/sbin/restorecon /var/lib/ceph/osd/ceph-1
2019-06-12 13:32:22.112579 I | Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
2019-06-12 13:32:22.659205 I | Running command: /bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-6ddfdaeb-402b-4fc0-8be3-39378cf4baea/osd-data-a07888e5-1e91-4230-982d-4a29378df074 --path /var/lib/ceph/osd/ceph-1 --no-mon-config
2019-06-12 13:32:23.269238 I | stderr: failed to read label for
2019-06-12 13:32:23.269257 I | stderr: /dev/ceph-6ddfdaeb-402b-4fc0-8be3-39378cf4baea/osd-data-a07888e5-1e91-4230-982d-4a29378df074
2019-06-12 13:32:23.269558 I | stderr: :
2019-06-12 13:32:23.269696 I | stderr: (2) No such file or directory
2019-06-12 13:32:23.270109 I | stderr:
2019-06-12 13:32:23.270290 I | stderr: 2019-06-12 13:32:23.265 7fe110064f00 -1 bluestore(/dev/ceph-6ddfdaeb-402b-4fc0-8be3-39378cf4baea/osd-data-a07888e5-1e91-4230-982d-4a29378df074) _read_bdev_label failed to open /dev/ceph-6ddfdaeb-402b-4fc0-8be3-39378cf4baea/osd-data-a07888e5-1e91-4230-982d-4a29378df074: (2) No such file or directory
2019-06-12 13:32:23.273615 I | --> RuntimeError: command returned non-zero exit status: 1
failed to activate osd. Failed to complete '': exit status 1.
When looking at lsblk output, before rebooting the node:
sms-04:~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 30G 0 disk
鈹溾攢sda1 8:1 0 8M 0 part
鈹斺攢sda2 8:2 0 30G 0 part /
sdb 8:16 0 100G 0 disk
鈹斺攢ceph--6ddfdaeb--402b--4fc0--8be3--39378cf4baea-osd--data--a07888e5--1e91--4230--982d--4a29378df074
After rebooting:
sms-04:~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 30G 0 disk
鈹溾攢sda1 8:1 0 8M 0 part
鈹斺攢sda2 8:2 0 30G 0 part /
sdb 8:16 0 100G 0 disk
Environment:
The lvm2 package was not installed on my systems. This is documented in the pre-requisites section at https://rook.io/docs/rook/v1.0/k8s-pre-reqs.html#lvm-package. Once I installed lvm2, then the osd pods came back after a reboot.
Most helpful comment
The
lvm2package was not installed on my systems. This is documented in the pre-requisites section at https://rook.io/docs/rook/v1.0/k8s-pre-reqs.html#lvm-package. Once I installedlvm2, then the osd pods came back after a reboot.