Rook: OSD pods fail after reboot

Created on 12 Jun 2019 · 1Comment · Source: rook/rook

Bug Report

After rebooting a node (power cycle), the OSD pod failed to activate

Expected behavior:
OSD pods should recover and rejoin the Ceph cluster

How to reproduce it (minimal and precise):
Brought up a 4 node cluster, each node has one 100GB virtual disk for Rook/Ceph
Brought up Rook, everything came up as expected all 4 disks were joined to the Ceph cluster.
Power cycled one node
When the node came back up, the OSD pod failed to activate the osd. The osd pod logs are:

kubectl logs -n rook-ceph rook-ceph-osd-1-6b777b487c-896pv
2019-06-12 13:32:20.233762 I | rookcmd: starting Rook v1.0.2 with arguments '/rook/rook ceph osd start -- --foreground --id 1 --osd-uuid c61860d2-8879-4de9-ae4b-3cd8ebdbc9c4 --conf /var/lib/rook/osd1/rook-ceph.config --cluster ceph --default-log-to-file false'
2019-06-12 13:32:20.233896 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=INFO, --osd-id=1, --osd-store-type=bluestore, --osd-uuid=c61860d2-8879-4de9-ae4b-3cd8ebdbc9c4
2019-06-12 13:32:20.233901 I | op-mon: parsing mon endpoints: 
2019-06-12 13:32:20.233905 W | op-mon: ignoring invalid monitor 
2019-06-12 13:32:20.234090 I | exec: Running command: stdbuf -oL ceph-volume lvm activate --no-systemd --bluestore 1 c61860d2-8879-4de9-ae4b-3cd8ebdbc9c4
2019-06-12 13:32:20.965452 I | Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1
2019-06-12 13:32:21.541785 I | Running command: /usr/sbin/restorecon /var/lib/ceph/osd/ceph-1
2019-06-12 13:32:22.112579 I | Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
2019-06-12 13:32:22.659205 I | Running command: /bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-6ddfdaeb-402b-4fc0-8be3-39378cf4baea/osd-data-a07888e5-1e91-4230-982d-4a29378df074 --path /var/lib/ceph/osd/ceph-1 --no-mon-config
2019-06-12 13:32:23.269238 I |  stderr: failed to read label for
2019-06-12 13:32:23.269257 I |  stderr: /dev/ceph-6ddfdaeb-402b-4fc0-8be3-39378cf4baea/osd-data-a07888e5-1e91-4230-982d-4a29378df074
2019-06-12 13:32:23.269558 I |  stderr: :
2019-06-12 13:32:23.269696 I |  stderr: (2) No such file or directory
2019-06-12 13:32:23.270109 I |  stderr: 
2019-06-12 13:32:23.270290 I |  stderr: 2019-06-12 13:32:23.265 7fe110064f00 -1 bluestore(/dev/ceph-6ddfdaeb-402b-4fc0-8be3-39378cf4baea/osd-data-a07888e5-1e91-4230-982d-4a29378df074) _read_bdev_label failed to open /dev/ceph-6ddfdaeb-402b-4fc0-8be3-39378cf4baea/osd-data-a07888e5-1e91-4230-982d-4a29378df074: (2) No such file or directory
2019-06-12 13:32:23.273615 I | -->  RuntimeError: command returned non-zero exit status: 1
failed to activate osd. Failed to complete '': exit status 1.

When looking at lsblk output, before rebooting the node:

sms-04:~ # lsblk
NAME                                 MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                                    8:0    0   30G  0 disk 
├─sda1                                 8:1    0    8M  0 part 
└─sda2                                 8:2    0   30G  0 part /
sdb                                    8:16   0  100G  0 disk 
└─ceph--6ddfdaeb--402b--4fc0--8be3--39378cf4baea-osd--data--a07888e5--1e91--4230--982d--4a29378df074

After rebooting:

sms-04:~ # lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0   30G  0 disk 
├─sda1   8:1    0    8M  0 part 
└─sda2   8:2    0   30G  0 part /
sdb      8:16   0  100G  0 disk

Environment:

OS : SUSE Linux Enterprise Server 15
Kernel: Linux sms-04 4.12.14-150.17-default #1 SMP Thu May 2 15:15:46 UTC 2019 (bf13fb8) x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration: VirtualBox VM
Rook version: 1.0.2
Ceph version: ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
Kubernetes version: 1.13.3
Kubernetes cluster type: bare metal
Storage backend status: HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 27 pgs inactive; clock skew detected on mon.b, mon.c

bug

Source

dab-q

Most helpful comment

The lvm2 package was not installed on my systems. This is documented in the pre-requisites section at https://rook.io/docs/rook/v1.0/k8s-pre-reqs.html#lvm-package. Once I installed lvm2, then the osd pods came back after a reboot.

dab-q on 12 Jun 2019

👍12 🚀5