Rook: Upgrade from v0.7 to v0.8 for nodes with long hostname broken

Created on 8 Nov 2018 · 4Comments · Source: rook/rook

Is this a bug report or feature request?

Bug Report

Bug Report

Expected behavior:
New OSD Deployments to replace the existing OSDs running in one ReplicaSets per node.

Deviation from expected behavior:
New OSD Deployments have extended the Ceph cluster as new OSDs and not replaced the existing in ReplicaSets running OSDs per node.

How to reproduce it (minimal and precise):
Have nodes with a long node name (e.g., ip-192-78-13-122.eu-central-1.compute.internal).

Additional information:
~~I'm not sure if the issue is only happening when metadataDevice is used in the cluster.~~ Confirmed by another user running into this.

Environment:

Rook version (use rook version inside of a Rook Pod): v0.7.x to v0.8.3
Kubernetes version (use kubectl version): seems to be unrelated to K8S version

bug

Source

galexrt

👍2

Most helpful comment

@rootfs The update to v0.8.1 fails because of failure during creating the prepare job because of the long hostnames.
I didn't test v0.8.2, only upgrading from v0.7.1 to v0.8.3.

The problem seems that because of the change with "truncating" long hostnames, we use a different ConfigMap than exists. The ConfigMap which exists tells us that the disks are already in use, due to this Rook began using a new/"empty" ConfigMap and wiping the disks..
The fix right now for existing clusters that have not updated (from v0.7 to v0.8.3) yet, is to run a bash snippet:

for nodeName in $(kubectl get nodes -o custom-columns=NAME:.metadata.name --no-headers=true); do
    echo "NODE: ${nodeName}"
    kubectl get -n rook cm "rook-ceph-osd-${nodeName}-config" -o yaml | sed "s/rook-ceph-osd-${nodeName}-config/rook-ceph-osd-$(echo -n "${nodeName}" | md5sum | cut -d' ' -f1)-config/g" > "/tmp/${nodeName}.yaml"
    kubectl create -f "/tmp/${nodeName}.yaml"
done

which creates the ConfigMaps under the new truncated name and updating then to v0.8.3 seems to work fine in my tests.

I will verify with pera from Slack on Monday in their test cluster and see if we can implement a fix besides having people need to run a bash snippet.
I already have a few adjustments ready which fixes some usages of nodename for ConfigMaps, which don't seem necessary but make it aligned with the other code usages of nodename.

Please use this workaround first in a test cluster before upgrading with your production cluster and/or have backups ready.

galexrt on 8 Nov 2018

👍2

All 4 comments

@galexrt can you post the operator logs?

rootfs on 8 Nov 2018

btw, does it happen to 0.8.2?

rootfs on 8 Nov 2018

@rootfs The update to v0.8.1 fails because of failure during creating the prepare job because of the long hostnames.
I didn't test v0.8.2, only upgrading from v0.7.1 to v0.8.3.

for nodeName in $(kubectl get nodes -o custom-columns=NAME:.metadata.name --no-headers=true); do
    echo "NODE: ${nodeName}"
    kubectl get -n rook cm "rook-ceph-osd-${nodeName}-config" -o yaml | sed "s/rook-ceph-osd-${nodeName}-config/rook-ceph-osd-$(echo -n "${nodeName}" | md5sum | cut -d' ' -f1)-config/g" > "/tmp/${nodeName}.yaml"
    kubectl create -f "/tmp/${nodeName}.yaml"
done

which creates the ConfigMaps under the new truncated name and updating then to v0.8.3 seems to work fine in my tests.

Please use this workaround first in a test cluster before upgrading with your production cluster and/or have backups ready.

galexrt on 8 Nov 2018

👍2

I can confirm that the workaround snippet worked for my cluster upgrade from v0.7.1 to v0.8.3 after a failed first attempt.