Rook: operator restarts the manager if the crd is "applied" even if nothing is changed

Created on 28 Oct 2019 · 3Comments · Source: rook/rook

When the CephCluster is 'applied' to the cluster it will cause the operator/manager to restart even when nothing has changed. This seems to be directly related to the use of the resources section in the CephCluster

See this slack thread for the same report from another user.

clusterChanged() was pointed out as suspect from the slack thread.

Deviation from expected behavior:

Upon 'applying' an unchanged CephCluster, the operator detects a change and restarts.

Expected behavior:

Upon 'applying' an unchanged CephCluster, the operator should not detect a change and not restart.

How to reproduce it (minimal and precise):

Consider this CephCluster definition with the resources section defined. Subsequent 'applications' of the cluster.yaml file will cause the operator to restart.

Remove the resources section and the same repeated applications of the cluster.yaml file will not cause the operator to restart.

This reoccurring 'application' of the CephCluster is going to be common when using something like Flux for gitops based operations of the cluster.

File(s) to submit:

cluster.yaml
Operator's log snippet when this occurs:

The line of interest is op-cluster: The Cluster CR has changed. diff=

2019-10-28 01:55:16.094198 I | op-k8sutil: finished waiting for updated deployment rook-ceph-osd-0                                                                                                                                                                                                       [336/97664]
2019-10-28 01:55:16.094228 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-10-28 01:55:16.094235 I | op-osd: started deployment for osd 0 (dir=false, type=)
2019-10-28 01:55:16.098893 I | op-osd: 4/4 node(s) completed osd provisioning
2019-10-28 01:55:16.099036 I | op-osd: checking if any nodes were removed
2019-10-28 01:55:16.107536 I | op-osd: processing 0 removed nodes
2019-10-28 01:55:16.107564 I | op-osd: done processing removed nodes
2019-10-28 01:55:16.107681 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/225580520
2019-10-28 01:55:17.491459 I | exec: Running command: ceph osd require-osd-release nautilus --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/956139559
2019-10-28 01:55:18.616953 I | cephclient: successfully disallowed pre-nautilus osds and enabled all new nautilus-only functionality
2019-10-28 01:55:18.616982 I | op-osd: completed running osds in namespace rook-ceph
2019-10-28 01:55:18.616994 I | rbd-mirror: configure rbd-mirroring with 0 workers
2019-10-28 01:55:18.619631 I | rbd-mirror: no extra daemons to remove
2019-10-28 01:55:18.619726 I | op-cluster: Done creating rook instance in namespace rook-ceph
2019-10-28 01:55:18.619754 I | op-cluster: CephCluster rook-ceph status: Created.
2019-10-28 01:55:18.624416 I | op-cluster: succeeded updating cluster in namespace rook-ceph
2019-10-28 01:55:18.625011 I | op-cluster: The Cluster CR has changed. diff=
2019-10-28 01:55:18.625107 I | op-cluster: update event for cluster rook-ceph is supported, orchestrating update now
2019-10-28 01:55:18.625276 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/901946714
2019-10-28 01:55:19.924101 I | op-cluster: ceph daemons running versions are: {Mon:map[ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable):3] Mgr:map[ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable):1] Osd:map[ceph version 14.2.4 (75f4de193b3ea585
12f204623e6c5a16e6c1e1ba) nautilus (stable):4] Rgw:map[] Mds:map[ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable):2] RbdMirror:map[] Overall:map[ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable):10]}
2019-10-28 01:55:19.924133 I | op-cluster: CephCluster rook-ceph status: Updating.
2019-10-28 01:55:19.932177 I | op-mon: start running mons
2019-10-28 01:55:19.935139 I | op-mon: parsing mon endpoints: a=10.43.123.157:6789,b=10.43.58.70:6789,c=10.43.82.142:6789

When the resources section is removed, the same operator log without restarting every application of CephCluster (a 'good' log) in the same part of operation looks like this:

2019-10-28 01:58:03.055308 I | op-k8sutil: finished waiting for updated deployment rook-ceph-osd-0                                                                                                                                                                                                        [40/97664]
2019-10-28 01:58:03.055326 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-10-28 01:58:03.055331 I | op-osd: started deployment for osd 0 (dir=false, type=)
2019-10-28 01:58:03.060282 I | op-osd: 4/4 node(s) completed osd provisioning
2019-10-28 01:58:03.060374 I | op-osd: checking if any nodes were removed
2019-10-28 01:58:03.067371 I | op-osd: processing 0 removed nodes
2019-10-28 01:58:03.067387 I | op-osd: done processing removed nodes
2019-10-28 01:58:03.067453 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/145145053
2019-10-28 01:58:03.620637 I | exec: Running command: ceph osd require-osd-release nautilus --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/125949592
2019-10-28 01:58:04.207432 I | cephclient: successfully disallowed pre-nautilus osds and enabled all new nautilus-only functionality
2019-10-28 01:58:04.207460 I | op-osd: completed running osds in namespace rook-ceph
2019-10-28 01:58:04.207468 I | rbd-mirror: configure rbd-mirroring with 0 workers
2019-10-28 01:58:04.210097 I | rbd-mirror: no extra daemons to remove
2019-10-28 01:58:04.210113 I | op-cluster: Done creating rook instance in namespace rook-ceph
2019-10-28 01:58:04.210122 I | op-cluster: CephCluster rook-ceph status: Created.
2019-10-28 01:58:04.214446 I | op-pool: start watching clusters in all namespaces
2019-10-28 01:58:04.214476 I | op-object: start watching object store resources in namespace rook-ceph
2019-10-28 01:58:04.214484 I | op-object: start watching object store user resources in namespace rook-ceph
2019-10-28 01:58:04.214489 I | op-bucket-prov: Ceph Bucket Provisioner launched
2019-10-28 01:58:04.215380 I | op-file: start watching filesystem resource in namespace rook-ceph
2019-10-28 01:58:04.215389 I | op-nfs: start watching ceph nfs resource in namespace rook-ceph
2019-10-28 01:58:04.215399 I | op-cluster: ceph status check interval is 60s

Environment:

OS (e.g. from /etc/os-release): Ubuntu 19.10
Kernel (e.g. uname -a): 5.3.0-19
Cloud provider or hardware configuration: bare metal
Rook version (use rook version inside of a Rook Pod): v1.1.4
Storage backend version (e.g. for ceph do ceph -v): 14.2.4
Kubernetes version (use kubectl version): v1.16.2-k3s.1
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): k3s
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

bug

Source

billimek

All 3 comments

@travisn From the logs its clear that even if diff= the clusterChanged() function returns a true value there. This can be fixed by checking if the difference thrown by comparing the new and old cluster is not null, then only return true value right?

nizamial09-zz on 4 Nov 2019

@nizamial09 Originally we always wanted to trigger the orchestration if the cluster changed event was raised. But since we added the check to see if there is a difference, it would certainly be better to only trigger the orchestration if something changed. Thanks for digging into this!

travisn on 4 Nov 2019

👍1

@nizamial09 Originally we always wanted to trigger the orchestration if the cluster changed event was raised. But since we added the check to see if there is a difference, it would certainly be better to only trigger the orchestration if something changed. Thanks for digging into this!

So can I make the changes and see if that works well. Add a condition so that orchestration gets triggered only when something has changed.

nizamial09-zz on 4 Nov 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings