Cockroach: ui: should we hide or move the decomissioned nodes list?

Created on 10 Apr 2018  Â·  17Comments  Â·  Source: cockroachdb/cockroach

QUESTION

Hey there, since I can only find topics on when decomissioned nodes are being hidden from the ui, but never found any information anywhere, when dead nodes will be hidden. So in this case one node died completely and was replaced with a complete new one, but it keeps in the ui forever and just does not disappear. Although I did marked it as decomissioned afterwards.

Is there any way to finalize the removal?

A-webui-general C-question O-community S-3-ux-surprise

Most helpful comment

Use Case to support remove option to clear decommissioned nodes from UI/DB (not dead node):
We are trying to use cockroachdb in the cloud, hosts will be replaced/refreshed every 30 days using Jenkins.
Running a cluster of 4 in 3 regions (Oregon, NOVA, Ohio) is at least 12 nodes per month accumulating under Decommissioned in prod.
We don't care about those nodes, they are gone forever.
Dev/test where we build rotation processes, is getting messy with all the decomms in the UI.
Recommendation: Build option to clear with time frame of how far back, let customer decide between info and clutter. Send a query that can purge decomms from the DB and it will be added to our decomm code.

All 17 comments

Hi @wzrdtales! We never remove dead nodes from the UI unless you decommission them. Are you having trouble decommissioning dead nodes? I can't quite tell from your report.

@benesch The nodes were dead before I decomissioned them. So now there is unnecessary trash in the UI which never comes back to live, since the new node registered under the same name as the old one registered itself as a complete new node. So yeah basically I have a problem with getting rid of dead nodes, decomissioning them is not the problem, but they still don't disappear from the ui, they're not dead anymore, but they're still trash flying around in the ui now in form of decomissioned nodes.

Now it looks like that, and that is probably not getting better.

image

Yep, that’s expected. We never forget about decommissioned nodes. You won’t
see them anywhere but that one list you screenshotted, though; we’ve hidden
them from e.g. graphs views in which those nodes have no data. At least, if
you see them anywhere else, it's a bug.

Decommissioning is not entirely free, so showing those decommissioned nodes
in the UI reminds you of the baggage your cluster will have to carry around
forever. It also explains to future administrations why your nodes are
numbered n1, n2, and n8.

That said, we might want to consider making the decommissioned nodes
section collapsed by default. Would that adequately address your concern?

On Tue, Apr 10, 2018 at 7:29 PM Tobias Gurtzick notifications@github.com
wrote:

@benesch https://github.com/benesch The nodes were dead before I
decomissed them. So now there is unnecessary trash in the UI which never
comes back to live, since the new node registered under the same name as
the old one registered itself as a complete new node. So yeah basically I
have a problem with getting rid of dead nodes, decomissioning them is not
the problem, but they still don't disappear from the ui, they're not dead
anymore, but they're still trash flying around in the ui now in form of
decomissioned nodes.

Now it looks like that, and that is probably not getting better.

[image: image]
https://user-images.githubusercontent.com/1786821/38588739-c3b79726-3d27-11e8-8b49-56f0abc68ac4.png

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/24636#issuecomment-380279300,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA15IGwtGF_P-nhphZDUjosHU5morTDRks5tnUB2gaJpZM4TOucl
.

Having an archive would be I think more meaningful. What I expected UX wise, that old decomissioned node just went invisible after a certain amount of time (a day or so) unless you look them up in the archive. Having this on the dashboard is UX wise not the best decision IMHO, since you make the assumption something is wrong, while it is not. By the way how wide is the integer storing the node numbers? Just curious.

Yeah, this view used to be better buried under the "View node list" link, but it appears the view was repurposed for the new 2.0 admin UI homepage, putting these decommissioned nodes front and center. /cc @couchand @vilterp

The limiting factor on the node ID is probably the int32 used in the NodeDescriptor proto: https://github.com/cockroachdb/cockroach/blob/bfa9931d495351f02b1681589ea60223a88393a4/pkg/roachpb/metadata.proto#L145

ok than there is at least room for 2 billion nodes (or dead ones) :p at least I guess that int32 is not unsigned here.

Displaying them for a certain amount of time wouldn't be that wrong, but hiding them afterwards would make sense IMHO.

We've actually talked about this quite a bit, mostly offline, but you can see some of the history in issues going way back to the first implementation (see the comments starting at https://github.com/cockroachdb/cockroach/pull/17553#issuecomment-321608844). I think @benesch's point is strong: there's a gap in the node ids which is useful to explain, and there is also a cost to the cluster associated with decommissioned nodes beyond just using the bits of the id field.

I can see the value in having that information be available if you seek it out, but not front and center. I'm not entirely sold on the argument that it makes it seem like something is wrong -- I think the meaning of "decommission" is pretty clear in this case, but it's not information you need daily, so it adds noise to the overview page.

As an aside, best practices when a node dies unexpectedly is to bring it back up, not replace it with a new one. All this requires is the data from the store directory. The decommissioning process is intended mostly for planned cluster changes, and the UX has been designed around that usage pattern.

As just one more aside, I was under the impression that removing decommissioned nodes from the liveness table eventually was still in the works (removing the nodes from the liveness table would mean the cluster actually forgets about them and the cost I alluded to above is recovered). At the end of #17553 that was punted to #15609, which I see was recently closed without any action being taken. It also looks like #20639 was opened to address this exact question, which was also closed in favor of #15609, so perhaps it was a mistake to close #15609 at all? @tschottdorf you commented on a few of these, what do you think? It sounds like your perspective is that we should just get rid of the "Decommissioned" list entirely, how does that square with @benesch's argument?

cc @piyush-singh

@couchand So this nodes died in completeness: Means no data available anymore, how are you supposed to bring this back up? In fact: They came back up with the same name, but registered as a complete new node. If I would have had an option to reregister them as the same node as before, but which lost all of his data, I would have done this though.

There's no concept in CockroachDB of restoring a node that has lost all data. If the data is gone, you simply can't bring it back up. The node's identity is completely independent of the interface it's listening on: a new node on the old host and port will still be a new node, and an old node on a new host and port will still be the old node.

ok, then I will have a lot of decomissioned nodes in the current scenario. Any node with all data could be gone at any random point in time though. But always enough left to restore the new one.

Use Case to support remove option to clear decommissioned nodes from UI/DB (not dead node):
We are trying to use cockroachdb in the cloud, hosts will be replaced/refreshed every 30 days using Jenkins.
Running a cluster of 4 in 3 regions (Oregon, NOVA, Ohio) is at least 12 nodes per month accumulating under Decommissioned in prod.
We don't care about those nodes, they are gone forever.
Dev/test where we build rotation processes, is getting messy with all the decomms in the UI.
Recommendation: Build option to clear with time frame of how far back, let customer decide between info and clutter. Send a query that can purge decomms from the DB and it will be added to our decomm code.

Unless something has changed that I'm unaware of, it's not accurate to say that decommissioned nodes are gone forever: there is still an overhead that will continue to accumulate as nodes are continually decommissioned. As long as this exists, it needs to be reportable on the front end.

Thus @Timbo000002, I don't believe your suggestion applies to the web UI, but rather to the decommissioning process of the cluster itself. This issue is limited to the suggestion to change the way that we display nodes which have been decommissioned but the cluster still remembers. Your suggestion is valid, so I've opened another issue to track it: #33542.

Zendesk ticket #3096 has been linked to this issue.

”repave” is becoming a common technique for applying OS patch. machines are immutable - So the patching requires a roll out a new nodes. Not cleaning up decommissioned nodes will be problematic

Designs for updates to the decommissioned node list can be found here: https://zpl.io/boG0kEo

These illustrate the following user stories:

As an application developer I need to:

  • See when the decommissioning of a node is in progress in the Admin UI
  • Have the ability to clear decommissioned nodes from the overview page
  • Have the ability to view a history of all decommissioned nodes

A possible edge case user story:

  • get alerted when a node cannot be decommissioned because there are only 3 nodes (meaning there are no nodes available to transfer to)

Note: we are also looking to update the 'Node Status' numbers at the top of the overview page for more accuracy: Currently, when a node is decommissioning, the UI displays that node as 'Suspect' which isn't accurate.

Instead, we will update the 'Live Nodes' count. For example, if you have 9 live nodes and decommission 1, the live nodes count updates to 8 and the suspect nodes count stays at 0.

Was this page helpful?
0 / 5 - 0 ratings