Cockroach: server: option to obliterate decommissioned nodes from cluster memory

Created on 7 Jan 2019  路  17Comments  路  Source: cockroachdb/cockroach

Use Case to support remove option to clear decommissioned nodes from UI/DB (not dead node):
We are trying to use cockroachdb in the cloud, hosts will be replaced/refreshed every 30 days using Jenkins.
Running a cluster of 4 in 3 regions (Oregon, NOVA, Ohio) is at least 12 nodes per month accumulating under Decommissioned in prod.
We don't care about those nodes, they are gone forever.
Dev/test where we build rotation processes, is getting messy with all the decomms in the UI.
Recommendation: Build option to clear with time frame of how far back, let customer decide between info and clutter. Send a query that can purge decomms from the DB and it will be added to our decomm code.

_Originally posted by @Timbo000002 in https://github.com/cockroachdb/cockroach/issues/24636#issuecomment-449008563_

C-enhancement

All 17 comments

cc @tbg for triage

This is a frequent request/point of confusion; I got a request for this at an operations training last week. I think it's important that nodes that are both decommissioned and dead drop out of the UI quickly.

That seems reasonable to achieve. Tagging as a UI issue for now (though there may be a CLI component too).

Update: I refreshed our 4 node x 2 region cluster yesterday to upgrade to 2.1.3, and now show 11 dead nodes , 19 under-replicated / Unavailable ranges and 14 Decommed nodes.

During the previous refresh changed the default port from 26257 to 5432 due to corporate firewall rules so we can access the cluster with DBeaver.

None of the dead and decomm nodes exist anymore so if CRDB is still doing something in the background it will probably never finish if it needs access to the nodes, some nodes in the UI are between 19 and 22 days gone.

To ensure we solve the right thing: this could be partly from node decomm cmd's, ours:
$> /usr/local/bin/cockroach quit --decommission --host=${COCKROACH_INSTANCE_IP}
$> systemctl stop securecockroachdb.service
Then call AWS to drop the node:

If created without stack:

$> /bin/aws ec2 terminate-instances --region=$region --instance-ids ${COCKROACH_INSTANCE_ID}

If created with stack:

$> /bin/aws cloudformation delete-stack --region=$region --stack-name ${STACK_NAME}

When the script runs I observe that it is draining as expected which concerns me there are under rep/avail ranges remaining and those nodes/ranges are now gone.

Please validate above.
Ping me as needed for more details.

Hi @Timbo000002, after proper decommissioning of a node there shouldn't be any unavailable or underreplicated ranges. If that's not what you're seeing, please open a new issue (this issue is unrelated). cc @tim-o for further assistance.

I opened: https://github.com/cockroachdb/cockroach/issues/35220. How do you cc someone in GitHub?

@Timbo000002 you just mention them. I did it in #35220 already.

Hey, I see this issue hasn't seen any activity in over a year.

On my cluster I had a disk failure which somehow ended up adding 4 nodes with the same IP address to the decommissioned node list. Systemd kept restarting cockroachdb which kept adding new node IDs to the cluster. Now every time I open the web UI I am reminded of that stressful event.

I would love it if the cluster could forget about those nodes forever, or at least not shove them in my face on the home page of the web UI.

Hi @Fornax96 you can remove these notes from the list using the cockroach node decommission command. Can you check that the issue gets alleviated this way?

@knz The nodes are already decommissioned. Executing the decommission command again doesn't change anything.

Ok this is odd - we have changed this logic (and added lots of test) throughout the v19.2 release cycle. Can you remind us which version you are running?

I'm on 19.2.6

Ok we need to investigate this further. Could you coordinate with our tech support to get some debugging information to use (output of cockroach debug zip is what we're looking for).

Hi @Fornax96 ,

Could you send the debug.zip via this link?

Thank you!
Matt

@knz the file is here.

@piyush-singh clicking through #24636 I see we recently did some work in https://github.com/cockroachdb/cockroach/pull/42817. Is the top-level issue still applicable? If not, let's close it out. If so, I think we can improve what the issue is asking for - I'm not sure what's actionable here. Leaving it to you to close it out/re-file if necessary.

I think this has been addressed right?

Was this page helpful?
0 / 5 - 0 ratings