When upgrading a cluster with a /_cluster/settings that disappears (or is changed and not properly registered by a plugin), then it becomes impossible to _unset_ a persistent setting. Worse still, it then becomes a problem to apply new settings because it tries to merge the old settings.
PUT /_cluster/settings
{
"persistent": {
"unknown.setting": null
}
}
should be allowed to run even though the setting is unrecognized.
We have an alternative issue #28026 that says: "we should not archive unknown and broken cluster settings. Instead, we should fail to recover the cluster state. The solution for users in an upgrade case would be to rollback to the previous version, address the settings that would be unknown or broken in the next major version, and then proceed with the upgrade."
Implementing #28026 makes this issue unnecessary, as the broken settings should NOT exist anymore in the cluster state.
But the valid question is what to do with clusters that can't downgrade and have broken settings stuck in the cluster state.
But the valid question is what to do with clusters that can't downgrade and have broken settings stuck in the cluster state.
In such a case, I think the ability to remove them is the only option?
We talked about it in fix it friday and we think this happens due to rolling restarts. Those currently don't validate anything and if a 6.x node joins a 5.x cluster which has setting that 6.x doesn't understand no error message will be triggered. Once the rolling restart is finished, we end up with a 6.x cluster with broken settings, preventing any future change.
To deal with the above we decided we need to add a join validator that checks that the joining node understands all current settings in the cluster.
While that is good and will prevent this specific issue as described, it still has problems:
1) People that are in this state - i.e., a 6.x cluster with broken settings - needs a solution.
2) It's still possible that while the cluster is in mixed mode and the master is on 5.x, people will update their setting to include a 5.x only setting.
To address the first point we want a 6.x to automatically archive any settings it doesn't understand on any setting update (regardless of which settings is being update).
To address the second, we thought to add a functionality to the TransportClusterUpdateSettingsAction where it first looks at the current cluster version nodes and if there are nodes that are with a higher version than the master, it will first reach out to them to validate the setting. Once the setting change is validate on the node with highest version, it will proceed to update the setting under the cluster state thread. While the setting update task is on the cluster state thread, it will again check that the preflight check is valid - i.e., no node with a newer version has joined. If it is not valid, the entire request will fail.
@s1monw another option that occured to me while writing is - maybe we should extend the Deprecated property with a required "remove in version x" parameter. Then the master can validate the settings on it's own based on this information. WDYT?
I believe this issue has been addressed by https://github.com/elastic/elasticsearch/pull/28888, and so I'm going to close it. If you believe I am incorrect about that, please comment and/or reopen this issue.
Most helpful comment
In such a case, I think the ability to remove them is the only option?