Cockroach: storage: add the ability to set/adjust zone configs for system ranges

Created on 14 Nov 2016  Â·  17Comments  Â·  Source: cockroachdb/cockroach

We need to be able to set the zone configs for all system ranges. Unlike other ranges, if any range that contains metadata is unavailable, the entire cluster can become unavailable.

In the future, it might be nice to automatically do so, based on a combination of locality and the size of the cluster.

All 17 comments

@BramGruneir, I want to have a see on it, Could I?

@a6802739 Of course, go for it.

@BramGruneir, Could you give me some references, I couldn't quite get your point?

forgive my poor English.

I think we need to hash the design out (in this issue) first. The initial message notes the desired goal: being able to set a zone config for system ranges (i.e. everything that isn't a table). How to accomplish that is an open question. The system.zones table has the following schema:

CREATE TABLE system.zones (
  id     INT PRIMARY KEY,
  config BYTES
);

The id column refers to a table or database ID. We use id 0 to refer to the "default" zone config that is used whenever another zone config does not apply to a range. That means that currently the "default" zone config is used for system ranges.

One thought is that we can extend the ZoneConfig proto to include a sub-config to be used for system ranges. Something like:

message ZoneConfig {
  ...
  optional ZoneConfig system = 7;
}

This feels a bit awkward. The other option is to reserve another ID from the system ID range in keys/constants.go. Something like systemConfigID = 15. This is probably more natural and might be an equivalent amount of work.

We'd want to extend the cockroach zone {get,set,ls} commands to recognize a special .system similar to the way they currently recognize .default. And you'll have to enhance SystemConfig.GetZoneConfigForKey to retrieve the correct zone config if the key is a "system" key.

Open question: do we want to be able to specify different zone configs for different portions of the system key space? For example, we might want higher replication for the meta{1,2} ranges than timeseries data.

Cc @spencerkimball

We're going to need more control than just the system ranges as a lump concept. Here's my take on what we need to control (and whatever mechanism we use should be extensible):

  • meta1 / meta2 addressing records – these are the crown jewels. For a geographically dispersed cluster, we'd want copies of these everywhere for local inconsistent reads. I imagine the replication for these will be set to encompass every datacenter in the cluster.
  • NodeLiveness, DescIDGenerator, RangeIDGenerator, StoreIDGenerator – these records need to be available, although not widely distributed, so I'd imagine you'd use something like 5 replicas in the most central datacenter in a geographically dispersed cluster.
  • Status* records
  • TimeSeries* records

@a6802739 I should have said, please feel free to step in here. Contributions welcome!

@spencerkimball, sorry for late response. I will try it, thanks a lot.

@petermattis, how could I specify different zone configs for different portions of the system key space, should I use different systemConfigID for different portions of the system key space?

id 0 is just RootNamespaceID, right?

I want to use SystemPrefix to judge if the key is a "system" key, but It seems meta{1,2} has no prefix SystemPrefix.

So I think the main problem is how could I specify different zone config for different key?

The meta{1,2} keys do not have SystemPrefix, but they have their own unique prefixes. Specifically, meta1 keys all of a have prefix \x02 and meta2 keys have a prefix of \x03. See pkg/keys/constants.go for more details.

Do we still want to get this in for 1.0? It looks like @a6802739 has mostly abandoned #12335, which #12513 was meant to supplement.

I'd really like to get this straightened out. How much work is left?

I think we should get this in 1.0. The work was mostly done.

@BramGruneir, do you have bandwidth to drive this to the finish line or do you need me to?

I'm out of the office until Wednesday and I'll be concentrating on Windows
for a bit. Feel free to take this over.

On Apr 3, 2017 10:09 AM, "Alex Robinson" notifications@github.com wrote:

@BramGruneir https://github.com/BramGruneir, do you have bandwidth to
drive this to the finish line or do you need me to?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/10692#issuecomment-291154591,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABihueT0Hg18-yv6HFCUjrz4PwQ1NBU3ks5rsP2mgaJpZM4KxyxF
.

I'm not sure we can split the system ranges up as nicely as proposed in https://github.com/cockroachdb/cockroach/issues/10692#issuecomment-261261349 due to the ordering of the relevant keys.

Splitting off the meta1/meta2 records is easy, since they're well separated from the system prefix. But within the system prefix, the other records aren't really separable. If I'm not mistaken, the ordering is effectively:

\x00liveness- (node liveness keys)
desc-idgen
node-idgen
range-idgen
status- (status-related keys)
store-idgen
system-version/ (migration-related keys)
tsd- (timeseries data)
update- (usage reporting / update-checking data)

There isn't a sane way to split things up as originally proposed without creating some very small ranges (e.g. a range with just the status- keys, and one with just store-idgen and the migration keys).

It looks like we may want to split things up a little differently. One reasonable option would be to use meta1/meta2 as one predefined split, everything in [\x00liveness-, tsd-) as another, and [tsd-, systemMax) as the last.

What I think is the best option, though, would be to have one config/split for meta records, one for timeseries records, and one for all other system range data. It'd make for one extra manual split and a small "update-" range, but it's the least restrictive to future expansion of the system key space because it more properly boxes in the one thing that really needs to be isolated (tsd).

I'm going to move forward with the last option, but am happy to discuss further if there are other opinions.

What I think is the best option, though, would be to have one config/split for meta records, one for timeseries records, and one for all other system range data. It'd make for one extra manual split and a small "update-" range, but it's the least restrictive to future expansion of the system key space because it more properly boxes in the one thing that really needs to be isolated (tsd).

Sounds fine to me. I don't think the very small update range will be problematic.

Sounds good to me too.

Was this page helpful?
0 / 5 - 0 ratings