Cockroach: Version mismatch errors during cluster upgrade

Created on 13 Nov 2017 · 23Comments · Source: cockroachdb/cockroach

Is this a question, feature request, or bug report?

BUG REPORT

Please supply the header (i.e. the first few lines) of your most recent
log file for each node in your cluster. On most unix-based systems
running with defaults, this boils down to the output of

grep -F '[config]' cockroach-data/logs/cockroach.log

When log files are not available, supply the output of cockroach version
and all flags/environment variables passed to cockroach start instead.

/opt/mesosphere/active/cockroach/bin/cockroach start --logtostderr --cache=100MiB --store=/var/lib/dcos/cockroach --certs-dir=/run/dcos/pki/cockroach --advertise-host=172.17.0.3 --host=172.17.0.3 --port=26257 --http-host=127.0.0.1 --http-port=8090 --log-dir= --join=172.17.0.3,172.17.0.2,172.17.0.4
I171113 09:29:47.659301 1 cli/start.go:785  CockroachDB CCL v1.1.2 (linux amd64, built 2017/11/07 08:40:54, go1.9.2)
I171113 09:29:47.760595 1 server/config.go:311  available memory from cgroups (8.0 EiB) exceeds system memory 31 GiB, using system memory
I171113 09:29:47.760664 1 server/config.go:425  system total memory: 31 GiB
I171113 09:29:47.760779 1 server/config.go:427  server configuration:
max offset                500000000
cache size                100 MiB
SQL memory pool size      128 MiB
scan interval             10m0s
scan max idle time        200ms
metrics sample interval   10s
event log enabled         true
linearizable              false
I171113 09:29:47.760996 12 cli/start.go:503  starting cockroach node
I171113 09:29:47.761051 12 cli/start.go:505  using local environment variables: COCKROACH_SKIP_ENABLING_DIAGNOSTIC_REPORTING=true
W171113 09:29:47.761994 12 security/certificate_loader.go:252  error finding key for /run/dcos/pki/cockroach/client.root.crt: could not read key file /run/dcos/pki/cockroach/client.root.key: open /run/dcos/pki/cockroach/client.root.key: permission denied
I171113 09:29:47.768751 12 storage/engine/rocksdb.go:405  opening rocksdb instance at "/var/lib/dcos/cockroach/local"
W171113 09:29:47.804815 12 gossip/gossip.go:1241  [n?] no incoming or outgoing connections
I171113 09:29:47.805159 12 storage/engine/rocksdb.go:405  opening rocksdb instance at "/var/lib/dcos/cockroach"
I171113 09:29:47.817498 24 gossip/client.go:129  [n?] started gossip client to 172.17.0.2:26257
I171113 09:29:48.140959 12 server/config.go:527  [n?] 1 storage engine initialized
I171113 09:29:48.141005 12 server/config.go:530  [n?] RocksDB cache size: 100 MiB
I171113 09:29:48.141027 12 server/config.go:530  [n?] store 0: RocksDB, max size 0 B, max open file limit 11384
I171113 09:29:48.142335 12 server/server.go:819  [n?] sleeping for 161.979362ms to guarantee HLC monotonicity
I171113 09:29:48.324021 12 server/node.go:461  [n1] initialized store [n1,s1]: disk (capacity=788 GiB, available=203 GiB, used=2.4 MiB, logicalBytes=20 MiB), ranges=23, leases=0, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=1987.00 p75=23656.00 p90=51983.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00}
I171113 09:29:48.324116 12 server/node.go:326  [n1] node ID 1 initialized
I171113 09:29:48.324432 12 gossip/gossip.go:327  [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"172.17.0.3:26257" > attrs:<> locality:<> ServerVersion:<major_val:1 minor_val:1 patch:0 unstable:0 >
I171113 09:29:48.324706 12 storage/stores.go:303  [n1] read 2 node addresses from persistent storage
I171113 09:29:48.325142 12 server/node.go:606  [n1] connecting to gossip network to verify cluster ID...
I171113 09:29:48.325222 12 server/node.go:631  [n1] node connected via gossip and verified as part of cluster "3dfb4c5c-dd06-4ea7-96a2-5c670fe2514b"
I171113 09:29:48.325334 12 server/node.go:403  [n1] node=1: started with [<no-attributes>=/var/lib/dcos/cockroach] engine(s) and attributes []
I171113 09:29:48.325596 12 sql/executor.go:408  [n1] creating distSQLPlanner with address {tcp 172.17.0.3:26257}
I171113 09:29:48.336739 12 server/server.go:946  [n1] starting https server at 127.0.0.1:8090
I171113 09:29:48.336782 12 server/server.go:947  [n1] starting grpc/postgres server at 172.17.0.3:26257
I171113 09:29:48.336809 12 server/server.go:948  [n1] advertising CockroachDB node at 172.17.0.3:26257
I171113 09:29:48.351394 12 server/server.go:1090  [n1] done ensuring all necessary migrations have run
I171113 09:29:48.351424 12 server/server.go:1092  [n1] serving sql connections
I171113 09:29:48.351528 12 cli/start.go:582  node startup completed:
CockroachDB node starting at 2017-11-13 09:29:48.351458 +0000 UTC (took 0.7s)
build:      CCL v1.1.2 @ 2017/11/07 08:40:54 (go1.9.2)
admin:      https://127.0.0.1:8090
sql:        postgresql://[email protected]:26257?application_name=cockroach&sslmode=verify-full&sslrootcert=%2Frun%2Fdcos%2Fpki%2Fcockroach%2Fca.crt
logs:
store[0]:   path=/var/lib/dcos/cockroach
status:     restarted pre-existing node
clusterID:  3dfb4c5c-dd06-4ea7-96a2-5c670fe2514b
nodeID:     1

Please describe the issue you observed:

What did you do?
A cluster of 3 cockroachdb nodes running v1.0.6 was running correctly.
Two of the nodes [node 1, node 2] were individually upgraded to v1.1.2 by stopping cockroachdb, replacing the binary, then starting it again using the same cmdline.
What did you expect to see?
Queries should keep working on node 3, which is still running v1.0.6.
What did you see instead?
Queries on node 3 begin to fail, reporting:

sqlalchemy.exc.InternalError: (psycopg2.InternalError) version mismatch in flow request: 3; this node accepts 6 through 6

Source

gpaul

All 23 comments

Thanks for the report, @gpaul! Assigning to @andreimatei who should know more about this.

a-robinson on 13 Nov 2017

cc @vivekmenezes related to #18955.

knz on 14 Nov 2017

@gpaul is the error transient, or does it go away quickly after nodes 1 and 2 have been upgraded?

andreimatei on 14 Nov 2017

Quick update: it reproduces with a single 1.1.2 upgraded node while the remaining two nodes are still running 1.0.6.

gpaul on 14 Nov 2017

After upgrading a single node, the error happened once and then seemed to stop happening (similar queries started passing). After upgrading the second of three nodes the error immediately occurs again and keeps doing so.

I'll let it run for an hour and see if the issue resolves itself after some time. I'll report back.

gpaul on 14 Nov 2017

It's been 30 minutes and the error is still occurring. I'm gaining confidence that this won't resolve itself.

gpaul on 14 Nov 2017

@gpaul the error not resolving itself is very surprising. Thanks for investigating. I'll go to the lab.
FWIW, as you probably know, upgrading all the nodes should fix it.

andreimatei on 14 Nov 2017

👍1

Thanks @andreimatei - yes I noticed that upgrading all the nodes solves the problem. Unfortunately we need to remain available during our upgrade. : /

Let me know if you need any help!

gpaul on 15 Nov 2017

@andreimatei any update? If you like I can see about reproducing it in a script or some other easily digestible format. It reproduces very reliably for me. I'd love for us to upgrade to the latest version (which requires zero downtime) so please let me know if I can help in any way.

gpaul on 20 Nov 2017

Hey @gpaul. I tried reproducing it today, but it didn't repro easily. I have to try harder with more data in a table and by varying the distribution of that data across nodes.
If you can give me a repro, that'd be gold. The ideal one would be a dump of a database plus instructions for starting the mixed-version cluster (or a script) and example queries that fail. Next best thing would be instructions for starting the cluster plus a program that populates the database plus example queries that fail (or a client program that produces some errors). Next best thing is anything else :) I will also continue playing on my own if I don't hear back.

If it brings you joy, I found other unrelated problems with mixed-version cluster while playing around for the repro.

andreimatei on 21 Nov 2017

🎉1

Sure, here is a tarball containing a file repro.sh which you can simply run as sh repro.sh and reproduces the issue reliably.

cockroachdb_issue_20003_repro.tar.gz

gpaul on 21 Nov 2017

Once the script crashes, you can keep trying to run the query and it keeps failing every time.

gpaul on 21 Nov 2017

Thanks! Looking.

andreimatei on 21 Nov 2017

OK, what's going on unfortunately is not surprising and I should have thought of it from the beginning.
1.0.x nodes do not have the mechanism to recover from this version mismatch errors. It was introduced in 1.1. Generally our forward compatibility story has become better on 1.1 on multiple fronts.
So what's happening is that 1.0 nodes are trying to access data owned by 1.1 nodes in a specific way, and failing. Were they to be version 1.1 nodes, they would know better than to try again (or more likely to even try that in the first place).
I'm sorry about that. Unfortunately backporting all the necessary machinery to 1.0 is probably not feasible at this point.

Probably the best work-around for you is to temporarily disable one of our query execution
engines, DistSQL. This execution engine is the one with this version compatibility problem. If you do SET CLUSTER SETTING sql.defaults.distsql = 0;, then all queries will use our other engine (it's called "local"). You're not really supposed to disable an engine, and you'll lose its benefits (e.g. parallel processing for large scans - and the query you gave me is a full table scan), but it can get you unstuck. Changing this cluster setting only affects new connections, so after setting it you probably have to either restart the client apps or all the nodes one by one.
And then you should be able to upgrade to 1.1 and then you can revert the setting with SET CLUSTER SETTING sql.defaults.distsql = 1;.

Does this work for you? Sorry for the trouble.

andreimatei on 21 Nov 2017

Ah, i see. Thanks for the explanation, I'll see what we can do.

Unfortunately backporting all the necessary machinery to 1.0 is probably not feasible at this point.

Man, that's a pity. We may be able to get away with having performance temporarily degraded for the duration of a cluster upgrade (ie., disabling distsql) but unavailability is not an option.

gpaul on 21 Nov 2017

1.0.x nodes do not have the mechanism to recover from this version mismatch errors [...] So what's happening is that 1.0 nodes are trying to access data owned by 1.1 nodes in a specific way, and failing.

Thanks for looking into this with priority, @andreimatei. That conclusion is disappointing for us because we planned with a smoother upgrade path. Did we miss any specific warning or pointer about this deficiency in the docs / release nodes / upgrade notes?

jgehrcke on 21 Nov 2017

I agree that it's unfortunate. I don't think you missed anything; I think we simply didn't do a good enough job here. We generally do think that being able to do rolling upgrades is very important. Historically, we've done a lot of work to support mixed version clusters. In 1.1 in particular we have introduced a number of backwards and forwards compatibility things. What you're encountering is part of the growing pains; version 1.0 shipped when it shipped without sufficient forward-looking mechanisms in it. I think when we introduced this particular version mismatch that you're encountering we considered that since version 1.0 didn't have the right level of support for it, we'd need to backport code to a 1.0.x and then tell people that they need to go through an upgrade to that before being able to go to 1.1 and so along the way somewhere the effort didn't seem worth while any more. This thread proves differently.

Now, if you tell me that the sql.defaults.distsql workaround is not sufficient for you, I'll look again to see if backporting something and or giving you a custom binary is at all feasible. I do hope it won't come to that, though. Depending on your exact queries and the volume of data, the temporary degraded performance may be a theoretical concern rather than a practical one. You can perhaps test things out by disabling DistSQL on a single session and running some of your bigger queries and seeing if there is a problem. You can do that by running set distsql=off; then DistSQL will not be used on that one connection.

Since we're talking and since you guys obviously care about downtime a lot, I'm curious to learn about your mindset and practices. Like, what is acceptable and what isn't for you, exactly. Is surfacing any errors to a client when a node fails or during a rolling update acceptable? Do you have any draining procedures for moving clients away from a node before you restart a node, perhaps through a load balancer? Is it acceptable for the failure of a node other than the one that a client is connected to to cause errors for the client in question? And if you were to be given a choice between surfacing errors in this last case versus blocking the respective query for a few seconds and then servicing it, which one would you choose?

andreimatei on 21 Nov 2017

We care deeply about downtime as well as unattended upgrades, yes.

Here's some info about our use case. As you can see we heavily rely on CockroachDB's architecture (gossip, rolling upgrade, TLS).

We run one CockroachDB instance per host.
There are typically 3 or 5 host per cluster.
For now, each CockroachDB has a single, colocated client process.
For security reasons the communication between the client and the database relies on your support for TLS peer certificate verification.
The client always dials to the CockroachDB instance colocated with it and establishes a persistent connection upon process startup.
Initial CockroachDB cluster discovery is performed by a python wrapper through ZooKeeper.
Hosts are completely uniform and automatically provisioned, with discovery and coordination through ZooKeeper.
The client provides Identity and Access Management to the other colocated services as well as clients external to the cluster.
As such, our cluster inherits its availability from CockroachDB.

Our rolling upgrade procedure must support the following invariants:

The upgrade may pause partway through for a long time, with different hosts running adjacent versions for hours or days.
The upgrade may back out at any stage through a rolling "upgrade" to the initial version.
Services may be down for the time it takes O(systemd to restart them) + O(ZooKeeper quorum to reform) + O(application-specific startup sequence).
Upgrades are performed one node at a time by running a single shell command on each.
Once a host is upgraded, the operator waits until all hosts in the cluster are healthy, then proceed to upgrade the subsequent hosts.

CockroachDB supports these invariants wonderfully.

Using gossip was brilliant. The first node discovers that it must bootstrap the cluster by checking ZooKeeper and finding that it is the first of its kind. Subsequent nodes discover that the cluster has been bootstrapped, register themselves in ZooKeeper, and --join= whichever nodes have thus far registered there.
Using TLS was brilliant. Our client is colocated with each CockroachDB instasnce and communicates over localhost. CockroachDB's TLS peer certificate verification allows us to expose it's --port so its instances can communicate with one another without requiring firewall tricks (which, given our automatic installation and upgrade procedures would have been a non-starter.)
Separating the --http-port and http-host from the internal --port and --host was brilliant. We have an authenticating reverse proxy process (nginx+openresty) colocated with each server that exposes the CockroachDB Admin UI, which itself is only available over localhost (ie., --http-host=127.0.0.1), to authorized clients.
Rolling upgrade was brilliant. ;)

Our ideal IAM schema is quite normalized as you might imagine (users, groups, resources and many-to-many relationships between them all.) CockroachDB v1.0.x could not handle our performance requirements.

As such, we calculated the size of the cartesian product of our normalized tables, considered our read vs. write load, and decided that instead of using postgresql (operational agony) or rqlite (lacked node-to-node encryption at the time and felt like a local optimum compared to CockroachDB), we'd maintain a single denormalized table from which to serve IAM queries. This means that the answer to "how much data do you have" is on the order of ~1-10 million records (given the source tables have ~10-100, and in one case ~10,000 records, each.) Given that single table scans with indexes are really damn fast with CockroachDB, this works incredibly well. This helped us move forward with CockroachDB while we eagerly await the day where we might drop our denomalized table completely and rely on fast JOINs.

I'll run some tests using distsql=off and see how that goes.

Would you be interested in sharing your work email address so we can share more details? We may be reaching out to you guys to figure something out soon.

Thanks for the support so far, I appreciate your time and effort in helping us out.

gpaul on 22 Nov 2017

Thanks for the info! My email is [email protected].

andreimatei on 22 Nov 2017

The next 1.1 release will have a --extra-1.0-compatibility flag which makes 1.1 nodes able to accept queries planned on 1.0 gateways. This should help with upgrades, with the caveat that the flag will not be present in 1.2.

andreimatei on 28 Nov 2017

❤1

Thanks!

gpaul on 28 Nov 2017

Hey! This is not mentioned by the release notes for 1.1.4: https://forum.cockroachlabs.com/t/release-notes-for-v1-1-4/1266/1. Is this patch included in the 1.1.4 release?

Edit: it is contained in the release, see https://github.com/cockroachdb/cockroach/compare/v1.1.3...v1.1.4.

jgehrcke on 9 Jan 2018

It is.

On Tue, Jan 9, 2018 at 4:17 PM, Jan-Philip Gehrcke <[email protected]

wrote:

Hey! This is not mentioned by the release notes for 1.1.4:
https://forum.cockroachlabs.com/t/release-notes-for-v1-1-4/1266/1. Is
this patch included in the 1.1.4 release?

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/20003#issuecomment-356416756,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAXBcXw30K1X1V1CPAUFA55IlrSAA_obks5tI9dqgaJpZM4Qbe9P
.