Elasticsearch: Enable rolling upgrades from default distribution prior to 6.3.0 to default distribution post 6.3.0

Created on 18 May 2018 · 29Comments · Source: elastic/elasticsearch

This is a meta-issue to track the work needed to enable smooth upgrades from the default distribution prior to 6.3.0 to the default distribution post 6.3.0. The sub-tasks are:

[x] add detection when we are communicating with a >= 6.3 transport client so that we can avoid sending it pieces of the cluster state that it would not be able to deserialize (the problem here is when an OSS transport client connects to a 6.3.0 default cluster) and avoid sending pieces of the cluster state that clients might not be able to understand #31020 @jasontedor
[x] same thing for < 6.3 transport clients @ywelsch #30971
[x] add a node attribute that X-Pack injects into the node attributes so that we can detect when we are connected to a node that understands X-Pack metadata in the cluster state @ywelsch #30743 #30963
[x] reject cluster state updates that contain X-Pack metadata when the cluster is not prepared for it (i.e., if there is not already X-Pack metadata in the cluster state and there are nodes in the cluster state than can not understand X-Pack metadata based on the above attribute) @ywelsch #30743
- [x] fix watcher template that uses custom index setting as an OSS master in a mixed cluster will not be able to add this template and repeatedly fail the PUT template request initiated by the x-pack node, spamming the logs @ywelsch #30832
[x] ensure that all features (e.g., ML today) that rely on custom cluster state metadata can properly handle that metadata not being present (e.g., during a rolling upgrade scenario) @droberts195 #30751
[x] add a rolling upgrade test for 3 nodes (we have one for 2 nodes today) to trigger this scenario during testing @nik9000 #30728
- [x] fix PersistentTaskParams so that it knows about the versions etc. Assume for example a mixed 6.2 / 6.3 x-pack cluster. If you start a rollup task, this will be put as persistent task into the cluster state. A 6.2 x-pack node cannot deserialize this task. @bleskes #31045
- [x] add more tests (e.g. that rollups cannot be created in a mixed OSS/X-Pack cluster. Mixed-cluster X-pack tests) @nik9000 #31036

:DistributeCluster Coordination :DistributeNetwork :ml blocker v6.3.0

Source

jasontedor

Most helpful comment

The weekend went well as far as the backwards compatibility builds goes! I'm happy to say that the upgrades looked great. I think I'm done here.

nik9000 on 11 Jun 2018

🎉5

All 29 comments

Pinging @elastic/es-core-infra

elasticmachine on 18 May 2018

Pinging @elastic/es-distributed

elasticmachine on 18 May 2018

Pinging @elastic/ml-core

elasticmachine on 18 May 2018

Thank for putting this list up. I think we should also deal with the TokenMetaData injected here when called from here.

bleskes on 21 May 2018

I think we should also deal with the TokenMetaData injected here when called from here.

It looks like @ywelsch took care of it in https://github.com/elastic/elasticsearch/pull/30743/files

bleskes on 21 May 2018

Now that #30743 is merged I wanted to test this. The 6.3 branch works perfectly for me. The 6.x branch is failing though. That probably isn't a 6.3 release blocker but it is weird. The failure comes during the 5.6.10 upgrade to 6.x. The failure is:

[2018-05-23T12:59:33,101][ERROR][o.e.x.w.s.WatcherIndexTemplateRegistry] [node-0] Error adding watcher template [.watch-history-8]
org.elasticsearch.transport.RemoteTransportException: [node-2][127.0.0.1:41731][indices:admin/template/put]
Caused by: java.lang.IllegalArgumentException: unknown setting [index.xpack.watcher.template.version] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
        at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:293) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:256) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:246) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.action.admin.indices.template.put.TransportPutIndexTemplateAction.masterOperation(TransportPutIndexTemplateAction.java:80) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.action.admin.indices.template.put.TransportPutIndexTemplateAction.masterOperation(TransportPutIndexTemplateAction.java:42) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:87) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]

The test that actually fails is " org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Test old multi type stuff}" but it only fails because one of its actions times out because the cluster is busy trying the thing above over and over again.

nik9000 on 23 May 2018

@nik9000 This is a real problem (the x-pack node tries to add a template with x-pack only settings, and the OSS master rejects it). I'm not sure why it's not triggered by the test on the 6.3 branch as the same templating behavior exists there (for Watcher, Security etc.) as well. I consider this a blocker for 6.3 and will work on a solution tomorrow.

ywelsch on 23 May 2018

I consider this a blocker for 6.3 and will work on a solution tomorrow.

:heart:

It might not come up in 6.3 because I got lucky the couple of times I ran it. Maybe the 6.3 node won the master election.

It might still be a problem in the 6.3 branch but not bad enough to slow the tests down enough to fail. I'll see if I can write a test that outright fails without this. I didn't want to have to use watcher in these tests because it has so much state, but I suspect I have no choice here.

nik9000 on 23 May 2018

I consider this a blocker for 6.3 and will work on a solution tomorrow.

+1, this is a blocker.

jasontedor on 23 May 2018

The issue with Watcher is that it uses a custom setting in its template. I've gone through the other XPack services to check if they present the same issue:

SecurityIndexManager adds templates (security-index-template.json) but no custom settings in template
IndexAuditTrail adds templates (security_audit_log.json) but no custom settings in template
LocalExporter adds templates (monitoring-*.json) but no custom settings. It also adds an ingest pipeline, but again no customs.
WatcherIndexTemplateRegistry adds templates (triggered-watches.json, watch-history.json, watches.json). Only watch-history uses a custom setting (xpack.watcher.template.version).

I'll explore getting rid of the xpack.watcher.template.version setting, and using the same approach as has been used for the other templates (e.g. security-index-template or security_audit_log.json) where there's a custom _meta field in the mapping.

ywelsch on 24 May 2018

I've opened #30832 for the watcher issue.

ywelsch on 24 May 2018

It might not come up in 6.3 because I got lucky the couple of times I ran it. Maybe the 6.3 node won the master election.

@nik9000 I've run the mixed-cluster tests a few times on 6.3, and I've seen this exception spamming the logs. The tests not failing on 6.3 are more of an indication that we need to add more tests.

ywelsch on 24 May 2018

@nik9000 I've run the mixed-cluster tests a few times on 6.3, and I've seen this exception spamming the logs. The tests not failing on 6.3 are more of an indication that we need to add more tests.

I figured as much. Earlier I'd said:

It might still be a problem in the 6.3 branch but not bad enough to slow the tests down enough to fail. I'll see if I can write a test that outright fails without this.

and that is still my plan. I got distracted by other things and didn't end up writing the test.

nik9000 on 24 May 2018

So in my grand tradition of finding things, I believe the following is flaky:

git checkout 6.x
while ./gradlew -p qa/rolling-upgrade/ check -Dtests.distribution=zip; do echo ok; done

On my desktop about half of those runs fail with:

[2018-05-24T16:51:55,044][INFO ][o.e.c.s.MasterService    ] [node-0] zen-disco-elected-as-master ([2] nodes joined)[, ], reason: new_master {node-0}{EI796DscQYWC8OejKoXa5Q}{ikSXu_0GTiaygCs4aLzfug}{127.0.0.1}{127
.0.0.1:34129}{ml.machine_memory=33651564544, xpack.installed=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true}
[2018-05-24T16:51:55,060][INFO ][o.e.c.s.ClusterApplierService] [node-0] new_master {node-0}{EI796DscQYWC8OejKoXa5Q}{ikSXu_0GTiaygCs4aLzfug}{127.0.0.1}{127.0.0.1:34129}{ml.machine_memory=33651564544, xpack.insta
lled=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true}, reason: apply cluster state (from master [master {node-0}{EI796DscQYWC8OejKoXa5Q}{ikSXu_0GTiaygCs4aLzfug}{127.0.0.1}{127.0.0.1:34129}{ml.machine_m
emory=33651564544, xpack.installed=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true} committed version [751] source [zen-disco-elected-as-master ([2] nodes joined)[, ]]])
[2018-05-24T16:51:55,085][WARN ][o.e.d.z.ZenDiscovery     ] [node-0] zen-disco-failed-to-publish, current nodes: nodes: 
   {node-0}{GsF5lxmmSQiemBzLM6Csbw}{lu_kNBYlROqRavg-7XxHSg}{127.0.0.1}{127.0.0.1:40615}{testattr=test, ml.machine_memory=33651564544, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
   {node-0}{EI796DscQYWC8OejKoXa5Q}{ikSXu_0GTiaygCs4aLzfug}{127.0.0.1}{127.0.0.1:34129}{ml.machine_memory=33651564544, xpack.installed=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true}, local, master
   {node-2}{ZOcWRGW1QNaBzGNXwigiBw}{xiHQaAk_ToS9Y52SnpY4Ww}{127.0.0.1}{127.0.0.1:42023}{testattr=test, gen=old, ml.machine_memory=33651564544, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}

[2018-05-24T16:51:55,085][WARN ][o.e.c.s.MasterService    ] [node-0] failing [maybe generate license for cluster]: failed to commit cluster state version [752]
org.elasticsearch.discovery.Discovery$FailedToCommitClusterStateException: failed to get enough masters to ack sent cluster state. [1] left
        at org.elasticsearch.discovery.zen.PublishClusterStateAction$SendingController.waitForCommit(PublishClusterStateAction.java:525) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.PublishClusterStateAction.innerPublish(PublishClusterStateAction.java:196) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.PublishClusterStateAction.publish(PublishClusterStateAction.java:161) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.ZenDiscovery.publish(ZenDiscovery.java:336) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:225) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:132) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:625) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]
[2018-05-24T16:51:55,086][ERROR][o.e.l.StartupSelfGeneratedLicenseTask] [node-0] unexpected failure during [maybe generate license for cluster]
org.elasticsearch.discovery.Discovery$FailedToCommitClusterStateException: failed to get enough masters to ack sent cluster state. [1] left
        at org.elasticsearch.discovery.zen.PublishClusterStateAction$SendingController.waitForCommit(PublishClusterStateAction.java:525) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.PublishClusterStateAction.innerPublish(PublishClusterStateAction.java:196) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.PublishClusterStateAction.publish(PublishClusterStateAction.java:161) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.ZenDiscovery.publish(ZenDiscovery.java:336) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:225) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:132) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:625) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]

Which looks pretty incriminating. There are totally three 3 available, master eligible nodes. It just logged about them.

But this is 6.x testing the upgrade from 6.3. I've seen no trouble going from 5.6 to 6.x or 6.2 going to 6.x. Go figure.

nik9000 on 24 May 2018

The other nodes see NotMasterException, but I think they are deserializing that from the master. Which doesn't think it is a master.

nik9000 on 24 May 2018

The underlying issue has to do with the licensing component, which looks to be backwards-incompatible between 6.3 and 6.4. A 6.3 x-pack node cannot read the license metadata issued by a 6.4 x-pack node. This is caused by the recent PR https://github.com/elastic/elasticsearch/pull/30251 which changes the License.VERSION_CURRENT field. The last time this field was changed was for 5.0 where we required a full-cluster restart for upgrades. As this does not affect upgrades from 5.x/6{<3} to 6.3, this should not be a blocker for the 6.3 release. It will be a blocker for the 6.4 release and prevent any further testing of rolling-upgrade from 6.3 to 6.4.

The full stack trace is:

[2018-05-25T10:03:27,213][DEBUG][o.e.d.z.PublishClusterStateAction] [node-0] failed to send cluster state to {node-2}{TElgN8JNRH6EXapvEUzXlA}{bR5b0qKpSqiioj0sWLpjMA}{127.0.0.1}{127.0.0.1:38939}{testattr=test, gen=old, ml.machine_memory=67423862784, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
org.elasticsearch.transport.RemoteTransportException: [node-2][127.0.0.1:38939][internal:discovery/zen/publish/send]
Caused by: org.elasticsearch.ElasticsearchException: Unknown license version found, please upgrade all nodes to the latest elasticsearch-license plugin
    at org.elasticsearch.license.License.readLicense(License.java:306) ~[?:?]
    at org.elasticsearch.license.LicensesMetaData.<init>(LicensesMetaData.java:175) ~[?:?]
    at org.elasticsearch.common.io.stream.NamedWriteableAwareStreamInput.readNamedWriteable(NamedWriteableAwareStreamInput.java:46) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.AbstractNamedDiffable$CompleteNamedDiff.<init>(AbstractNamedDiffable.java:87) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.AbstractNamedDiffable.readDiffFrom(AbstractNamedDiffable.java:47) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.license.LicensesMetaData.readDiffFrom(LicensesMetaData.java:188) ~[?:?]
    at org.elasticsearch.common.io.stream.NamedWriteableAwareStreamInput.readNamedWriteable(NamedWriteableAwareStreamInput.java:46) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.NamedDiffableValueSerializer.readDiff(NamedDiffableValueSerializer.java:56) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.NamedDiffableValueSerializer.readDiff(NamedDiffableValueSerializer.java:30) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.DiffableUtils$MapDiff.<init>(DiffableUtils.java:392) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.DiffableUtils$ImmutableOpenMapDiff.<init>(DiffableUtils.java:221) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.DiffableUtils.readImmutableOpenMapDiff(DiffableUtils.java:120) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.MetaData$MetaDataDiff.<init>(MetaData.java:726) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.MetaData.readDiffFrom(MetaData.java:682) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterState$ClusterStateDiff.<init>(ClusterState.java:743) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterState.readDiffFrom(ClusterState.java:660) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.discovery.zen.PublishClusterStateAction.handleIncomingClusterStateRequest(PublishClusterStateAction.java:385) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.discovery.zen.PublishClusterStateAction$SendClusterStateRequestHandler.messageReceived(PublishClusterStateAction.java:436) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.discovery.zen.PublishClusterStateAction$SendClusterStateRequestHandler.messageReceived(PublishClusterStateAction.java:432) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:246) ~[?:?]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:304) ~[?:?]
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1542) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1501) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1371) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:64) [transport-netty4-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.16.Final.jar:4.1.16.Final]
    at java.lang.Thread.run(Thread.java:844) [?:?]

/cc: @jkakavas @jaymode @rjernst

ywelsch on 25 May 2018

I've created #30859 as a fix for the licensing serialization issue.

ywelsch on 25 May 2018

@nik9000 With my fix in #30859, I have this consistently passing now

ywelsch on 25 May 2018

The fix only helps in case of a rolling restart, so the mixed-cluster tests still fail with this fix on 6.x. I will have to discuss with @elastic/es-security what can be done about the mixed-cluster situation.

ywelsch on 25 May 2018

Reminder to self: We also need to fix PersistentTaskParams so that it knows about the versions etc. Assume for example a mixed 6.2 / 6.3 x-pack cluster. If you start a rollup task, this will be put as persistent task into the cluster state. A 6.2 x-pack node cannot deserialize this task.

ywelsch on 30 May 2018

add more tests (e.g. that rollups cannot be created in a mixed OSS/X-Pack cluster. Mixed-cluster X-pack tests)

I'll take this.

nik9000 on 1 Jun 2018

We have another test failure that I believe belongs here as well (there are several tests failing but they appear to do so for the same reason).

The first one of them has the reproduction line:

./gradlew :qa:rolling-upgrade:v6.3.0-SNAPSHOT#twoThirdsUpgradedTestRunner -Dtests.seed=F070E6B561B268B2 -Dtests.class=org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT -Dtests.method="test {p0=mixed_cluster/10_basic/Verify custom cluster metadata still exists during upgrade}" -Dtests.security.manager=true -Dtests.locale=nl-BE -Dtests.timezone=America/Cuiaba -Dtests.distribution=zip -Dtests.rest.suite=mixed_cluster

Click arrow for failure details



07:23:15 ERROR   67.1s | UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Verify custom cluster metadata still exists during upgrade} <<< FAILURES!

07:23:15    > Throwable #1: org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:40576], URI [/], status line [HTTP/1.1 503 Service Unavailable]

07:23:15    > {

07:23:15    >   "name" : "node-0",

07:23:15    >   "cluster_name" : "rolling-upgrade",

07:23:15    >   "cluster_uuid" : "OhN80TdsRXmqjvybzQA48A",

07:23:15    >   "version" : {

07:23:15    >     "number" : "6.4.0",

07:23:15    >     "build_flavor" : "default",

07:23:15    >     "build_type" : "zip",

07:23:15    >     "build_hash" : "1eede11",

07:23:15    >     "build_date" : "2018-06-04T06:30:39.454194Z",

07:23:15    >     "build_snapshot" : true,

07:23:15    >     "lucene_version" : "7.4.0",

07:23:15    >     "minimum_wire_compatibility_version" : "5.6.0",

07:23:15    >     "minimum_index_compatibility_version" : "5.0.0"

07:23:15    >   },

07:23:15    >   "tagline" : "You Know, for Search"

07:23:15    > }

07:23:15    >   at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:821)

07:23:15    >   at org.elasticsearch.client.RestClient.performRequest(RestClient.java:182)

07:23:15    >   at org.elasticsearch.client.RestClient.performRequest(RestClient.java:227)

07:23:15    >   at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.readVersionsFromInfo(ESClientYamlSuiteTestCase.java:282)

07:23:15    >   at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.initAndResetContext(ESClientYamlSuiteTestCase.java:106)

07:23:15    >   at java.lang.Thread.run(Thread.java:748)

07:23:15    > Caused by: org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:40576], URI [/], status line [HTTP/1.1 503 Service Unavailable]

07:23:15    > {

07:23:15    >   "name" : "node-0",

07:23:15    >   "cluster_name" : "rolling-upgrade",

07:23:15    >   "cluster_uuid" : "OhN80TdsRXmqjvybzQA48A",

07:23:15    >   "version" : {

07:23:15    >     "number" : "6.4.0",

07:23:15    >     "build_flavor" : "default",

07:23:15    >     "build_type" : "zip",

07:23:15    >     "build_hash" : "1eede11",

07:23:15    >     "build_date" : "2018-06-04T06:30:39.454194Z",

07:23:15    >     "build_snapshot" : true,

07:23:15    >     "lucene_version" : "7.4.0",

07:23:15    >     "minimum_wire_compatibility_version" : "5.6.0",

07:23:15    >     "minimum_index_compatibility_version" : "5.0.0"

07:23:15    >   },

07:23:15    >   "tagline" : "You Know, for Search"

07:23:15    > }

07:23:15    >   at org.elasticsearch.client.RestClient$1.completed(RestClient.java:495)

07:23:15    >   at org.elasticsearch.client.RestClient$1.completed(RestClient.java:484)

07:23:15    >   at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)

07:23:15    >   at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)

07:23:15    >   at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)

07:23:15    >   at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)

07:23:15    >   at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)

07:23:15    >   at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)

07:23:15    >   at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)

07:23:15    >   at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)

07:23:15    >   at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)

07:23:15    >   at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)

07:23:15    >   at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)

07:23:15    >   at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)

07:23:15    >   at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)

07:23:15    >   at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)

07:23:15    >   ... 1 moreThrowable #2: java.lang.AssertionError: there are still running tasks:

07:23:15    > {time_in_queue=15ms, time_in_queue_millis=15, source=zen-disco-elected-as-master ([2] nodes joined), executing=true, priority=URGENT, insert_order=185}

07:23:15    > {time_in_queue=1ms, time_in_queue_millis=1, source=install-token-metadata, executing=false, priority=URGENT, insert_order=186}

07:23:15    > {time_in_queue=0s, time_in_queue_millis=0, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=187}

07:23:15    >   at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)

07:23:15    >   at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:767)

07:23:15    >   at org.elasticsearch.test.rest.ESRestTestCase.waitForClusterStateUpdatesToFinish(ESRestTestCase.java:338)

07:23:15    >   at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:151)

07:23:15    >   at java.lang.Thread.run(Thread.java:748)

07:23:15    >   Suppressed: java.lang.AssertionError: there are still running tasks:

07:23:15    > {time_in_queue=86ms, time_in_queue_millis=86, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42}

07:23:15    > {time_in_queue=71ms, time_in_queue_millis=71, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}

07:23:15    > {time_in_queue=4ms, time_in_queue_millis=4, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}

07:23:15    >       at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)

07:23:15    >       at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)

07:23:15    >       ... 37 more

07:23:15    >   Suppressed: java.lang.AssertionError: there are still running tasks:

07:23:15    > {time_in_queue=96ms, time_in_queue_millis=96, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42}

07:23:15    > {time_in_queue=81ms, time_in_queue_millis=81, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}

07:23:15    > {time_in_queue=14ms, time_in_queue_millis=14, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}

07:23:15    >       at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)

07:23:15    >       at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)

07:23:15    >       ... 37 more

07:23:15    >   Suppressed: java.lang.AssertionError: there are still running tasks:

07:23:15    > {time_in_queue=102ms, time_in_queue_millis=102, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42}

07:23:15    > {time_in_queue=86ms, time_in_queue_millis=86, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}

07:23:15    > {time_in_queue=20ms, time_in_queue_millis=20, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}

07:23:15    >       at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)

07:23:15    >       at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)

07:23:15    >       ... 37 more

07:23:15    >   Suppressed: java.lang.AssertionError: there are still running tasks:

07:23:15    > {time_in_queue=109ms, time_in_queue_millis=109, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42}

07:23:15    > {time_in_queue=94ms, time_in_queue_millis=94, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}

07:23:15    > {time_in_queue=27ms, time_in_queue_millis=27, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}

07:23:15    >       at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)

07:23:15    >       at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)

07:23:15    >       ... 37 more

07:23:15    >   Suppressed: java.lang.AssertionError: there are still running tasks:

07:23:15    > {time_in_queue=106ms, time_in_queue_millis=106, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}

07:23:15    > {time_in_queue=40ms, time_in_queue_millis=40, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}

07:23:15    > {time_in_queue=0s, time_in_queue_millis=0, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=49}

07:23:15    >       at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)

07:23:15    >       at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)

07:23:15    >       ... 37 more

Full cluster logs are available in rolling-upgrade-cluster-logs.zip

danielmitterdorfer on 4 Jun 2018

I had a look at @danielmitterdorfer's failure. A few things:

This is an xpack to xpack upgrade from 6.3 to 6.x
I read the logs to say that:
a. Everything goes fine until the we upgrade the second node.
b. Once the second upgraded node comes online the first upgraded node is elected the master.
c. It starts doing housekeeping like upgrading templates.
d. It can't sync the cluster state

I'd expect there to be a failure somewhere in the log describing how the cluster state sync failed. But I can't find one. All of the exceptions have to do with the restarts and the cluster not having a valid master after the incident.

nik9000 on 4 Jun 2018

@nik9000 @danielmitterdorfer this will be fixed by #30859. It's not blocking the 6.3 release, but the 6.4 release.

ywelsch on 5 Jun 2018

I've opened #31112 to make the x-pack upgrade tests (all three of them) use three nodes. It isn't perfect but it is about as complex as I'd like to get and still backport to 6.3.

nik9000 on 5 Jun 2018

So I merged #31112 to master and 6.x yesterday but that caused all kinds of problems. I'm trying to un-break them now. I'll merge to 6.3 once everything is calmer in the branches I've already merged to.

nik9000 on 7 Jun 2018

I've finished backporting #31112 to 6.3. We can see how it does over the weekend.