Cloud-on-k8s: Cluster temporarily red during v6 -> v7 upgrade

Created on 23 Aug 2019 · 3Comments · Source: elastic/cloud-on-k8s

Playing with the following E2E test (doesn't exist in the project yet):

// TestVersionUpgrade680To720 creates a cluster in version 6.8.0,
// and upgrades it to 7.2.0.
func TestVersionUpgrade680To720(t *testing.T) {
    // create an ES cluster with 3 x 6.8.0 nodes
    initial := elasticsearch.NewBuilder("test-version-up-680-to-720").
        WithVersion("6.8.0").
        WithESMasterDataNodes(3, elasticsearch.DefaultResources)
    // mutate it to 3 x 7.2.0 nodes
    mutated := initial.WithNoESTopology().
        WithVersion("7.2.0").
        WithESMasterDataNodes(3, elasticsearch.DefaultResources)

    test.RunMutation(t, initial, mutated)
}

It fails because the cluster temporarily gets a red health during the upgrade:

--- FAIL: TestVersionUpgrade680To720/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (0.00s)
        steps_mutation.go:72:
                Error Trace:    steps_mutation.go:72
                Error:          Not equal:
                                expected: 0
                                actual  : 40
                Test:           TestVersionUpgrade680To720/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:19.091763 +0200 CEST m=+63.835095547: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:20.341087 +0200 CEST m=+65.084415900: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:23.341108 +0200 CEST m=+68.084429416: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:26.344761 +0200 CEST m=+71.088074601: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:29.344415 +0200 CEST m=+74.087722373: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:32.353867 +0200 CEST m=+77.097168050: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:35.343774 +0200 CEST m=+80.087069068: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:38.334265 +0200 CEST m=+83.077555470: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:41.348929 +0200 CEST m=+86.092213747: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:44.338261 +0200 CEST m=+89.081540951: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:47.34297 +0200 CEST m=+92.086245091: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:50.340683 +0200 CEST m=+95.083954416: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:53.343611 +0200 CEST m=+98.086877803: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:56.340173 +0200 CEST m=+101.083434694: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:59.336863 +0200 CEST m=+104.080120836: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:02.339313 +0200 CEST m=+107.082567161: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:05.346571 +0200 CEST m=+110.089820498: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:14.894874 +0200 CEST m=+119.638111066: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:17.344099 +0200 CEST m=+122.087333043: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:20.340006 +0200 CEST m=+125.083235346: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:23.347806 +0200 CEST m=+128.091032397: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:26.345415 +0200 CEST m=+131.088637299: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:29.330129 +0200 CEST m=+134.073347173: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:32.338585 +0200 CEST m=+137.081799094: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:35.335134 +0200 CEST m=+140.078344710: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:38.345066 +0200 CEST m=+143.088272305: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:41.34504 +0200 CEST m=+146.088242669: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:50.34155 +0200 CEST m=+155.084740954: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:53.351347 +0200 CEST m=+158.094534273: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:56.342322 +0200 CEST m=+161.085505298: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:59.339899 +0200 CEST m=+164.083079044: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:02.344272 +0200 CEST m=+167.087447658: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:05.339829 +0200 CEST m=+170.083001228: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:08.33933 +0200 CEST m=+173.082498304: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:11.334353 +0200 CEST m=+176.077517335: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:14.341543 +0200 CEST m=+179.084703340: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:17.338709 +0200 CEST m=+182.081865536: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:20.332513 +0200 CEST m=+185.075665723: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:23.333402 +0200 CEST m=+188.076550530: cluster health red
        steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:26.336902 +0200 CEST m=+191.080047612: cluster health red

Manually looking at the cluster health after the test gives a green health, all nodes are up and running in the new version.

We may be missing something in the way we handle the zen1 -> zen2 transition. Or maybe this should be considered "normal" and we should adapt the E2E test accordingly. To investigate.

>bug

Source

sebgl

Most helpful comment

I did some debugging:

The cluster goes red because some shards of the data-integrity-check index are unassigned
This index has 0 replicas: any node going down during rolling upgrade will lead to unassigned shard
The E2E test is successful if I change the index to have 1 replica:

func (dc *DataIntegrityCheck) Init() error {
    // default to 0 replicas to ensure we test data migration works
    indexSettings := `
{
    "settings" : {
        "index" : {
            "number_of_shards" : %d,
            "number_of_replicas" : 1
        }
    }
}

Looks like there is no bug in the operator, this is expected. Side benefit: a proof that the data integrity check is useful :)

To move forward with this test we probably need to make the replicas of this index configurable depending on the test? If testing rolling upgrade: set it to at least 1. If testing data migration: set it to 0.

sebgl on 23 Aug 2019

👍2

All 3 comments

I think a HA cluster going red during a rolling upgrade should not be considered normal.

pebrc on 23 Aug 2019

👍1

I did some debugging:

The cluster goes red because some shards of the data-integrity-check index are unassigned
This index has 0 replicas: any node going down during rolling upgrade will lead to unassigned shard
The E2E test is successful if I change the index to have 1 replica:

func (dc *DataIntegrityCheck) Init() error {
    // default to 0 replicas to ensure we test data migration works
    indexSettings := `
{
    "settings" : {
        "index" : {
            "number_of_shards" : %d,
            "number_of_replicas" : 1
        }
    }
}

Looks like there is no bug in the operator, this is expected. Side benefit: a proof that the data integrity check is useful :)

sebgl on 23 Aug 2019

👍2

Ah of course! 😞 I added that on purpose as not to hide any data loss behind a replica! But in this case it is actually not what we want ...

pebrc on 23 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ARM e2e run sometimes ignores build constraints

pebrc · 5Comments

Kibana does not support rolling upgrades

pebrc · 3Comments

Scale testing ECK

nkvoll · 6Comments

TestMutationLessNodes is flaky

barkbay · 4Comments

Could not update cluster license: failed to revert to basic

thbkrkr · 5Comments