Playing with the following E2E test (doesn't exist in the project yet):
// TestVersionUpgrade680To720 creates a cluster in version 6.8.0,
// and upgrades it to 7.2.0.
func TestVersionUpgrade680To720(t *testing.T) {
// create an ES cluster with 3 x 6.8.0 nodes
initial := elasticsearch.NewBuilder("test-version-up-680-to-720").
WithVersion("6.8.0").
WithESMasterDataNodes(3, elasticsearch.DefaultResources)
// mutate it to 3 x 7.2.0 nodes
mutated := initial.WithNoESTopology().
WithVersion("7.2.0").
WithESMasterDataNodes(3, elasticsearch.DefaultResources)
test.RunMutation(t, initial, mutated)
}
It fails because the cluster temporarily gets a red health during the upgrade:
--- FAIL: TestVersionUpgrade680To720/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (0.00s)
steps_mutation.go:72:
Error Trace: steps_mutation.go:72
Error: Not equal:
expected: 0
actual : 40
Test: TestVersionUpgrade680To720/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:19.091763 +0200 CEST m=+63.835095547: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:20.341087 +0200 CEST m=+65.084415900: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:23.341108 +0200 CEST m=+68.084429416: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:26.344761 +0200 CEST m=+71.088074601: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:29.344415 +0200 CEST m=+74.087722373: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:32.353867 +0200 CEST m=+77.097168050: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:35.343774 +0200 CEST m=+80.087069068: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:38.334265 +0200 CEST m=+83.077555470: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:41.348929 +0200 CEST m=+86.092213747: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:44.338261 +0200 CEST m=+89.081540951: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:47.34297 +0200 CEST m=+92.086245091: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:50.340683 +0200 CEST m=+95.083954416: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:53.343611 +0200 CEST m=+98.086877803: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:56.340173 +0200 CEST m=+101.083434694: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:30:59.336863 +0200 CEST m=+104.080120836: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:02.339313 +0200 CEST m=+107.082567161: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:05.346571 +0200 CEST m=+110.089820498: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:14.894874 +0200 CEST m=+119.638111066: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:17.344099 +0200 CEST m=+122.087333043: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:20.340006 +0200 CEST m=+125.083235346: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:23.347806 +0200 CEST m=+128.091032397: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:26.345415 +0200 CEST m=+131.088637299: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:29.330129 +0200 CEST m=+134.073347173: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:32.338585 +0200 CEST m=+137.081799094: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:35.335134 +0200 CEST m=+140.078344710: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:38.345066 +0200 CEST m=+143.088272305: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:41.34504 +0200 CEST m=+146.088242669: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:50.34155 +0200 CEST m=+155.084740954: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:53.351347 +0200 CEST m=+158.094534273: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:56.342322 +0200 CEST m=+161.085505298: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:31:59.339899 +0200 CEST m=+164.083079044: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:02.344272 +0200 CEST m=+167.087447658: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:05.339829 +0200 CEST m=+170.083001228: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:08.33933 +0200 CEST m=+173.082498304: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:11.334353 +0200 CEST m=+176.077517335: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:14.341543 +0200 CEST m=+179.084703340: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:17.338709 +0200 CEST m=+182.081865536: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:20.332513 +0200 CEST m=+185.075665723: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:23.333402 +0200 CEST m=+188.076550530: cluster health red
steps_mutation.go:74: Elasticsearch cluster health check failure at 2019-08-23 13:32:26.336902 +0200 CEST m=+191.080047612: cluster health red
Manually looking at the cluster health after the test gives a green health, all nodes are up and running in the new version.
We may be missing something in the way we handle the zen1 -> zen2 transition. Or maybe this should be considered "normal" and we should adapt the E2E test accordingly. To investigate.
Related to https://github.com/elastic/cloud-on-k8s/issues/822.
I think a HA cluster going red during a rolling upgrade should not be considered normal.
I did some debugging:
data-integrity-check index are unassignedfunc (dc *DataIntegrityCheck) Init() error {
// default to 0 replicas to ensure we test data migration works
indexSettings := `
{
"settings" : {
"index" : {
"number_of_shards" : %d,
"number_of_replicas" : 1
}
}
}
Looks like there is no bug in the operator, this is expected. Side benefit: a proof that the data integrity check is useful :)
To move forward with this test we probably need to make the replicas of this index configurable depending on the test? If testing rolling upgrade: set it to at least 1. If testing data migration: set it to 0.
Ah of course! 馃槥 I added that on purpose as not to hide any data loss behind a replica! But in this case it is actually not what we want ...
Most helpful comment
I did some debugging:
data-integrity-checkindex are unassignedLooks like there is no bug in the operator, this is expected. Side benefit: a proof that the data integrity check is useful :)
To move forward with this test we probably need to make the replicas of this index configurable depending on the test? If testing rolling upgrade: set it to at least 1. If testing data migration: set it to 0.