Cloudformation-coverage-roadmap: AWS::ElasticSearch::Domain - support in-place upgrades

Created on 15 Aug 2019  路  45Comments  路  Source: aws-cloudformation/cloudformation-coverage-roadmap

  1. Title -> AWS::ElasticSearch::Domain
  2. Scope of request -> Support in-place version upgrades. Currently attempting to change the version causes replacement of the cluster.
  3. Expected behavior -> Apply an update - like in the console
  4. Category tag (optional) -> Analytics
analytics enhancement

Most helpful comment

As it turns out, a bug on the ElasticSearch (ES) side was discovered that caused the slow operations. The bug was fixed by the ES team. We have received guidance that we don't need to increase our timeouts since these operations should not take this long in the first place, and that existing timeouts (1 hr) are appropriate for these upgrade operations. I strongly recommend that you retry your in-place upgrades again (if possible) so you can validate that this has been resolved.

All 45 comments

@danieljamesscott just confirming that this is an existing property of an existing resource that you'd like to behave differently?

Yes - like for RDS version upgrades. When I change the version in the definition for RDS, the instance is upgraded. When I change the elastic search version, the instance is replaced.

When the ElasticsearchVersion parameter of AWS::ElasticSearch::Domain change and the change is a supported upgrade (e.g.: from 6.7 to 6.8) Cloudformation should upgrade the cluster with an UpgradeElasticsearchDomain API call, instead of a CreateElasticsearchDomain API call.
A proper error message must be given in case of a non supported upgrade.

This must also have to work with 'named' clusters.

Agreed, we will be stuck at this version until this is supported, as we will not want to recreate a cluster. the ES service fully supports this CFN just needs to do the correct API call and logic.

I see this is being worked on, but an alternative approach could also be helpful: I don't mind manually upgrading ES in the console, and then setting the new version in my template. When a new version is found in an update, but the version matches the current domain version, do not take any action.

This would help with situations where domain upgrades can take multiple hours.

For those looking for the documentation:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html#cfn-attributes-updatepolicy-upgradeelasticsearchdomain

I think there's an issue with the amount of time taken to update the cluster - I'm not sure how CF works, but the ES resource in CF needs to send a 'still going' notification back to CF. After an hour, the CF stack times out... Updating Elasticsearch Domain did not stabilize. My cluster still shows Upgrade processing and now my stack is inconsistent with the state of the cluster. :(

@danieljamesscott This is exactly what i was worried about in my above comment, upgrade can take 4 hours.
Edit: Thanks for being brave and testing for the rest of us.

urgh ...

Yes, but RDS seems to handle it just fine - although it's a slightly different mechanism, there is no UpdatePolicy. It may all work out OK eventually, if I re-apply the change, once the upgrade has finished. (As always, I applied this in our test account before trying it on anything important... ;) )

I don't think it's particularly important, but the cluster upgrade has just completed - took 20 hours.

Ahh, wonderful:

Failed to submit upgrade. Upgrade from 6.8 to 6.8 not supported. (Service: AWSElasticsearch; Status Code: 400; Error Code: BaseException; Request ID: 534b4933-10fb-11ea-821f-3fe3ace9e9ea)

Is there any update on this - I need the features of newer versions of ES.

As far as we've been able to investigate, the work from CloudFormation's side on this is complete. We reopened the issue to let people tell us if there are any show-stopping bugs. It doesn't appear that this is the case, other than the upgrades taking a long time. Let us know if we need to reopen this, but we'll close it for now.

@luiseduardocolon - i think the implementation is wrong as timeouts will be very common unless you have a cluster of 1 or 2 nodes and thats not a normal use case. the wait condition needs to be removed i think, when a version change is bumped it will check to make sure the update is happening but no wait for it to report end status.

I'm not sure whether it's CF or ES that needs to fix it, but something is badly wrong with this. Any cluster upgrade which takes longer than the CF timeout (1hr) results in a broken stack.

Also, it would be a nice workaround if CF could accept an update to the version which checks the actual cluster version, and accepts the change if the cluster is already upgraded to that version.

As it stands, I don't see how this can be seen as 'complete'. Any 'production' cluster will surely take longer than 1hr to upgrade and end up in a bad state.

@luiseduardocolon I and others have not used this feature yet because the thread made it obvious the work was not complete. Please give us strong assurances that the issues experienced by others have been resolved and then maybe I will attempt it on my own cluster.

Reopening to investigate further.

As it turns out, a bug on the ElasticSearch (ES) side was discovered that caused the slow operations. The bug was fixed by the ES team. We have received guidance that we don't need to increase our timeouts since these operations should not take this long in the first place, and that existing timeouts (1 hr) are appropriate for these upgrade operations. I strongly recommend that you retry your in-place upgrades again (if possible) so you can validate that this has been resolved.

I ran a cluster upgrade yesterday and I still saw this issue. The cluster upgrade took ~1h10m, barely exceeding the timeout limit for cfn. And this was for a dev cluster, which didn鈥檛 contain many documents, nor wasn鈥檛 scaled to the size of our prod fleet. I鈥檓 ok with ES taking long, I just want to be able to do a subsequent cfn deployment stack and not have it failed for not being able to upgrade to the same version.

I'm testing the upgrade today, and while it ran correctly on a first domain, on other larger domains I still ran into Updating Elasticsearch Domain did not stabilize.

After the 1hr timeout, it triggered a Cloudformation rollback (which of course also fails because Caution: version Downgrade is not supported during update rollback progress. Inconsistency could exist between stack template and the actual resource.)

Also a warning note for anyone that may be tempted to try : my cloudformation stack is now stuck, as any subsequent update fails Failed to submit upgrade. Upgrade from 6.8 to 6.8 not supported. (Service: AWSElasticsearch; Status Code: 400; Error Code: BaseException; Request ID: ...... )

@imgaray , @axelpavageau - have you opened support tickets for these? Let me know if you can share more info - like which region you are seeing this in, for example. Just wondering if the deployment of the fix encountered any problems.

@luiseduardocolon I have. It's been assigned and escalated but I don't have an answer to share yet.
My issue is in eu-west-1.
Let me know if there's anything else I can share to help. (note : I'm also available on the AWS Developers' slack if needed)

Would it be possible to support updates when the current version is the same as the supplied version? I'm concerned about applying this to any of my production ES clusters in case it fails.

@luiseduardocolon I ran into the same problem in both our production and development environment. I opened support ticket for them. My issue is in eu-central-1. Also, let me know if any other information is needed I will be very much to help.
@axelpavageau Please share the answer when you make any progress. I am stuck with this bug for almost 3 months now, thanks!

@akhiljain100 the AWS support advised me to revert my cloudformation change (in my case "downgrading" from 6.8 to 6.2).
This doesn't change the fact that my ES domains are no longer in sync with their cloudformation templates, nor the fact that for now I can't upgrade them anymore.
However it allows me to deploy other (non-ES) changes without encountering rollbacks.

Thanks @axelpavageau Does that mean that your resources are now no longer managed by Cloud Formation? What is the benefit of having downgraded?

@akhiljain100 My understanding is that Cloudformation only attempts to upgrade domain version if they differ between different revisions of your stack, so the main benefit of proceeding with this option would be to allow your stack to transition to "update_complete", hence unblocking the deployment of other resources that may be in the stack.

The negative side effect of this would be that there would be a drift between your ES domain and its cloudformation definition (i.e. the real version being higher than the one defined in the template), so I wouldn't risk updating any attribute on that domain until this problem gets fixed.

@imgaray is right.

Another word of caution : I'm still waiting to hear back from support but any subsequent update to my cloudformation stack takes 3 hours to complete. I'm fearing this is a side effect from the initial ES issue.

2020-03-04 10:20:53 UTC+0100 mymainstack UPDATE_IN_PROGRESS User Initiated
2020-03-04 10:26:16 UTC+0100 substack UPDATE_COMPLETE -
2020-03-04 13:26:18 UTC+0100 mymainstack UPDATE_COMPLETE_CLEANUP_IN_PROGRESS -
2020-03-04 13:27:16 UTC+0100 mymainstack UPDATE_COMPLETE -

By the way, we are still actively chasing this issue...although I don't have a time-to-resolve yet, I am escalating this to several stakeholders internally. Please update this thread with your latest observations.

Hi @luiseduardocolon . Thanks for the update.
Nothing new on my end. Support confirmed that the 3 hours completion time is a side effect of the ES issue.
Since this is happening on a production stack I'm leaving it as it is (doing as little deployments as possible) while waiting for a fix.

Hi @luiseduardocolon Nothing new on my end either. I left both my production and development stack as it is. I am unable to do any deployments to my other resources as well.

same issue here

A new fix will be deployed in the next month. Stay tuned. Keeping this open until then.

thanks for the update @luiseduardocolon

A new fix will be deployed in the next month. Stay tuned. Keeping this open until then.

Hi @luiseduardocolon, is the fix already deployed? Thanks

Ran into this issue as well. Upgrading from 6.8 to 7.4 and a stack rollback occurred afterwards. CloudFormation got into the weird state of trying to upgrade from 7.4 to 7.4. I have to update the stack with the previous ES version specified for stack updates to succeed.

Failed to submit upgrade. Upgrade from 7.4 to 7.4 not supported.

I did an upgrade on another project from 6.5 to 6.8 and had a rollback. CloudFormation handled this case gracefully though.

Ran into this issue as well. Upgrading from 6.8 to 7.4 and a stack rollback occurred afterwards. CloudFormation got into the weird state of trying to upgrade from 7.4 to 7.4. I have to update the stack with the previous ES version specified for stack updates to succeed.

Failed to submit upgrade. Upgrade from 7.4 to 7.4 not supported.

This appears to be fixed as of today.

I ran a stack update about 30 minutes ago with 7.4 specified as the version. After one second CloudFormation reported UPDATE_COMPLETE for the ElasticSearch Domain.

I confirm that upgrade checks no longer fail when cloudformation tries to update the cluster's to its actual current version (tested on 4 distinct clusters this morning).
So while the root issue is not fixed, at least we can now manually update the ES clusters, then update the cloudformation templates to reflect the change.

@axelpavageau what do you mean? I am able to update the version via CloudFormation by specifying EnableVersionUpgrade in the update policy.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html#cfn-attributes-updatepolicy-upgradeelasticsearchdomain

https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-elasticsearch-upgrade/

yes, but the root issue was that cloudformation waits for the actual update to complete. Since the update usually takes more than 1 hour, cloudformation fails the operation and rollsback. That's still happening, as far as I know.

yes, but the root issue was that cloudformation waits for the actual update to complete. Since the update usually takes more than 1 hour, cloudformation fails the operation and rollsback.

I could see that happening for domains with large amounts of data.

Are you using a service linked role?

I think increasing the timeout can be a different feature requests, It should be possible for the service team to increase that (IIRC 12hours is the current hard limit for resource providers)

Ran into the same issue upgrading from 6.0 to 6.8. Cloudformation failed the update with Elasticsearch Domain did not stabilize and now stuck on Caution: version Downgrade is not supported during update rollback progress. Inconsistency could exist between stack template and the actual resource.

I had the opportunity to test this feature again.

The good news is that cloudformation no longer seems to wait for the actual update to complete. It triggers the update and reports as the resource as "UPDATE COMPLETE" a second later.

The bad news is that in my case the domain update did not go ahead, because the domain was being snapshoted at the time :

Upgrade from 7.4 to 7.7 - 08/09/2020, 17:03:04 - Failed

    Checking upgrade eligibility - Failed
        Prior snapshot operation (with repository name - cs-automated) has not yet completed. Shard relocations as part of the upgrade process cannot be performed when a snapshot operation is running

This means that my cloudformation template is now no longer reflecting the actual resource state (cloudformation thinks the domain is running 7.7, while it's actually still 7.4).

I think this should be handled by a cloudformation update failure & rollback instead.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

grauj-aws picture grauj-aws  路  3Comments

baxang picture baxang  路  3Comments

kdgregory picture kdgregory  路  3Comments

luiseduardocolon picture luiseduardocolon  路  4Comments

hoegertn picture hoegertn  路  4Comments