Terraform-provider-aws: Terraform state gets out of sync with AWS CloudHSM v2 resources, and creates more CloudHSM v2 instances than defined in the code

Created on 15 May 2019 · 6Comments · Source: hashicorp/terraform-provider-aws

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

Terraform v0.11.8

Affected Resource(s)

aws_cloudhsm_v2_hsm

Terraform Configuration Files


resource "aws_cloudhsm_v2_cluster" "main" {
  hsm_type   = "hsm1.medium"
  subnet_ids = ["${var.subnet_ids}"]

  tags {
    Name        = "cloudhsm.${var.region}.${var.env}.${var.stack}"
    environment = "${var.env}"
    stack       = "${var.stack}"
  }
}

resource "aws_cloudhsm_v2_hsm" "hsm1" {
  subnet_id  = "${element(var.subnet_ids, 0)}"
  cluster_id = "${aws_cloudhsm_v2_cluster.main.cluster_id}"
}

resource "aws_cloudhsm_v2_hsm" "hsm2" {
  subnet_id  = "${element(var.subnet_ids, 1)}"
  cluster_id = "${aws_cloudhsm_v2_cluster.main.cluster_id}"
}

Expected Behavior

Once two cloudhsm instances are created, subsequent terraform runs don't change anything. More specifically, terraform doesn't create more cloudhsm instances.

Actual Behavior

Some subsequent runs randomly create a _new_ cloudhsm instance, in addition to existing. This happens randomly from time to time, and we end up with more than two instances. I.e.; terraform state says only two instances are there, while there are more than two in AWS.

Steps to Reproduce

Haven't been able to reproduce yet. From what we've seen, terraform can run for months before it starts showing this weird behavior. Haven't noticed any patterns.

Important Factoids

This is happening in two AWS accounts

bug serviccloudhsmv2

Source

chamindg

👍25

All 6 comments

Found the problem.

Please note the following excerpt from https://aws.amazon.com/cloudhsm/faqs:

Q: What happens in case of failure?

The CloudHSM service provides fully managed HSMs in the AWS cloud. The service handles all updates and failover for you. Replacements are transparent to your application, as the CloudHSM client automatically handles failover and load balancing. HSMs are replaced to the same ENI as the original HSM. You can see when an HSM has been replaced in your audit logs in CloudWatch. You will see the log stream for one HSM ID terminate, and a new HSM ID begin, when a replacement occurs. Refer to the Monitoring AWS CloudHSM Audit Logs in Amazon CloudWatch Logs documentation at https://docs.aws.amazon.com/cloudhsm/latest/userguide/get-hsm-audit-logs-using-cloudwatch.html

This means terraform state cannot use CloudHSM instance ID as the unique identifier. ENI ID should work, as it stays the same.

Hope this information is helpful to provide a fix.

chamindg on 16 May 2019

👍1

@bflad , is this assigned to any release yet? Thanks

chamindg on 21 May 2019

Hi @chamindg 👋 This is currently not a focus of the maintainers, but we would be happy to look at a pull request for a fix. We do not generally assign items more than a week or two out at the moment.

bflad on 21 May 2019

@bflad - we've been hit by this bug a couple of times. I'd like to help out by working on the PR. My initial concern is one of backwards compatibility though; as it'll require tracking resources by a different unique identifier (ENI ID, rather than HSM instance ID) I'd hate for my naive implementation to throw everyone's HSMs away when they upgrade their provider.

Are there any examples where such tracking-id-migrations (for want of a better term) have been necessary? If so I'll gladly take a look at sorting this one out. Thanks!

mattburgess on 17 Mar 2020

I also encountered this bug and it's a good thing you're working on it. I'd like to point out that while not critical, the financial ramifications of the issue are rather significant. An extra rogue HSM in a cluster costs some $1500 per month. Yikes!

keksipurkki on 23 Apr 2020

😕2

@mattburgess I didn't yet find precedent for changing IDs, maybe I can with a bit more digging, but one idea could be to make the provider attempt to first find HSM by eni_id as suggested by @chamindg, and if that fails, then match by id:
https://github.com/terraform-providers/terraform-provider-aws/blob/v3.3.0/aws/resource_aws_cloudhsm2_hsm.go#L93

That way, whatever ID happens to live in your state file, whether it is an eni_id, or a hsm_id, it will still find the right one. Then we would need to store that eni_id in place of the hsm_id for new resources.

I'm going to try to independently confirm that when our hsm_id changes, the eni_id remains the same.