Terraform: All `vpc_security_group_ids` being removed when updating instance Security Groups

Created on 26 Sep 2016  ·  17Comments  ·  Source: hashicorp/terraform

When trying to modify AWS Security Groups on an already Terraformed EC2 instance, it seems to forget about all existing SGs (even though they appear in the statefile), and the plan output initially looks legit.

We originally found this issue, when trying to add a brand-new SG. Terraform removed all existing SGs (and caused some chaos!), but successfully added the new one. We've since reduced the problem down to just removing an existing SG, or somewhere parsing the existing SGs from the vpc_security_group_ids field.

In the following example:

  • One EC2 instance called ganglia-a
  • 4 existing Security Groups AAAAAA, BBBBBB, CCCCCC, & DDDDDD
  • Trying to remove BBBBBB.
  • Statefile shows the existing 4 SGs, (and all have their definitions too).
  • Terraform plan shows the reduction in count from 4=>3
  • Terraform plan shows the SG being removed
  • Terraform apply attempts to clear out all SGs. Only limited in this case by the AWS API complaining about lack of parameter payload.

Terraform Version

Terraform v0.7.4

Affected Resource(s)

  • aws_instance
  • vpc_security_group_ids

    Terraform Configuration Files

Existing Statefile showing existing 4 SGs:

                "aws_instance.ganglia-a": {
                    "type": "aws_instance",
                    "depends_on": [
                        "aws_security_group.CCCCCC",
                        "aws_security_group.DDDDDD",
                        "aws_security_group.AAAAAA",
                        "aws_subnet.blah-public",
                        "data.template_file.ganglia-cloudinit"
                    ],
                    "primary": {
                        "id": "i-INSTANCE-ID-HERE",
                        "attributes": {
                            "ami": "ami-AMI-ID",
                            "associate_public_ip_address": "false",
                            "availability_zone": "us-west-2a",
                            "disable_api_termination": "false",
                            "ebs_block_device.#": "0",
                            "ebs_optimized": "false",
                            "ephemeral_block_device.#": "0",
                            "iam_instance_profile": "",
                            "id": "i-INSTANCE-ID-HERE",
                            "instance_state": "running",
                            "instance_type": "m4.large",
                            "key_name": "dom",
                            "monitoring": "false",
                            "network_interface_id": "eni-xxxxxx",
                            "private_dns": "blah.compute.internal",
                            "private_ip": "SOME_IP",
                            "public_dns": "blah.compute.amazonaws.com",
                            "public_ip": "SOME_IP",
                            "root_block_device.#": "1",
                            "root_block_device.0.delete_on_termination": "true",
                            "root_block_device.0.iops": "192",
                            "root_block_device.0.volume_size": "64",
                            "root_block_device.0.volume_type": "gp2",
                            "security_groups.#": "0",
                            "source_dest_check": "true",
                            "subnet_id": "subnet-ID",
                            "tags.%": "2",
                            "tags.Cost Center": "Platform",
                            "tags.Name": "ganglia",
                            "tenancy": "default",
                            "user_data": "SOME_HASH_HERE",
                            "vpc_security_group_ids.#": "4",
                            "vpc_security_group_ids.1389827218": "sg-AAAAAA",
                            "vpc_security_group_ids.258711328": "sg-DDDDDD",
                            "vpc_security_group_ids.4034046821": "sg-CCCCCCC",
                            "vpc_security_group_ids.4249215419": "sg-BBBBBB"
                        },
                        "meta": {
                            "schema_version": "1"
                        },
                        "tainted": false
                    },
                    "deposed": [],
                    "provider": ""
                }

Existing Terraform file:

# ganglia
resource "aws_instance" "ganglia-a" {
  ami = "${var.ubuntu_ami_id}"
  availability_zone = "${var.vpc_region}a"
  instance_type = "m4.large"
  key_name = "dom"
  vpc_security_group_ids = ["${aws_security_group.AAAAAA.id}", "${aws_security_group.BBBBBB.id}", "${aws_security_group.CCCCCC.id}", "${aws_security_group.DDDDDD.id}" ]
  subnet_id = "${aws_subnet.blah-public.0.id}"
  user_data = "${data.template_file.SOMEDATA.rendered}"

  lifecycle {
    ignore_changes = ["user_data"]
  }

  root_block_device {
    volume_type = "gp2"
    volume_size = "${var.default_volume_size}"
  }

  provisioner "local-exec" {
    command = "../../../scripts/autosign ${var.foreman_proxy_url} ganglia-${replace(aws_instance.ganglia-a.id, "i-", "")}.${var.vpc_domain}"
  }

  tags {
    Name = "ganglia"
    "Cost Center" = "Platform"
  }
}

Actual change we're trying to apply (removal of BBBBBB)

-  vpc_security_group_ids = ["${aws_security_group.AAAAAA.id}", "${aws_security_group.BBBBBB.id}", "${aws_security_group.CCCCCC.id}", "${aws_security_group.DDDDDD.id}"]
+  vpc_security_group_ids = ["${aws_security_group.AAAAAA.id}", "${aws_security_group.CCCCCC.id}", "${aws_security_group.DDDDDD.id}"]

Terraform Plan

~ aws_instance.ganglia-a
    vpc_security_group_ids.#:          "4" => "3"
    vpc_security_group_ids.4249215419: "sg-BBBBBB" => ""

Debug Output

  • Debug output shows the API call to Amazon, having an empty payload. It should have the remaining 3 SGs?
    https://gist.github.com/scottgeary/67a7584bffdeeb672425cf6d1092c5c9

    Expected Behavior

  • A single SG aws_security_group.BBBBBB being removed

  • Terraform plan usually shows elements to be kept.

    Actual Behavior

  • Attempts to removes all SGs

  • Causes AWS API error, about lack of parameters.

    Steps to Reproduce

  1. terraform plan
  2. terraform apply

    Important Factoids

  • I've removed a lot of IDs from the various output.
  • If we try adding a brand new SG, it removes all existing SGs, and leaves the new one as the _only_ one. (This is how we discovered the bug)

    References

  • None that I could find that clear the existing SGs.

bug core

Most helpful comment

@scottgeary hi there! I am very sorry that you are affected by this!

We are having a look, as this is a very hard to catch issue, as I personally could not reproduce it successfully. I will keep you posted once we find out what might be creeping there!

All 17 comments

@scottgeary hi there! I am very sorry that you are affected by this!

We are having a look, as this is a very hard to catch issue, as I personally could not reproduce it successfully. I will keep you posted once we find out what might be creeping there!

I walked @stack72 through this on my machine, but we still couldn't figure out why. Does not appear to be an API-limit-related issue.

@scottgeary After two days of head banging I just realized that I'm having this exact same problem. Mine manifested when I tried to add 1 SG to an existing array of 4. Repeated applies flip-flop between _just_ the new 1 and _just_ the original 4. When doing a delete it's the same behavior except aws won't let you send it 0 SGs. Otherwise yours would flip-flop too between setting your instance to 4 then 0 then 4 then 0.

I believe it's yet another ignore_changes bug. I had just recently added ignore_changes['user_data'] to my resource just like above. I thought it was working now (to be fair, it does ignore the user_data now) :(

One can reliably reproduce the issue like below. First, create an instance:

resource "aws_instance" "mystery_test" {
  provider = "aws.mystery"
  instance_type = "t2.nano"
  user_data = "foo"
  lifecycle {
    ignore_changes = [ "user_data" ]
  }
  vpc_security_group_ids = ["sg-AAAAAAAA", "sg-BBBBBBBB"]
  ami       = "ami-a1288ec2"
  subnet_id = "${var.subnetid}"
}

Then, change user_data "foo" to "bar" and also add a new SG to the array.

user_data = "bar"
  lifecycle {
    ignore_changes = [ "user_data" ]
  }
  vpc_security_group_ids = ["sg-AAAAAAAA", "sg-BBBBBBBB", "sg-CCCCCCCC]

a plan will give you strange output:

~ module.mysterymod.aws_instance.mystery_test
    vpc_security_group_ids.#:          "2" => "3"
    vpc_security_group_ids.3333333333: "" => "sg-CCCCCCCC"

Note how the plan doesn't show the expected

vpc_security_group_ids.1111111111: "sg-AAAAAAAA" => "sg-AAAAAAAA"
vpc_security_group_ids.222222222: "sg-BBBBBBBB" => "sg-BBBBBBBB"

entries. This is bad :)
In the trace log of the apply, you'll see it POST just

Action=ModifyInstanceAttribute&GroupId.1=sg-CCCCCCCC&InstanceId=i123455678&Version=2016-09-15

The action _should_ be all 3 groups (GroupId.1 and GroupId.2 and GroupId.3)
What _seems_ to be happening is that ignore_changes catches the user_data change (foo => bar) which then makes it ignore other diffs (or at least mess things up)

2016/10/26 14:35:37 [DEBUG] root.mystery: eval: *terraform.EvalDiff
2016/10/26 14:35:37 [DEBUG] Removing 'id' diff and setting Destroy to false because after ignore_changes, this diff no longer requires replacement
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: user_data
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: vpc_security_group_ids.11111111
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: security_groups.#
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: root_block_device.#
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: ebs_block_device.#
...

If you then remove the ignore_changes line and run a plan at this point, you'll get a plan that of course wants to recreate the instance, _but_ the sg change will at least look correct:

-/+ module.mystery.aws_instance.mystery
    ami:                               "ami-a1288ec2" => "ami-a1288ec2"
...
    user_data:                         "0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33" => "62cdb7020ff920e5aa642c3d4066950dd1f01f4d" (forces new resource)
    vpc_security_group_ids.#:          "2" => "3"
    vpc_security_group_ids.3333333333: "" => "sg-CCCCCCCC"
    vpc_security_group_ids.11111111111: "sg-AAAAAAAA" => "sg-AAAAAAAA"
    vpc_security_group_ids.2222222222: "sg-BBBBBBBB" => "sg-BBBBBBBB"

I of course added ignore_changes recently because I needed to update my user_data without reprovisioning lots of nodes, thus I'm now stuck in a state where I get to pick between adding a SG and reprovisioning lots of nodes :(

I will add that as a workaround, you can manually attach the new SG to your instance via console/cli and Terraform will not complain about it. Probably works for removing one also.

Anyway, seems ignore_changes needs to be examined further. Hope this helps!
@phinze

Just to add a bit more proof, I played a bit last night with eval_diff.go and tried (around line 240) :

   // If we didn't hit any of our early exit conditions, we can filter the diff.
    for k := range ignorableAttrKeys {
        if !(strings.Contains(k, "vpc_security_group_ids")) {
            log.Printf("[DEBUG] [EvalIgnoreChanges] %s - Ignoring diff attribute: %s",
                n.Resource.Id(), k)
            diff.DelAttribute(k)
        }
    }

running this against my earlier example resource results in the correct plan (didn't try applying). This suggests that one approach might be to make sure we're not deleting diff attributes that we care about in this particular run. Note that if I run a plan using the above hacked TF against other resources that have a pending recreate due to user data, are blocked from doing so by ignore_changes, but have no other changes (vpc_security_group_ids or otherwise), I get a useless diff in my plan for all of them, e.g.

"A" => "A"
"B" => "B"
"C" => <computed>

Looking further it seems the problem is a bit earlier where we do:

   if changeType == DiffDestroyCreate {
        for k, v := range diff.CopyAttributes() {
            if v.Empty() || v.NewComputed {
                ignorableAttrKeys[k] = true
            }
        }

problem being that this ends up deleting part of our diff, e.g. A => A, since that's matched by .Empty(). This is likely a problem for all list types of attributes. Seems that we should look for the existence of a parent counter key, e.g. "vpc_security_group_ids.#" and if it's changing, we should exclude all child entries from the ignorableAttrKeys map.

What we actually need to do is check if any siblings are changing. Potential fix in #9791

I'm seeing this bug too, I'm adding a new security group to the instance and the original 2 are disappearing and it everytime I run plan it flips flops back and forth.

~ aws_instance.management.0
    vpc_security_group_ids.#:          "2" => "3"
    vpc_security_group_ids.1111564906: "" => "sg-aaaa4197"

then

~ aws_instance.management.0
    vpc_security_group_ids.#:          "1" => "3"
    vpc_security_group_ids.111142371:  "" => "sg-aaaa417a"
    vpc_security_group_ids.1111450837: "" => "sg-aaaa4184"

I'm seeing this as well. Earlier in the week it was that a vpc_security_group_ids entry was not being removed, and so I removed it via console. The plan output looked much like ekristen's comment. This was relatively benign and so I didn't bother to enter an issue on it.

Today however, this same issue had some destructive behavior. It removed all SG except the one I was adding; luckily I was doing this on a dev machine. But I must proceed with modification of SG across many more instances.

Early warning from plan:

~ aws_instance.compute_100_dev
    vpc_security_group_ids.#:         "3" => "4"
    vpc_security_group_ids.423628628: "" => "sg-asdfasdf"

Count goes to 4 so I'm sure everything will be OK and so I apply.

aws_instance.compute_100_dev: Modifying...
  vpc_security_group_ids.#:         "3" => "4"
  vpc_security_group_ids.423628628: "" => "sg-asdfasdf"
aws_instance.compute_100_dev: Modifications complete

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

Close enough, right? Probably just some display error.

Then I validate in console and the only SG attached to the instance is sg-asdfasdf, so I begin to cry again. I wipe my face and then plan:

~ aws_instance.compute_100_dev
    vpc_security_group_ids.#:          "1" => "4"
    vpc_security_group_ids.1967715036: "" => "sg-12341234"
    vpc_security_group_ids.2402598217: "" => "sg-qwerqwer"
    vpc_security_group_ids.257549500:  "" => "sg-zxcvzxcv"


Plan: 0 to add, 1 to change, 0 to destroy.

Hooray. This looks more correct. I mean, it isn't showing sg-asdfasdf but surely this is just a display bug.
Apply:

aws_instance.compute_100_dev: Modifying...
  vpc_security_group_ids.#:          "1" => "4"
  vpc_security_group_ids.1967715036: "" => "sg-12341234"
  vpc_security_group_ids.2402598217: "" => "sg-qwerqwer"
  vpc_security_group_ids.257549500:  "" => "sg-zxcvzxcv"
aws_instance.compute_100_dev: Modifications complete

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

Console reveals that I again have only 3 SG and the new one isn't being put into place. It is too early in the day to start drinking.

Maybe I can make a workaround. Advice appreciated.

@ekristen @Gary-Armstrong can you guys test my fix from #9791 ? Maybe it'll get more traction if it works for you.

Update:

  vpc_security_group_ids = [
    "${aws_security_group.wxmix_http_sg.id}",
    "${aws_security_group.wxmix_admin_sg.id}"
  ]

Using this HCL where the 2 SG are completely unrelated, when I do a plan/apply cycle it will switch to whichever SG isn't applied to the instance.

] aws ec2 describe-instances --instance-id i-xxxxxxxx|jq .Reservations[].Instances[].SecurityGroups[]
{
  "GroupName": "wxmix_http_sg",
  "GroupId": "sg-aaaaaaaa"
}
...TF APPLY...
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
...
] aws ec2 describe-instances --instance-id i-xxxxxxxx|jq .Reservations[].Instances[].SecurityGroups[]
{
  "GroupName": "wxmix_admin_sg",
  "GroupId": "sg-bbbbbbbb"
}

Just by doing an apply, it will flip back and forth between the two SG.

More experiments reveal the pattern (in my case) is to replace existing list items with new list items, rather than merge the list and replace existing list items with new merged list items.

@br0ch0n If it involves compiling a new terraform, I'll not participate. I have to manually change SG assignments on many instances today.

Still happens in 0.8.6

@stack72 is the prototype you mentioned in #1887 meant to work around this as well (i.e. you're planning on completely refactoring ignore_changes)? Otherwise can we move forward with #9791? Happy to tweak it if it's not an ideal solution.

just got bit by this bug as well today. Added a new security group to my vpc_security_group_ids list and it removed all existing security groups from my instance and only added the new security group.

Thanks for the research here. I've reproduced this issue with the following gist and instructions above:

I'm relabeling this as a core bug, the root of it seems to involve ignore_changes as mentioned.

Hey all – looks like this should be patched up with #12897 , will go out in the next release

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Was this page helpful?
0 / 5 - 0 ratings