Terraform: Dangerous 0.12 change: strings and numbers no longer compare equal

Created on 4 Jul 2019  路  9Comments  路  Source: hashicorp/terraform

TL;DR

0 == "0" ? "foo" : "bar"

  • In Terraform 0.11, we get "foo"
  • In Terraform 0.12, we get "bar"

Terraform Version

Terraform v0.12.3
+ provider.google v2.10.0

Terraform Configuration Files

I have an existing terraform config.
The idea of the config is: I want to have two VMs, one should serve production traffic, the other should be hot standby. The software running on the VM selects which role the VM has by inspecting its GCE metadata (Terraform does not care about this, this is happening on the VM). In the following config, we create 2 VMs, the first VM gets the metadata server_status set to production, the second VM gets the metadata server_status set to standby.

resource "google_compute_instance" "app" {
  count        = "2"
  ...[skipping irrelevant parts]...

  metadata = {    
    foo           = "bar"
    server_status = "${count.index == local.prod_index ? "production" : "standby"}"
  }
}

locals {
  prod_index = "0"
}

Full file to reproduce the issue:

resource "google_compute_instance" "app" {
  count        = "2"
  project      = "foobar"
  name         = "foobar"
  machine_type = "n1-standard-1"
  zone         = "us-central1"

  boot_disk {
    initialize_params {
      image = "debian-cloud/debian-9"
    }
  }
  network_interface {
    network = "default"
    access_config {}
  }

  metadata = {    
    foo           = "bar"
    server_status = "${count.index == local.prod_index ? "production" : "standby"}"
  }
}

locals {
  prod_index = "0"
}

Debug Output

Full output (no TF_LOG, since there is nothing interesting in there): https://pastebin.com/xz01cn0i

Crash Output

no crash

Expected Behavior

In Terraform 0.11, this worked as expected: the first VM gets the metadata server_status set to production, the second VM gets the metadata server_status set to standby.

Snippet of the plan output:

Terraform will perform the following actions:

  + google_compute_instance.app[0]
      ...
      metadata.%:                                          "2"
      metadata.foo:                                        "bar"
      metadata.server_status:                              "production"
     ...

  + google_compute_instance.app[1]
      ...
      metadata.%:                                          "2"
      metadata.foo:                                        "bar"
      metadata.server_status:                              "standby"
      ....


Plan: 2 to add, 0 to change, 0 to destroy.

I expect that this config continues to work in Terraform 0.12 or that the config gets rejected with a type error.

Actual Behavior

Terraform 0.12 accepts the config without any warning. Also, terraform 0.12upgrade does not issue a warning and simply upgrades the config without complaint. The issue exists with running 0.12upgrade before and without. For completeness, here is the config again after 0.12upgrade:

resource "google_compute_instance" "app" {
  count        = "2"
   ...
  metadata = {
    foo           = "bar"
    server_status = count.index == local.prod_index ? "production" : "standby"
  }
}

locals {
  prod_index = "0"
}

The issue: In Terraform 0.12, both VMs get the metadata server_status set to standby. Shortened plan output:

  # google_compute_instance.app[0] will be created
  + resource "google_compute_instance" "app" {
     ...
      + metadata             = {
          + "foo"           = "bar"
          + "server_status" = "standby"
        }
    }

  # google_compute_instance.app[1] will be created
  + resource "google_compute_instance" "app" {
     ...
      + metadata             = {
          + "foo"           = "bar"
          + "server_status" = "standby"
        }
    }

Plan: 2 to add, 0 to change, 0 to destroy.

The underlying problem is that strings and numbers no longer compare equal. When I replace

locals {
  prod_index = "0"
}

with

locals {
  prod_index = 0
}

the config does again what it should.

In all versions, terraform0.12.3 validate says Success! The configuration is valid.

Steps to Reproduce

  1. terraform0.11.14 init
  2. terraform0.11.14 plan
  3. rm -rf .terraform
  4. terraform0.12.3 init
  5. terraform0.12.3 plan

Additional Context

Since count is still initialized as the string "2", even after 0.12upgrade, this somehow feels inconsistent.

I somehow feel that the expression count.index == local.prod_index ? "production" : "standby" should either evaluate to true if count.index is 0 and local.prod_index is "0" (this is how Terraform 0.11 behaved) or throw a type error, rejecting the configuration.

I personally feel that the current behavior is dangerous, since in my use case, an existing terraform configuration file looks like it will continue to work, however, the 0.12 upgrade will lead to an outage in my project since my production VM will switch to standby and stop serving.

I acknowledge that the plan output shows this change. Yet, even in my simplified example, the plan output for creating just one single google_compute_instance is already over 60 lines, and this small peculiarity may easily slip through.

I personally feel that 0 == "0" ? "production" : "standby" should be a type error and my config (without requiring manual effort on my side) should stop working.

References

bug config v0.12

All 9 comments

Why is this tagged as "question"?
I feel that this is rather a bug, since this should probably be a type error instead of a dangerous change in behavior from 0.11->0.12.

No change with Terraform 0.12.4.

The documentation states "Any of the equality, comparison, and logical operators can be used to define the condition. The two result values may be of any type, but they must both be of the same type so that Terraform can determine what type the whole conditional expression will return without knowing the condition value."
Therefore, 0 == "0" should be a type error.

@diekmann The results in the docs refer to the right side of the conditional, that is ? "this" : "and this" being of the same type.

I'm not sure about 0 == "0" being a type error -- given the current docs on type conversion, I'm considering approaching it from this angle, such that 0 == "0" would be true, but that itself would be a fairly large change.

As such I'm interested to hear a bit more about why you think the comparison should be a type error, as comparing two items of different types "string" == 0 or even "string" == nil are both valid false responses, in my opinion.

@pselle Thanks for your reply.

My motivation stems from the 0.11->0.12 migration.

In 0.11, basically everything was a string and it was technically impossible to write 0 == "0", one could only write "0" == "0".
However, one could also write count.index == "0". I will now assume count.index is zero.
In 0.11, this translates to "0" == "0", i.e. true.
In 0.12, this translates to 0 == "0", i.e. false.

If I have a .tf file which is perfectly syntactically valid for 0.11 and 0.12, upgrading 0.11->0.12 may change the behavior of that file, since some conditionals may suddenly change from true to false. The motivation in my first post describes how this can easily lead to an outage (if one does not inspect the plan output carefully). This bug is based on a true story. :cry:

I also saw another dangerous situation where we almost added useless DNS entries due to the 0.11->0.12 upgrade. The config was similar to the following: A variable var.some_machine_count defines how many machines we want. If we have at least one machine, we need DNS entries. The interesting part (after running 0.12upgrade) is this:

resource "google_dns_record_set" "some_dns_records" {
  count        = var.some_machine_count != 0 ? 1 : 0
  [...]
}

In a different module, some_machine_count was initialized as some_machine_count = "0" instead of some_machine_count = 0, which effectively creates DNS records we don't need. Even after running 0.12upgrade, some_machine_count stays "0" and does not become 0.

Therefore, I argue that something like count.index == "0" should not silently (i.e. I don't get a warning or type error) change its semantics when upgrading 0.11->0.12. Hence, 0 == "0" should either evaluate to true (as it did in 0.11) or raise a type error and stop me from continuing until I have fixed my config.

Thanks for the context! This is very helpful. Since this is concerning upgrades, I'm thinking the place for a warning is in the upgrade command itself, when we encounter a conditional that has a value that can be cast to an integer.

I've done a little digging into 0.12upgrade and I'm not sure how possible this is[1], but would getting a Warning: after an upgrade, which (if successfully added) point out the possibly problematic areas, be something that would be helpful in this case? (and is that essentially what you are describing in your final paragraph?).

Thinking something like:

Warning: Possible comparison error

  on example.tf line 25:
  25:   count = var.some_machine_count != "0" ? 1 : 0

This comparison is against a number represented as a string, and the comparison
may no longer work as intended, that is, 0 == "0" will return false.

Very sorry to hear about the unhappy story relating to this, and thank you for sharing your experience so that (hopefully) we can save others the same pain!

[1] that is, I'm presenting an idea, but disclosing that I am unsure if it is possible to implement this.

Thanks for your consideration!

Could adding a warning to the 0.12upgrade tool catch the following case?
var.some_machine_count == count.index, where the type of some_machine_count is not specified. In general, could this catch var.foo == var.bar without false positives? Can there be other cases where the type of a value is only known at runtime and the static 0.112upgrade tool can (by design) not catch this?

In my original use case, where the 0.11 config was accepted without complaint by 0.12, I did not run 0.12upgrade, so a warning in 0.12upgrade would not have caught the issue.

Could this warning be added to terraform validate?

In general, wouldn't it be safer to raise a type error when two values of different type are compared? The language now provides many type conversion functions, so requiring that we can only compare types of equal type would feel morally correct. It would also require .tf file authors to be explicit about types, which is a good thing.

@diekmann I'm going to bring this up with the team as regards changing the comparison behavior -- thank you so much for your engagement here, it's been very helpful!

I'll keep looking into if we could warn when a file is comparing count.index in tooling; as well, I think the upgrade guide would benefit from calling this out -- I couldn't find the mention of this breaking change when I looked there, or the changelog (at least not when hunting for "comparison"), so I'll look at that as well.

Thanks so much Pam!

Side note: it is not just count.index. In my google_dns_record_set-example, after 0.12upgrade, a hard-coded 0 was now a number, and it can also happen that normal input variables or locals become numbers and get compared to a string.

Was this page helpful?
0 / 5 - 0 ratings