Terraform-provider-google: Replica Cloud SQL instances fail to be created

Created on 2 Mar 2018 · 12Comments · Source: hashicorp/terraform-provider-google

I am running into this issue reliably. We are provisioning multiple replicas and a failover for a master database. At least one (usually more) of the replicas / failover fail to get created. They end up in a failed state in the GCP console and cannot be removed:

It seems to be a timing issue or not waiting for the master to be actually ready because we can workaround it by creating the master, waiting a couple minutes, and then creating the replicas. For this example the workaround looks like:

terraform apply -target=google_sql_database_instance.master
# wait a couple minutes
terraform apply -target=google_sql_database_instance.failover
terraform apply -target=google_sql_database_instance.replica

Terraform Version

$ terraform -v
Terraform v0.11.3
+ provider.google v1.6.0
+ provider.random v1.1.0
+ provider.zerotier (unversioned)

Affected Resource(s)

google_sql_database_instance

Terraform Configuration Files

locals {
  region  = "us-east1"
  project = "staging-af5b7922"
}

provider "google" {
  version = "~> 1.6.0"
  region  = "us-east1"
  project = "${local.project}"
}

provider "random" {
  version = "~> 1.1.0"
}

resource "random_id" "database" {
  byte_length = 4
  prefix      = "database-"
}

resource "google_sql_database_instance" "master" {
  name             = "${random_id.database.hex}"
  database_version = "MYSQL_5_7"
  region           = "${local.region}"

  settings {
    tier             = "db-n1-standard-2"
    disk_size        = 20
    replication_type = "SYNCHRONOUS"

    backup_configuration {
      binary_log_enabled = true
      enabled            = true
    }
  }
}

resource "google_sql_database_instance" "failover" {
  name                 = "${random_id.database.hex}-failover"
  database_version     = "MYSQL_5_7"
  master_instance_name = "${google_sql_database_instance.master.name}"
  region               = "${local.region}"

  settings {
    tier                   = "db-n1-standard-2"
    replication_type       = "SYNCHRONOUS"
    crash_safe_replication = true
    disk_size              = 20
  }

  replica_configuration {
    failover_target = true
  }
}

resource "google_sql_database_instance" "replica" {
  name                 = "${random_id.database.hex}-replica-${count.index}"
  database_version     = "MYSQL_5_7"
  master_instance_name = "${google_sql_database_instance.master.name}"
  region               = "${local.region}"
  count                = 1

  settings {
    tier                   = "db-n1-standard-2"
    replication_type       = "SYNCHRONOUS"
    crash_safe_replication = true
    disk_size              = 100
  }
}

Debug Output

https://gist.github.com/andyshinn/93bd82100be2a77c080e94a64a111bf6

Panic Output

No panic.

Expected Behavior

Replica databases created without error.

Actual Behavior

Multiple replica Cloud SQL instances usually result in at least 1 failing and not being able to be re-created.

Steps to Reproduce

terraform apply

Important Factoids

Nothing out of the ordinary. Standard GCP project created in Terraform.

References

This is possibly the same as #1069 and #1083. But they are both closed and I'm not quite sure so I am opening as a new issue. But if this is the same then I'm happy to continue the conversation in one of them and close this.

upstream

Source

andyshinn

Most helpful comment

Wanted to let you all know they're still working it internally.

ndmckinley on 12 Mar 2018

👍4

All 12 comments

I've seen something like this before, yeah. I think this will be a tricky one, I'll try to dig in.

ndmckinley on 2 Mar 2018

👍 Thanks for the quick response. Let me know if there is any ways I can assist.

andyshinn on 2 Mar 2018

Thanks. I've got a consistent minimal repro and I hope that'll help ... but it may take a little while.

ndmckinley on 3 Mar 2018

👍1

Good news! An internal bug is open and there are people who work on the Cloud SQL systems working on root-causing it. :) I'll tag this "upstream", and keep it updated.

ndmckinley on 3 Mar 2018

🎉2 👍1

Wanted to let you all know they're still working it internally.

ndmckinley on 12 Mar 2018

👍4

any updates on this @ndmckinley ?

zandersmith-wp on 30 Mar 2018

There's movement and people are working it, but it's still going to be a while - this problem is not unique to terraform, it's a general "creating multiple replicas at once" issue, which requires a general fix.

ndmckinley on 2 Apr 2018

👍1

Been a while, just wanted to reach out and see if there were any updates.

Bhuwan on 8 Feb 2019

Unfortunately, the upstream bug is not fixed yet. We did submit a fix a while ago which should cause terraform to retry this when it happens. I'll mark this as closed, and if people see the issue again, they can comment here, or else open a new issue referring to this one.

ndmckinley on 8 Feb 2019

👍1

Do you recall which version Terraform + Google Provider has the fix?

andyshinn on 8 Feb 2019

Yeah, this was submitted in https://github.com/terraform-providers/terraform-provider-google/blob/master/CHANGELOG.md#1190-october-08-2018, under bugfixes. So if you're still seeing the issue in 1.19, let us know!

ndmckinley on 8 Feb 2019

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!