Terraform-provider-google: Error creating resources using Private IPs in parallel.

Created on 16 Feb 2019 · 12Comments · Source: hashicorp/terraform-provider-google

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

Terraform v0.11.11

provider.google v2.0.0
provider.google-beta v2.0.0
provider.random v2.0.0

Affected Resource(s)

google_sql_database_instance
google_container_cluster
Probably all resources that support Private IP.

Terraform Configuration Files

provider "google" {
  region = "${var.region}"
}

provider "google-beta" {
  region = "${var.region}"
}

variable "region" {
  default = "us-central1"
}

variable "org_id" {
  default = "*****"
}

variable "billing_account" {
  default = "*******"
}

variable "count" {
  default = 2
}

resource "random_id" "project" {
  byte_length = 4
  prefix      = "test-tf-project-"
}

resource "google_project" "project" {
  name                = "Test Terraform Project"
  project_id          = "${random_id.project.hex}"
  org_id              = "${var.org_id}"
  auto_create_network = false
  billing_account     = "${var.billing_account}"
}

resource "google_project_service" "networking" {
  project                    = "${google_project.project.project_id}"
  service                    = "servicenetworking.googleapis.com"
  disable_on_destroy         = false
  disable_dependent_services = true
}

resource "google_compute_network" "network" {
  description             = "Network"
  name                    = "test-network"
  auto_create_subnetworks = "false"
  project                 = "${google_project.project.project_id}"
}

resource "google_compute_global_address" "private_ip_alloc" {
  provider      = "google-beta"
  name          = "private-ip-alloc"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  prefix_length = 16
  network       = "${google_compute_network.network.self_link}"
  project       = "${google_project_service.networking.project}"
}

resource "google_service_networking_connection" "connection" {
  provider                = "google-beta"
  network                 = "${google_compute_network.network.self_link}"
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = ["${google_compute_global_address.private_ip_alloc.name}"]
}

resource "random_id" "master" {
  byte_length = 4
  prefix      = "master-"
}

resource "google_sql_database_instance" "master" {
  count            = "${var.count}"
  name             = "${random_id.master.hex}-${count.index}"
  database_version = "MYSQL_5_7"
  region           = "${var.region}"
  project          = "${google_project.project.project_id}"

  settings {
    tier      = "db-f1-micro"

    ip_configuration {
      private_network = "${google_service_networking_connection.connection.network}"
    }
  }
}

Debug Output

https://gist.github.com/yuvaldrori/034fd15acff47edf83af77dea885fa36

Panic Output

Expected Behavior

All resources should have been created successfully. If you change the variable count = 1 it will succeed.

Actual Behavior

Only one CloudSQL gets created successfully.

Steps to Reproduce

terraform apply

Important Factoids

Tried similar script with one CloudSQL and one GKE cluster and many GKE clusters with private IPs - same results.

References

#0000

bug upstream

Source

yuvaldrori

👍13

Most helpful comment

Update - I've been talking with the private networking team and they are working on a fix for this. They let me know that this is happening because there is an entry that gets set up the first time any private networking feature is turned on for a project/network. Creating the 2 instances at the same time causes a collision setting up this singleton, so if you are able to set up 1 resource that uses private networking before creating others in parallel you should be able to work around this issue.

chrisst on 11 Apr 2019

👍3

All 12 comments

I have tried a couple of different times and been unable to reproduce this failure. Also based on the api responses in your debug log I can't see what the exact failure was because the Operation polling just shows "code": "UNKNOWN". Are you still able to consistently reproduce this error and if so can you look in the cloud console UI and see if there is a more detailed reason that the sql instance failed to create?

chrisst on 4 Mar 2019

@chrisst just run the tf script again: one machine succeeded and the other failed with the same unknown error. The ui says exactly the same: "An unknown error occurred".
I did open a ticket with Google support and they asked me to see if I can test it with a gcloud command - I was not able to repro with gcloud. The bash script I used:
`

!/bin/bash

for i in 774yhf5 59swec6
do
gcloud beta sql instances create gcloud-test-$i --async --database-version MYSQL_5_7 --tier db-f1-micro --region us-central1 --network network --project=some-project-name &
done
`

yuvaldrori on 5 Mar 2019

I'm still unable to reproduce your error with terraform, but using the UI I was unable to modify or delete multiple peering routes because of the error: "There is a peering operation in progress on the local or peer network. Try again later." It sounds like this could be what is happening with your config. Can you check http://console.cloud.google.com/networking/peering/list and http://console.cloud.google.com/networking/routes/list to see if there is a similar error on any of those automatically generated resources?

If there is, we should be able to tweak the lock on sql operations to account for the peering operations as well. It won't solve cross resource contention (sql + gke) but it should fix the count based sql clusters.

chrisst on 5 Mar 2019

Sorry for the late reply, I just set up notification. Anyway, I ran the TF script again and was able to see the same errors and when I check the routes and peering list all looks OK:

gcloud alpha services vpc-peerings list --project test-tf-project-e93f7cc2 --network test-network
---
network: projects/221637507821/global/networks/test-network
peering: servicenetworking-googleapis-com
reservedPeeringRanges:
- private-ip-alloc
service: services/servicenetworking.googleapis.com
---
network: projects/221637507821/global/networks/test-network
peering: cloudsql-mysql-googleapis-com
reservedPeeringRanges:
- private-ip-alloc
service: services/servicenetworking.googleapis.com

gcloud compute routes list --project test-tf-project-e93f7cc2 
NAME                            NETWORK       DEST_RANGE       NEXT_HOP                       PRIORITY
default-route-28a3d65e45473cb3  test-network  172.20.181.0/24  test-network                   1000
default-route-b6db25e614d13793  test-network  0.0.0.0/0        default-internet-gateway       1000
peering-route-339f952832fab934  test-network  192.168.0.0/24   cloudsql-mysql-googleapis-com  1000

I don't get how come I can get this error every time and you cannot - what can be different in our setup?

yuvaldrori on 7 Mar 2019

Ok I was able to reproduce on a consistent basis by tearing down and spinning up the project and network connections at the same time. I was hoping this was something that could be controlled by locking SQL instance operations based on the project name but I don't think it's the case. At this point since it's only reproducible when other non-sql resources are being created it's not possible to identify this situation from within Terraform and so I'm not sure there's a good way to guard against it. I'm try and get a bug updated on the sql api to see if they can provide retries or a better error for this.

chrisst on 11 Mar 2019

chrisst on 11 Apr 2019

👍3

Suffering from this issue as well. I posted a question and workaround in Stack Overflow: https://stackoverflow.com/questions/55990713/how-to-fix-an-unknown-error-occurred-when-creating-multiple-google-cloud-sql-i/55991852#55991852.

bantalon on 20 May 2019

Can we get an update on this? We are running into it regularly when setting up databases for multiple environments and we have to do two separate terraform runs to work around this. The delay workaround does not really work in our case as we are using a module for cloudsql and you cannot have one module wait on the other (at least not in a simple non-hacky way).