Terraform-provider-google: Error creating resources using Private IPs in parallel.

Created on 16 Feb 2019  路  12Comments  路  Source: hashicorp/terraform-provider-google


Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
  • If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

Terraform v0.11.11

  • provider.google v2.0.0
  • provider.google-beta v2.0.0
  • provider.random v2.0.0

Affected Resource(s)

  • google_sql_database_instance
  • google_container_cluster
    Probably all resources that support Private IP.

Terraform Configuration Files

provider "google" {
  region = "${var.region}"
}

provider "google-beta" {
  region = "${var.region}"
}

variable "region" {
  default = "us-central1"
}

variable "org_id" {
  default = "*****"
}

variable "billing_account" {
  default = "*******"
}

variable "count" {
  default = 2
}

resource "random_id" "project" {
  byte_length = 4
  prefix      = "test-tf-project-"
}

resource "google_project" "project" {
  name                = "Test Terraform Project"
  project_id          = "${random_id.project.hex}"
  org_id              = "${var.org_id}"
  auto_create_network = false
  billing_account     = "${var.billing_account}"
}

resource "google_project_service" "networking" {
  project                    = "${google_project.project.project_id}"
  service                    = "servicenetworking.googleapis.com"
  disable_on_destroy         = false
  disable_dependent_services = true
}

resource "google_compute_network" "network" {
  description             = "Network"
  name                    = "test-network"
  auto_create_subnetworks = "false"
  project                 = "${google_project.project.project_id}"
}

resource "google_compute_global_address" "private_ip_alloc" {
  provider      = "google-beta"
  name          = "private-ip-alloc"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  prefix_length = 16
  network       = "${google_compute_network.network.self_link}"
  project       = "${google_project_service.networking.project}"
}

resource "google_service_networking_connection" "connection" {
  provider                = "google-beta"
  network                 = "${google_compute_network.network.self_link}"
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = ["${google_compute_global_address.private_ip_alloc.name}"]
}

resource "random_id" "master" {
  byte_length = 4
  prefix      = "master-"
}

resource "google_sql_database_instance" "master" {
  count            = "${var.count}"
  name             = "${random_id.master.hex}-${count.index}"
  database_version = "MYSQL_5_7"
  region           = "${var.region}"
  project          = "${google_project.project.project_id}"

  settings {
    tier      = "db-f1-micro"

    ip_configuration {
      private_network = "${google_service_networking_connection.connection.network}"
    }
  }
}

Debug Output

https://gist.github.com/yuvaldrori/034fd15acff47edf83af77dea885fa36

Panic Output

Expected Behavior

All resources should have been created successfully. If you change the variable count = 1 it will succeed.

Actual Behavior

Only one CloudSQL gets created successfully.

Steps to Reproduce

  1. terraform apply

Important Factoids

Tried similar script with one CloudSQL and one GKE cluster and many GKE clusters with private IPs - same results.

References

  • #0000
bug upstream

Most helpful comment

Update - I've been talking with the private networking team and they are working on a fix for this. They let me know that this is happening because there is an entry that gets set up the first time any private networking feature is turned on for a project/network. Creating the 2 instances at the same time causes a collision setting up this singleton, so if you are able to set up 1 resource that uses private networking before creating others in parallel you should be able to work around this issue.

All 12 comments

I have tried a couple of different times and been unable to reproduce this failure. Also based on the api responses in your debug log I can't see what the exact failure was because the Operation polling just shows "code": "UNKNOWN". Are you still able to consistently reproduce this error and if so can you look in the cloud console UI and see if there is a more detailed reason that the sql instance failed to create?

@chrisst just run the tf script again: one machine succeeded and the other failed with the same unknown error. The ui says exactly the same: "An unknown error occurred".
I did open a ticket with Google support and they asked me to see if I can test it with a gcloud command - I was not able to repro with gcloud. The bash script I used:
`

!/bin/bash

for i in 774yhf5 59swec6
do
gcloud beta sql instances create gcloud-test-$i --async --database-version MYSQL_5_7 --tier db-f1-micro --region us-central1 --network network --project=some-project-name &
done
`

I'm still unable to reproduce your error with terraform, but using the UI I was unable to modify or delete multiple peering routes because of the error: "There is a peering operation in progress on the local or peer network. Try again later." It sounds like this could be what is happening with your config. Can you check http://console.cloud.google.com/networking/peering/list and http://console.cloud.google.com/networking/routes/list to see if there is a similar error on any of those automatically generated resources?

If there is, we should be able to tweak the lock on sql operations to account for the peering operations as well. It won't solve cross resource contention (sql + gke) but it should fix the count based sql clusters.

Sorry for the late reply, I just set up notification. Anyway, I ran the TF script again and was able to see the same errors and when I check the routes and peering list all looks OK:

gcloud alpha services vpc-peerings list --project test-tf-project-e93f7cc2 --network test-network
---
network: projects/221637507821/global/networks/test-network
peering: servicenetworking-googleapis-com
reservedPeeringRanges:
- private-ip-alloc
service: services/servicenetworking.googleapis.com
---
network: projects/221637507821/global/networks/test-network
peering: cloudsql-mysql-googleapis-com
reservedPeeringRanges:
- private-ip-alloc
service: services/servicenetworking.googleapis.com
gcloud compute routes list --project test-tf-project-e93f7cc2 
NAME                            NETWORK       DEST_RANGE       NEXT_HOP                       PRIORITY
default-route-28a3d65e45473cb3  test-network  172.20.181.0/24  test-network                   1000
default-route-b6db25e614d13793  test-network  0.0.0.0/0        default-internet-gateway       1000
peering-route-339f952832fab934  test-network  192.168.0.0/24   cloudsql-mysql-googleapis-com  1000

I don't get how come I can get this error every time and you cannot - what can be different in our setup?

Ok I was able to reproduce on a consistent basis by tearing down and spinning up the project and network connections at the same time. I was hoping this was something that could be controlled by locking SQL instance operations based on the project name but I don't think it's the case. At this point since it's only reproducible when other non-sql resources are being created it's not possible to identify this situation from within Terraform and so I'm not sure there's a good way to guard against it. I'm try and get a bug updated on the sql api to see if they can provide retries or a better error for this.

Update - I've been talking with the private networking team and they are working on a fix for this. They let me know that this is happening because there is an entry that gets set up the first time any private networking feature is turned on for a project/network. Creating the 2 instances at the same time causes a collision setting up this singleton, so if you are able to set up 1 resource that uses private networking before creating others in parallel you should be able to work around this issue.

Can we get an update on this? We are running into it regularly when setting up databases for multiple environments and we have to do two separate terraform runs to work around this. The delay workaround does not really work in our case as we are using a module for cloudsql and you cannot have one module wait on the other (at least not in a simple non-hacky way).

Hi, any updates on this...?

Sorry no update at this point. The upstream team is still working on it and I'll update if I see that anything has been resolved.

+1 for looking for a fix for this.

+1 for looking for a fix for this.

Was this page helpful?
0 / 5 - 0 ratings