$ terraform -v
Terraform v0.11.10
+ provider.google v1.19.1
+ provider.google-beta v1.19.0
See terraform logs.
When applying a plan that involves creating a new gke cluster, terraform-provider-google polls container.googleapis.com/v1beta1/projects/<project_id>/locations/<region>/operations/<operation_id> every 10 seconds.
When polling for status of the "cluster create" operation:
terraform apply was stuck for 15 minutes waiting for a response to container.googleapis.com/v1beta1/projects/<project_id>/locations/<region>/operations/<operation_id>. It then timed out, resulting in an error applying the plan.
There is no easy/reliable way to reproduce this behavior since we don't have control over the response of container.googleapis.com. Applying the same plan again _could_ be successful.
Here is the relevant code where the polling for the completion of the "create cluster operation" is kicked off:
the core of the logic in the "wait for state to reach X" is in: https://github.com/terraform-providers/terraform-provider-google/blob/a4b26062b47ea3836066586a0bdde504d0e78a3b/google/container_operation.go#L62-L80
the code for resp, err := w.Service.Projects.Locations.Operations.Get(name).Do() is provided by the package google.golang.org/api/container/v1beta1:
Ah, sorry, that's where it lived in v1.19.0 - we released 2.0.0 a few days ago and things are a bit different now. Still close, though! The PR I sent out should do it.
Thanks @ndmckinley for the quick response! I think there is another bug or problem, at least in 1.19 (I have not had a chance to look at 2.0.0 or master yet). @vbhavsar and I work together, and I think the logs attached to the original issue might not have fully explained what we thought was the bug.
When we created the cluster, the verbose output from Terraform shows that it is polling the GKE API to get the status of the operation (to know when it has completed). We observed that the polling stopped at a certain point:
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: 2019/02/25 18:36:52 [DEBUG] Google API Request Details:
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: ---[ REQUEST ]---------------------------------------
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: GET /v1beta1/projects/PROJECT/locations/us-central1/operations/operation-1551119564913-7ab8cd44?alt=json&prettyPrint=false HTTP/1.1
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: Host: container.googleapis.com
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: User-Agent: google-api-go-client/0.5 Terraform/0.11.7 (+https://www.terraform.io) terraform-provider-google-beta/1.20.0
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: Accept-Encoding: gzip
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4:
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4:
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: -----------------------------------------------------
for 15 minutes we saw output like the following where Terraform is waiting for the operation to complete (but no additional polling of the GKE API to check the operation status is happening):
13:37:12 2019/02/25 18:37:12 [TRACE] dag/walk: vertex "meta.count-boundary (count boundary fixup)", waiting for: "module.us-central1-b04t.google_container_cluster.spotify_gke_cluster"
13:37:12 2019/02/25 18:37:12 [TRACE] dag/walk: vertex "module.us-central1-b04t.provider.google-beta (close)", waiting for: "module.us-central1-b04t.google_container_cluster.spotify_gke_cluster"
13:37:12 2019/02/25 18:37:12 [TRACE] dag/walk: vertex "provisioner.local-exec (close)", waiting for: "module.us-central1-b04t.google_container_cluster.spotify_gke_cluster"
13:37:12 2019/02/25 18:37:12 [TRACE] dag/walk: vertex "root", waiting for: "provisioner.local-exec (close)"
13:37:12 module.us-central1-b04t.google_container_cluster.spotify_gke_cluster: Still creating... (4m30s elapsed)
and then finally the Terraform operation as a whole aborts when the HTTP connection to the GKE API times out:
13:52:23 * google_container_cluster.spotify_gke_cluster: Error waiting for creating GKE cluster: Get https://container.googleapis.com/v1beta1/projects/PROJECT/locations/us-central1/operations/operation-1551119564913-7ab8cd44?alt=json&prettyPrint=false: read tcp 10.99.0.3:35006->108.177.15.95:443: read: connection timed out
13:52:23 2019/02/25 18:52:23 [ERROR] root.us-central1-b04t: eval: *terraform.EvalSequence, err: 1 error(s) occurred:
13:52:23
13:52:23 * google_container_cluster.spotify_gke_cluster: Error waiting for creating GKE cluster: Get https://container.googleapis.com/v1beta1/projects/PROJECT/locations/us-central1/operations/operation-1551119564913-7ab8cd44?alt=json&prettyPrint=false: read tcp 10.99.0.3:35006->108.177.15.95:443: read: connection timed out
13:52:23 2019/02/25 18:52:23 [TRACE] [walkApply] Exiting eval tree: module.us-central1-b04t.google_container_cluster.spotify_gke_cluster
13:52:23 2019/02/25 18:52:23 [TRACE] dag/walk: upstream errored, not walking "meta.count-boundary (count boundary fixup)"
13:52:23 2019/02/25 18:52:23 [TRACE] dag/walk: upstream errored, not walking "provisioner.local-exec (close)"
13:52:23 2019/02/25 18:52:23 [TRACE] dag/walk: upstream errored, not walking "module.us-central1-b04t.provider.google-beta (close)"
13:52:23 2019/02/25 18:52:23 [TRACE] dag/walk: upstream errored, not walking "root"
13:52:23 2019/02/25 18:52:23 [TRACE] Preserving existing state lineage "c57ad482-66eb-0ec0-7b0a-c8e829bc7578"
13:52:23 2019/02/25 18:52:23 [TRACE] Preserving existing state lineage "c57ad482-66eb-0ec0-7b0a-c8e829bc7578"
13:52:23 2019/02/25 18:52:23 [TRACE] Preserving existing state lineage "c57ad482-66eb-0ec0-7b0a-c8e829bc7578"
13:52:35
13:52:35 2019/02/25 18:52:34 [DEBUG] plugin: waiting for all plugin processes to complete...
13:52:35 Error: Error applying plan:
13:52:35
13:52:35 1 error(s) occurred:
13:52:35
13:52:35 * module.us-central1-b04t.google_container_cluster.spotify_gke_cluster: 1 error(s) occurred:
13:52:35
13:52:35 * google_container_cluster.spotify_gke_cluster: Error waiting for creating GKE cluster: Get https://container.googleapis.com/v1beta1/projects/PROJECT/locations/us-central1/operations/operation-1551119564913-7ab8cd44?alt=json&prettyPrint=false: read tcp 10.99.0.3:35006->108.177.15.95:443: read: connection timed out
From what I can tell (again this is in 1.19, not sure about master or 2.0.0 yet), when w.Service.Projects.Locations.Operations.Get(name).Do() is called by google/ContainerBetaOperationWaiter, that function is constructing a google.golang.org/container/v1beta1/ProjectsLocationsOperationsGetCall instance with a nil Context, so when ProjectLocationsOperationsGetCall.Do() is called, it eventually leads to google.golang.org/api/gensupport.SendRequest calling http.Client.Do(req) with what seems like no connection or request timeout.
So I think there is another issue here (beyond certain requests being safe to retry) where the HTTP request to check the status of the operation seems to take 15 minutes (or longer) to hit a connection timeout.
I agree that that's a problem! Unfortunately it's not something we can fix - those libraries are autogenerated and all we can do is file a bug internally. I'll do that, though. The only problem we can fix right now, in this repo, is that TF fails the apply when a timeout happens.
@ndmckinley thanks! should I create a new issue in this repo to track that?
I'd recommend creating it in https://github.com/googleapis/google-api-go-client instead!
is there any room to configure the http.Client instance used in google/config.go? If I am understanding correctly, the http.Client used by google.golang.org/api/container/v1beta1, google.golang.org/api/container/v1, and in calls (from terraform-provider-google like w.Service.Projects.Locations.Operations.Get(name).Do() all use the http.Client created by google/config.go and would therefore use its Timeout value.
I think that might work! I'll test it.
We are seeing exactly the same issue I think but we are trying to create a postgres sql instance. We see lots of HTTP requests checking the status of the operation and the operation status is "running" and then things seem to hang and we later get "read: operation timed out"
Understood - that's surprising, I would expect to see this very rarely, but this change should help most instances of this problem.
I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.
If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!
Most helpful comment
I think that might work! I'll test it.