Terraform-provider-google: A time out when polling for cluster creation status results in cluster creation failure

Created on 26 Feb 2019 · 11Comments · Source: hashicorp/terraform-provider-google

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

$ terraform -v
Terraform v0.11.10
+ provider.google v1.19.1
+ provider.google-beta v1.19.0

Affected Resource(s)

google_container_cluster

Debug Output

See terraform logs.

Panic Output

Expected Behavior

When applying a plan that involves creating a new gke cluster, terraform-provider-google polls container.googleapis.com/v1beta1/projects/<project_id>/locations/<region>/operations/<operation_id> every 10 seconds.

When polling for status of the "cluster create" operation:

a single timeout in status request should not have failed cluster creation. There should have been a few retry attempts before giving up.
When polling for status of cluster create operation, the request should have failed much faster than ~15 minutes.

Actual Behavior

terraform apply was stuck for 15 minutes waiting for a response to container.googleapis.com/v1beta1/projects/<project_id>/locations/<region>/operations/<operation_id>. It then timed out, resulting in an error applying the plan.

Steps to Reproduce

There is no easy/reliable way to reproduce this behavior since we don't have control over the response of container.googleapis.com. Applying the same plan again _could_ be successful.

bug

Source

vbhavsar

👍2

Most helpful comment

I think that might work! I'll test it.

ndmckinley on 28 Feb 2019

❤2

All 11 comments

Here is the relevant code where the polling for the completion of the "create cluster operation" is kicked off:

https://github.com/terraform-providers/terraform-provider-google/blob/a4b26062b47ea3836066586a0bdde504d0e78a3b/google/resource_container_cluster.go#L671

https://github.com/terraform-providers/terraform-provider-google/blob/a4b26062b47ea3836066586a0bdde504d0e78a3b/google/container_operation.go#L130

the core of the logic in the "wait for state to reach X" is in: https://github.com/terraform-providers/terraform-provider-google/blob/a4b26062b47ea3836066586a0bdde504d0e78a3b/google/container_operation.go#L62-L80

the code for resp, err := w.Service.Projects.Locations.Operations.Get(name).Do() is provided by the package google.golang.org/api/container/v1beta1:

https://github.com/terraform-providers/terraform-provider-google/blob/a4b26062b47ea3836066586a0bdde504d0e78a3b/vendor/google.golang.org/api/container/v1beta1/container-gen.go#L8662

https://github.com/terraform-providers/terraform-provider-google/blob/a4b26062b47ea3836066586a0bdde504d0e78a3b/vendor/google.golang.org/api/container/v1beta1/container-gen.go#L8762

mattnworb on 27 Feb 2019

👍1

Ah, sorry, that's where it lived in v1.19.0 - we released 2.0.0 a few days ago and things are a bit different now. Still close, though! The PR I sent out should do it.

ndmckinley on 28 Feb 2019

Thanks @ndmckinley for the quick response! I think there is another bug or problem, at least in 1.19 (I have not had a chance to look at 2.0.0 or master yet). @vbhavsar and I work together, and I think the logs attached to the original issue might not have fully explained what we thought was the bug.

When we created the cluster, the verbose output from Terraform shows that it is polling the GKE API to get the status of the operation (to know when it has completed). We observed that the polling stopped at a certain point:

13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: 2019/02/25 18:36:52 [DEBUG] Google API Request Details:
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: ---[ REQUEST ]---------------------------------------
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: GET /v1beta1/projects/PROJECT/locations/us-central1/operations/operation-1551119564913-7ab8cd44?alt=json&prettyPrint=false HTTP/1.1
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: Host: container.googleapis.com
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: User-Agent: google-api-go-client/0.5 Terraform/0.11.7 (+https://www.terraform.io) terraform-provider-google-beta/1.20.0
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: Accept-Encoding: gzip
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: 
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: 
13:36:53 2019-02-25T18:36:52.019Z [DEBUG] plugin.terraform-provider-google-beta_v1.20.0_x4: -----------------------------------------------------

for 15 minutes we saw output like the following where Terraform is waiting for the operation to complete (but no additional polling of the GKE API to check the operation status is happening):

13:37:12 2019/02/25 18:37:12 [TRACE] dag/walk: vertex "meta.count-boundary (count boundary fixup)", waiting for: "module.us-central1-b04t.google_container_cluster.spotify_gke_cluster"
13:37:12 2019/02/25 18:37:12 [TRACE] dag/walk: vertex "module.us-central1-b04t.provider.google-beta (close)", waiting for: "module.us-central1-b04t.google_container_cluster.spotify_gke_cluster"
13:37:12 2019/02/25 18:37:12 [TRACE] dag/walk: vertex "provisioner.local-exec (close)", waiting for: "module.us-central1-b04t.google_container_cluster.spotify_gke_cluster"
13:37:12 2019/02/25 18:37:12 [TRACE] dag/walk: vertex "root", waiting for: "provisioner.local-exec (close)"
13:37:12 module.us-central1-b04t.google_container_cluster.spotify_gke_cluster: Still creating... (4m30s elapsed)

and then finally the Terraform operation as a whole aborts when the HTTP connection to the GKE API times out:

13:52:23 * google_container_cluster.spotify_gke_cluster: Error waiting for creating GKE cluster: Get https://container.googleapis.com/v1beta1/projects/PROJECT/locations/us-central1/operations/operation-1551119564913-7ab8cd44?alt=json&prettyPrint=false: read tcp 10.99.0.3:35006->108.177.15.95:443: read: connection timed out
13:52:23 2019/02/25 18:52:23 [ERROR] root.us-central1-b04t: eval: *terraform.EvalSequence, err: 1 error(s) occurred:
13:52:23 
13:52:23 * google_container_cluster.spotify_gke_cluster: Error waiting for creating GKE cluster: Get https://container.googleapis.com/v1beta1/projects/PROJECT/locations/us-central1/operations/operation-1551119564913-7ab8cd44?alt=json&prettyPrint=false: read tcp 10.99.0.3:35006->108.177.15.95:443: read: connection timed out
13:52:23 2019/02/25 18:52:23 [TRACE] [walkApply] Exiting eval tree: module.us-central1-b04t.google_container_cluster.spotify_gke_cluster
13:52:23 2019/02/25 18:52:23 [TRACE] dag/walk: upstream errored, not walking "meta.count-boundary (count boundary fixup)"
13:52:23 2019/02/25 18:52:23 [TRACE] dag/walk: upstream errored, not walking "provisioner.local-exec (close)"
13:52:23 2019/02/25 18:52:23 [TRACE] dag/walk: upstream errored, not walking "module.us-central1-b04t.provider.google-beta (close)"
13:52:23 2019/02/25 18:52:23 [TRACE] dag/walk: upstream errored, not walking "root"
13:52:23 2019/02/25 18:52:23 [TRACE] Preserving existing state lineage "c57ad482-66eb-0ec0-7b0a-c8e829bc7578"
13:52:23 2019/02/25 18:52:23 [TRACE] Preserving existing state lineage "c57ad482-66eb-0ec0-7b0a-c8e829bc7578"
13:52:23 2019/02/25 18:52:23 [TRACE] Preserving existing state lineage "c57ad482-66eb-0ec0-7b0a-c8e829bc7578"
13:52:35 
13:52:35 2019/02/25 18:52:34 [DEBUG] plugin: waiting for all plugin processes to complete...
13:52:35 Error: Error applying plan:
13:52:35 
13:52:35 1 error(s) occurred:
13:52:35 
13:52:35 * module.us-central1-b04t.google_container_cluster.spotify_gke_cluster: 1 error(s) occurred:
13:52:35 
13:52:35 * google_container_cluster.spotify_gke_cluster: Error waiting for creating GKE cluster: Get https://container.googleapis.com/v1beta1/projects/PROJECT/locations/us-central1/operations/operation-1551119564913-7ab8cd44?alt=json&prettyPrint=false: read tcp 10.99.0.3:35006->108.177.15.95:443: read: connection timed out

From what I can tell (again this is in 1.19, not sure about master or 2.0.0 yet), when w.Service.Projects.Locations.Operations.Get(name).Do() is called by google/ContainerBetaOperationWaiter, that function is constructing a google.golang.org/container/v1beta1/ProjectsLocationsOperationsGetCall instance with a nil Context, so when ProjectLocationsOperationsGetCall.Do() is called, it eventually leads to google.golang.org/api/gensupport.SendRequest calling http.Client.Do(req) with what seems like no connection or request timeout.

So I think there is another issue here (beyond certain requests being safe to retry) where the HTTP request to check the status of the operation seems to take 15 minutes (or longer) to hit a connection timeout.

mattnworb on 28 Feb 2019

I agree that that's a problem! Unfortunately it's not something we can fix - those libraries are autogenerated and all we can do is file a bug internally. I'll do that, though. The only problem we can fix right now, in this repo, is that TF fails the apply when a timeout happens.

ndmckinley on 28 Feb 2019

👍1

@ndmckinley thanks! should I create a new issue in this repo to track that?

mattnworb on 28 Feb 2019

I'd recommend creating it in https://github.com/googleapis/google-api-go-client instead!

ndmckinley on 28 Feb 2019

is there any room to configure the http.Client instance used in google/config.go? If I am understanding correctly, the http.Client used by google.golang.org/api/container/v1beta1, google.golang.org/api/container/v1, and in calls (from terraform-provider-google like w.Service.Projects.Locations.Operations.Get(name).Do() all use the http.Client created by google/config.go and would therefore use its Timeout value.

https://github.com/terraform-providers/terraform-provider-google/blob/77c086de1c533e1ad4ea23a153d4266775c9ab2d/google/config.go#L115-L148

mattnworb on 28 Feb 2019

I think that might work! I'll test it.

ndmckinley on 28 Feb 2019

❤2

We are seeing exactly the same issue I think but we are trying to create a postgres sql instance. We see lots of HTTP requests checking the status of the operation and the operation status is "running" and then things seem to hang and we later get "read: operation timed out"

NAJ10 on 1 Mar 2019

Understood - that's surprising, I would expect to see this very rarely, but this change should help most instances of this problem.

ndmckinley on 1 Mar 2019

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!