Terraform-provider-google: Impossible to reliably create a GKE cluster using terraform

Created on 11 Sep 2018 · 26Comments · Source: hashicorp/terraform-provider-google

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

❯❯❯ terraform -v
Terraform v0.11.8
+ provider.google v1.17.1

Affected Resource(s)

As far as I've tested, the following resources at the least are affected:

google_container_cluster
google_container_node_pool

Terraform Configuration Files

provider "google" {
  credentials = "${file(".account.json")}"
  project     = "example-001"
  region      = "europe-west1"
}

resource "google_container_cluster" "production-001" {
  name               = "production-001"
  zone               = "europe-west1-c"
  initial_node_count = 3

  node_config {
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]
  }
}

resource "google_container_node_pool" "webpool-001" {
  name    = "webpool-001"
  cluster = "${google_container_cluster.production-001.name}"
  zone    = "europe-west1-c"

  node_config {
    machine_type = "n1-standard-1"
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]
  }

  node_count = 3
  autoscaling {
    min_node_count = 3
    max_node_count = 10
  }

  management {
    auto_repair  = false
    auto_upgrade = false
  }
}

Debug Output

https://gist.github.com/vncntvandriessche/84c404a4950eb35abe6b3099ef8cc435

Panic Output

Expected Behavior

I expected terraform to build the GKE cluster and attach the matching node pool without failures due to api errors.

Actual Behavior

We are getting a broken TF-state due to the api reporting an error

Steps to Reproduce

terraform init
terraform apply

Important Factoids

If we run apply again after this failure, terraform will fail due to the pool already existing, but it's not been registered into the state.

References

#0000

bug upstream

Source

vncntvandriessche

👍43

Most helpful comment

Cool, I also filed an issue internally against the team, so hopefully between your issue and mine, we'll be able to get to the bottom of this.

Just in case it was lost in the comments, @cepefernando pointed out that this seems to only happen when autoscaling is configured, so one other thing to try would be to create the node pool without autoscaling, and then add autoscaling in after.

danawillow on 14 Sep 2018

👍5

All 26 comments

I think I'm hitting this, but with a slightly different set of actual behaviors.

The first run, terraform creates the cluster and the node pool, and then panics as described.
Apparently unknown to terraform, the cluster and node pool are created
the next run destroys the cluster and node pool, and attempts to make them anew
triggering the same panic.

If I look at the GKE webui, sometimes it tells me it's resizing the master server, other times that it's creating the node pool. In non-TF experience, I have found that changing node pools can result in long apiserver unavailability as it goes through resizing

For me, it pretty consistently fails at 13 minutes. Which seems like It's acting a bit like there's a timeout. But, it looks like the underlying code has a 30min timeout. So that seems like an interesting discrepancy.

directionless on 12 Sep 2018

Testing some more.... the google_container_cluster seems fine, it's the addition of google_container_node_pool that causes errors.

If I comment out google_container_node_pool it applies fine. I get a GKE cluster, etc. But if I add that back in the apply bombs out at 13min, the node pool is created anyhow. Subsequent applies remove the prior node pool, then timeout at 13min and repeat the cycle

directionless on 12 Sep 2018

I have faced the same issue, after some troubleshooting I have noticed that when the node pool has the autoscaling parameter this error appears, as a temporary fix, if you remove that node pool and add a node pool without the autoscaling enabled it should work.

cepefernando on 12 Sep 2018

👍1

Yes, this is an unfortunate error being returned from GKE because the configuration you're pushing is causing it to be unavailable at the 10m mark (which I believe is the current timeout). If you believe that @directionless is correct and that the apiserver will become available again sometime after that, you can increase the timeout for create (or update, if you're hitting this on update) to a sufficiently long window. As a non-k8s expert, I unfortunately can't say for sure, but it certainly feels right. :)

Google's Terraform provider cannot validate your GKE config - there are too many possible configurations for us to be confident we are blocking the ones that will not work while allowing all valid configs. The only change we can really make is to make sure that the node pool _does_ end up in state. I'm happy to add that. I'll try to figure that out and send a PR.

ndmckinley on 12 Sep 2018

So I don't think it's a timeout issue. That's 30 minutes for create already (might want to set this the same for update)

        Timeouts: &schema.ResourceTimeout{
            Create: schema.DefaultTimeout(30 * time.Minute),
            Update: schema.DefaultTimeout(10 * time.Minute),
            Delete: schema.DefaultTimeout(10 * time.Minute),
        },

The problem seems to be that the API returns done. The logs start overwriting each other so i got this last message from mitmproxy:

{
    "detail": "All cluster resources were brought up, but the cluster API is reporting that: component \"kube-apiserver\" from endpoint \"gke-a2ef596d3e557814a5cb-2e7e\" is unhealthy\ngoroutine 425382131 [running]:\nruntime/debug.Stack(0xc01b85d51b, 0x3, 0x2dc1b7a)\n\tthird_party/go/gc/src/runtime/debug/stack.go:24 +0xa7\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).createErr(0x55277e0, 0xc0004fa380)\n\tcloud/kubernetes/engine/common/error_desc.go:199 +0x26\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).WithDetail(0x55277e0, 0x312a4a0, 0xc0087d54e0, 0xc0087d54e0, 0x3121ac0)\n\tcloud/kubernetes/engine/common/error_desc.go:166 +0x40\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1.1(0x0, 0xc00f5b75c0)\n\tcloud/kubernetes/engine/common/healthcheck.go:141 +0x7bb\ngoogle3/cloud/kubernetes/engine/common/call.WithTimeout(0x318d620, 0xc017877770, 0x77359400, 0x8bb2c97000, 0xc024bedd08, 0xc017877770, 0xc012577180)\n\tcloud/kubernetes/engine/common/call.go:36 +0x153\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1(0x318d620, 0xc017877770, 0xc024cac000, 0xc0021b1500, 0xc005eedc70, 0x8bb2c97000, 0x0, 0x0)\n\tcloud/kubernetes/engine/common/healthcheck.go:137 +0x33b\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify.func3(0xc002f1e180, 0x318d560, 0xc0173b4040, 0x7fd96c0576f8, 0xc00ea74880, 0xc021551d80, 0x0, 0xc026875ef0, 0xc024cac000, 0xc0021b1500, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:969 +0x1b3\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify(0x318d560, 0xc0173b4040, 0xc002f1e180, 0x7fd96c0576f8, 0xc00ea74880, 0xc024cac000, 0xc021551d80, 0x0, 0x1, 0x0, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:975 +0x13f\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.(*Deployer).recreateMasterReplicas.func2(0x0, 0x0)\n\tcloud/kubernetes/engine/server/deploy/update.go:546 +0x23c\ngoogle3/cloud/kubernetes/engine/common/errors.CollectFns.func1(0xc00ed534a0, 0xc0087f6c80)\n\tcloud/kubernetes/engine/common/errors.go:162 +0x27\ncreated by google3/cloud/kubernetes/engine/common/errors.CollectFns\n\tcloud/kubernetes/engine/common/errors.go:162 +0x82\n.",
    "endTime": "2018-09-13T01:36:18.837939633Z",
    "name": "operation-1536801745861-34ba47a8",
    "operationType": "CREATE_NODE_POOL",
    "selfLink": "https://container.googleapis.com/v1beta1/projects/1111111/zones/australia-southeast1-a/operations/operation-1536801745861-34ba47a8",
    "startTime": "2018-09-13T01:22:25.861642499Z",
    "status": "DONE",
    "statusMessage": "All cluster resources were brought up, but the cluster API is reporting that: component \"kube-apiserver\" from endpoint \"gke-a2ef596d3e557814a5cb-2e7e\" is unhealthy\ngoroutine 425382131 [running]:\nruntime/debug.Stack(0xc01b85d51b, 0x3, 0x2dc1b7a)\n\tthird_party/go/gc/src/runtime/debug/stack.go:24 +0xa7\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).createErr(0x55277e0, 0xc0004fa380)\n\tcloud/kubernetes/engine/common/error_desc.go:199 +0x26\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).WithDetail(0x55277e0, 0x312a4a0, 0xc0087d54e0, 0xc0087d54e0, 0x3121ac0)\n\tcloud/kubernetes/engine/common/error_desc.go:166 +0x40\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1.1(0x0, 0xc00f5b75c0)\n\tcloud/kubernetes/engine/common/healthcheck.go:141 +0x7bb\ngoogle3/cloud/kubernetes/engine/common/call.WithTimeout(0x318d620, 0xc017877770, 0x77359400, 0x8bb2c97000, 0xc024bedd08, 0xc017877770, 0xc012577180)\n\tcloud/kubernetes/engine/common/call.go:36 +0x153\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1(0x318d620, 0xc017877770, 0xc024cac000, 0xc0021b1500, 0xc005eedc70, 0x8bb2c97000, 0x0, 0x0)\n\tcloud/kubernetes/engine/common/healthcheck.go:137 +0x33b\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify.func3(0xc002f1e180, 0x318d560, 0xc0173b4040, 0x7fd96c0576f8, 0xc00ea74880, 0xc021551d80, 0x0, 0xc026875ef0, 0xc024cac000, 0xc0021b1500, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:969 +0x1b3\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify(0x318d560, 0xc0173b4040, 0xc002f1e180, 0x7fd96c0576f8, 0xc00ea74880, 0xc024cac000, 0xc021551d80, 0x0, 0x1, 0x0, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:975 +0x13f\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.(*Deployer).recreateMasterReplicas.func2(0x0, 0x0)\n\tcloud/kubernetes/engine/server/deploy/update.go:546 +0x23c\ngoogle3/cloud/kubernetes/engine/common/errors.CollectFns.func1(0xc00ed534a0, 0xc0087f6c80)\n\tcloud/kubernetes/engine/common/errors.go:162 +0x27\ncreated by google3/cloud/kubernetes/engine/common/errors.CollectFns\n\tcloud/kubernetes/engine/common/errors.go:162 +0x82\n.",
    "targetLink": "https://container.googleapis.com/v1beta1/projects/111111/zones/australia-southeast1-a/clusters/jamie-test/nodePools/jamie-test-nodes",
    "zone": "australia-southeast1-a"
}

So DONE with a statusMessage is being passed back as a failure from the command. Our choices would seem to be either to ignore the error and try and fetch the nodepool information again from google - or figure out why google APIs changed to start returning a failure.

jamielennox on 13 Sep 2018

note that this seems to happen regardless of the remove_default_node_pool setting which I could see how that might cause the API to not be ready yet.

jamielennox on 13 Sep 2018

I started looking at this again.

I ran TF apply, and 12m 30s later, got the same error. This time, I also noticed it in the web console. And I noticed that the stack dump is pretty clearly that the kubernetes apiserver is failing it's healthcheck. (Y'all might have noticed that already)

I opened a google case about it. Between that, and the consistent 12m 30s, something seems fishy.

directionless on 14 Sep 2018

Google support says they can reproduce this, so that's positive. Meanwhile, I made a patch to ignore that error. I'll PR it if you want, but it's a bit ugly.

https://github.com/terraform-providers/terraform-provider-google/compare/master...directionless:workaround-2022

Though my apply now succeeds, I think I'm now running into https://github.com/terraform-providers/terraform-provider-google/issues/1712

directionless on 14 Sep 2018

Cool, I also filed an issue internally against the team, so hopefully between your issue and mine, we'll be able to get to the bottom of this.

danawillow on 14 Sep 2018

👍5

@danawillow I just created a cluster with one node_pool without autoscaling and it was successful. I then added the autoscaling to the existing cluster and it updated in-place successfully. No errors and terraform kept the state of the node_pool.

It's an annoying way around the error but a working one for now.

JackFazackerley on 16 Sep 2018

👍1

This happens using the Google console to create a new cluster as well.

wibobm on 16 Sep 2018

👍1

Just got this issue without using terraform too...

finished with error: All cluster resources were brought up, but the cluster API is reporting that: component "kube-apiserver" from endpoint "gke-c......" is unhealthy

guillaumeeb on 17 Sep 2018

👍1

EDIT: PEBKAC here, just keeping this comment for conversation context.

I am having a different issue which is potentially related: Creating a GKE cluster with Terraform creates no default node pool.

Terraform v0.11.7
Google provider v1.16

resource "google_container_cluster" "y" {
  name               = "y"
  project            = "${google_project.project.project_id}"
  zone               = "us-east1-b"

  additional_zones = [
    "us-east1-c",
    "us-east1-d"
  ]

  initial_node_count = 2

  maintenance_policy {
    daily_maintenance_window {
      start_time = "11:00"
    }
  }

  remove_default_node_pool = true
  node_config {

    machine_type = "n1-standard-1"

    oauth_scopes = [
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/servicecontrol",
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/compute"
    ]

    labels {
      stack = "x"
    }

    tags = [ "x" ]
  }
}

Legogris on 18 Sep 2018

@Legogris you're setting remove_default_node_pool = true. Remove that line if you want the default node pool.

JackFazackerley on 18 Sep 2018

👍1

@JackFazackerley: Derp, I somehow managed to gloss over that line every time I looked at my template even as I edited it for pasting. Thanks.

Legogris on 18 Sep 2018

@wibobm Happens via the console? That super interesting. ~Do you happen to have a screen shot, or the specifics of the things you set for that?~ Found a gcloud reproduction. Will write up more tonight

directionless on 19 Sep 2018

FYI to all- I'm tracking this issue internally and the GKE team is working very hard on it. I'm leaving this issue open since it's not resolved yet, but the issue is not Terraform-specific. I'll update again once I have more I can say.

danawillow on 20 Sep 2018

👍4

@danawillow Cool. Sounds like y'all have enough of a reproduction. My support ticket has been less productive :)

From the gcloud command line, definitely seems like the resizing you pointed at

directionless on 20 Sep 2018

Google Cloud support have just got back with a solution for the issue:

Description:
We are investigating an issue with Google Kubernetes Engine. Customers may receive error like: "All cluster resources were brought up, but the cluster API is reporting that: component kube-apiserver from endpoint gke-HASH is unhealthy" when they are creating NodePool with Autoscaling enabled on 1.9.x clusters. We will provide more information by Thursday, 2018-09-20 10:45 US/Pacific.

Workaround:
Customers can work around this by:

Creating a NodePool without Autoscaling, then enabling Autoscaling once that's complete.
Upgrade to 1.10.

JackFazackerley on 21 Sep 2018

I'm creating a 1.10 cluster and also have this issue.

edevil on 21 Sep 2018

@edevil oh... I'll get back to them. Cheers for trying.

JackFazackerley on 21 Sep 2018

@JackFazackerley creating the nodepool without autoscaling and enabling it afterwards worked though.

edevil on 21 Sep 2018

😄1

I also see this with 1.10.

In addition when this error occurs I also see another, potentially related, behaviour where pods scheduled on the first node created (same as kube-dns) can't resolve any DNS queries. pinging other pods works fine though. It's a bit random but maybe it helps someone. (similar report)

teh on 23 Sep 2018

Google Cloud support got back to me again with the following:

The issue with Google Kubernetes Engine NodePool has been resolved for all affected users as of Saturday, 2018-09-22 09:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

I have tested this myself and it is working fine.

JackFazackerley on 24 Sep 2018

🎉2 👍2

@JackFazackerley That's great news!

A big thanks to everyone who was involved with this issue! Never expected this to be handled so quickly.

I'll close this issue as I'd say this is no longer an issue.

vncntvandriessche on 24 Sep 2018

👍2

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!