Terraform-provider-google: Impossible to reliably create a GKE cluster using terraform

Created on 11 Sep 2018  ·  26Comments  ·  Source: hashicorp/terraform-provider-google


Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
  • If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

❯❯❯ terraform -v
Terraform v0.11.8
+ provider.google v1.17.1

Affected Resource(s)

As far as I've tested, the following resources at the least are affected:

  • google_container_cluster
  • google_container_node_pool

Terraform Configuration Files

provider "google" {
  credentials = "${file(".account.json")}"
  project     = "example-001"
  region      = "europe-west1"
}

resource "google_container_cluster" "production-001" {
  name               = "production-001"
  zone               = "europe-west1-c"
  initial_node_count = 3

  node_config {
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]
  }
}

resource "google_container_node_pool" "webpool-001" {
  name    = "webpool-001"
  cluster = "${google_container_cluster.production-001.name}"
  zone    = "europe-west1-c"

  node_config {
    machine_type = "n1-standard-1"
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]
  }

  node_count = 3
  autoscaling {
    min_node_count = 3
    max_node_count = 10
  }

  management {
    auto_repair  = false
    auto_upgrade = false
  }
}

Debug Output

https://gist.github.com/vncntvandriessche/84c404a4950eb35abe6b3099ef8cc435

Panic Output

Expected Behavior

I expected terraform to build the GKE cluster and attach the matching node pool without failures due to api errors.

Actual Behavior

We are getting a broken TF-state due to the api reporting an error

Steps to Reproduce

  1. terraform init
  2. terraform apply

Important Factoids

  • If we run apply again after this failure, terraform will fail due to the pool already existing, but it's not been registered into the state.

References

  • #0000
bug upstream

Most helpful comment

Cool, I also filed an issue internally against the team, so hopefully between your issue and mine, we'll be able to get to the bottom of this.

Just in case it was lost in the comments, @cepefernando pointed out that this seems to only happen when autoscaling is configured, so one other thing to try would be to create the node pool without autoscaling, and then add autoscaling in after.

All 26 comments

I think I'm hitting this, but with a slightly different set of actual behaviors.

  • The first run, terraform creates the cluster and the node pool, and then panics as described.
  • Apparently unknown to terraform, the cluster and node pool are created
  • the next run destroys the cluster and node pool, and attempts to make them anew
  • triggering the same panic.

If I look at the GKE webui, sometimes it tells me it's resizing the master server, other times that it's creating the node pool. In non-TF experience, I have found that changing node pools can result in long apiserver unavailability as it goes through resizing

For me, it pretty consistently fails at 13 minutes. Which seems like It's acting a bit like there's a timeout. But, it looks like the underlying code has a 30min timeout. So that seems like an interesting discrepancy.

Testing some more.... the google_container_cluster seems fine, it's the addition of google_container_node_pool that causes errors.

If I comment out google_container_node_pool it applies fine. I get a GKE cluster, etc. But if I add that back in the apply bombs out at 13min, the node pool is created anyhow. Subsequent applies remove the prior node pool, then timeout at 13min and repeat the cycle

I have faced the same issue, after some troubleshooting I have noticed that when the node pool has the autoscaling parameter this error appears, as a temporary fix, if you remove that node pool and add a node pool without the autoscaling enabled it should work.

Yes, this is an unfortunate error being returned from GKE because the configuration you're pushing is causing it to be unavailable at the 10m mark (which I believe is the current timeout). If you believe that @directionless is correct and that the apiserver will become available again sometime after that, you can increase the timeout for create (or update, if you're hitting this on update) to a sufficiently long window. As a non-k8s expert, I unfortunately can't say for sure, but it certainly feels right. :)

Google's Terraform provider cannot validate your GKE config - there are too many possible configurations for us to be confident we are blocking the ones that will not work while allowing all valid configs. The only change we can really make is to make sure that the node pool _does_ end up in state. I'm happy to add that. I'll try to figure that out and send a PR.

So I don't think it's a timeout issue. That's 30 minutes for create already (might want to set this the same for update)

        Timeouts: &schema.ResourceTimeout{
            Create: schema.DefaultTimeout(30 * time.Minute),
            Update: schema.DefaultTimeout(10 * time.Minute),
            Delete: schema.DefaultTimeout(10 * time.Minute),
        },

The problem seems to be that the API returns done. The logs start overwriting each other so i got this last message from mitmproxy:

{
    "detail": "All cluster resources were brought up, but the cluster API is reporting that: component \"kube-apiserver\" from endpoint \"gke-a2ef596d3e557814a5cb-2e7e\" is unhealthy\ngoroutine 425382131 [running]:\nruntime/debug.Stack(0xc01b85d51b, 0x3, 0x2dc1b7a)\n\tthird_party/go/gc/src/runtime/debug/stack.go:24 +0xa7\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).createErr(0x55277e0, 0xc0004fa380)\n\tcloud/kubernetes/engine/common/error_desc.go:199 +0x26\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).WithDetail(0x55277e0, 0x312a4a0, 0xc0087d54e0, 0xc0087d54e0, 0x3121ac0)\n\tcloud/kubernetes/engine/common/error_desc.go:166 +0x40\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1.1(0x0, 0xc00f5b75c0)\n\tcloud/kubernetes/engine/common/healthcheck.go:141 +0x7bb\ngoogle3/cloud/kubernetes/engine/common/call.WithTimeout(0x318d620, 0xc017877770, 0x77359400, 0x8bb2c97000, 0xc024bedd08, 0xc017877770, 0xc012577180)\n\tcloud/kubernetes/engine/common/call.go:36 +0x153\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1(0x318d620, 0xc017877770, 0xc024cac000, 0xc0021b1500, 0xc005eedc70, 0x8bb2c97000, 0x0, 0x0)\n\tcloud/kubernetes/engine/common/healthcheck.go:137 +0x33b\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify.func3(0xc002f1e180, 0x318d560, 0xc0173b4040, 0x7fd96c0576f8, 0xc00ea74880, 0xc021551d80, 0x0, 0xc026875ef0, 0xc024cac000, 0xc0021b1500, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:969 +0x1b3\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify(0x318d560, 0xc0173b4040, 0xc002f1e180, 0x7fd96c0576f8, 0xc00ea74880, 0xc024cac000, 0xc021551d80, 0x0, 0x1, 0x0, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:975 +0x13f\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.(*Deployer).recreateMasterReplicas.func2(0x0, 0x0)\n\tcloud/kubernetes/engine/server/deploy/update.go:546 +0x23c\ngoogle3/cloud/kubernetes/engine/common/errors.CollectFns.func1(0xc00ed534a0, 0xc0087f6c80)\n\tcloud/kubernetes/engine/common/errors.go:162 +0x27\ncreated by google3/cloud/kubernetes/engine/common/errors.CollectFns\n\tcloud/kubernetes/engine/common/errors.go:162 +0x82\n.",
    "endTime": "2018-09-13T01:36:18.837939633Z",
    "name": "operation-1536801745861-34ba47a8",
    "operationType": "CREATE_NODE_POOL",
    "selfLink": "https://container.googleapis.com/v1beta1/projects/1111111/zones/australia-southeast1-a/operations/operation-1536801745861-34ba47a8",
    "startTime": "2018-09-13T01:22:25.861642499Z",
    "status": "DONE",
    "statusMessage": "All cluster resources were brought up, but the cluster API is reporting that: component \"kube-apiserver\" from endpoint \"gke-a2ef596d3e557814a5cb-2e7e\" is unhealthy\ngoroutine 425382131 [running]:\nruntime/debug.Stack(0xc01b85d51b, 0x3, 0x2dc1b7a)\n\tthird_party/go/gc/src/runtime/debug/stack.go:24 +0xa7\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).createErr(0x55277e0, 0xc0004fa380)\n\tcloud/kubernetes/engine/common/error_desc.go:199 +0x26\ngoogle3/cloud/kubernetes/engine/common/errdesc.(*GKEErrorDescriptor).WithDetail(0x55277e0, 0x312a4a0, 0xc0087d54e0, 0xc0087d54e0, 0x3121ac0)\n\tcloud/kubernetes/engine/common/error_desc.go:166 +0x40\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1.1(0x0, 0xc00f5b75c0)\n\tcloud/kubernetes/engine/common/healthcheck.go:141 +0x7bb\ngoogle3/cloud/kubernetes/engine/common/call.WithTimeout(0x318d620, 0xc017877770, 0x77359400, 0x8bb2c97000, 0xc024bedd08, 0xc017877770, 0xc012577180)\n\tcloud/kubernetes/engine/common/call.go:36 +0x153\ngoogle3/cloud/kubernetes/engine/common/healthcheck.glob..func1(0x318d620, 0xc017877770, 0xc024cac000, 0xc0021b1500, 0xc005eedc70, 0x8bb2c97000, 0x0, 0x0)\n\tcloud/kubernetes/engine/common/healthcheck.go:137 +0x33b\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify.func3(0xc002f1e180, 0x318d560, 0xc0173b4040, 0x7fd96c0576f8, 0xc00ea74880, 0xc021551d80, 0x0, 0xc026875ef0, 0xc024cac000, 0xc0021b1500, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:969 +0x1b3\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.upgradeMasterAndVerify(0x318d560, 0xc0173b4040, 0xc002f1e180, 0x7fd96c0576f8, 0xc00ea74880, 0xc024cac000, 0xc021551d80, 0x0, 0x1, 0x0, ...)\n\tcloud/kubernetes/engine/server/deploy/update.go:975 +0x13f\ngoogle3/cloud/kubernetes/engine/server/deploy/deploy.(*Deployer).recreateMasterReplicas.func2(0x0, 0x0)\n\tcloud/kubernetes/engine/server/deploy/update.go:546 +0x23c\ngoogle3/cloud/kubernetes/engine/common/errors.CollectFns.func1(0xc00ed534a0, 0xc0087f6c80)\n\tcloud/kubernetes/engine/common/errors.go:162 +0x27\ncreated by google3/cloud/kubernetes/engine/common/errors.CollectFns\n\tcloud/kubernetes/engine/common/errors.go:162 +0x82\n.",
    "targetLink": "https://container.googleapis.com/v1beta1/projects/111111/zones/australia-southeast1-a/clusters/jamie-test/nodePools/jamie-test-nodes",
    "zone": "australia-southeast1-a"
}

So DONE with a statusMessage is being passed back as a failure from the command. Our choices would seem to be either to ignore the error and try and fetch the nodepool information again from google - or figure out why google APIs changed to start returning a failure.

note that this seems to happen regardless of the remove_default_node_pool setting which I could see how that might cause the API to not be ready yet.

I started looking at this again.

I ran TF apply, and 12m 30s later, got the same error. This time, I also noticed it in the web console. And I noticed that the stack dump is pretty clearly that the kubernetes apiserver is failing it's healthcheck. (Y'all might have noticed that already)

I opened a google case about it. Between that, and the consistent 12m 30s, something seems fishy.

Google support says they can reproduce this, so that's positive. Meanwhile, I made a patch to ignore that error. I'll PR it if you want, but it's a bit ugly.

https://github.com/terraform-providers/terraform-provider-google/compare/master...directionless:workaround-2022

Though my apply now succeeds, I think I'm now running into https://github.com/terraform-providers/terraform-provider-google/issues/1712

Cool, I also filed an issue internally against the team, so hopefully between your issue and mine, we'll be able to get to the bottom of this.

Just in case it was lost in the comments, @cepefernando pointed out that this seems to only happen when autoscaling is configured, so one other thing to try would be to create the node pool without autoscaling, and then add autoscaling in after.

@danawillow I just created a cluster with one node_pool without autoscaling and it was successful. I then added the autoscaling to the existing cluster and it updated in-place successfully. No errors and terraform kept the state of the node_pool.

It's an annoying way around the error but a working one for now.

This happens using the Google console to create a new cluster as well.

Just got this issue without using terraform too...

finished with error: All cluster resources were brought up, but the cluster API is reporting that: component "kube-apiserver" from endpoint "gke-c......" is unhealthy

EDIT: PEBKAC here, just keeping this comment for conversation context.

I am having a different issue which is potentially related: Creating a GKE cluster with Terraform creates no default node pool.

Terraform v0.11.7
Google provider v1.16

resource "google_container_cluster" "y" {
  name               = "y"
  project            = "${google_project.project.project_id}"
  zone               = "us-east1-b"

  additional_zones = [
    "us-east1-c",
    "us-east1-d"
  ]

  initial_node_count = 2

  maintenance_policy {
    daily_maintenance_window {
      start_time = "11:00"
    }
  }

  remove_default_node_pool = true
  node_config {

    machine_type = "n1-standard-1"

    oauth_scopes = [
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/servicecontrol",
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/compute"
    ]

    labels {
      stack = "x"
    }

    tags = [ "x" ]
  }
}

@Legogris you're setting remove_default_node_pool = true. Remove that line if you want the default node pool.

@JackFazackerley: Derp, I somehow managed to gloss over that line every time I looked at my template even as I edited it for pasting. Thanks.

@wibobm Happens via the console? That super interesting. ~Do you happen to have a screen shot, or the specifics of the things you set for that?~ Found a gcloud reproduction. Will write up more tonight

FYI to all- I'm tracking this issue internally and the GKE team is working very hard on it. I'm leaving this issue open since it's not resolved yet, but the issue is not Terraform-specific. I'll update again once I have more I can say.

@danawillow Cool. Sounds like y'all have enough of a reproduction. My support ticket has been less productive :)

From the gcloud command line, definitely seems like the resizing you pointed at

Google Cloud support have just got back with a solution for the issue:

Description:
We are investigating an issue with Google Kubernetes Engine. Customers may receive error like: "All cluster resources were brought up, but the cluster API is reporting that: component kube-apiserver from endpoint gke-HASH is unhealthy" when they are creating NodePool with Autoscaling enabled on 1.9.x clusters. We will provide more information by Thursday, 2018-09-20 10:45 US/Pacific.

Workaround:
Customers can work around this by:

  1. Creating a NodePool without Autoscaling, then enabling Autoscaling once that's complete.
  2. Upgrade to 1.10.

I'm creating a 1.10 cluster and also have this issue.

@edevil oh... I'll get back to them. Cheers for trying.

@JackFazackerley creating the nodepool without autoscaling and enabling it afterwards worked though.

I also see this with 1.10.

In addition when this error occurs I also see another, potentially related, behaviour where pods scheduled on the first node created (same as kube-dns) can't resolve any DNS queries. pinging other pods works fine though. It's a bit random but maybe it helps someone. (similar report)

Google Cloud support got back to me again with the following:

The issue with Google Kubernetes Engine NodePool has been resolved for all affected users as of Saturday, 2018-09-22 09:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

I have tested this myself and it is working fine.

@JackFazackerley That's great news!

A big thanks to everyone who was involved with this issue! Never expected this to be handled so quickly.

I'll close this issue as I'd say this is no longer an issue.

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

Was this page helpful?
0 / 5 - 0 ratings