Terraform-provider-azurerm: AKS node pool k8s version not being updated

Created on 28 Jan 2020  ยท  19Comments  ยท  Source: terraform-providers/terraform-provider-azurerm

Community Note

  • Please vote on this issue by adding a ๐Ÿ‘ reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureRM Provider) Version

  • terraform version: 0.12.8
  • azurerm provider version: 1.41

Affected Resource(s)

  • azurerm_kubernetes_cluster

Terraform Configuration Files

resource "azurerm_kubernetes_cluster" "k8s" {
  name                = var.cluster_info.name
  location            = azurerm_resource_group.k8s.location
  dns_prefix          = var.cluster_info.dns_prefix
  resource_group_name = azurerm_resource_group.k8s.name
  kubernetes_version  = var.kubernetes_version

  role_based_access_control {
    enabled = ! local.aks_aad_skip_rbac
    dynamic "azure_active_directory" {
      for_each = "${! local.aks_aad_skip_rbac ? list(local.aks_rbac_setting) : []}"
      content {
        client_app_id     = local.aks_rbac_setting.client_app_id
        server_app_id     = local.aks_rbac_setting.server_app_id
        server_app_secret = local.aks_rbac_setting.server_app_secret
        tenant_id         = local.aks_rbac_setting.tenant_id
      }
    }
  }

  default_node_pool {
    name               = var.agent_pool.name
    node_count         = var.agent_pool.count
    vm_size            = "Standard_DS2_v2"
    type               = "VirtualMachineScaleSets"
    os_disk_size_gb    = 30
    max_pods           = 30
    availability_zones = local.sanitized_availability_zones

    enable_auto_scaling= true
    min_count = 3
    max_count = 12
  }

  service_principal {
    client_id     = var.aks_login.client_id
    client_secret = var.aks_login.client_secret
  }

  addon_profile {
    oms_agent {
      enabled                    = true
      log_analytics_workspace_id = azurerm_log_analytics_workspace.k8s_logs.id
    }
  }

  tags = {
    CreatedBy   = var.tags.created_by != "" ? var.tags.created_by : null
    ChangedBy   = var.tags.changed_by != "" ? var.tags.changed_by : null
    Environment = var.tags.environment != "" ? var.tags.environment : null
  }

  lifecycle {
    ignore_changes = [
      role_based_access_control, 
      role_based_access_control["azure_active_directory"],
      agent_pool_profile.0.count]
  }

  network_profile {
    network_plugin    = "azure"
    load_balancer_sku = var.aks_load_balancer_sku
  }
}

Expected Behavior

With azurerm_provider == 1.39, the kubelet version in our cluster's node pool would be updated (in addition to the AKS k8s version) by changing var.kubernetes_version

### Actual Behavior

With azurerm_provider == 1.41, the kubelet version in our cluster's node pool is not updated (in addition to the AKS k8s version) by changing var.kubernetes_version

Important Factoids

  • Switching back to azurerm_provider == 1.39 and doing terraform apply seems to perform the expected behavior, even if the exact same config was previously run with azurerm_provider == 1.41.
bug servickubernetes-cluster

Most helpful comment

Using azurerm provider at version "=2.1.0", I've upgraded an azurerm_kubernetes_cluster resource from 1.14 to 1.15. The Control Plane has seemed to upgrade to 1.15, but the VMSS node pool has stayed behind at 1.14.

What is the current expected behavior Kubernetes upgrade and their node pools? I understand (https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools#validation-rules-for-upgrades) expresses certain validation conditions.

Will the node pool upgrade when it's obligated to? or can we control this as @jstevans suggested with an input?

All 19 comments

Just checked -- the issue also occurs in azurerm_provider == 1.40.

EDIT: I take it back. I've been git bisecting, and v1.40.0 doesn't seem to repro the issue when built myself. I'm starting over from v1.39.0..v1.41.0.

EDIT2: Double-takeback, there's a bug in my testbed. Re-starting.

A git bisect v1.39.0..v1.40.0 points to 87de8aec2f9eb9c23ff080bbcd6d68abc6fe82ea as the cause of issue. The code changes in that commit seem unrelated, so I'd guess it has to do with the azure-sdk container-service version bump.

I'm having trouble finding canon on what changed between azure-sdk/container-service 20190601 and 20191001, so I checked what az cli does on an az aks nodepool upgrade. Looks like OrchestratorVersion is settable when updating agent pools, so that's probably the fix.

If I were to guess, multiple agent pools is a new feature in AKS, so that probably drove the agent pool k8s upgrade logic's separation from the AKS k8s upgrade logic.

@tombuildsstuff I'd like to contribute the PR, but would appreciate some guidance.

  • Would we rather

    • keep the old behavior for now, of coupling the agent pool k8s version to the cluster k8s version; or
    • support configuring the OrchestratorVersion field on agent pools individually?
  • Does terraform-provider-azurerm v1.x supports multiple agent pools for AKS?

@jstevans

A git bisect v1.39.0..v1.40.0 points to 87de8ae as the cause of issue. The code changes in that commit seem unrelated, so I'd guess it has to do with the azure-sdk container-service version bump.

Yeah this is likely a change in behaviour between the different versions of the Container Service API

@tombuildsstuff I'd like to contribute the PR, but would appreciate some guidance.
Would we rather

  • keep the old behavior for now, of coupling the agent pool k8s version to the cluster k8s version; or
  • support configuring the OrchestratorVersion field on agent pools individually?
    Does terraform-provider-azurerm v1.x supports multiple agent pools for AKS?

๐Ÿ‘ we support multiple node pools via the azurerm_kubernetes_cluster_node_pool resource - however there's a default node pool defined within the azurerm_kubernetes_cluster resource. As such I'm wondering if the behaviour should be:

  • Default Node Pool follows the same version of Kubernetes as the Cluster
  • Other node pools can be any version >= the cluster (which I believe is the API limitation)

Alternatively we can make them all configurable - but since the default node pool's hosting system jobs it's treated a little differently to the other node pools so this probably wants some testing to confirm which way to go - maybe @jluk can confirm the expected behaviour here?

The ruleset between control plane and agent pools are defined in this public document. There is a window of config drift you are allowed to have between the control plane and each agent pool.
https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools#upgrade-a-cluster-control-plane-with-multiple-node-pools

Rules for valid versions to upgrade node pools:

The node pool version must have the same major version as the control plane.
The node pool minor version must be within two minor versions of the control plane version.
The node pool version cannot be greater than the control major.minor.patch version.

Hope this helps.

Great, thanks @jluk :) @tombuildsstuff:

  • So that I can unblock my team's upgrade to [email protected] while we continue this discussion, can I follow your first option to have default_node_pool.OrchestratorVersion := kubernetes_cluster.KubernetesVersion?

  • For the longer-term, what if kubernetes_cluster.default_node_pool and kubernetes_cluster_node_pool both have an optional string argument orchestrator_version which defaults to kubernetes_cluster.kubernetes_version?

  • Do the version drift rules between control plane vs. agent pool need to be encoded on the client? Is it insufficient to make the REST call and let it fail?

@tombuildsstuff I'm going to work on this tomorrow according to my previous comment, please shout if that's the wrong thing to do ๐Ÿ™‚

@jstevans

So that I can unblock my team's upgrade to [email protected] while we continue this discussion, can I follow your first option to have default_node_pool.OrchestratorVersion := kubernetes_cluster.KubernetesVersion?

That sounds fine for now, however the field doesn't want exposing to users currently and instead wants to be an internal behaviour, until..

For the longer-term, what if kubernetes_cluster.default_node_pool and kubernetes_cluster_node_pool both have an optional string argument orchestrator_version which defaults to kubernetes_cluster.kubernetes_version?
Do the version drift rules between control plane vs. agent pool need to be encoded on the client? Is it insufficient to make the REST call and let it fail?

Adding properties for kubernetes_version or similar to each node pool seems fine - however we may need to add locks into the azurerm_kubernetes_cluster and azurerm_kubernetes_cluster_node_pool resources to account for this (we'd need to test upgrading a major version and the node pools at the same time and confirm what happens, we may be able to get away with locking on the node pool names in each resources rather than the name of the kubernetes_cluster - since this'd allow still creating/deleting each node pool in parallel). Realistically we'll only know which thing to lock on (and if this is even needed) by testing it however.

Hope that helps :)

Coupling by default could work, as long as it can be disabled. Upgrading control plane and default node pool with one terraform apply could be very impactful to a cluster. I'm commenting to vote on exposing OrchestratorVersion for default node pool as well as the node pool resource.

I would rather have OrchestratorVersion exposed ASAP without the proper locking even if it means I can control it. My current plans are to use the GRAPH API directly to upgrade node pools. https://docs.microsoft.com/en-us/rest/api/aks/agentpools/createorupdate

I'm available (with Azure resources as well) to test potential patches and I am able to write golang, but I lack terraform internals experience. Let me know how I can help.

Using azurerm provider at version "=2.1.0", I've upgraded an azurerm_kubernetes_cluster resource from 1.14 to 1.15. The Control Plane has seemed to upgrade to 1.15, but the VMSS node pool has stayed behind at 1.14.

What is the current expected behavior Kubernetes upgrade and their node pools? I understand (https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools#validation-rules-for-upgrades) expresses certain validation conditions.

Will the node pool upgrade when it's obligated to? or can we control this as @jstevans suggested with an input?

I'd like to know this!

@tombuildsstuff ๐Ÿ†™
any info on the above ?

This has been released in version 2.14.0 of the provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. As an example:

provider "azurerm" {
    version = "~> 2.14.0"
}
# ... other configuration ...

Hello,
I am having the same issue. I tried to use the provider 2.14.0 or the 2.16.0 and even though my cluster upgraded correctly, the nodepools are still in the old version (1.15.10)...

Have you set the new orchestrator_version property of the (default) nodepool(s)?

@EPinci Hello, I haven't. Lemme try!
Thanks for the heads up.

@EPinci Hey man, really appreciate the quick comment!
I set the orchestrator_version = my AKS version and it worked just fine!
Sorry for the dumb mistake.
Cheers,

I'm going to lock this issue because it has been closed for _30 days_ โณ. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error ๐Ÿค– ๐Ÿ™‰ , please reach out to my human friends ๐Ÿ‘‰ [email protected]. Thanks!

Was this page helpful?
0 / 5 - 0 ratings