We are using TF azure provider and tested with versions 2.17 to create VMSS sets used to back a kubernetes deployment. After bootstrapping the VMSS's , and without any changes to the TF code, using plan sometimes result in the following proposed changed:
~ storage_account_type = "Premium_LRS" -> "Standard_LRS" # forces replacement
For the os disk , a slew of other related changes follow.
When examining this further, it appears that the initial apply can result in an inconsistent configuration of the storage_account_type where some VMSS's are created with Premium_LRS while others adhere to the specified Standard_LRS in the code.
The following snippet is creates the VMSS sets:
resource "azurerm_linux_virtual_machine_scale_set" "dp_master_scale" {
count = length(var.zones)
name = "${var.cluster_name}-master-${var.zones[count.index]}-scale"
location = azurerm_resource_group.k8s.location
resource_group_name = azurerm_resource_group.k8s.name
zones = ["${var.zones[count.index]}"]
sku = var.master_instance_type
instances = 0
identity {
type = "SystemAssigned, UserAssigned"
identity_ids = [azurerm_user_assigned_identity.master_id.id]
}
lifecycle {
ignore_changes = [
instances,
]
}
source_image_id = var.here_dp_image_id
os_disk {
storage_account_type = "Standard_LRS"
caching = "ReadWrite"
}
data_disk {
lun = 0
caching = "ReadWrite"
disk_size_gb = 50
storage_account_type = "Standard_LRS"
}
Additionally, we have seen this appear with 'null image reference' , as well as proposing the data disks as new '+' additions. So it appears that there is a partial application or failure when creating the vmss and configuring its storage profile.
Azure template exports for affected VMSS shows the following:
"storageProfile": {
"osDisk": {
"createOption": "FromImage",
"caching": "ReadWrite",
"managedDisk": {
"storageAccountType": "Premium_LRS"
},
"diskSizeGB": 30
},
"imageReference": {
"id": "[concat(parameters('galleries_dpimages_externalid'), '/images/privatename/versions/privateversion')]"
}
},
as well as
osDisk": {
"osType": "Linux",
"name": "some-private-prefix-eOS__1_dasdf43488347239289284",
"createOption": "FromImage",
"caching": "ReadWrite",
"writeAcceleratorEnabled": false,
"managedDisk": {
"storageAccountType": "Standard_LRS",
"id": "[parameters('disks_thesameprivateprefix_eOS__1_thesameuuidhashlikeasabove_externalid')]"
},
"diskSizeGB": 30
},
Note how the same disk appears with different storage types in profile vs osdisk serializations
I think this is similar/related to the following issues:
The impact is severe:
Please let me know if you have any further questions or any suggestions to collect more information or try work arounds.
@cloudcomplex thanks for opening this issue.
Based on the configuration above I'm unable to spot anything which would cause this issue, as such would you be able to provide a reproducible Terraform Configuration (including all of the values necessary to test this) - could you also confirm if there's anything in the Kubernetes Cluster which would be changing these disks from Standard to Premium disks here; and that the image being provisioned from is based on a Standard disk?
Thanks!
Hello,
Thank you for the comment. Below is the rest of the config for this scale set. The caveat being it is not complete. This is a fairly large and complex stack so I apologize this may still not be reproducible but it is the best I can do right now:
network_interface {
name = "${var.cluster_name}-worker-${var.zones[count.index]}-network-interface"
primary = true
ip_configuration {
name = "${var.cluster_name}-worker-${var.zones[count.index]}-ip-config"
primary = true
subnet_id = azurerm_subnet.k8s_subnet[count.index].id
application_gateway_backend_address_pool_ids = [
"${azurerm_application_gateway.resource_name_private.id}/backendAddressPools/workers",
]
application_security_group_ids = [
azurerm_application_security_group.worker_asg.id,
azurerm_application_security_group.k8s_asg.id,
azurerm_application_security_group.vnet_asg.id,
]
load_balancer_backend_address_pool_ids = [
azurerm_lb_backend_address_pool.public_worker_backend_pool.id,
azurerm_lb_backend_address_pool.private_worker_backend_pool.id,
]
}
}
tags = {
Name = var.cluster_fullname
cluster = var.cluster_fullname
env = var.environment
min = var.worker_asg_min
max = var.worker_asg_max
cluster-autoscaler-enabled = "true"
}
depends_on = [
azurerm_application_gateway.resource_name_private,
module.module_name.some_resource_name_function_name,
]
}
resource "azurerm_virtual_machine_scale_set_extension" "worker-extension" {
count = length(var.zones)
name = "HealthExtension-${count.index}"
virtual_machine_scale_set_id = azurerm_linux_virtual_machine_scale_set.dp_worker_default_scale_set[count.index].id
publisher = "Microsoft.ManagedServices"
type = "ApplicationHealthLinux"
type_handler_version = "1.0"
auto_upgrade_minor_version = true
settings = "{\"port\": 8188, \"protocol\": \"http\", \"requestPath\": \"/\"}"
}
We are upgrading to the provider version 2.25.0. This issue is intermittent, and we have recently switched to StandardSSD_LRS as well. We have not been seeing it in the week or so(the stack get exercised multiple times a day).
I was speculating that this maybe related to some storage_account_type issue or image reference issue as I had seen some of those errors co-occurring when the disk configuration was botched. I understood there was some bug on the Azure API side of things.
I also looked through the code and the test cases around the data disks were a bit thin , the OS disk had better coverage but still thin. Is this something I could help with?
@tombuildsstuff adding my consistent reproduction of this issue here.
Workflow to repro:
Example:
provider "azurerm" {
features {}
}
resource "azurerm_resource_group" "example" {
name = "example-resources"
location = "West US2"
}
resource "azurerm_virtual_network" "example" {
name = "example-network"
resource_group_name = azurerm_resource_group.example.name
location = azurerm_resource_group.example.location
address_space = ["10.0.0.0/16"]
}
resource "azurerm_subnet" "internal" {
name = "internal"
resource_group_name = azurerm_resource_group.example.name
virtual_network_name = azurerm_virtual_network.example.name
address_prefixes = ["10.0.2.0/24"]
}
resource "azurerm_linux_virtual_machine_scale_set" "example" {
name = "example-vmss"
resource_group_name = azurerm_resource_group.example.name
location = azurerm_resource_group.example.location
sku = "Standard_F2s_v2"
instances = 1
admin_username = "adminuser"
admin_ssh_key {
username = "adminuser"
public_key = file("~/.ssh/id_rsa.pub")
}
source_image_id = "/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/forimage_group/providers/Microsoft.Compute/galleries/sig/images/test/versions/1.0.0"
os_disk {
storage_account_type = "Standard_LRS"
caching = "ReadWrite"
disk_size_gb = 100
}
network_interface {
name = "example"
primary = true
ip_configuration {
name = "internal"
primary = true
subnet_id = azurerm_subnet.internal.id
}
}
}
````
Do terraform plan / terraform apply. This applied the right resources.
Updated terraform updating the image version
```tf
provider "azurerm" {
features {}
}
resource "azurerm_resource_group" "example" {
name = "example-resources"
location = "West US2"
}
resource "azurerm_virtual_network" "example" {
name = "example-network"
resource_group_name = azurerm_resource_group.example.name
location = azurerm_resource_group.example.location
address_space = ["10.0.0.0/16"]
}
resource "azurerm_subnet" "internal" {
name = "internal"
resource_group_name = azurerm_resource_group.example.name
virtual_network_name = azurerm_virtual_network.example.name
address_prefixes = ["10.0.2.0/24"]
}
resource "azurerm_linux_virtual_machine_scale_set" "example" {
name = "example-vmss"
resource_group_name = azurerm_resource_group.example.name
location = azurerm_resource_group.example.location
sku = "Standard_F2s_v2"
instances = 1
admin_username = "adminuser"
admin_ssh_key {
username = "adminuser"
public_key = file("~/.ssh/id_rsa.pub")
}
source_image_id = "/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/forimage_group/providers/Microsoft.Compute/galleries/sig/images/test/versions/1.0.1"
os_disk {
storage_account_type = "Standard_LRS"
caching = "ReadWrite"
disk_size_gb = 100
}
network_interface {
name = "example"
primary = true
ip_configuration {
name = "internal"
primary = true
subnet_id = azurerm_subnet.internal.id
}
}
}
The terraform plan for this is only planning to update the image_id, not the disk SKU. Terraform plan output belof:
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.
azurerm_resource_group.example: Refreshing state... [id=/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources]
azurerm_virtual_network.example: Refreshing state... [id=/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources/providers/Microsoft.Network/virtualNetworks/example-network]
azurerm_subnet.internal: Refreshing state... [id=/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources/providers/Microsoft.Network/virtualNetworks/example-network/subnets/internal]
azurerm_linux_virtual_machine_scale_set.example: Refreshing state... [id=/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources/providers/Microsoft.Compute/virtualMachineScaleSets/example-vmss]
------------------------------------------------------------------------
An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
~ update in-place
Terraform will perform the following actions:
# azurerm_linux_virtual_machine_scale_set.example will be updated in-place
~ resource "azurerm_linux_virtual_machine_scale_set" "example" {
admin_username = "adminuser"
computer_name_prefix = "example-vmss"
disable_password_authentication = true
do_not_run_extensions_on_overprovisioned_machines = false
encryption_at_host_enabled = false
id = "/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources/providers/Microsoft.Compute/virtualMachineScaleSets/example-vmss"
instances = 1
location = "westus2"
max_bid_price = -1
name = "example-vmss"
overprovision = true
priority = "Regular"
provision_vm_agent = true
resource_group_name = "example-resources"
scale_in_policy = "Default"
single_placement_group = true
sku = "Standard_F2s_v2"
~ source_image_id = "/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/forimage_group/providers/Microsoft.Compute/galleries/sig/images/test/versions/1.0.0" -> "/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/forimage_group/providers/Microsoft.Compute/galleries/sig/images/test/versions/1.0.1"
tags = {}
unique_id = "f1692efe-996e-4d04-95ac-de405ebd638e"
upgrade_mode = "Manual"
zone_balance = false
zones = []
admin_ssh_key {
public_key = "ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAQEAslS5LnoCJlj8OE4VncUK2iP6YhVT/RmeNkvP3VTd/GbiZd384wrD0rzr3MwEgMm4ZkjUQno54x+bpRhIFDha4Kj89cs7LwuPHZSkXLF+aVydxy2nu464TmflnhVVW71wLE9E3bCUxmh5+IZ3sJ8is2XQMuC1IHiIoEMFc+buMTG+kVc3f+VaJ5ZT+bFPjqs816YBPTSZRmUjzfwRcLIRXvlVxlFsMckhSTa7xCCxunsGKITOnqmlk/vIWr/bKfev6RD+qV8DFquM0zxquwcSv5ERXE384m6ESJ/YJ4IN5P14CDWT3pdZtwM1jOaL/zPyMHbamk5iTPLfuPao740plQ=="
username = "adminuser"
}
automatic_instance_repair {
enabled = false
grace_period = "PT30M"
}
network_interface {
dns_servers = []
enable_accelerated_networking = false
enable_ip_forwarding = false
name = "example"
primary = true
ip_configuration {
application_gateway_backend_address_pool_ids = []
application_security_group_ids = []
load_balancer_backend_address_pool_ids = []
load_balancer_inbound_nat_rules_ids = []
name = "internal"
primary = true
subnet_id = "/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources/providers/Microsoft.Network/virtualNetworks/example-network/subnets/internal"
version = "IPv4"
}
}
os_disk {
caching = "ReadWrite"
disk_size_gb = 100
storage_account_type = "Standard_LRS"
write_accelerator_enabled = false
}
}
# azurerm_resource_group.example will be updated in-place
~ resource "azurerm_resource_group" "example" {
id = "/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources"
location = "westus2"
name = "example-resources"
~ tags = {
- "CreatedOnDate" = "2020-10-28T21:53:27.0019093Z" -> null
- "deleteByDate" = "11/4/2020 9:53:27 PM" -> null
}
}
Plan: 0 to add, 2 to change, 0 to destroy.
------------------------------------------------------------------------
This plan was saved to: plan.out
To perform exactly these actions, run the following command to apply:
terraform apply "plan.out"
After applying this, the scale set disk is updated to premium_ssd. A new terraform plan wants to change this back to standard_ssd.
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.
azurerm_resource_group.example: Refreshing state... [id=/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources]
azurerm_virtual_network.example: Refreshing state... [id=/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources/providers/Microsoft.Network/virtualNetworks/example-network]
azurerm_subnet.internal: Refreshing state... [id=/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources/providers/Microsoft.Network/virtualNetworks/example-network/subnets/internal]
azurerm_linux_virtual_machine_scale_set.example: Refreshing state... [id=/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources/providers/Microsoft.Compute/virtualMachineScaleSets/example-vmss]
------------------------------------------------------------------------
An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
~ update in-place
-/+ destroy and then create replacement
Terraform will perform the following actions:
# azurerm_linux_virtual_machine_scale_set.example must be replaced
-/+ resource "azurerm_linux_virtual_machine_scale_set" "example" {
admin_username = "adminuser"
~ computer_name_prefix = "example-vmss" -> (known after apply)
disable_password_authentication = true
do_not_run_extensions_on_overprovisioned_machines = false
- encryption_at_host_enabled = false -> null
~ id = "/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources/providers/Microsoft.Compute/virtualMachineScaleSets/example-vmss" -> (known after apply)
instances = 1
location = "westus2"
max_bid_price = -1
name = "example-vmss"
overprovision = true
priority = "Regular"
provision_vm_agent = true
resource_group_name = "example-resources"
scale_in_policy = "Default"
single_placement_group = true
sku = "Standard_F2s_v2"
source_image_id = "/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/forimage_group/providers/Microsoft.Compute/galleries/sig/images/test/versions/1.0.1"
- tags = {} -> null
~ unique_id = "f1692efe-996e-4d04-95ac-de405ebd638e" -> (known after apply)
upgrade_mode = "Manual"
zone_balance = false
- zones = [] -> null
admin_ssh_key {
public_key = "ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAQEAslS5LnoCJlj8OE4VncUK2iP6YhVT/RmeNkvP3VTd/GbiZd384wrD0rzr3MwEgMm4ZkjUQno54x+bpRhIFDha4Kj89cs7LwuPHZSkXLF+aVydxy2nu464TmflnhVVW71wLE9E3bCUxmh5+IZ3sJ8is2XQMuC1IHiIoEMFc+buMTG+kVc3f+VaJ5ZT+bFPjqs816YBPTSZRmUjzfwRcLIRXvlVxlFsMckhSTa7xCCxunsGKITOnqmlk/vIWr/bKfev6RD+qV8DFquM0zxquwcSv5ERXE384m6ESJ/YJ4IN5P14CDWT3pdZtwM1jOaL/zPyMHbamk5iTPLfuPao740plQ=="
username = "adminuser"
}
~ automatic_instance_repair {
~ enabled = false -> (known after apply)
~ grace_period = "PT30M" -> (known after apply)
}
+ extension {
+ auto_upgrade_minor_version = (known after apply)
+ force_update_tag = (known after apply)
+ name = (known after apply)
+ protected_settings = (sensitive value)
+ provision_after_extensions = (known after apply)
+ publisher = (known after apply)
+ settings = (known after apply)
+ type = (known after apply)
+ type_handler_version = (known after apply)
}
~ network_interface {
- dns_servers = [] -> null
enable_accelerated_networking = false
enable_ip_forwarding = false
name = "example"
primary = true
~ ip_configuration {
- application_gateway_backend_address_pool_ids = [] -> null
- application_security_group_ids = [] -> null
- load_balancer_backend_address_pool_ids = [] -> null
- load_balancer_inbound_nat_rules_ids = [] -> null
name = "internal"
primary = true
subnet_id = "/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources/providers/Microsoft.Network/virtualNetworks/example-network/subnets/internal"
version = "IPv4"
}
}
~ os_disk {
caching = "ReadWrite"
disk_size_gb = 100
~ storage_account_type = "Premium_LRS" -> "Standard_LRS" # forces replacement
write_accelerator_enabled = false
}
+ terminate_notification {
+ enabled = (known after apply)
+ timeout = (known after apply)
}
}
# azurerm_resource_group.example will be updated in-place
~ resource "azurerm_resource_group" "example" {
id = "/subscriptions/d19dddf3-9520-4226-a313-ae8ee08675e5/resourceGroups/example-resources"
location = "westus2"
name = "example-resources"
~ tags = {
- "CreatedOnDate" = "2020-10-28T22:07:13.6413687Z" -> null
- "deleteByDate" = "11/4/2020 10:07:13 PM" -> null
}
}
Plan: 1 to add, 1 to change, 1 to destroy.
------------------------------------------------------------------------
This plan was saved to: plan.out
To perform exactly these actions, run the following command to apply:
terraform apply "plan.out"
Confirmed that if you ignore changes on the image id by adding this to the vmss resource:
lifecycle {
ignore_changes = [ source_image_id ]
}
Then use az cli to set the image id:
az vmss update --name example-vmss \
--resource-group example-resources \
--set \
virtualMachineProfile.storageProfile.imageReference.id={{ image2 }}
Nothing is destroyed.
A temporary workaround is to ignore storage changes like
lifecycle {
ignore_changes = [ os_disk[0].storage_account_type]
}
I'm seeing evidence that the disk size is being changed to.
UPDATE
Confirmed. disk_size_gb changes along with storage_account_type.
Most helpful comment
@tombuildsstuff adding my consistent reproduction of this issue here.
Workflow to repro:
Example:
The terraform plan for this is only planning to update the image_id, not the disk SKU. Terraform plan output belof:
After applying this, the scale set disk is updated to premium_ssd. A new terraform plan wants to change this back to standard_ssd.