Azure-cli: `az vm create` boots up VM without DNS-servers

Created on 9 Nov 2020  路  8Comments  路  Source: Azure/azure-cli

Describe the bug

Sometime betweeen:
Last working: 3. November 2020, 10:01 UTC
Bug first appeared: 6. November 2020, 5:10 UTC

the following behavior started happening reproducibly and consistently (100% of the time):

When creating a new Ubuntu 20.04 VM from the official azure Image via az vm create (exact command below) in a resource group that is empty except for a NSG, the command will return and the VM will boot but without any DNS-servers passed to it.

The NIC of the new VM is attached to a new VNET, and the new VNET has its DNS servers correctly set to "Standard (provided by Azure)" - but the VM boots to fast or something, because on first startup it has 0 DNS servers configured. A reboot fixes this.

This behavior started with azure-cli 2.14.0-1~bionic but continues today with the latest 2.14.2-1~bionic.

To Reproduce

  1. Create a resource group that is empty except a NSG with the default rules + one that allows you to ssh in
  2. Locally, create an ssh-keypair to log in to the azure VM with
  3. Create a new azure VM like this:
    VMSUFFIX="$(date +%s)_$(openssl rand --hex 3)"
    az vm create \
        --resource-group "myNewResourceGroup" \
        --name "builder_$VMSUFFIX" \
        --image "Canonical:0001-com-ubuntu-server-focal:20_04-lts:latest" \
        --size Standard_D2_v3 \
        --nsg "pre-existing-NSG" \
        --storage-sku Standard_LRS \
        --data-disk-sizes-gb 90 \
        --admin-username packer \
        --authentication-type ssh \
        --ssh-key-values ~/.ssh/azurekey.pub \
        --tags id="$VMSUFFIX"
  1. Connect to the azure-vm via ssh and its public IP
  2. Try to do any DNS-resolving: ping a public host/website, try to do apt update, try to curl something ... it will all fail
  3. Check the DNS status of the interface with: systemd-resolve --interface eth0 --status.
    There will be no DNS servers assigned, thus all DNS queries fail.
  4. After a reboot, the DNS server(s) from Azure are assigned correctly

Expected behavior
Just like up until the problem started between November 3rd and November 6th, make sure the NIC/VNET are ready and asigned a public Azure DNS server themselves and can hand it down to the new VM before booting the new VM.

Environment summary
Install Method: None, AzureCLI is pre-installed in the GitHub Actions Runner VMs I use it from. GitHub Actions image: ubuntu-latest
CLI versions: azure-cli 2.14.0-1~bionic to 2.14.2-1~bionic.
OS Version: ubuntu-latest image for GitHub Actions Runners
Shell type: bash

Additional context
I am not sure what more I could add, please ask if you need to know anything.

Compute

Most helpful comment

I've done some tests, and narrowed down the broken-ness to somewhere between these images:

Canonical:0001-com-ubuntu-server-focal:20_04-lts:20.04.202008140 // confirmed working
Canonical:0001-com-ubuntu-server-focal:20_04-lts:20.04.202009030 // didn't test
Canonical:0001-com-ubuntu-server-focal:20_04-lts:20.04.202009070 // didn't test
Canonical:0001-com-ubuntu-server-focal:20_04-lts:20.04.202009211 // confirmed broken

This is once again using the latest Azure CLI ( 2.14.2-1~bionic ) and running from within the ubuntu-latest GitHub Actions Image.

All 8 comments

hi @qwordy could you pls check if this caused by @zhoxing-ms 's hotfix release? thanks

Hi @qwordy this is the my hotfix commit: https://github.com/Azure/azure-cli/commit/2a84a5b3b6a624d7cfc351a29f4d64e5dfefd839
It should be irrelevant to the question~

This behavior started with azure-cli 2.14.0-1\~bionic but continues today with the latest 2.14.2-1~bionic

In addition, this problem exists in version 2.14.0, but my hotfix version is 2.14.1

I don鈥檛 think this is an az cli issue.

I am not using az cli, but I am using Terraform to deploy the exact same Ubuntu image as @jantari into a VNET that has custom DNS configured.

When a newly created VM in this VNET boots for the first time it has no DNS servers configured within the guest OS, meaning pretty much nothing works. Reboot that VM, and it picks up the VNET configured DNS servers.

I think the issue might be a mixture of how the image is being built, and how the DNS sever IP鈥檚 are being pulled into the VM (by WAAgent?).

Previous VM鈥檚 I鈥檝e deployed have put the VNET configured DNS servers into /etc/resolv.conf. But now systemd-resolved is a thing.

When you first boot this Ubuntu 20.04 VM, resolv.conf is set to point at 127.0.0.53 - as in, send queries to systemd-resolved. If you run systemd-resolve 鈥攕tatus, you find that it has no DNS servers configured to forward the requests onto. After rebooting, /etc/resolv.confremains unchanged but running systemd-resolve 鈥攕tatus now shows all the VNET configured DNS servers.

In other words: the virtual machine (in particular, the OS) is expecting the DNS servers to be set in an appropriate location for systemd-resolved, but on first boot whatever component of the Azure contraption is supposed to do this simply doesn鈥檛, I鈥檓 assuming because of changes to the way all of this is strung together within the guest OS.

I think - but I鈥檓 not certain - this is WAAgents responsibility, so perhaps this needs routing to the appropriate team? In either case, please don鈥檛 just close this issue, this is causing some real pain and is quite clearly a non-insignificant bug given it breaks standard functionality - and unfortunately the workaround of rebooting is not sufficient (because it breaks our post-boot automation).

Oh one other thing, it seems a bit intermittent. Of around 20 destroy and recreate operations on the VM, 2 times it came up with working DNS 馃檭

So this is definitely something wron in the latest 20.04 images, I just tested by redeploying using version 20.04.202007080 and DNS works as expected from first boot.

Thanks for the testing, depending on how the DNS is assigned to the VM (I would have guessed the standard ol' DHCP option, but maybe it's the Azure Agent) this might really not be an AzureCLI problem.

However, it's also unlikely to be a systemd-resolve problem as that's been used in Ubuntu since 18.04 I think, or at the very least it was already used on previous images of 20.04 and those didn't have this problem.

If the problem is not present in earlier versions of the Ubuntu image - I'll try this out myself now - it likely really is a problem with the Azure Agent or something Canonical changed.

Either way this is a pretty big deal imo as it breaks just about any kind of automated deployment one can imagine!

From what I've read the very same problem existed in 18.10 and 18.04 and was fixed last year :( I suspect the change (as in, why it worked in previous versions of the same image) is as you suggest likely to just be a mismatch between how it's expecting to get DNS and how it IS getting DNS.

Indeed, cloud-init poops its pants on boot with no DNS.

I've done some tests, and narrowed down the broken-ness to somewhere between these images:

Canonical:0001-com-ubuntu-server-focal:20_04-lts:20.04.202008140 // confirmed working
Canonical:0001-com-ubuntu-server-focal:20_04-lts:20.04.202009030 // didn't test
Canonical:0001-com-ubuntu-server-focal:20_04-lts:20.04.202009070 // didn't test
Canonical:0001-com-ubuntu-server-focal:20_04-lts:20.04.202009211 // confirmed broken

This is once again using the latest Azure CLI ( 2.14.2-1~bionic ) and running from within the ubuntu-latest GitHub Actions Image.

Was this page helpful?
0 / 5 - 0 ratings