Azure-devops-docs: CD pipeline job is waiting forever while VS scale set instances are in "Running" state

Created on 30 Apr 2020 · 30Comments · Source: MicrosoftDocs/azure-devops-docs

Hello,
I've created a new Azure VS scale set and a new Agent Pool in DevOps. I selected this new Agent Pool in my CD pipeline. The problem is my job is forever in state

Waiting for an available agent
All eligible agents are disabled or offline

I have two instances of VM scale set in "Running" state.
One of these instances has thrown an error, but its state is still "Running":

Failed to update virtual machine scale set 'vmscalesetmigra'. Error: VM 'vmscalesetmigra_7' has not reported status for VM agent or extensions. Verify the VM has a running VM agent and that it can establish outbound connections to Azure storage. Please refer to https://aka.ms/vmextensionwindowstroubleshoot for additional VM agent troubleshooting information.

DevOps settings:

Maximum number of virtual machines in the scale set - 2
Number of agents to keep on standby - 1
Automatically tear down virtual machines after every use - NO

I was following the documentation, I tried different VM sizes (currently trying with B2ms).

What could be the problem?
Thank you in advance for any help.

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 0b2476ab-d8db-bd41-689c-8d3b7c4a16d8
Version Independent ID: 6d2e5dd0-8678-d850-a1f3-f664a653bac3
Content: Azure virtual machine scale set agents - Azure Pipelines
Content Source: docs/pipelines/agents/scale-set-agents.md
Product: devops
Technology: devops-cicd-agents
GitHub Login: @steved0x
Microsoft Alias: sdanie

Pri2 devops-cicd-agenttech devopprod doc-bug

Source

annajanicka

👍1

Most helpful comment

Sorry for the trouble. This VMSS agent pool feature suffered an outage. We have a fix rolling out now that should reach everybody by Monday.

WillLennon on 1 May 2020

🎉2

All 30 comments

I am seeing the same issue on my end. No errors, but the VM instance never shows up in my agent pool.

michaelbpalmer on 30 Apr 2020

Same issue as well. I setup a pool with a Server2019 image, and when the pool was initiated, it setup the first VM and it was in the running state, but the VM was never added into the pool. Then after scheduling a job, I can see it creating a new VM, but no agents have been added to the pool.

cishawe on 1 May 2020

Same issue here using an Ubuntu image (following exactly what this doc says and a few other variations). VMSS instances are healthy but no agents showing up in the pool.

I reported the same issue earlier here but it was closed as a non-documentation issue.

norman-v on 1 May 2020

I have the issue as well:

There is an error in the agent logs in the instance.
If you ssh (Linux) or RDP (Win) into the agent instance, one of the log errors is:

[2020-05-01 12:17:12Z ERR Terminal] WRITE ERROR: Access Denied: Microsoft.TeamFoundation.ServiceIdentity;:AgentPoolAdmin:needs the following permission(s) on the resource / to perform this action: Read service definitions and/or access mappings.

jeanfrancoislarente on 1 May 2020

@annajanicka Thank you for the question, assigning this to the author for further review.

Please note that for user-defined scenarios (outside the referenced example), you need to request support from one of the following:

msebolt on 1 May 2020

👍1

Sorry for the trouble. This VMSS agent pool feature suffered an outage. We have a fix rolling out now that should reach everybody by Monday.

WillLennon on 1 May 2020

🎉2

I'll keep this issue open now for visibility, and close it once the fix is rolled out.

steved0x on 1 May 2020

👍1

Sorry for the trouble. This VMSS agent pool feature suffered an outage. We have a fix rolling out now that should reach everybody by Monday.

Fantastic. Will try it first thing!

jeanfrancoislarente on 2 May 2020

@WillLennon - any update on this? Tried now (I know it's early on Monday) and it's still the same error / behavior.

jeanfrancoislarente on 4 May 2020

Is there another area where this issue is being tracked, so that we can get updates? Thanks!

jeanfrancoislarente on 4 May 2020

Would like to give some more feedback too. Just to help out and contribute to the product. Would be nice to have a place, close to development, to do that.

frankzo on 4 May 2020

@frankzo I found something here - not sure if it's the right spot

jeanfrancoislarente on 4 May 2020

@jeanfrancoislarente Understand; but it would be nice to have some kind of updates on preview functions. For feedback and some headsup.

frankzo on 4 May 2020

👍1

I am still seeing this error. I created the vmssagentpool yesterday. The instances seem to run for a while then get deleted and new instances are recreated. This is happening continuously. And I do not see any agents in ADO.

divyacbala on 5 May 2020

The fix rolled out to everybody as of Wednesday May 7th. Is anyone still hitting the issues with agents failing to come online?

WillLennon on 7 May 2020

@WillLennon it is working again for us! Thanks for the effort. Really love the feature!

frankzo on 7 May 2020

@WillLennon yes we are.
We created a scale set that worked fine for few weeks (except for the fact that all the environment variables needed for java build were not avaibile inside the job)

Yesteday we noticed that no new agents were made available to the Agent pool. We noticed that the scale set continuosly create and then destroy new VMs. No Agents are available anymore and all the pipeline stale forever. We also tried to delete and recreate the agent pool, no benefit. We deleted and recreated the entire scale set, no benefits as well.

No clear signs of failures in the azure portal activity log...

andmig-ilty on 21 May 2020

@WillLennon we are also seeing this. Please let me know how we can help debug further.

tjaffri on 30 May 2020

Just trying out this functionality and seing the same thing on a newly created agent pool and scale set. Job triggers provisioning in the scale set. Once created and running no agent registers. Scaleset scasled down and the cycle repeats. No agent ever registers in DevOps...

lmvlmv on 10 Jun 2020

@lmvlmv Are you using a firewall? I have noticed this behavior when the scaleset extends fails due to Deny on the FW. Make sure to add to right rules.

frankzo on 10 Jun 2020

No firewall set and no NSG assocaited with the scale set. Works fine with our scale sets (manually managed) with agents pre-installed.

lmvlmv on 10 Jun 2020

I am having the same issue as @lmvlmv with not firewall or SG associated with the scale set. It's almost as if the agent software isn't being installed and registered back with the associated pool. Is there something on the user-end we need to do? I'd really like to try this feature

chevonc on 12 Jun 2020

I tried again today. Same scale set configuration, same image, same network. And it works now!!! I noticed that the agent is now 2.170.1

andmig-ilty on 16 Jun 2020

Still nope. Took 28 minues for the scale set instance to reach running. Looks like the agent extension may have been retrying. Ultimately failed in the operations log with:

    "statusMessage": "{\"status\":\"Failed\",\"error\":{\"code\":\"ResourceOperationFailure\",\"message\":\"The resource operation completed with terminal provisioning state 'Failed'.\",\"details\":[{\"code\":\"VMAgentStatusCommunicationError\",\"message\":\"VM 'POOLNAME-BL-SS_24' has not reported status for VM agent or extensions. Verify the VM has a running VM agent and that it can establish outbound connections to Azure storage. Please refer to https://aka.ms/vmextensionlinuxtroubleshoot for additional VM agent troubleshooting information.\"}]}}",

lmvlmv on 17 Jun 2020

Hi All, chiming in with my experience so far.

From my testing, I noticed that if you get no agents in the pool despite having the amount of "standby" being non-zero, The extension script that installs the agent has probably failed in some manner which at the moment, isn't clearly thrown as an error anywhere in the Azure Portal or Azure DevOps. Extensions in Azure are run under the waagent service on Linux, which is written and dependant upon Python. The provisioning script for the Azure DevOps agent is run as an Azure VM Extension, which requires the waagent service to be installed and in working order.

From what I can tell, it uses /usr/bin/env python to determine the python binary to run and in our case, we were installing our own python binary (2.7.15) that we build from source, which was causing import problems with the waagent service when it tried to run. If you are changing your python, you might end up breaking the waagent and the service will fail but may still be running. for CentOS, check /var/log/messages for quite a lot of ImportError problems related to the waagent trying to restart itself.

We were lucky enough to not have to update our python at the VM level as we migrated to a container build that uses it instead for a while ago, so I just took it out of the Ansible roles to run on the machine. After I fixed this, I could see agents appearing in our pool properly after a fresh start from scratch (in terms of new Scale Set, New Agent Pool with images without python change).

Whilst I only experimented slightly, you might be able to get away with editing the PATH environment variables the waagent service starts with if you want to deconflict the paths for it (i.e you installed a python under /usr/local/bin which is overriding your OS-level one that would work for it). This can be done by adding the following line under the [Service] block in /usr/lib/systemd/system/waagent.service:

Environment=PATH=/sbin:/bin:/usr/sbin:/usr/bin:/root/bin

However, I believe the agent that then is bootstrapped under the waagent inherits the environment from this service, so if you need /usr/local/bin in the future it could be troublesome... but I could be wrong. If someone goes down this path be sure to comment just to confirm yes or no to this.

JamesAllen16 on 22 Jun 2020

Has there been any update on this? I'm encountering the same issue with Win2019DataCenter and UbuntuLTS images. All the azure setup works but once the agent pool is associated with the scale set, the instances are constantly being provisioned and deleted with a little bit of time actually "running". They never show up as agents in azure devops.

kthayer424 on 21 Jul 2020

@kthayer424 does the scale set have access to the needed urls? Or is there a config between? I have seen the same behavior when the extension can not complete successfully due to network issues

frankzo on 21 Jul 2020

@frankzo I'm not aware of the set of needed urls, is there a reference to that? My setup is just following the steps here https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/scale-set-agents?view=azure-devops#create-a-virtual-machine-scale-set-agent-pool and changing the image (and related settings).

For example my script is
az vmss create --name vmssagentspool-scaleset --resource-group vmssagents-rg --image Win2019DataCenter --vm-sku Standard_D2_v3 --storage-sku StandardSSD_LRS --authentication-type Password --instance-count 2 --disable-overprovision --upgrade-policy-mode manual --single-placement-group false --platform-fault-domain-count 1 --load-balancer '""'
taken from that page and slightly modified to fit windows. Then the agent pool is set up directly from that, there's nothing custom around the networking. If you could provide the URLs I need to test, that would be great.

kthayer424 on 21 Jul 2020

Do the scaled sets need public IPs or something in addition to the create script on that page?

kthayer424 on 22 Jul 2020

Not sure if it helps anyone else but I did need to include --public-ip-per-vm to my az vmss create script to get the agents to show up in Azure Devops. Maybe that's just specific to me. Looks to be working now.

kthayer424 on 24 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings