Hello,
I've created a new Azure VS scale set and a new Agent Pool in DevOps. I selected this new Agent Pool in my CD pipeline. The problem is my job is forever in state
Waiting for an available agent
All eligible agents are disabled or offline
I have two instances of VM scale set in "Running" state.
One of these instances has thrown an error, but its state is still "Running":
Failed to update virtual machine scale set 'vmscalesetmigra'. Error: VM 'vmscalesetmigra_7' has not reported status for VM agent or extensions. Verify the VM has a running VM agent and that it can establish outbound connections to Azure storage. Please refer to https://aka.ms/vmextensionwindowstroubleshoot for additional VM agent troubleshooting information.
DevOps settings:
I was following the documentation, I tried different VM sizes (currently trying with B2ms).
What could be the problem?
Thank you in advance for any help.
⚠Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
I am seeing the same issue on my end. No errors, but the VM instance never shows up in my agent pool.
Same issue as well. I setup a pool with a Server2019 image, and when the pool was initiated, it setup the first VM and it was in the running state, but the VM was never added into the pool. Then after scheduling a job, I can see it creating a new VM, but no agents have been added to the pool.
Same issue here using an Ubuntu image (following exactly what this doc says and a few other variations). VMSS instances are healthy but no agents showing up in the pool.
I reported the same issue earlier here but it was closed as a non-documentation issue.
I have the issue as well:
There is an error in the agent logs in the instance.
If you ssh (Linux) or RDP (Win) into the agent instance, one of the log errors is:
[2020-05-01 12:17:12Z ERR Terminal] WRITE ERROR: Access Denied: Microsoft.TeamFoundation.ServiceIdentity;
@annajanicka Thank you for the question, assigning this to the author for further review.
Please note that for user-defined scenarios (outside the referenced example), you need to request support from one of the following:
Sorry for the trouble. This VMSS agent pool feature suffered an outage. We have a fix rolling out now that should reach everybody by Monday.
I'll keep this issue open now for visibility, and close it once the fix is rolled out.
Sorry for the trouble. This VMSS agent pool feature suffered an outage. We have a fix rolling out now that should reach everybody by Monday.
Fantastic. Will try it first thing!
@WillLennon - any update on this? Tried now (I know it's early on Monday) and it's still the same error / behavior.
Is there another area where this issue is being tracked, so that we can get updates? Thanks!
Would like to give some more feedback too. Just to help out and contribute to the product. Would be nice to have a place, close to development, to do that.
@frankzo I found something here - not sure if it's the right spot
@jeanfrancoislarente Understand; but it would be nice to have some kind of updates on preview functions. For feedback and some headsup.
I am still seeing this error. I created the vmssagentpool yesterday. The instances seem to run for a while then get deleted and new instances are recreated. This is happening continuously. And I do not see any agents in ADO.
The fix rolled out to everybody as of Wednesday May 7th. Is anyone still hitting the issues with agents failing to come online?
@WillLennon it is working again for us! Thanks for the effort. Really love the feature!
@WillLennon yes we are.
We created a scale set that worked fine for few weeks (except for the fact that all the environment variables needed for java build were not avaibile inside the job)
Yesteday we noticed that no new agents were made available to the Agent pool. We noticed that the scale set continuosly create and then destroy new VMs. No Agents are available anymore and all the pipeline stale forever. We also tried to delete and recreate the agent pool, no benefit. We deleted and recreated the entire scale set, no benefits as well.
No clear signs of failures in the azure portal activity log...
@WillLennon we are also seeing this. Please let me know how we can help debug further.
Just trying out this functionality and seing the same thing on a newly created agent pool and scale set. Job triggers provisioning in the scale set. Once created and running no agent registers. Scaleset scasled down and the cycle repeats. No agent ever registers in DevOps...
@lmvlmv Are you using a firewall? I have noticed this behavior when the scaleset extends fails due to Deny on the FW. Make sure to add to right rules.
No firewall set and no NSG assocaited with the scale set. Works fine with our scale sets (manually managed) with agents pre-installed.
I am having the same issue as @lmvlmv with not firewall or SG associated with the scale set. It's almost as if the agent software isn't being installed and registered back with the associated pool. Is there something on the user-end we need to do? I'd really like to try this feature
I tried again today. Same scale set configuration, same image, same network. And it works now!!! I noticed that the agent is now 2.170.1
Still nope. Took 28 minues for the scale set instance to reach running. Looks like the agent extension may have been retrying. Ultimately failed in the operations log with:
"statusMessage": "{\"status\":\"Failed\",\"error\":{\"code\":\"ResourceOperationFailure\",\"message\":\"The resource operation completed with terminal provisioning state 'Failed'.\",\"details\":[{\"code\":\"VMAgentStatusCommunicationError\",\"message\":\"VM 'POOLNAME-BL-SS_24' has not reported status for VM agent or extensions. Verify the VM has a running VM agent and that it can establish outbound connections to Azure storage. Please refer to https://aka.ms/vmextensionlinuxtroubleshoot for additional VM agent troubleshooting information.\"}]}}",
Hi All, chiming in with my experience so far.
From my testing, I noticed that if you get no agents in the pool despite having the amount of "standby" being non-zero, The extension script that installs the agent has probably failed in some manner which at the moment, isn't clearly thrown as an error anywhere in the Azure Portal or Azure DevOps. Extensions in Azure are run under the waagent service on Linux, which is written and dependant upon Python. The provisioning script for the Azure DevOps agent is run as an Azure VM Extension, which requires the waagent service to be installed and in working order.
From what I can tell, it uses /usr/bin/env python to determine the python binary to run and in our case, we were installing our own python binary (2.7.15) that we build from source, which was causing import problems with the waagent service when it tried to run. If you are changing your python, you might end up breaking the waagent and the service will fail but may still be running. for CentOS, check /var/log/messages for quite a lot of ImportError problems related to the waagent trying to restart itself.
We were lucky enough to not have to update our python at the VM level as we migrated to a container build that uses it instead for a while ago, so I just took it out of the Ansible roles to run on the machine. After I fixed this, I could see agents appearing in our pool properly after a fresh start from scratch (in terms of new Scale Set, New Agent Pool with images without python change).
Whilst I only experimented slightly, you might be able to get away with editing the PATH environment variables the waagent service starts with if you want to deconflict the paths for it (i.e you installed a python under /usr/local/bin which is overriding your OS-level one that would work for it). This can be done by adding the following line under the [Service] block in /usr/lib/systemd/system/waagent.service:
Environment=PATH=/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
However, I believe the agent that then is bootstrapped under the waagent inherits the environment from this service, so if you need /usr/local/bin in the future it could be troublesome... but I could be wrong. If someone goes down this path be sure to comment just to confirm yes or no to this.
Has there been any update on this? I'm encountering the same issue with Win2019DataCenter and UbuntuLTS images. All the azure setup works but once the agent pool is associated with the scale set, the instances are constantly being provisioned and deleted with a little bit of time actually "running". They never show up as agents in azure devops.
@kthayer424 does the scale set have access to the needed urls? Or is there a config between? I have seen the same behavior when the extension can not complete successfully due to network issues
@frankzo I'm not aware of the set of needed urls, is there a reference to that? My setup is just following the steps here https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/scale-set-agents?view=azure-devops#create-a-virtual-machine-scale-set-agent-pool and changing the image (and related settings).
For example my script is
az vmss create --name vmssagentspool-scaleset --resource-group vmssagents-rg --image Win2019DataCenter --vm-sku Standard_D2_v3 --storage-sku StandardSSD_LRS --authentication-type Password --instance-count 2 --disable-overprovision --upgrade-policy-mode manual --single-placement-group false --platform-fault-domain-count 1 --load-balancer '""'
taken from that page and slightly modified to fit windows. Then the agent pool is set up directly from that, there's nothing custom around the networking. If you could provide the URLs I need to test, that would be great.
Do the scaled sets need public IPs or something in addition to the create script on that page?
Not sure if it helps anyone else but I did need to include --public-ip-per-vm to my az vmss create script to get the agents to show up in Azure Devops. Maybe that's just specific to me. Looks to be working now.
Most helpful comment
Sorry for the trouble. This VMSS agent pool feature suffered an outage. We have a fix rolling out now that should reach everybody by Monday.