Azure-pipelines-agent: Run once agent sometimes accepts second job

Created on 2 Jul 2019 · 22Comments · Source: microsoft/azure-pipelines-agent

Agent Version and Platform

Agent: 2.153.2
OS: Windows Server core 2019

Azure DevOps Type and Version

dev.azure.com
Can provide organisation name if required

What's not working?

I have been creating containers that recycle themselves by using the --once flag on the agent that was implemented in the last few months.

I have noticed when there are jobs waiting for an available agent in the queue, then the agent will be assigned a second job whilst its in the process of shutting down, obviously this means that the second build is then canceled.

Steps

1) Create an agent pool with run once agents. lets say 3
2) Queue 6 jobs against the pool
3) Once an agent has finished a job, another job will be assigned whilst the agent is in the process of shutting itself down

In our setup, when the agent dies, the container will die and then docker swarm will spin up a new agent container after ~1 minute as it realises that it has n-1 replicas.

Im not sure if this is actually a server issue? In terms of the server not knowing that an agent is running with the --once flag and therefore assigns the next queued job to the agent, once the first job has completed

Source

andyfisher100

👍3

Most helpful comment

Re-opening until the new agent version with this fix has been released and rolled out.

alex-peck on 21 Jan 2020

🚀2 🎉2 😄2 👀1

All 22 comments

I have a customer who has the same issue. It seemed to work initially and then started to accept a second job. And same as above the second job fails due to the build agent being cancelled.

Any help on this would be great.

fireblade95402 on 2 Sep 2019

Hello @TingluoHuang can you please have a look into this?

kirkone on 3 Sep 2019

Is it possible to try it with a newer version of the agent? I setup a test case and it seemed to have worked as expected in my dev environment.

jtpetty on 3 Sep 2019

I can confirm this behavior.
Checked with Agent version 2.155.1
This is what happend when queue multiple builds in a fast order:

Scanning for tool capabilities.
Connecting to the server.
2019-09-03 19:23:46Z: Listening for Jobs
2019-09-03 19:23:56Z: Running job: Job
2019-09-03 19:24:05Z: Job Job completed with result: Succeeded
Removing agent from the server
Connecting to server ...
Failed: Removing agent from the server
Agent "TestAgent01" is running a job for pool "AZDO-Agent-Test"

I use run.sh to start the agent

It also behaves like this when starting with: ./bin/Agent.Listener run

kirkone on 3 Sep 2019

That's very annoying behaviour, is there a timeline on when it is getting fixed?

DanielHabenicht on 4 Sep 2019

I have a theory about the cause and will work with @vtbassmatt on scheduling getting a fix in place as I think this will need an update to the server code.

jtpetty on 4 Sep 2019

🎉2

Would like to also chime and experiencing this issue. We have a pool of agents running as containers with --once flag. Seems top happen when there is a backup of queued jobs waiting on agents. Here is the error:

The agent: ED9603FB54F1 lost communication with the server. Verify the machine is running and has a healthy network connection. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

crooms on 9 Sep 2019

👍1

@croomshine that matches what @jtpetty found. We're hoping to address it shortly -- it's slightly more complex than your average bugfix.

vtbassmatt on 9 Sep 2019

🎉1

How is the agent working in the hosted agent pool in AzDO?
This agent will be also shut down after used once. How is this achieved?

kirkone on 16 Sep 2019

@jtpetty and @vtbassmatt any news on this?

kirkone on 21 Oct 2019

@kirkone sorry for the delay. I'm hoping to get this scheduled soon; it's been a victim of higher priorities for the past few weeks.

vtbassmatt on 21 Oct 2019

👍1

@vtbassmatt thanks for the update.

kirkone on 21 Oct 2019

It would be great to have this issue fixed as we are trying to achieve concepts of ephemeral agents using --once option.

harsh-vm on 5 Nov 2019

👍1

We hear you and are working towards a fix. As a workaround, the feature works most of the time right now. It's only in cases where you have a lot of queued job pressure when the race condition is encountered.

vtbassmatt on 5 Nov 2019

Any further update on this?

andyfisher100 on 3 Jan 2020

Fixed by #2728

alex-peck on 21 Jan 2020

🎉1

Re-opening until the new agent version with this fix has been released and rolled out.

alex-peck on 21 Jan 2020

🚀2 🎉2 😄2 👀1

@alex-peck I think the new version is released, so this can be closed.

Thanks all for getting this sorted.

kirkone on 8 Feb 2020

🎉2

Is there something we have to do to get this new version pushed to our Azure Devops instance? I just ran into this issue this morning with version 2.165.2.