Azure-pipelines-agent: Run once agent sometimes accepts second job

Created on 2 Jul 2019  路  22Comments  路  Source: microsoft/azure-pipelines-agent

Agent Version and Platform

Agent: 2.153.2
OS: Windows Server core 2019

Azure DevOps Type and Version

dev.azure.com
Can provide organisation name if required

What's not working?

I have been creating containers that recycle themselves by using the --once flag on the agent that was implemented in the last few months.

I have noticed when there are jobs waiting for an available agent in the queue, then the agent will be assigned a second job whilst its in the process of shutting down, obviously this means that the second build is then canceled.

Steps

1) Create an agent pool with run once agents. lets say 3
2) Queue 6 jobs against the pool
3) Once an agent has finished a job, another job will be assigned whilst the agent is in the process of shutting itself down

In our setup, when the agent dies, the container will die and then docker swarm will spin up a new agent container after ~1 minute as it realises that it has n-1 replicas.

Im not sure if this is actually a server issue? In terms of the server not knowing that an agent is running with the --once flag and therefore assigns the next queued job to the agent, once the first job has completed

Most helpful comment

Re-opening until the new agent version with this fix has been released and rolled out.

All 22 comments

I have a customer who has the same issue. It seemed to work initially and then started to accept a second job. And same as above the second job fails due to the build agent being cancelled.

Any help on this would be great.

Hello @TingluoHuang can you please have a look into this?

Is it possible to try it with a newer version of the agent? I setup a test case and it seemed to have worked as expected in my dev environment.

I can confirm this behavior.
Checked with Agent version 2.155.1
This is what happend when queue multiple builds in a fast order:

Scanning for tool capabilities.
Connecting to the server.
2019-09-03 19:23:46Z: Listening for Jobs
2019-09-03 19:23:56Z: Running job: Job
2019-09-03 19:24:05Z: Job Job completed with result: Succeeded
Removing agent from the server
Connecting to server ...
Failed: Removing agent from the server
Agent "TestAgent01" is running a job for pool "AZDO-Agent-Test"

I use run.sh to start the agent

It also behaves like this when starting with: ./bin/Agent.Listener run

That's very annoying behaviour, is there a timeline on when it is getting fixed?

I have a theory about the cause and will work with @vtbassmatt on scheduling getting a fix in place as I think this will need an update to the server code.

Would like to also chime and experiencing this issue. We have a pool of agents running as containers with --once flag. Seems top happen when there is a backup of queued jobs waiting on agents. Here is the error:

The agent: ED9603FB54F1 lost communication with the server. Verify the machine is running and has a healthy network connection. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

@croomshine that matches what @jtpetty found. We're hoping to address it shortly -- it's slightly more complex than your average bugfix.

How is the agent working in the hosted agent pool in AzDO?
This agent will be also shut down after used once. How is this achieved?

@jtpetty and @vtbassmatt any news on this?

@kirkone sorry for the delay. I'm hoping to get this scheduled soon; it's been a victim of higher priorities for the past few weeks.

@vtbassmatt thanks for the update.

It would be great to have this issue fixed as we are trying to achieve concepts of ephemeral agents using --once option.

We hear you and are working towards a fix. As a workaround, the feature works most of the time right now. It's only in cases where you have a lot of queued job pressure when the race condition is encountered.

Any further update on this?

Fixed by #2728

Re-opening until the new agent version with this fix has been released and rolled out.

@alex-peck I think the new version is released, so this can be closed.

Thanks all for getting this sorted.

Is there something we have to do to get this new version pushed to our Azure Devops instance? I just ran into this issue this morning with version 2.165.2.

Don't think so, that agent would be new enough and the server-side change should have rolled everywhere. @alex-peck am I right about that?

Yes. It should not be possible for this to happen now. Can you send me the build link where you see this happening?

Looks like this is a false alarm. Appreciate the quick turnaround.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

riezebosch picture riezebosch  路  4Comments

gitfool picture gitfool  路  4Comments

tjinjin95 picture tjinjin95  路  3Comments

BrendanThompson picture BrendanThompson  路  4Comments

CodeCasterNL picture CodeCasterNL  路  4Comments