For Linux docker containers, exec ./externals/node/bin/node ./bin/AgentService.js interactive becomes PID 1, but it does not reap zombie processes. Docker containers that run multiple pipeline jobs end up with a huge amount of zombie processes which makes it harder to troubleshoot.
The quick fix is to use docker run --init which causes PID 1 to be /sbin/docker-init, which will reap zombie processes. It will also forward signals, so AgentService.js should still get the TERM and INT signals that it needs to restart and self-update.
I think the documentation should be updated to tell the user either:
docker run --initAlternatively, AgentService.js could be updated to reap zombie processes.
https://ahmet.im/blog/minimal-init-process-for-containers/
⚠Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
@egrubbs Thank you for the feedback, I am passing this along to the team.
We don't really expect containers to run for longer than the length of a single job. Can you share more about your use case?
Microsoft-hosted agents exist in containers that only run for a single job, but the documentation for setting up a docker self-hosted agent uses the default behavior of AgentService.js, which continually requests new jobs.
The documentation for self-hosted Linux (not specific to docker) mentions the ability to use ./run.sh once to only process a single job, but the documentation does not recommend this as the normal way to run an agent. It even has directions on using systemd to run the agent as a service.
https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/v2-linux?view=azure-devops#run-once
Only processing a single job per container is not mentioned at all in the documentation for setting up a self-hosted Linux agents inside of docker.
https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/docker?view=azure-devops#linux
Good point, I should update our recommendations there. For a normal, VM-based self-hosted agent, it's not common to want --once. For container-based agents running in some kind of outside orchestrator (Kubernetes, ACI, etc.), we do recommend --once so that the agent goes away and the orchestrator can reap the container.
A doc-based fix has been merged and should publish today. I'm closing this issue since I'm likely to forget to circle back to it.