2.173.0
on
centos-release-7-6.1810.2.el7.centos.x86_64
Edit: It's a release agent for a deployment pool.
dev.azure.com (cloud)
After the auto update to 2.173.0 the agent can no longer connect to Azure DevOps.
# Running run once with agent version 2.160.1
./run.sh --once
Scanning for tool capabilities.
Connecting to the server.
2020-08-25 21:31:02Z: Listening for Jobs
Agent update in progress, do not shutdown agent.
Downloading 2.173.0 agent
Waiting for current job finish running.
Generate and execute update script.
Agent will exit shortly for update, should back online within 10 seconds.
‘/root/azagent/_diag/SelfUpdate-20200825-213148.log’ -> ‘/root/azagent/_diag/SelfUpdate-20200825-213148.log.succeed’
Scanning for tool capabilities.
Connecting to the server.
# this now runs indefinitely
Similar to issue #2824
Could someone please tell me how to stop the auto update? Multiple agents on production machines are offline and I have, as of now, no idea how to fix that.
Edit: After trying different approaches and versions for hours, I found a workaround: Currently using version 2.164.8 (with the agent.disableupdate variable to disable auto updates). However, this is not a long term solution and break any moment, as soon as version 2.164.8 won't be accepted by the server anymore.
@michaelschufi In the AzureDevOps UI, navigate to the Agent pool UI (organization settings -> pipelines -> agent settings). Select the "Setting" tab on the pool in question and there is a toggle option for "Allow agents in this pool to automatically update".
It's a release agent and this option is not available in the deployment group.
@michaelschufi Is the process crashing or is it still running but no further output appears in the log? If you install a new instance of 2.173.0 and configure it fresh in a separate directory does it have the same behavior?
It is still running but no further output appears. As in #2824 the service does not stop if I run ./svc.sh stop and is in an "exiting" (or similar) state for a long time until I kill it with kill [pid]:
~/azagent $ ./svc.sh status
/etc/systemd/system/vsts.agent.myredactedorg.Main.host\x2dname.service
● vsts.agent.myredactedorg.Main.host\x2dname.service - Azure Pipelines Agent (myredactedorg.Main.hostname)
Loaded: loaded (/etc/systemd/system/vsts.agent.myredactedorg.Main.host\x2dname.service; enabled; vendor preset: disabled)
Active: deactivating (stop-sigterm) since Wed 2020-08-26 18:35:39 UTC; 1min 34s ago
Main PID: 396 (runsvc.sh)
Tasks: 19
Memory: 424.2M
CGroup: /system.slice/vsts.agent.myredactedorg.Main.host\x2dname.service
├─ 396 /bin/bash /root/azagent/runsvc.sh
├─ 405 ./externals/node/bin/node ./bin/AgentService.js
└─31334 /root/azagent/bin/Agent.Listener run --startuptype service
Aug 25 17:05:25 hostname runsvc.sh[396]: Agent listener exited with error code 3
Aug 25 17:05:25 hostname runsvc.sh[396]: Agent listener exit because of updating, re-launch agent in 5 seconds.
Aug 25 17:05:30 hostname runsvc.sh[396]: Starting Agent listener with startup type: service
Aug 25 17:05:30 hostname runsvc.sh[396]: Started listener process
Aug 25 17:05:32 hostname runsvc.sh[396]: Scanning for tool capabilities.
Aug 25 17:05:32 hostname runsvc.sh[396]: Connecting to the server.
Aug 26 18:35:39 hostname systemd[1]: Stopping Azure Pipelines Agent (myredactedorg.Main.hostname)...
Aug 26 18:35:39 hostname runsvc.sh[396]: Shutting down agent listener
Aug 26 18:35:39 hostname runsvc.sh[396]: Sending SIGINT to agent listener to stop
Aug 26 18:35:39 hostname runsvc.sh[396]: Exiting...
Hint: Some lines were ellipsized, use -l to show in full.
and running the config gives:
>> End User License Agreements:
Building sources from a TFVC repository requires accepting the Team Explorer Everywhere End User License Agreement. This step is not required for building sources from Git repositories.
A copy of the Team Explorer Everywhere license agreement can be found at:
/root/newagent/externals/tee/license.html
>> Connect:
Connecting to server ...
>> Register Agent:
Scanning for tool capabilities.
Connecting to the server.
Enter replace? (Y/N) (press enter for N) > y
Enter deployment group tags for agent? (Y/N) (press enter for N) > n
Successfully replaced the agent
Testing agent connection.
# runs indefinitely without new output
@mjroghelia I am also experiencing the same issues this comment mentions on newer versions than 2.164.8.
Something I have not mentioned yet:
I am using and installing the agent using the root user with AGENT_ALLOW_RUNASROOT="1".
@mjroghelia The same is happening on an Ubuntu 18.04 installation. I am also getting an Authentication failed with status code 401. message (version 2.174.1 this time). The process hangs, does not crash, and no further log appears. I've also enabled http tracing - those are the last few lines of the log:
[2020-09-01 14:44:12Z INFO RSAFileKeyManager] Loading RSA key parameters from file /root/azagent/.credentials_rsaparams
[2020-09-01 14:44:13Z INFO RSAFileKeyManager] Loading RSA key parameters from file /root/azagent/.credentials_rsaparams
[2020-09-01 14:44:13Z WARN VisualStudioServices] Authentication failed with status code 401.
Cache-Control: no-cache
Pragma: no-cache
P3P: CP="CAO DSP COR ADMa DEV CONo TELo CUR PSA PSD TAI IVDo OUR SAMi BUS DEM NAV STA UNI COM INT PHY ONL FIN PUR LOC CNT"
WWW-Authenticate: Bearer authorization_uri=https://login.microsoftonline.com/0ba2168f-2571-49f4-851f-a59710df852f, Basic realm="https://tfsprodweu5.visualstudio.com/", TFS-Federated
X-TFS-ProcessId: 127fcabc-8462-4afc-99d4-3ec597a296ef
Strict-Transport-Security: max-age=31536000; includeSubDomains
ActivityId: f167f102-f872-46d0-8403-d33f4379ae4e
X-TFS-Session: dfcf33c5-ab0b-4cd0-b7df-cd4a43f2953a
X-VSS-E2EID: 1cd18c4e-8456-4aa1-9c74-fb644f248d38
X-Frame-Options: SAMEORIGIN
X-TFS-FedAuthRealm: https://tfsprodweu5.visualstudio.com/
X-TFS-FedAuthIssuer: https://dev.azure.com/orgname/
X-VSS-AuthorizationEndpoint: https://vssps.dev.azure.com/orgname/
X-VSS-ResourceTenant: 0ba2168f-2571-49f4-851f-a59710df852f
X-VSS-S2STargetService: 00000002-0000-8888-8000-000000000000/visualstudio.com
X-TFS-FedAuthRedirect: https://spsprodweu4.vssps.visualstudio.com/_signin?realm=dev.azure.com&reply_to=https%3A%2F%2Fdev.azure.com%2Forgname%2F_apis%2FconnectionData%3FconnectOptions%3D0%26lastChangeId%3D61616171%26lastChangeId64%3D61616171&redirect=1&hid=e1801778-422d-4abd-aac1-504d3ffbd27d&context=eyJodCI6MiwiaGlkIjoiMGYzMjg1NWYtZjc3OS00YjE5LTgwMDYtYTlhYjRlMzVhNDBkIiwicXMiOnt9LCJyciI6IiIsInZoIjoiIiwiY3YiOiIiLCJjcyI6IiJ90#ctx=eyJTaWduSW5Db29raWVEb21haW5zIjpbImh0dHBzOi8vbG9naW4ubWljcm9zb2Z0b25saW5lLmNvbSIsImh0dHBzOi8vbG9naW4ubWljcm9zb2Z0b25saW5lLmNvbSJdfQ2
Request-Context: appId=cid-v1:1d01aa00-5a22-493d-8bc0-acd0af8f839d
Access-Control-Expose-Headers: Request-Context
X-Content-Type-Options: nosniff
X-MSEdge-Ref: Ref A: 197D666C864645F28E6CA616D548D70A Ref B: ZRHEDGE0813 Ref C: 2020-09-01T14:44:12Z
Date: Tue, 01 Sep 2020 14:44:12 GMT
[2020-09-01 14:44:13Z WARN VisualStudioServices] Authentication failed with status code 401.
Cache-Control: no-cache
Pragma: no-cache
P3P: CP="CAO DSP COR ADMa DEV CONo TELo CUR PSA PSD TAI IVDo OUR SAMi BUS DEM NAV STA UNI COM INT PHY ONL FIN PUR LOC CNT"
WWW-Authenticate: Bearer authorization_uri=https://login.microsoftonline.com/0ba2168f-2571-49f4-851f-a59710df852f, Basic realm="https://tfsprodweu5.visualstudio.com/", TFS-Federated
X-TFS-ProcessId: ab178765-4ec5-4710-b2b2-dde539ce7601
Strict-Transport-Security: max-age=31536000; includeSubDomains
ActivityId: 70a5fbe5-a57a-48dc-b0a5-519475ac42f3
X-TFS-Session: dfcf33c5-ab0b-4cd0-b7df-cd4a43f2953a
X-VSS-E2EID: 12a82770-7360-4f2e-8d20-ea480814827e
X-Frame-Options: SAMEORIGIN
X-TFS-FedAuthRealm: https://tfsprodweu5.visualstudio.com/
X-TFS-FedAuthIssuer: https://dev.azure.com/orgname/
X-VSS-AuthorizationEndpoint: https://vssps.dev.azure.com/energyschweiz/
X-VSS-ResourceTenant: 0ba2168f-2571-49f4-851f-a59710df852f
X-VSS-S2STargetService: 00000002-0000-8888-8000-000000000000/visualstudio.com
X-TFS-FedAuthRedirect: https://spsprodweu4.vssps.visualstudio.com/_signin?realm=dev.azure.com&reply_to=https%3A%2F%2Fdev.azure.com%2Forgname%2F_apis%2FconnectionData%3FconnectOptions%3D0%26lastChangeId%3D61616171%26lastChangeId64%3D61616171&redirect=1&hid=e1801778-422d-4abd-aac1-504d3ffbd27d&context=eyJodCI6MiwiaGlkIjoiMGYzMjg1NWYtZjc3OS00YjE5LTgwMDYtYTlhYjRlMzVhNDBkIiwicXMiOnt9LCJyciI6IiIsInZoIjoiIiwiY3YiOiIiLCJjcyI6IiJ90#ctx=eyJTaWduSW5Db29raWVEb21haW5zIjpbImh0dHBzOi8vbG9naW4ubWljcm9zb2Z0b25saW5lLmNvbSIsImh0dHBzOi8vbG9naW4ubWljcm9zb2Z0b25saW5lLmNvbSJdfQ2
Request-Context: appId=cid-v1:1d01aa00-5a22-493d-8bc0-acd0af8f839d
Access-Control-Expose-Headers: Request-Context
X-Content-Type-Options: nosniff
X-MSEdge-Ref: Ref A: B4F2F49A95B04D3B89C1584555AAF96E Ref B: ZRHEDGE0811 Ref C: 2020-09-01T14:44:12Z
Date: Tue, 01 Sep 2020 14:44:13 GMT
[2020-09-01 14:44:13Z INFO VisualStudioServices] AAD Correlation ID for this token request: Unknown
[2020-09-01 14:44:13Z INFO VisualStudioServices] Finished operation Location.GetConnectionData
[2020-09-01 14:44:13Z INFO VisualStudioServices] Finished operation Location.GetConnectionData
[2020-09-01 14:44:13Z INFO VisualStudioServices] Finished operation Location.GetConnectionData
@mjroghelia Any updates on this?
This is a high severity issue and it's currently blocking our organization release pipeline. Is there any update?
Using version 2.164.8 is the only way to add new agents but they won't connect when started.
@anatolybolshakov Please have someone take a look at reproducing this one. The summary is that for some users versions of the agent since we upgraded to .Net Core 3.1 have been reported to hang or crash when starting up. One detail that may be relevant is that this user is running as root. Also note that 401s appear in the logs when the agent first starts normally, so that is not likely related (see #3149). The real issue sounds like it is that the process hangs and never starts polling for messages.
Hi @anatolybolshakov
I finally have some more information about this issue. Our hoster's partner has investigated the issue and this was their response:
Initially, the file does not exist in the ./azagent directory and was created by the process (pid 444) which was one of the children of the main process pid 428 [1]. The process which forked pid 450 was killed before pid 450 exit and lead to a zombie state for unfinished chmod.
The above scenario is considered to be out of's scope as it is related to the way how script works. In other words, an issue you have reported can also be seen on a local systems (not all of them). This is not a flaw in product but rather a race condition occuring during the script execution.
I appreciate your patience and cooperation while we were processing the case.[1] 233 pts/1 Ss+ 0:00 \_ /bin/sh 391 pts/1 S 0:00 /bin/bash ./config.sh --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node45241.example.ch 428 pts/1 Sl 0:08 \_ ./bin/Agent.Listener configure --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node45241-example.ch 450 pts/1 Z 0:00 \_ [chmod] <defunct> strace results: 450 fchmodat(AT_FDCWD, "/azagent/.credentials_rsaparams", 0600 <unfinished ...> 450 <... fchmodat resumed> ) = 0 450 fchmodat(AT_FDCWD, "/azagent/.credentials_rsaparams", 0600 <unfinished ...> 450 <... fchmodat resumed> ) = 0 450 close(1<pipe:[104053172]> <unfinished ...> 450 <... close resumed> ) = 0 450 close(2<pipe:[104053175]> <unfinished ...> 450 <... close resumed> ) = 0 450 exit_group(0) = ? 450 +++ exited with 0 +++
I would really appreciate if you could look into this issue. It's been an issue for a rather long time now. I was also writing in this VS Developer Community thread).
Please tell me if I can test something or help in some other way.
Anyone found any work around? I am kind of blocked in a Go-Live scenario. I even tried with 2.164.6 version still it doesn't work.
Please run Deployment Group script on Windows PowerShell tool like the following, not Windows PowerShell ISE tool. This actually worked for me with 2.175.2.
Hi @michaelschufi
We are currently investigating this issue - do you have a chance to check if it's reproducible for you on any other linux OS (like ubuntu)? Or it seems to be centos specific?
Hi @anatolybolshakov
It's reproducible in our environment with an Ubuntu 18.04. (The logs posted are from a CentOS afaik).
Edit: To clarify: Same virtualization environment as the node was running in where the posted logs are coming from.
@michaelschufi could you please check if there are any other errors in agent logs - except 401 issues (this seems to be not related to the current issue)?
@michaelschufi could you try also to remove agent in DevOps and reconfigure updated agent - by running ./config remove and ./config sequentially?
@anatolybolshakov
I'm not sure if you're able to access all the comments in this VS Developer Community Thread, but it holds many logs and things I have already tried. I will try and share as many things publicly but the HTTP traces should stay secret afaik.
(In my next comment I will try to summarize some other approaches me and my hoster have been trying over the last months.)
@michaelschufi could you try also to remove agent in DevOps and reconfigure updated agent - by running
./config removeand./configsequentially?
Ofc, I can and will get back to you. But if this is just to "clean up" the server side, I don't see how it helps. Because if I newly provision an Ubuntu 18.04 VM (with a different hostname, etc.), the problem persists. It is not just while updating the agent.
Please quickly tell me how it is relevant.
could you please check if there are any other errors in agent logs - except 401 issues (this seems to be not related to the current issue)?
Not many errors are in there. I think this is why most people report the 401 errors.
The most interesting one so far is:
[2020-09-23 08:43:13Z INFO DeploymentGroupAgentConfigProvider] Connect complete for deployment group
[2020-09-23 08:43:13Z INFO ConfigurationManager] Test Connection complete.
[2020-09-23 08:43:13Z INFO HostContext] Well known directory 'Bin': '/home/azure/azagent/bin'
[2020-09-23 08:43:13Z INFO HostContext] Well known directory 'Root': '/home/azure/azagent'
[2020-09-23 08:43:13Z INFO HostContext] Well known config file 'RSACredentials': '/home/azure/azagent/.credentials_rsaparams'
[2020-09-23 08:43:13Z INFO RSAFileKeyManager] Creating new RSA key using 2048-bit key length
[2020-09-23 08:43:13Z INFO RSAFileKeyManager] Successfully saved RSA key parameters to file /home/azure/azagent/.credentials_rsaparams
[2020-09-23 08:43:13Z INFO RSAFileKeyManager] Which: 'chmod'
[2020-09-23 08:43:13Z INFO RSAFileKeyManager] Location: '{matches.First()}'
[2020-09-23 08:43:13Z INFO HostContext] Well known directory 'Bin': '/home/azure/azagent/bin'
[2020-09-23 08:43:13Z INFO HostContext] Well known directory 'Root': '/home/azure/azagent'
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Starting process:
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] File name: '/bin/chmod'
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Arguments: '600 /home/azure/azagent/.credentials_rsaparams'
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Working directory: '/home/azure/azagent'
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Require exit code zero: 'False'
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Encoding web name: ; code page: ''
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Force kill process on cancellation: 'False'
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Redirected STDIN: 'False'
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Persist current code page: 'False'
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Keep redirected STDIN open: 'False'
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] High priority process: 'False'
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Failed to update oom_score_adj for PID: 5183.
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] System.IO.IOException: Invalid argument
at System.IO.FileStream.WriteNative(ReadOnlySpan`1 source)
at System.IO.FileStream.FlushWriteBuffer()
at System.IO.FileStream.Dispose(Boolean disposing)
at System.IO.Stream.Close()
at System.IO.StreamWriter.CloseStreamFromDispose(Boolean disposing)
at System.IO.StreamWriter.Dispose(Boolean disposing)
at System.IO.TextWriter.Dispose()
at System.IO.File.WriteAllText(String path, String contents)
at Microsoft.VisualStudio.Services.Agent.Util.ProcessInvoker.WriteProcessOomScoreAdj(Int32 processId, Int32 oomScoreAdj)
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Process started with process id 5183, waiting for process exit.
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Exited process 5183 with exit code 0
[2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Finished process 5183 with exit code 0, and elapsed time 00:00:00.0422561.
[2020-09-23 08:43:13Z INFO RSAFileKeyManager] Successfully set permissions for RSA key parameters file /home/azure/azagent/.credentials_rsaparams
[2020-09-23 08:43:13Z INFO Terminal] WRITE LINE:
[2020-09-23 08:43:13Z INFO Terminal] WRITE LINE: >> Register Agent:
[2020-09-23 08:43:13Z INFO Terminal] WRITE LINE:
[2020-09-23 08:43:13Z INFO CommandSettings] Arg 'projectname': 'Test'
So it completed afterwards, even though there was en error. (As far as I can remember. This was more than a month ago)
Running the agent afterwards does not work.
Note: All the issues are reproducible from a newly provisioned Ubuntu 18.04 and CentOS 7.8 system.
This is described above after updating to the newest version and I also think in #2824.
The agent is "configurable" after running the same command multiple times. See the following log config_retries.log
Experienced a hang in the following steps (was on another system than the log above, sorry):
but it was "configured" afterwards.
This was done after running the configuration like described above.
I say "successful" because running the agent afterwards and giving it a job results in the following (./run.sh --once) run.log
After cancelling a hanging process:
Scanning for tool capabilities.
Connecting to the server.
^CExiting...
./config.sh: line 86: 2688 Segmentation fault ./bin/Agent.Listener configure "$@"
CT-54577 azagent# ./config.sh --deploymentgroup --deploymentgroupname "All" --acceptteeeula -
Seg faults seem to go "away" if I do chown -R root:root
total 133944
drwxr-xr-x 4 1001 118 4096 Sep 25 19:27 .
drwx------ 6 root root 4096 Sep 30 06:59 ..
drwxr-xr-x 14 1001 118 20480 Sep 25 19:27 bin
-rwxr-xr-x 1 1001 118 2863 Sep 25 19:14 config.sh
-rwxr-xr-x 1 1001 118 734 Sep 25 19:14 env.sh
drwxr-xr-x 7 1001 118 4096 Sep 25 19:15 externals
-rwxr-xr-x 1 1001 118 2014 Sep 25 19:14 run.sh
-rw-r--r-- 1 root root 137109502 Sep 30 06:59 vstsagent.tar.gz<br>
total 133944
drwxr-xr-x 4 1001 118 4096 Sep 25 19:27 .
drwx------ 6 root root 4096 Sep 30 06:59 ..
drwxr-xr-x 14 1001 118 20480 Sep 25 19:27 bin
-rwxr-xr-x 1 1001 118 2863 Sep 25 19:14 config.sh
-rwxr-xr-x 1 1001 118 734 Sep 25 19:14 env.sh
drwxr-xr-x 7 1001 118 4096 Sep 25 19:15 externals
-rwxr-xr-x 1 1001 118 2014 Sep 25 19:14 run.sh
-rw-r--r-- 1 root root 137109502 Sep 30 06:59 vstsagent.tar.gz
total 133960
drwxr-xr-x 5 1001 118 4096 Sep 30 07:06 .
drwx------ 6 root root 4096 Sep 30 07:07 ..
drwxr-xr-x 14 1001 118 20480 Sep 25 19:27 bin
-rwxr-xr-x 1 1001 118 2863 Sep 25 19:14 config.sh
-rw------- 1 root root 1667 Sep 30 07:06 .credentials_rsaparams
drwxr-xr-x 2 root root 4096 Sep 30 07:06 _diag
-rw-r--r-- 1 root root 17 Sep 30 07:06 .env
-rwxr-xr-x 1 1001 118 734 Sep 25 19:14 env.sh
drwxr-xr-x 7 1001 118 4096 Sep 25 19:15 externals
-rw-r--r-- 1 root root 131 Sep 30 07:06 .path
-rwxr-xr-x 1 1001 118 2014 Sep 25 19:14 run.sh
-rw-r--r-- 1 root root 137109502 Sep 30 06:59 vstsagent.tar.gz
total 133976
drwxr-xr-x 5 1001 118 4096 Sep 30 07:10 .
drwx------ 6 root root 4096 Sep 30 07:07 ..
-rw-r--r-- 1 root root 290 Sep 30 07:10 .agent
drwxr-xr-x 14 1001 118 20480 Sep 25 19:27 bin
-rwxr-xr-x 1 1001 118 2863 Sep 25 19:14 config.sh
-rw-r--r-- 1 root root 290 Sep 30 07:10 .credentials
-rw------- 1 root root 1667 Sep 30 07:06 .credentials_rsaparams
drwxr-xr-x 2 root root 4096 Sep 30 07:09 _diag
-rw-r--r-- 1 root root 17 Sep 30 07:06 .env
-rwxr-xr-x 1 1001 118 734 Sep 25 19:14 env.sh
drwxr-xr-x 7 1001 118 4096 Sep 25 19:15 externals
-rw-r--r-- 1 root root 131 Sep 30 07:09 .path
-rwxr-xr-x 1 1001 118 2014 Sep 25 19:14 run.sh
-rwxr-xr-x 1 root root 4627 Sep 30 07:10 svc.sh
-rw-r--r-- 1 root root 137109502 Sep 30 06:59 vstsagent.tar.gz
@michaelschufi we tried to reproduce it on newly provisioned Ubuntu 18.04 and CentOS 7 with last agent version but with no luck - so I'm currently trying to determine which process was killed and lead to zombie state of chmod
@anatolybolshakov
I'm not sure if you're able to access all the comments in this VS Developer Community Thread, but it holds many logs and things I have already tried. I will try and share as many things publicly but the HTTP traces should stay secret afaik.
(In my next comment I will try to summarize some other approaches me and my hoster have been trying over the last months.)
@michaelschufi could you try also to remove agent in DevOps and reconfigure updated agent - by running
./config removeand./configsequentially?Ofc, I can and will get back to you. But if this is just to "clean up" the server side, I don't see how it helps. Because if I newly provision an Ubuntu 18.04 VM (with a different hostname, etc.), the problem persists. It is not just while updating the agent.
Please quickly tell me how it is relevant.could you please check if there are any other errors in agent logs - except 401 issues (this seems to be not related to the current issue)?
Not many errors are in there. I think this is why most people report the 401 errors.
The most interesting one so far is:
[2020-09-23 08:43:13Z INFO DeploymentGroupAgentConfigProvider] Connect complete for deployment group [2020-09-23 08:43:13Z INFO ConfigurationManager] Test Connection complete. [2020-09-23 08:43:13Z INFO HostContext] Well known directory 'Bin': '/home/azure/azagent/bin' [2020-09-23 08:43:13Z INFO HostContext] Well known directory 'Root': '/home/azure/azagent' [2020-09-23 08:43:13Z INFO HostContext] Well known config file 'RSACredentials': '/home/azure/azagent/.credentials_rsaparams' [2020-09-23 08:43:13Z INFO RSAFileKeyManager] Creating new RSA key using 2048-bit key length [2020-09-23 08:43:13Z INFO RSAFileKeyManager] Successfully saved RSA key parameters to file /home/azure/azagent/.credentials_rsaparams [2020-09-23 08:43:13Z INFO RSAFileKeyManager] Which: 'chmod' [2020-09-23 08:43:13Z INFO RSAFileKeyManager] Location: '{matches.First()}' [2020-09-23 08:43:13Z INFO HostContext] Well known directory 'Bin': '/home/azure/azagent/bin' [2020-09-23 08:43:13Z INFO HostContext] Well known directory 'Root': '/home/azure/azagent' [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Starting process: [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] File name: '/bin/chmod' [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Arguments: '600 /home/azure/azagent/.credentials_rsaparams' [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Working directory: '/home/azure/azagent' [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Require exit code zero: 'False' [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Encoding web name: ; code page: '' [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Force kill process on cancellation: 'False' [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Redirected STDIN: 'False' [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Persist current code page: 'False' [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Keep redirected STDIN open: 'False' [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] High priority process: 'False' [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Failed to update oom_score_adj for PID: 5183. [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] System.IO.IOException: Invalid argument at System.IO.FileStream.WriteNative(ReadOnlySpan`1 source) at System.IO.FileStream.FlushWriteBuffer() at System.IO.FileStream.Dispose(Boolean disposing) at System.IO.Stream.Close() at System.IO.StreamWriter.CloseStreamFromDispose(Boolean disposing) at System.IO.StreamWriter.Dispose(Boolean disposing) at System.IO.TextWriter.Dispose() at System.IO.File.WriteAllText(String path, String contents) at Microsoft.VisualStudio.Services.Agent.Util.ProcessInvoker.WriteProcessOomScoreAdj(Int32 processId, Int32 oomScoreAdj) [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished. [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished. [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Process started with process id 5183, waiting for process exit. [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Exited process 5183 with exit code 0 [2020-09-23 08:43:13Z INFO ProcessInvokerWrapper] Finished process 5183 with exit code 0, and elapsed time 00:00:00.0422561. [2020-09-23 08:43:13Z INFO RSAFileKeyManager] Successfully set permissions for RSA key parameters file /home/azure/azagent/.credentials_rsaparams [2020-09-23 08:43:13Z INFO Terminal] WRITE LINE: [2020-09-23 08:43:13Z INFO Terminal] WRITE LINE: >> Register Agent: [2020-09-23 08:43:13Z INFO Terminal] WRITE LINE: [2020-09-23 08:43:13Z INFO CommandSettings] Arg 'projectname': 'Test'So it completed afterwards, even though there was en error. (As far as I can remember. This was more than a month ago)
Running the agent afterwards does not work.
@michaelschufi is this error appears in logs when connection process is stuck?
@anatolybolshakov
This occurred during "Connecting to server", yes. Unfortunately, it does not log every time :/
@michaelschufi it seems that process created in line is being killed for some reason and it leads to chmod zombie state... Could you please try to run 'dmesg | less' command on your machine right after agent is stuck- to verify that no processes has been killed? (you will see lines in output like 'Killed process
Hi @anatolybolshakov
Unfortunately, I cannot provide you currently with the kernel logs. But ps -aux is showing me the following when stuck on the first "Connecting to server".
root 14651 0.0 0.5 9908 2796 pts/1 S+ 11:14 0:00 /bin/bash ./config.sh --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch --url https://dev.azure.com/mi
root 14693 5.9 17.5 4142232 92016 pts/1 Sl+ 11:14 0:01 ./bin/Agent.Listener configure --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch --url https://dev.azu
root 14725 0.0 0.0 0 0 pts/1 Z+ 11:14 0:00 [chmod] <defunct>
Edit:
top process tree
0:00.47 `- sshd: root@pts/0
0:00.00 `- -bash
0:00.00 `- -bash
0:00.00 `- /bin/bash ./config.sh --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch-server.com --url +
0:00.69 `- ./bin/Agent.Listener configure --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch-serv+
0:00.00 `- ./bin/Agent.Listener configure --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch-+
0:00.00 `- ./bin/Agent.Listener configure --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch-+
0:00.00 `- ./bin/Agent.Listener configure --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch-+
0:00.00 `- ./bin/Agent.Listener configure --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch-+
0:00.00 `- ./bin/Agent.Listener configure --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch-+
0:00.00 `- ./bin/Agent.Listener configure --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch-+
0:00.00 `- ./bin/Agent.Listener configure --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch-+
0:00.44 `- ./bin/Agent.Listener configure --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch-+
0:00.00 `- ./bin/Agent.Listener configure --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch-+
0:00.55 `- ./bin/Agent.Listener configure --deploymentgroup --deploymentgroupname All --acceptteeeula --agent node.hoster.ch-+
0:00.00 `- [chmod] <defunct>
@michaelschufi it seems that parent process for pid 14725 has been killed for some reason - looks like there are some external factors which lead to this, can't determine it currently...
@michaelschufi is there any chance to check this - to exclude it as one of external factors which could lead to dropping parent process of chmod?
@anatolybolshakov Hmm, I am not sure what I should check. I could maybe send you a process list or similar.
There are of course some basic services on the system running. But we did not experience similar problems with other software so far, and we let our hoster and their partner investigate the issue. They determined it was a problem with the Azure agent itself. Their feedback was mentioned in one of the previous comments above. What is strange though, is that we are not the only ones with the problem, which lead me to believe external factors are not the only problems - it happens also on CentOS and Ubuntu, which is even stranger.
Maybe we can go as far as to give you access to a sandbox system, so you can do some tests. However, I cannot promise you anything yet. Would this be an option?
Either way, I would appreciate if you could write me an email, since I want to avoid to share potentially too much information publicly on GitHub (basic security reasons), which I hope you can understand.
@michaelschufi sure, here's my email - [email protected]
Could you please share full agent logs for the case when it stuck - for us to investigate it further?
It seems that there could be some issue on net core side - created ticket.
I am running rhel and having something related where the permissions is denied on the .credentials_rsaparams is denied, but none of the . folders exist.
@TroySchmidt could you please share your agent logs also?
Hi @michaelschufi do you have any updates regarding last testing?
Yep. Unfortunately, the echo -1000 > /proc/$$/oom_score_adj scenario doesn't seem to help.
Most helpful comment
Hi @anatolybolshakov
I finally have some more information about this issue. Our hoster's partner has investigated the issue and this was their response:
I would really appreciate if you could look into this issue. It's been an issue for a rather long time now. I was also writing in this VS Developer Community thread).
Please tell me if I can test something or help in some other way.