Argo-cd: ArgoCD authentication handshake failed

Created on 22 Dec 2020 · 20Comments · Source: argoproj/argo-cd

Describe the bug

Hello ArgoCD Team!

We have upgraded our ArgoCD instances recently and facing bothering issue now. Time to time ArgoCD starts sync process and looks like it hanging somewhere. Then it works normally and then hit again.

We also noticed that restart of argocd-repo-server can help for a short moment.

We have one replica of argocd-repo-server and two replicas of argocd-server.

To Reproduce

N/A.

Expected behavior

ArgoCD is not hanging during app sync

Screenshots

Screenshot 2020-12-22 at 16 03 04

Version

argocd: v1.8.1+c2547dc
  BuildDate: 2020-12-10T02:57:57Z
  GitCommit: c2547dca95437fdbb4d1e984b0592e6b9110d37f
  GitTreeState: clean
  GoVersion: go1.14.12
  Compiler: gc
  Platform: linux/amd64

Logs

time="2020-12-22T12:43:44Z" level=info msg="Sync operation to  failed: ComparisonError: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: authentication handshake failed: remote error: tls: internal error\"" application=graph-db-graphdb-api-instances dest-namespace=graph-db dest-server="https://kubernetes.default.svc" reason=OperationCompleted type=Warning

bug high major

Source

storm1kk

👍10

Most helpful comment

Hi,

We're currently facing the issue, here the result of the command (run internally of a repo-server):


argocd@argocd-repo-server-c4d8c7f6b-cwhjj:~$ openssl s_client -host localhost -port 8081

CONNECTED(00000003)
Can't use SSL_get_servername
depth=0 O = Argo CD
verify error:num=18:self signed certificate
verify return:1
depth=0 O = Argo CD
verify return:1
140146647491712:error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error:../ssl/record/rec_layer_s3.c:1544:SSL alert number 80
---
Certificate chain
 0 s:O = Argo CD
   i:O = Argo CD
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIDETCCAfmgAwIBAgIRAKGoOTg9C6uM6wu2GonmFT4wDQYJKoZIhvcNAQELBQAw
EjEQMA4GA1UEChMHQXJnbyBDRDAeFw0yMDEyMjIxNjMyMjRaFw0yMTEyMjIxNjMy
MjRaMBIxEDAOBgNVBAoTB0FyZ28gQ0QwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAw
ggEKAoIBAQCsshiDMHQ61J55+0AOApeCY4xgmbQ+/+YqWKQDbRpBbHdDEgf4Bux8
Ij1XUBgC4iTlI6LNu1rXgt3wIr3peeiNsrjcHR73C0A5WVAPsbD+ueyAj9mdA1RM
Iq70WwvdJv7ITGSCfJ2di4gV3AeMVO2qBH5A5GOtIaVt4dbskvG4i7cvvr+Rvnrq
4xX1jMbjh1pkhCnXrNhnyxPrwHgYV0Lz4+eeirMsJD603OJgekYDHGT/v9AtnW/8
KN/6VGHy4qvjjlms7wXaeANK1wu9h0dfublFuC8jvr/BUdHdMqZ71vn/FluRVdqa
smAHQWO2DxdbR2CHXuTuLaKdZYTwOso3AgMBAAGjYjBgMA4GA1UdDwEB/wQEAwIC
pDATBgNVHSUEDDAKBggrBgEFBQcDATAPBgNVHRMBAf8EBTADAQH/MCgGA1UdEQQh
MB+CCWxvY2FsaG9zdIISYXJnb2NkLXJlcG8tc2VydmVyMA0GCSqGSIb3DQEBCwUA
A4IBAQCargXh/niqJcbKZGkhDp7SY72Fmy9wSjnfSALOJtiomHeAt2kuOmmcu8v6
B62xIHYHMIU/bVecV4CgdyoVOeNmA9Hs3UUuIMBWWuCPFnUJUIpijY34/xdYceXB
AHX8OGmjY/VdLQgRM5fQg+ufZiqNRRPnB9uxxzpqy1VxGKetoXdCzfATmIsNh32N
0otUE2PGEufM01ggJWD3sUoKewlBHPmyAocEzDDLFVsQdKfFFB4PsYPCDZvlH2mq
CoDvRcSQ2Y2D7U6DlzQOrDBhqVMmpC9GHp8wHtH+rMgo1ZJyAfZgIvYJTgdfn2HC
LqFFE09tUWyK8ZZSIhLtSE53QZsZ
-----END CERTIFICATE-----
subject=O = Argo CD

issuer=O = Argo CD

---
No client certificate CA names sent
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 1009 bytes and written 283 bytes
Verification error: self signed certificate
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 18 (self signed certificate)
---

Hope I will help.

Edit: I solved the issue by deleting the repo-server, that was okay for that cluster, only few apps. Our main cluster has more than 2500 applications, with 10 repo-servers, I can't do that as mitigation.

Issif on 23 Dec 2020

👍4

All 20 comments

@alexmt Probably similar issue as the one we recently talked about.

jannfis on 22 Dec 2020

Yes we are facing this internally as well. We suspect the upgrade in gRPC libraries may have caused this but haven't confirmed. In an internal build, we put in place a gRPC health check which kills the repo-server when this happened and that allowed us to recover, but doesn't address the root cause.

jessesuen on 22 Dec 2020

👍1

As discussed on Slack already, it would be really valuable if someone could inspect the argocd-repo-server listener when its failing. with another TLS capable client that gives some more information about handshake.

For example, full output of openssl s_client -host <host_of_failing_reposerver_pod> -port <port_on_pod> (or service instead of pod, if there's only one replica of repo-server), which might give us some more indications of what happens on TLS handshake level.

I have also taken a look at our gRPC stack, it looks terrible :) We have updated some of the components, but others not, and its not easy to get this straight due to many legacy constructs in the code. I'm working on it, tho.

I'm still trying to reproduce the issue locally, but without much luck.

jannfis on 23 Dec 2020

As discussed on Slack already, it would be really valuable if someone could inspect the argocd-repo-server listener when its failing. with another TLS capable client that gives some more information about handshake.

For example, full output of openssl s_client -host <host_of_failing_reposerver_pod> -port <port_on_pod> (or service instead of pod, if there's only one replica of repo-server), which might give us some more indications of what happens on TLS handshake level.

We will try to catch it next time ArgoCD hangs.

storm1kk on 23 Dec 2020

Hi,

We're currently facing the issue, here the result of the command (run internally of a repo-server):


argocd@argocd-repo-server-c4d8c7f6b-cwhjj:~$ openssl s_client -host localhost -port 8081

CONNECTED(00000003)
Can't use SSL_get_servername
depth=0 O = Argo CD
verify error:num=18:self signed certificate
verify return:1
depth=0 O = Argo CD
verify return:1
140146647491712:error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error:../ssl/record/rec_layer_s3.c:1544:SSL alert number 80
---
Certificate chain
 0 s:O = Argo CD
   i:O = Argo CD
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIDETCCAfmgAwIBAgIRAKGoOTg9C6uM6wu2GonmFT4wDQYJKoZIhvcNAQELBQAw
EjEQMA4GA1UEChMHQXJnbyBDRDAeFw0yMDEyMjIxNjMyMjRaFw0yMTEyMjIxNjMy
MjRaMBIxEDAOBgNVBAoTB0FyZ28gQ0QwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAw
ggEKAoIBAQCsshiDMHQ61J55+0AOApeCY4xgmbQ+/+YqWKQDbRpBbHdDEgf4Bux8
Ij1XUBgC4iTlI6LNu1rXgt3wIr3peeiNsrjcHR73C0A5WVAPsbD+ueyAj9mdA1RM
Iq70WwvdJv7ITGSCfJ2di4gV3AeMVO2qBH5A5GOtIaVt4dbskvG4i7cvvr+Rvnrq
4xX1jMbjh1pkhCnXrNhnyxPrwHgYV0Lz4+eeirMsJD603OJgekYDHGT/v9AtnW/8
KN/6VGHy4qvjjlms7wXaeANK1wu9h0dfublFuC8jvr/BUdHdMqZ71vn/FluRVdqa
smAHQWO2DxdbR2CHXuTuLaKdZYTwOso3AgMBAAGjYjBgMA4GA1UdDwEB/wQEAwIC
pDATBgNVHSUEDDAKBggrBgEFBQcDATAPBgNVHRMBAf8EBTADAQH/MCgGA1UdEQQh
MB+CCWxvY2FsaG9zdIISYXJnb2NkLXJlcG8tc2VydmVyMA0GCSqGSIb3DQEBCwUA
A4IBAQCargXh/niqJcbKZGkhDp7SY72Fmy9wSjnfSALOJtiomHeAt2kuOmmcu8v6
B62xIHYHMIU/bVecV4CgdyoVOeNmA9Hs3UUuIMBWWuCPFnUJUIpijY34/xdYceXB
AHX8OGmjY/VdLQgRM5fQg+ufZiqNRRPnB9uxxzpqy1VxGKetoXdCzfATmIsNh32N
0otUE2PGEufM01ggJWD3sUoKewlBHPmyAocEzDDLFVsQdKfFFB4PsYPCDZvlH2mq
CoDvRcSQ2Y2D7U6DlzQOrDBhqVMmpC9GHp8wHtH+rMgo1ZJyAfZgIvYJTgdfn2HC
LqFFE09tUWyK8ZZSIhLtSE53QZsZ
-----END CERTIFICATE-----
subject=O = Argo CD

issuer=O = Argo CD

---
No client certificate CA names sent
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 1009 bytes and written 283 bytes
Verification error: self signed certificate
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 18 (self signed certificate)
---

Hope I will help.

Issif on 23 Dec 2020

👍4

OK, this one is interesting

140146647491712:error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error:../ssl/record/rec_layer_s3.c:1544:SSL alert number 80

Apparently, for openssl client, this error is transient while for our golang/gRPC client, it is not.

Will investigate further.

Thank you for this additional information, @Issif

jannfis on 23 Dec 2020

👍1

Having the same issue here.

@Issif when you said you deleted the repo-server can you clarify what exactly you did? Did you delete the pods for argocd-repo-server deployment?

eroji on 23 Dec 2020

Having the same issue here.

@Issif when you said you deleted the repo-server can you clarify what exactly you did? Did you delete the pods for argocd-repo-server deployment?

Exactly. I just deleted the pod and let the deployment recreate it.

Issif on 23 Dec 2020

Got it. I guess I'll use that as a workaround until this is fixed.

eroji on 23 Dec 2020

We've added livenessProbe that restart repo server as a workaround. Here is merge patch that introduce the probe:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-repo-server
spec:
  template:
    spec:
      containers:
      - name: argocd-repo-server
        livenessProbe:
          initialDelaySeconds: 10
          periodSeconds: 10
          failureThreshold: 3
          exec:
            command: [ "/usr/local/bin/grpc-health-probe", "-addr=:8081", "-tls", "-tls-no-verify" ]
        volumeMounts:
        - mountPath: /usr/local/bin/grpc-health-probe
          name: custom-tools
          subPath: grpc-health-probe
      volumes:
      - name: custom-tools
        emptyDir: {}
      initContainers:
      - name: download-grpc-health-probe
        image: docker.intuit.com/oicp/alpine3.8:latest
        command: [sh, -c]
        args:
          - wget --no-check-certificate https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/v0.3.5/grpc_health_probe-linux-amd64 &&
            mv grpc_health_probe-linux-amd64 /custom-tools/grpc-health-probe &&
            chmod ugo+x /custom-tools/grpc-health-probe
        volumeMounts:
          - mountPath: /custom-tools
            name: custom-tools

@jannfis , @jessesuen do you think we should add it by default until issue is resolved?

alexmt on 23 Dec 2020

Yes, I think we should. The consequence of not having this is users have repo-servers which will eventually be rendered inoperable.

jessesuen on 23 Dec 2020

I don't like fork/exec and additional init container though. Instead of doing that we could just change an existing HTTP health check implementation to call GRPC health check on localhost. So GET /healthz handler should call grpc GetHealth on localhost:8080 . What do you think?

alexmt on 23 Dec 2020

I don't like fork/exec and additional init container though. Instead of doing that we could just change an existing HTTP health check implementation to call GRPC health check on localhost. So GET /healthz handler should call grpc GetHealth on localhost:8080 . What do you think?

Is that a built-in function? Or do we need to implement a dummy RPC for that, as described in https://github.com/grpc/grpc/blob/master/doc/health-checking.md?

jannfis on 23 Dec 2020

So just to gather more information, this issue started for people after upgrading to 1.8 right?

May I ask from what exact versions you were upgrading from, that was not affected by this issue? We update Go with v1.7.9 and it would be interesting to know if this and later 1.7 versions are affected as well.

jannfis on 23 Dec 2020

I don't like fork/exec and additional init container though. Instead of doing that we could just change an existing HTTP health check implementation to call GRPC health check on localhost. So GET /healthz handler should call grpc GetHealth on localhost:8080 . What do you think?

I'm also weary of fork/exec so I like this suggestion, assuming it can catch the issue.

So just to gather more information, this issue started for people after upgrading to 1.8 right?

Yes, we only started noticing this after upgrade from Argo CD from v1.7 to v1.8.

jessesuen on 23 Dec 2020

So just to gather more information, this issue started for people after upgrading to 1.8 right?

May I ask from what exact versions you were upgrading from, that was not affected by this issue? We update Go with v1.7.9 and it would be interesting to know if this and later 1.7 versions are affected as well.

We did upgrade from 1.7 to 1.8 and I confirm that 1.7 has no this issue.

storm1kk on 24 Dec 2020

👍1

We saw this from 1.7 -> 1.8 as well

himberjack on 24 Dec 2020

👍1

I'm still not able to reproduce this issue - it doesn't happen in my production environment (HA setup with 1 controller, 2 API server and 2 repo server instances), and also not with a local test setup where I'm hammering forced refreshs of applications against the repo server.

May I ask people for the following information:

What Kubernetes version are you running on?
How does your setup look like (in terms of scaling, and do you deviate from standard settings as in the installation manifests - e.g. changed command line parameters, etc)
How many apps is your Argo CD managing, and on how many clusters?

Also, if it is happening quite often in your environment, it might be worth trying to downgrade repo server TLS endpoint from TLS 1.3 to TLS 1.2 by setting --tlsmaxversion 1.2 to the startup parameters for Argo CD repo server. Could anyone please try that and see if that helps the situation?

Above can be done by changing the argocd-repo-server Deployment in .spec.template.spec from:

      containers:
      - command:
        - uid_entrypoint.sh
        - argocd-repo-server
        - --redis
        - argocd-redis:6379

      containers:
      - command:
        - uid_entrypoint.sh
        - argocd-repo-server
        - --redis
        - argocd-redis:6379
        - --tlsmaxversion
        - "1.2"

(the value for argocd-redis:6379 can differ depending on your setup)

jannfis on 30 Dec 2020

We didn't faced the issue again. Here the details about the cluster which got the problem:

EKS 1.18
Argocd v1.8.1+c2547dc
6 Nodes
1 controller + 1 repo server + 1 server
~16 apps
arguments (all defaults except for controller):

            - argocd-application-controller
            - --app-resync
            - "600"
            - --status-processors
            - "55"
            - --operation-processors
            - "30"
            - --kubectl-parallelism-limit
            - "70"

This is a very small cluster, for QA of tierced apps version.

Our main cluster is between 70-100 nodes with 2500-3500 apps, and we didn't faced TLS issues with, even if it has 10 repo servers on it.

Issif on 31 Dec 2020

@jannfis, I think our upgrade was:

-  _ARGO_PROJECT: argocd@sha256:3e300c3c421ef15278daf662fc125d75050dc0816570cfbe08295a734a342314
+  _ARGO_PROJECT: argocd@sha256:26e1632e83e7f72ac7e6c6361dbd83ae52e8ad06cc1c266a1917054a0ae03f12

Sadly, I'm not sure I can actually convert those SHAs into anything useful