Linkerd2: Linkerd fails during node outage

Created on 26 Jun 2020 · 23Comments · Source: linkerd/linkerd2

Bug Report

What is the issue?

During node outage where some of linkerd components “linkerd-destination” and “linkerd-identity” were running, it looks like linkerd keeps sending traffic to pods on the failed node. In my cluster Linkerd is installed in HA mode.

How can it be reproduced?

Installed linkerd in HA mode
Mesh two apps, where app A talks to app B or vise versa
Stop node where one of linkerd component "linkerd-destination" was running (The goal is to simulate a node outage which has happened to us and lead us to seeing this bug) . During this time the pod will wait for 5 minutes as per eviction timeout and get rescheduled in a new node after 5 minutes.
Randomly one or two of app pods will fail to make call to another app
This issue happens only when there is ungraceful shutdown of nodes

linker-proxy logs

2020-06-18T20:48:30.790810742Z [ 14263.548533981s] WARN outbound:accept{peer.addr=172.21.14.116:54590}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: request timed out 2020-06-18T20:48:30.798501272Z [ 14263.555872710s] WARN outbound:accept{peer.addr=172.21.14.116:52074}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: request timed out 2020-06-18T20:48:31.317647919Z [ 14264.75155158s] WARN outbound:accept{peer.addr=172.21.14.116:52074}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast 2020-06-18T20:48:31.317687819Z [ 14264.75275658s] WARN outbound:accept{peer.addr=172.21.14.116:54590}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast 2020-06-18T20:48:32.322513181Z [ 14265.80191420s] WARN outbound:accept{peer.addr=172.21.14.116:52074}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast

```2020-06-18T22:36:56.281636047Z [ 110.160896474s] WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate
2020-06-18T22:36:56.785530039Z [ 110.664675566s] WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate
2020-06-18T22:36:57.288315626Z [ 111.167593054s] WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate
2020-06-18T22:36:57.79501223Z [ 111.674226057s] WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate

#### `linkerd check` output

```kubernetes-api                                                                      
--------------                                                                      
√ can initialize the client                                                         
√ can query the Kubernetes API                                                      

kubernetes-version                                                                  
------------------                                                                  
√ is running the minimum Kubernetes API version                                     
√ is running the minimum kubectl version                                            

linkerd-existence                                                                   
-----------------                                                                   
√ 'linkerd-config' config map exists                                                
√ heartbeat ServiceAccount exist                                                    
√ control plane replica sets are ready                                              
√ no unschedulable pods                                                             
√ controller pod is running                                                         
√ can initialize the client                                                         
√ can query the control plane API                                                   

linkerd-config                                                                      
--------------                                                                      
√ control plane Namespace exists                                                    
√ control plane ClusterRoles exist                                                  
√ control plane ClusterRoleBindings exist                                           
√ control plane ServiceAccounts exist                                               
√ control plane CustomResourceDefinitions exist                                     
√ control plane MutatingWebhookConfigurations exist                                 
√ control plane ValidatingWebhookConfigurations exist                               
√ control plane PodSecurityPolicies exist                                           

linkerd-identity                                                                    
----------------                                                                    
√ certificate config is valid                                                       
√ trust roots are using supported crypto algorithm                                  
√ trust roots are within their validity period                                      
√ trust roots are valid for at least 60 days                                        
√ issuer cert is using supported crypto algorithm                                   
√ issuer cert is within its validity period                                         
√ issuer cert is valid for at least 60 days                                         
√ issuer cert is issued by the trust root                                           

linkerd-api                                                                         
-----------                                                                         
√ control plane pods are ready                                                      
√ control plane self-check                                                          
√ [kubernetes] control plane can talk to Kubernetes                                 
√ [prometheus] control plane can talk to Prometheus                                 
√ tap api service is running                                                        

linkerd-version                                                                     
---------------                                                                     
√ can determine the latest version                                                  
‼ cli is up-to-date                                                                 
    is running version 2.7.1 but the latest stable version is 2.8.1                 
    see https://linkerd.io/checks/#l5d-version-cli for hints                        

control-plane-version                                                               
---------------------                                                               
‼ control plane is up-to-date                                                       
    is running version 2.7.1 but the latest stable version is 2.8.1                 
    see https://linkerd.io/checks/#l5d-version-control for hints                    
√ control plane and cli versions match                                              

linkerd-ha-checks                                                                   
-----------------                                                                   
√ pod injection disabled on kube-system                                             

Status check results are √

Environment

Kubernetes Version: 1.16.9
Cluster Environment: AKS
Linkerd version: stable-2.7.1

Possible solution

Additional context

There is similar open github issue, I am not 100 percent sure they are the same

arecontroller areha needrepro

Source

Abrishges

All 23 comments

@Abrishges IIRC from our slack conversation, some of your injected application pods were not able to communicate with each other, even though some destination and identity replicas were still up, right? IOW, a partial control plane outage caused some Linkerd proxies to not work. What happens after all the destination and identity pods recovered from the outage? Did the injected application pods recover? eventually? not at all? Thanks!

ihcsim on 26 Jun 2020

@ihcsim Yes, Even though I have two destinationand two identity replicas running, some of application pod with proxy fail to communicate with each other. After all the destination and identity pods recovered from outage in most cases application pods took 15 to 30 minutes to recover. In some case I have redeploy or restart the application to recover faster

Abrishges on 26 Jun 2020

@Abrishges I tried reproducing this issue with the steps described in https://github.com/linkerd/linkerd2/issues/4624#issuecomment-651644201, except I did an ungradeful shutdown of the node. Could not see this happening. Can you give us some tips on trying to reproduce this?

zaharidichev on 30 Jun 2020

@zaharidichev From the steps described on your test, it look like you drained the nodes graceful. if you are using azure portal to stop the node, that is a graceful shutdown and might not cause errors since Linkerd pods will be drain properly and they will be reschedule to new node. To simulate a sudden node outage at least which happened to us, we used the following command in AKS
Stop-AzVmss -ResourceGroupName rgname -VMScaleSetName vmssname -InstanceId id -Force -SkipShutdown -StayProvisioned

Abrishges on 1 Jul 2020

@Abrishges this is very helpful. Thanks a bunch. Will try that out. I indeed shut down the node through the UI.

zaharidichev on 1 Jul 2020

We are also seeing these issues in Azure when some nodes become "unschedulable" temporarily, and then linkerd fails and starts to return 502 errors with the following logs:

linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast

ericsuhong on 24 Jul 2020

We encountered this issue again in PROD today (requests failing with 403 for about 30 minutes while node was unreachable).

We are running Linkerd Edge 20.5.1 in HA mode (3 replicas) on Kubernetes 1.16.9 in Azure.

I want to explain our affected scenario in detail:

Node X became unreachable for 30 minutes
503s were being returned from 2 ingress controllers (out of 5) for about 30 minutes, running on node A, B
From the affected ingress controller's linkerd-proxy logs, we found the following trace during affected 30 minutes until it recovered:
[1384577.280085623s] INFO outbound:accept{peer.addr=10.248.8.133:46170}:source{target.addr=10.248.6.109:80}:logical{addr=MYSERVICE.svc.cluster.local:80}:profile: linkerd2_service_profiles::client: Using default service profile after timeout [1384579.925284208s] WARN outbound:accept{peer.addr=10.248.8.133:39868}:source{target.addr=10.248.13.243:80}: linkerd2_app_core::errors: Failed to proxy request: request timed out [1384579.761528070s] WARN outbound:accept{peer.addr=10.248.8.133:46170}:source{target.addr=10.248.6.109:80}: linkerd2_app_core::errors: Failed to proxy request: request timed out [1384604.237339863s] WARN outbound:accept{peer.addr=10.248.8.133:46504}:source{target.addr=10.248.6.109:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast [1384619.9701770s] WARN outbound:accept{peer.addr=10.248.8.133:38718}:source{target.addr=10.248.14.143:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast [1384625.46743122s] WARN outbound:accept{peer.addr=10.248.8.133:40464}:source{target.addr=10.248.13.243:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast [1384650.925862452s] WARN outbound:accept{peer.addr=10.248.8.133:47060}:source{target.addr=10.248.6.109:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast [1384658.511683637s] WARN outbound:accept{peer.addr=10.248.8.133:38718}:source{target.addr=10.248.14.143:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast [1384677.479735169s] WARN outbound:accept{peer.addr=10.248.8.133:40464}:source{target.addr=10.248.13.243:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast [1384786.372390312s] INFO outbound:accept{peer.addr=10.248.8.133:48802}:source{target.addr=10.248.6.109:80}:logical{addr=MYSERVICE.svc.cluster.local:80}:profile: linkerd2_service_profiles::client: Using default service profile after timeout [1384788.924501476s] WARN outbound:accept{peer.addr=10.248.8.133:42500}:source{target.addr=10.248.13.243:80}: linkerd2_app_core::errors: Failed to proxy request: request timed out [1384788.824801543s] WARN outbound:accept{peer.addr=10.248.8.133:48802}:source{target.addr=10.248.6.109:80}: linkerd2_app_core::errors: Failed to proxy request: request timed out [1385089.202606373s] INFO outbound:accept{peer.addr=10.248.8.133:44670}:source{target.addr=10.248.14.143:80}:logical{addr=MYSERVICE.svc.cluster.local:80}:profile: linkerd2_service_profiles::client: Using default service profile after timeout [1385091.677859631s] WARN outbound:accept{peer.addr=10.248.8.133:44670}:source{target.addr=10.248.14.143:80}: linkerd2_app_core::errors: Failed to proxy request: request timed out [1385330.475957221s] INFO outbound:accept{peer.addr=10.248.8.133:47650}:source{target.addr=10.248.14.143:80}:logical{addr=MYSERVICE.cluster.local:80}:profile: linkerd2_service_profiles::client: Using default service profile after timeout [1385332.953015592s] WARN outbound:accept{peer.addr=10.248.8.133:47650}:source{target.addr=10.248.14.143:80}: linkerd2_app_core::errors: Failed to proxy request: request timed out [1385333.846909781s] WARN outbound:accept{peer.addr=10.248.8.133:49356}:source{target.addr=10.248.13.243:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast [1385366.140589852s] WARN outbound:accept{peer.addr=10.248.8.133:48116}:source{target.addr=10.248.14.143:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast [1385368.655594396s] WARN outbound:accept{peer.addr=10.248.8.133:49792}:source{target.addr=10.248.13.243:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast [1385401.205876500s] WARN outbound:accept{peer.addr=10.248.8.133:56524}:source{target.addr=10.248.6.109:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast [1385403.464294439s] WARN outbound:accept{peer.addr=10.248.8.133:48612}:source{target.addr=10.248.14.143:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast [1385436.127600293s] WARN outbound:accept{peer.addr=10.248.8.133:49016}:source{target.addr=10.248.14.143:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast [1385471.628446455s] WARN outbound:accept{peer.addr=10.248.8.133:51120}:source{target.addr=10.248.13.243:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast [1385528.903974442s] WARN dst: linkerd2_reconnect::service: Service failed error=channel closed [1385528.903891537s] WARN outbound:accept{peer.addr=10.248.8.133:47650}:source{target.addr=10.248.14.143:80}:logical{addr=MYSERVICE.svc.cluster.local:80}:profile:daemon:poll_profile: linkerd2_service_profiles::client: Could not fetch profile error=grpc-status: Unknown, grpc-message: "connection error: broken pipe"
Target services were running on node A,B,C (not on the node X that became unreachable)
Out of all linkerd services, we found that only linkerd-destination pod was running on unreachable node X during this time.

Can we please get some help on this issue? This is randomly affected our SLA in a region where a node outage seems to be frequent.

ericsuhong on 24 Jul 2020

@ericsuhong We are working on this at the moment. There is an open PR for the proxy that hopefully resolves this issue. If you are keen to help out and test this fix out, I can post some instructions on how to do it.

zaharidichev on 24 Jul 2020

@zaharidichev Thanks a lot for taking the time to work on this issue. I am available to test the change, feel free to share me the instructions

Abrishges on 27 Jul 2020

@zaharidichev Can you upload the proxy image somewhere for testing?

@Abrishges Once the image is available, you can annotate your workload pod templates with the config.linkerd.io/proxy-image and config.linkerd.io/proxy-version annotations. Then when restarted, the new pods will be injected with the new proxy.

ihcsim on 27 Jul 2020

@zaharidichev We are also interested in testing this fix, if fix becomes available as an edge version.

ericsuhong on 28 Jul 2020

Okey @ericsuhong, @Abrishges In order to test this:

Grab my control plane branch: https://github.com/linkerd/linkerd2/tree/zd/headless
Run bin/build-cli-bin
Run bin/linkerd install --ha --proxy-version=cplb | kubectl apply -f - (add whatever options you need for your usecase)

All the images you need are pushed so you will be good to go. Now you can manually fail selected nodes, etc and verify that this fixes your problem. I myself has a server and client application running and failed some of the other nodes that has linkerd identity running on them. Before this fix, I would observe what you are describing. The changes fix this problem for me. As always, the more logs the better :)

zaharidichev on 28 Jul 2020

👍1

@zaharidichev @ihcsim we did test following your instructions and killed selected nodes where some of linkerd components “linkerd-destination” and “linkerd-identity” were running. The change is good and checked multiple services, we haven't seen any issue.

Environment used for testing

Kubernetes Version: 1.16.10
Cluster Environment: AKS

@zaharidichev Thanks a bunch. This is relay critical change as we are approaching our GA

Abrishges on 29 Jul 2020

@zaharidichev I wanted to follow up on this bug to see when it will be integrated into a release branch. @Abrishges and I are hoping it gets pulled in soon. Can you let us know what we should expect date wise?

jawhites on 24 Aug 2020

@jawhites We're hoping to merge this in time for this week's edge release.

olix0r on 24 Aug 2020

👍1

All of the proxy changes have merged, but we hit some issues while testing the required control plane changes (discussed in https://github.com/linkerd/linkerd2/issues/4912). I expect these changes to make it into next week's edge release.

olix0r on 28 Aug 2020

👍1

@jawhites @Abrishges You cna pick the latest edge and try it out to verify that this problem has been fixed. Thanks a bunch for your help testing and diagnosing this problem

zaharidichev on 4 Sep 2020

Thanks @zaharidichev , we will start testing

Abrishges on 4 Sep 2020

@Abrishges Any updates on this one ?

zaharidichev on 14 Sep 2020

Thank you @zaharidichev We rolled back due to pods stacking in crash loop after upgrade. Opened up an issue #4973

Abrishges on 15 Sep 2020

@Abrishges As far as I take it from our conversations this particular issue is now fixed. Is that correct?

zaharidichev on 22 Sep 2020

@zaharidichev yes this issue is fixed. I simulated node outage a couple of time and this issue didn't happen again

Abrishges on 22 Sep 2020

Fixed by https://github.com/linkerd/linkerd2-proxy/pull/594