Linkerd2: 503 Service Unavailable error when trying to access a http/grpc headless service through a dns SRV record

Created on 6 Dec 2019 · 3Comments · Source: linkerd/linkerd2

Bug Report

What is the issue?

I have a simple GRPC client/server model

kubectl -n dev-sherba get pods -o wide | grep grpc
grpc-client-5b4c588f84-dv42q   3/3     Running            15         15h   10.100.7.24      minion-2   <none>           <none>
grpc-server-6dfb8b468c-56sq7   2/2     Running            0          40h   10.100.7.136     minion-2   <none>           <none>
grpc-server-6dfb8b468c-92hqq   2/2     Running            0          15h   10.100.94.109   minion-1   <none>           <none>
grpc-server-6dfb8b468c-hsbfx   2/2     Running            0          15h   10.100.7.23      minion-2   <none>           <none>
grpc-server-6dfb8b468c-qdzts   2/2     Running            0          15h   10.100.238.21    minion-5   <none>           <none>
grpc-server-6dfb8b468c-xktdd   2/2     Running            0          15h   10.100.197.63    minion-3   <none>           <none>

kubectl get svc -o wide | grep grpc 
grpc-server         ClusterIP      None           <none>                                          8000/TCP             41s     app=grpc-server

Historically, our client application uses dns SRV record in order to form a service access string like "host:port" so neither host or port variables are needed to be provided. Such a scheme works fine with a ClusterIP service because a SRV record looks like:

;; ANSWER SECTION:
grpc-server.dev-sherba.svc.cluster.local. 5 IN SRV 0 100 8000 grpc-server.dev-sherba.svc.cluster.local.

But some of our services are headless so there are multiple DNS SRV records per each pod and they look like:

;; ANSWER SECTION:
10-100-197-63.grpc-server.dev-sherba.svc.cluster.local. 5 IN SRV 0 100 5300 10-100-197-63.grpc-server.dev-sherba.svc.cluster.local.

In that case our client picks up just only one random SRV record to access a service, but it seems that linkerd proxy has some issues with such record format and following error appears:

/grpc-client test
FATAL: 2019/12/06 05:51:54 fail to dial: rpc error: code = Unavailable desc = Service Unavailable: HTTP status code 503; transport: missing content-type field

At the same time, I observe following logs in a linkerd-destination pod:

kubectl logs -n linkerd linkerd-destination-6c5fff7f56-mqxj2 destination -f --tail=10000 | grep grpc
time="2019-12-06T05:11:51Z" level=info msg="Establishing watch on service dev-sherba/grpc-server" addr=":8086" component=traffic-split-watcher
time="2019-12-06T05:11:51Z" level=info msg="Establishing watch on profile dev-sherba/grpc-server.dev-sherba.svc.cluster.local" addr=":8086" component=profile-watcher
time="2019-12-06T05:11:51Z" level=info msg="Establishing watch on profile dev-sherba/grpc-server.dev-sherba.svc.cluster.local" addr=":8086" component=profile-watcher
time="2019-12-06T05:11:51Z" level=info msg="Establishing watch on endpoint [10-100-197-63.dev-sherba/grpc-server:5300]" addr=":8086" component=endpoints-watcher

Last line looks weird for me because the namespace somehow has transformed from dev-sherba to 10-100-197-63.dev-sherba.

How can it be reproduced?

Create a namespace with a linkerd.io/inject: enabled annotation.
Run a kuard pod (as an example web service)

kubectl run --restart=Never --image=gcr.io/kuar-demo/kuard-amd64:blue kuard

Create a simple headless service for a kuard pod like:

apiVersion: v1
kind: Service
metadata:
  name: kuard-service
spec:
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    run: kuard
  sessionAffinity: None
  type: ClusterIP
  clusterIP: None

Run a curl pod

kubectl run curl --image=radial/busyboxplus:curl -i --tty

nslookup kuard-service, u will get someting similiar to

nslookup kuard-service
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kuard-service
Address 1: 172.17.0.18 172-17-0-18.kuard-service.default.svc.cluster.local

Curl kuard-service through http://kuard-service:8080 and
http://172-17-0-18.kuard-service.default.svc.cluster.local:8080
Grep linkerd-destinations logs for "kuard-service".

Logs, error output, etc

(If the output is long, please create a gist and
paste the link here.)

`linkerd check` output

linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Environment

Kubernetes Version: v1.14.1
Cluster Environment: bare-metal (Vmware + Kubespray)
Host OS: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-50-generic x86_64)
Linkerd version: Client version: stable-2.6.0 / Server version: stable-2.6.0

Possible solution

Additional context

arecontroller bug needrepro prioritP1 wontfix

Source

miklezzzz

All 3 comments

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] on 8 Mar 2020

👎3

miklezzzz on 10 Mar 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] on 9 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add Support for PodDisruptionBudgets in Helm Chart

tustvold · 4Comments

Wire up stats and dashboards for Jobs

klingerf · 3Comments

Add validation to the New Service Profile popup form

alpeb · 3Comments

Publish Helm chart to Helm Hub

ihcsim · 4Comments

Add CNI Chart To Helm Hub