Linkerd2: 503 Service Unavailable error when trying to access a http/grpc headless service through a dns SRV record

Created on 6 Dec 2019  ยท  3Comments  ยท  Source: linkerd/linkerd2

Bug Report

What is the issue?

I have a simple GRPC client/server model

kubectl -n dev-sherba get pods -o wide | grep grpc
grpc-client-5b4c588f84-dv42q   3/3     Running            15         15h   10.100.7.24      minion-2   <none>           <none>
grpc-server-6dfb8b468c-56sq7   2/2     Running            0          40h   10.100.7.136     minion-2   <none>           <none>
grpc-server-6dfb8b468c-92hqq   2/2     Running            0          15h   10.100.94.109   minion-1   <none>           <none>
grpc-server-6dfb8b468c-hsbfx   2/2     Running            0          15h   10.100.7.23      minion-2   <none>           <none>
grpc-server-6dfb8b468c-qdzts   2/2     Running            0          15h   10.100.238.21    minion-5   <none>           <none>
grpc-server-6dfb8b468c-xktdd   2/2     Running            0          15h   10.100.197.63    minion-3   <none>           <none>

kubectl get svc -o wide | grep grpc 
grpc-server         ClusterIP      None           <none>                                          8000/TCP             41s     app=grpc-server

Historically, our client application uses dns SRV record in order to form a service access string like "host:port" so neither host or port variables are needed to be provided. Such a scheme works fine with a ClusterIP service because a SRV record looks like:

;; ANSWER SECTION:
grpc-server.dev-sherba.svc.cluster.local. 5 IN SRV 0 100 8000 grpc-server.dev-sherba.svc.cluster.local.

But some of our services are headless so there are multiple DNS SRV records per each pod and they look like:

;; ANSWER SECTION:
10-100-197-63.grpc-server.dev-sherba.svc.cluster.local. 5 IN SRV 0 100 5300 10-100-197-63.grpc-server.dev-sherba.svc.cluster.local.

In that case our client picks up just only one random SRV record to access a service, but it seems that linkerd proxy has some issues with such record format and following error appears:

/grpc-client test
FATAL: 2019/12/06 05:51:54 fail to dial: rpc error: code = Unavailable desc = Service Unavailable: HTTP status code 503; transport: missing content-type field

At the same time, I observe following logs in a linkerd-destination pod:

kubectl logs -n linkerd linkerd-destination-6c5fff7f56-mqxj2 destination -f --tail=10000 | grep grpc
time="2019-12-06T05:11:51Z" level=info msg="Establishing watch on service dev-sherba/grpc-server" addr=":8086" component=traffic-split-watcher
time="2019-12-06T05:11:51Z" level=info msg="Establishing watch on profile dev-sherba/grpc-server.dev-sherba.svc.cluster.local" addr=":8086" component=profile-watcher
time="2019-12-06T05:11:51Z" level=info msg="Establishing watch on profile dev-sherba/grpc-server.dev-sherba.svc.cluster.local" addr=":8086" component=profile-watcher
time="2019-12-06T05:11:51Z" level=info msg="Establishing watch on endpoint [10-100-197-63.dev-sherba/grpc-server:5300]" addr=":8086" component=endpoints-watcher

Last line looks weird for me because the namespace somehow has transformed from dev-sherba to 10-100-197-63.dev-sherba.

How can it be reproduced?

  1. Create a namespace with a linkerd.io/inject: enabled annotation.
  2. Run a kuard pod (as an example web service)
kubectl run --restart=Never --image=gcr.io/kuar-demo/kuard-amd64:blue kuard
  1. Create a simple headless service for a kuard pod like:
apiVersion: v1
kind: Service
metadata:
  name: kuard-service
spec:
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    run: kuard
  sessionAffinity: None
  type: ClusterIP
  clusterIP: None
  1. Run a curl pod
kubectl run curl --image=radial/busyboxplus:curl -i --tty
  1. nslookup kuard-service, u will get someting similiar to
nslookup kuard-service
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kuard-service
Address 1: 172.17.0.18 172-17-0-18.kuard-service.default.svc.cluster.local
  1. Curl kuard-service through http://kuard-service:8080 and
    http://172-17-0-18.kuard-service.default.svc.cluster.local:8080

  2. Grep linkerd-destinations logs for "kuard-service".

Logs, error output, etc

(If the output is long, please create a gist and
paste the link here.)

linkerd check output

linkerd check
kubernetes-api
--------------
โˆš can initialize the client
โˆš can query the Kubernetes API

kubernetes-version
------------------
โˆš is running the minimum Kubernetes API version
โˆš is running the minimum kubectl version

linkerd-config
--------------
โˆš control plane Namespace exists
โˆš control plane ClusterRoles exist
โˆš control plane ClusterRoleBindings exist
โˆš control plane ServiceAccounts exist
โˆš control plane CustomResourceDefinitions exist
โˆš control plane MutatingWebhookConfigurations exist
โˆš control plane ValidatingWebhookConfigurations exist
โˆš control plane PodSecurityPolicies exist

linkerd-existence
-----------------
โˆš 'linkerd-config' config map exists
โˆš heartbeat ServiceAccount exist
โˆš control plane replica sets are ready
โˆš no unschedulable pods
โˆš controller pod is running
โˆš can initialize the client
โˆš can query the control plane API

linkerd-api
-----------
โˆš control plane pods are ready
โˆš control plane self-check
โˆš [kubernetes] control plane can talk to Kubernetes
โˆš [prometheus] control plane can talk to Prometheus
โˆš no invalid service profiles

linkerd-version
---------------
โˆš can determine the latest version
โˆš cli is up-to-date

control-plane-version
---------------------
โˆš control plane is up-to-date
โˆš control plane and cli versions match

Status check results are โˆš

Environment

  • Kubernetes Version: v1.14.1
  • Cluster Environment: bare-metal (Vmware + Kubespray)
  • Host OS: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-50-generic x86_64)
  • Linkerd version: Client version: stable-2.6.0 / Server version: stable-2.6.0

Possible solution

Additional context

arecontroller bug needrepro prioritP1 wontfix

All 3 comments

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

:(

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tustvold picture tustvold  ยท  4Comments

klingerf picture klingerf  ยท  3Comments

alpeb picture alpeb  ยท  3Comments

ihcsim picture ihcsim  ยท  4Comments

ihcsim picture ihcsim  ยท  4Comments