Linkerd2: Pod can't reliably establish watches properly

Created on 21 Jan 2019 · 43Comments · Source: linkerd/linkerd2

Bug Report

What is the issue?

I am running the latest version of linkerd edge 19.1.2 and I am getting this error

WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for NameAddr { name: DnsName(DNSName("cs-ch-domain-manager-v1.content-hub-test.svc.cluster.local.")), port: 8080 }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })

How can it be reproduced?

I just deployed the latest version. Nothing more

Logs, error output, etc

output for linkerd logs --control-plane-component controller

linkerd linkerd-controller-7bc49fd77f-lwt8q linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

output for linkerd logs --control-plane-component controller -c proxy-api

linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T13:54:55Z" level=info msg="Stopping watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api W0121 15:57:34.899318       1 reflector.go:341] k8s.io/client-go/informers/factory.go:130: watch of *v1beta2.ReplicaSet ended with: too old resource version: 3417120 (3420499)
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T17:25:43Z" level=info msg="Establishing watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T17:32:18Z" level=info msg="Stopping watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api W0121 17:49:54.531144       1 reflector.go:341] k8s.io/client-go/informers/factory.go:130: watch of *v1beta2.ReplicaSet ended with: too old resource version: 3437967 (3439015)
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T21:32:21Z" level=info msg="Establishing watch on endpoint linkerd-prometheus.linkerd:9090"

(If the output is long, please create a gist and
paste the link here.)

`linkerd check` output

kubernetes-api
--------------
✔ can initialize the client
✔ can query the Kubernetes API

kubernetes-version
------------------
✔ is running the minimum Kubernetes API version

linkerd-existence
-----------------
✔ control plane namespace exists
✔ controller pod is running
✔ can initialize the client
✔ can query the control plane API

linkerd-api
-----------
✔ control plane pods are ready
✔ can query the control plane API
✔ [kubernetes] control plane can talk to Kubernetes
✔ [prometheus] control plane can talk to Prometheus

linkerd-service-profile
-----------------------
✔ no invalid service profiles

linkerd-version
---------------
✔ can determine the latest version
✔ cli is up-to-date

control-plane-version
---------------------
✔ control plane is up-to-date

Status check results are ✔

Environment

Kubernetes Version:
Cluster Environment: (GKE, AKS, kops, ...) EKS
Host OS: Amazon ami
Linkerd version: edge 19.1.2

Possible solution

Additional context

arecontroller bug prioritP0

Source

jmirc

Most helpful comment

I finished my test and everything works as expected. No more errors in the log of linkerd-proxy.
My service is able to connect external service.

This new version has fixed all the problems I had previously.

jmirc on 25 Jan 2019

🚀3 🎉3

All 43 comments

Any chance that you have some minimal YAML manifests that we can use to reproduce the issue? I am not seeing the errors on EKS 1.11.5. Is this a new installation of edge 19.1.2 or an upgrade from an older version of Linkerd2? Also, do you know if this happens on other EKS clusters too, or just one particular cluster?

ihcsim on 22 Jan 2019

I am sorry I can't share our code. We have only one EKS cluster with Linkerd.

However, I can do testing on our side.

I upgraded the linkerd stack for each new small update. I can delete the linkerd stack and try with a new fresh installation. Let me know.

jmirc on 22 Jan 2019

I just deleted the namespace linkerd and re-install and I am getting the following errors.
I don't have this exception 172.20.0.1:443: connect: connection refused from other apps.

linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:01Z" level=info msg="running version edge-19.1.2"
linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:01Z" level=info msg="waiting for caches to sync"
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.556917       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://172.20.0.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.557836       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://172.20.0.1:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.557934       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://172.20.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.558011       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://172.20.0.1:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.558411       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://172.20.0.1:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:02Z" level=info msg="caches synced"
linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:02Z" level=info msg="starting gRPC server on 127.0.0.1:8088"
linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:02Z" level=info msg="starting admin server on :9998"
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error

jmirc on 22 Jan 2019

So after letting my EKS cluster run overnight, my linkerd proxies in the linkerd and application namespaces are flooded with the WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for NameAddr { name: DnsName(DNSName("linkerd-grafana.linkerd.svc.cluster.local")), port: 3000 }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" }) error. However, my application is still up and running.

fwiw, I was using t3.small workers and I did see an OOM error on one of my nodes, causing application to be unavailable. But the node recovered. I assume you are using bigger machines than mine?

The connect: connection refused errors you saw might just be start up errors. I saw them on GKE and Minikube too when the control plane was first deployed. And yes, definitely seeing many error fetching profile warnings on other platforms too. But I don't think they are related to your problem.

Are all your meshed services unreachable?

ihcsim on 22 Jan 2019

Hi,

I was able to reach from ingress the service but I was not able to call an external service. By external, I mean another service running on Kubernetes. However, I was able to call this service from another service that was not meshed by linkerd.

I will retry later today or tomorrow with a fresh install of linkerd. So you know why we are getting all these errors

jmirc on 22 Jan 2019

It seems at least possible that this issue is related to https://github.com/linkerd/linkerd2/issues/2111. I think it probably makes sense for us to spend some more time on #2111 first to see if that bug could explain the behavior reported in this issue.

olix0r on 22 Jan 2019

Interesting. We have grpc service also. I will try to mesh only the services that are not using grpc to see if I am still getting the errors

jmirc on 22 Jan 2019

1) I injected linkerd inside a service and I am getting the same error in the linkerd-proxy which is

WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for NameAddr { name: DnsName(DNSName("cs-ch-domain-manager-v1.content-hub-test.svc.cluster.local.")), port: 8080 }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })

2) Inside the injected container, I am trying to call another service and I am getting the following error

webmaster@cs-ch-gateway-adapter:/app$ curl http://cs-ch-domain-manager-v1:8080/service/security/clients  -v
*   Trying 172.20.1.80...
* TCP_NODELAY set
* Connected to cs-ch-domain-manager-v1 (172.20.1.80) port 8080 (#0)
> GET /service/security/clients HTTP/1.1
> Host: cs-ch-domain-manager-v1:8080
> User-Agent: curl/7.62.0
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< content-length: 0
< date: Wed, 23 Jan 2019 00:07:22 GMT
<
* Connection #0 to host cs-ch-domain-manager-v1 left intact

The error 500 is not returned by the service, it is returned by linkerd-proxy

3) If I am calling the same service through the external route53 - I am getting the response.

4) Other services can communicate with the linkerd injected service. So, only outbound connections are affected

jmirc on 23 Jan 2019

This does _not_ appear to be related to issue #2111; that issue was caused by a client sending requests that were not standards-compliant.

hawkw on 23 Jan 2019

Ok, let me know if I can help on anything to fix this issue

jmirc on 23 Jan 2019

@jmirc just to clarify, you only began seeing this error _after_ upgrading to edge-19.1.2? What version were you on previously?

hawkw on 23 Jan 2019

Yes. I didn't have this issue with the previous version which was edge-19.1.1.
I will retry to install the version edge-19.1.1 this evening to see if I am getting the issue.

jmirc on 23 Jan 2019

Thanks @jmirc! That narrows the search space considerably. 🙂

hawkw on 23 Jan 2019

I can confirm that the warnings in the linkerd-proxy container started appearing as of the edge-19.1.2 release.

If I install the edge-19.1.1 release and open the dashboard, in the controller's linkerd-proxy logs, I see:

$ kubectl -n linkerd logs deploy/linkerd-controller linkerd-proxy --timestamps
2019-01-23T20:43:46.8655514Z INFO linkerd2_proxy::app::main using controller at Some(Name(NameAddr { name: DnsName(DNSName("localhost.")), port: 8086 }))
2019-01-23T20:43:46.8656894Z INFO linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
2019-01-23T20:43:46.8670232Z INFO linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
2019-01-23T20:43:46.8670578Z INFO linkerd2_proxy::app::main serving Prometheus metrics on V4(0.0.0.0:4191)
2019-01-23T20:43:46.8670856Z INFO linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
2019-01-23T20:43:46.8672108Z INFO linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}
...

Whereas if I install the edge-19.1.2 release and open the dashboard, in the controller's linkerd-proxy logs, I see:

$ kubectl -n linkerd logs deploy/linkerd-controller linkerd-proxy --timestamps
2019-01-23T21:02:39.7247423Z INFO linkerd2_proxy::app::main using controller at Some(Name(NameAddr { name: DnsName(DNSName("localhost.")), port: 8086 }))
2019-01-23T21:02:39.724812Z INFO linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
2019-01-23T21:02:39.7248405Z INFO linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
2019-01-23T21:02:39.7248669Z INFO linkerd2_proxy::app::main serving Prometheus metrics on V4(0.0.0.0:4191)
2019-01-23T21:02:39.7248925Z INFO linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
2019-01-23T21:02:39.7249283Z INFO linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}
...
2019-01-23T21:03:01.3305407Z WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
2019-01-23T21:03:07.338784Z WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
2019-01-23T21:03:11.4398961Z WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
2019-01-23T21:03:11.4554226Z WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
2019-01-23T21:03:12.0799139Z WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-controller-api.linkerd.svc.cluster.local:8085: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

klingerf on 23 Jan 2019

I can confirm too. just installed the previous version (edge-19.1.1) and everything works.

INFO linkerd2_proxy::app::main using controller at Some(Name(NameAddr { name: DnsName(DNSName("linkerd-proxy-api.linkerd.svc.cluster.local")), port: 8086 }))
INFO linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
INFO linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
INFO linkerd2_proxy::app::main serving Prometheus metrics on V4(0.0.0.0:4191)
INFO linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
INFO linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}

jmirc on 23 Jan 2019

Excellent, thanks again for confirming that for me. I have some theories about possible causes of this issue and I'm looking into them.

hawkw on 23 Jan 2019

Interesting, I've been testing with a fresh linkerd install, and I'm thus far only seeing the

error fetching profile for ...:  Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

log line, not the

 Destination.Get stream errored for NameAddr { ... }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })

error. I believe the "error fetching profile" warning is relatively benign, and only the failure to get destinations should result in the proxy returning 500 errors.

hawkw on 24 Jan 2019

@hawkw To date with the latest edge version, I've only see the Destination.Get stream errored for NameAddr error in EKS, not GKE or Minikube, after leaving the app running for a few hours. But even with the errors, the emojivoto application seems to work just fine.

ihcsim on 24 Jan 2019

@ihcsim _Iiiiiiinteresting!_ I've been testing exclusively in GKE, but I note that @jmirc is also running in EKS. Issue #2115, which AFAICT is just the timeout error, was reported on Azure.

hawkw on 24 Jan 2019

to add, I've also seen (in azure and starting with edge-19.1.2. edge-19.1.1 was fine)

Destination.Get stream errored for NameAddr { ... }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })

my thoughts are that the error fetching profile is benign but the ingress controller starts returning 500s once Destination.Get stream errored shows up in the logs. The only way to recover is to restart the ingress controller pods

to note: we only have http rest services, no grpc

jon-walton on 24 Jan 2019

thanks @jon-walton, that confirms what I've been seeing.

the issue isn't specific to gRPC services, as the proxy itself uses gRPC to talk to the control plane's service discovery API.

hawkw on 24 Jan 2019

👍1

FWIW, I've installed edge 19.1.2 on an Azure AKS cluster and am seeing the same kinds of errors.

Some notes:

That cluster used to run linkerd stable 2.1.0, but I had uninjected the proxy from all deployments and deleted the linked namespace about a week ago.
I have installed the edge-19.1.2 CLI on my mac, and ran the linkerd install | kubectl apply -f - command
I have not yet injected the proxy into any deployment.

Highlights from the controller logs:

A bunch of linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for some-service-endpoint
A few public-api E0124 17:53:25.043216 1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod (and v1.AnythingElse)

Full result of linkerd logs --control-plane-component controller are in thie attached file.

controller.log

bourquep on 24 Jan 2019

I did a cleanup as thorough as I could think of:

k delete ns linkerd
k delete clusterrolebinding linkerd-linkerd-controller
k delete clusterrolebinding linkerd-linkerd-prometheus
k delete clusterrole linkerd-linkerd-controller
k delete clusterrole linkerd-linkerd-prometheus
k delete crd serviceprofiles.linkerd.io

And installed edge-19.1.2 again (linkerd install | kubectl apply -f -)

Now, I'm not seeing the "error fetching profile" errors anymore, just the "Failed to list *v1.Pod/Deployment/..." errors)

controller2.log

bourquep on 24 Jan 2019

I've just tried with stable-2.1.0, and I see the same errors:

k logs linkerd-controller-55cf8bd769-pfbw6 -n linkerd -c public-api


time="2019-01-24T18:35:50Z" level=info msg="running version stable-2.1.0"
time="2019-01-24T18:35:50Z" level=info msg="starting admin server on :9995"
time="2019-01-24T18:35:50Z" level=info msg="starting HTTP server on :8085"
time="2019-01-24T18:35:50Z" level=info msg="waiting for caches to sync"
E0124 18:35:50.653890       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:50.653890       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:50.653958       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:50.654014       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:50.654092       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.662692       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.662692       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.662747       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.662792       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.663056       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669449       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669456       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669506       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669549       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669558       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677299       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677299       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677385       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677439       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677690       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:54.991160       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: EOF
time="2019-01-24T18:35:56Z" level=info msg="caches synced"

bourquep on 24 Jan 2019

@bourquep Thanks for the additional info. I just wanted to chime in and say that those "connection refused" messages that appear prior to the "caches synced" message are (unfortunately) expected. They're a result of the public-api trying to query the kubernetes API before the linkerd-proxy container in the same pod is ready to serve requests. They eventually succeed if you see the "caches synced" message.

For more context, we use k8s.io/client-go to query the kubernetes API, and that package uses glog to log errors when the API is unreachable, before retrying. We would be better off suppressing all of the glog logs, but we have to redirect them to stderr, due to all of the reasons mentioned in kubernetes/kubernetes#61006. Kubernetes recently swapped out glog with it's own fork (called klog 🙄) that is apparently more configurable. So it's possible that by updating to a more recent version of client-go we could suppress those message, but we haven't gotten around to it yet.

klingerf on 24 Jan 2019

👍1

Aahh, thanks for clarifying that. :)

bourquep on 24 Jan 2019

@bourquep Similar to @klingerf's previous reply, those profiles error fetching profile for some-service-endpoint messages coming from linkerd-proxy are also benign. They indicate that a service profile has not been installed for a service, which is default behavior unless you have explicitly added service profiles. We're working on a fix to make those log lines look less like errors.

siggy on 24 Jan 2019

So considering that my install went like this, everything should be fine regardless of those errors?

(go to 1:15 to see the logs)

bourquep on 24 Jan 2019

A little more info on the error fetching profile for errors. I observed these running locally with edge-19.1.2, and then they went away once I installed service profiles for the control plane...

linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-prometheus-74d66f86f6-cc9zv linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-prometheus-74d66f86f6-cc9zv linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-prometheus-74d66f86f6-cc9zv linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-prometheus-74d66f86f6-cc9zv linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

linkerd install-sp | kubectl apply -f -

linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

siggy on 24 Jan 2019

We've narrowed the _error fetching profile_ warning down to a recent controller change. Though, my understanding is that this warning should not indicate any functional problem for traffic. However, I believe folks have reported seeing communication fail in this situation? Is that true?

The errors related to the Destination.Get stream errored for NameAddr are more concerning, as this implies an error is occurring in an unrelated API path, though.

olix0r on 24 Jan 2019

The edge-19.1.3 release that just went out contains a few changes that should be relevant here: first, we've fixed the issue which was causing the (benign) error fetching profile ... log spam, and second (more importantly), we've improved the diagnostics for gRPC errors --- it should hopefully be much easier to now to track down where the Destination.Get error is coming from on the new version.

hawkw on 25 Jan 2019

I have 45 minutes before my son's hockey game starts, installing now! :)

bourquep on 25 Jan 2019

🚀2

thanks @bourquep ❤️

hawkw on 25 Jan 2019

Hey, I'm in the middle of nowhere, nothing else to do. :)

https://www.google.com/maps/@46.5172089,-72.7554464,12z

bourquep on 25 Jan 2019

@bourquep not too far from me. I am in Montreal ;) I am starting to test this version

jmirc on 25 Jan 2019

@jmirc Je suis sur la rive-sud de Mtl, mais en tournoi de hockey à Shawi :)

bourquep on 25 Jan 2019

I finished my test and everything works as expected. No more errors in the log of linkerd-proxy.
My service is able to connect external service.

This new version has fixed all the problems I had previously.

jmirc on 25 Jan 2019

🚀3 🎉3

@hawkw sounds like we can close this one out?

grampelberg on 25 Jan 2019

@grampelberg I'd prefer to hear back from @bourquep and @jon-walton before closing this, to confirm that the issue has been resolved for everyone affected.

hawkw on 25 Jan 2019

So far so good on my side!

bourquep on 25 Jan 2019

FWIW, I spun up an AKS test cluster with edge-19.1.3 last night, and after 17 hours, I haven't seen any failures either.

hawkw on 25 Jan 2019

Yep, 19.1.3 solved all the issues I was having with 19.1.2, awesome :+1:

jon-walton on 26 Jan 2019

That's great to hear! :D

hawkw on 26 Jan 2019

🚀1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

linkerd-identity connection issues / retries

michalschott · 32Comments

Intermittent 502 Bad Gateway issue when service is meshed

mbelang · 23Comments

Linkerd 2.4.0: request aborted because it reached the configured dispatch deadline

bjoernhaeuser · 35Comments

BadCertificate in case of \n at the end of root cert

StupidScience · 25Comments

All linkerd components in Init:CrashLoopBackOff

jurgengrech · 39Comments

Linkerd2: Pod can't reliably establish watches properly

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

Most helpful comment

All 43 comments

Related issues

`linkerd check` output