When injecting a workload with many replicas, many of the replicas appear to be uninjected during the admission phase. We have reasons to believe that this is caused by the default 30 sec timeout imposed on the webhook call, by the k8s API server. At present, the 30 seconds interval is the maximum allowed values. (See https://github.com/kubernetes/kubernetes/issues/60914).
We need to investigate into:
To reproduce:
Relevant API server logs:
W0830 17:53:24.316506 1 dispatcher.go:136] Failed calling webhook, failing open linkerd-proxy-injector.linkerd.io: failed calling webhook "linkerd-proxy-injector.linkerd.io": Post https://linkerd-proxy-injector.linkerd.svc:443/?time
out=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0830 17:53:24.316536 1 dispatcher.go:137] failed calling webhook "linkerd-proxy-injector.linkerd.io": Post https://linkerd-proxy-injector.linkerd.svc:443/?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiti
ng headers)
Relevant #3290.
An additional issue is that the kube-apiserver is not load-balancing requests among webhook replicas, and instead gets pegged to just one. This was reported in kubernetes/kubernetes#75791 and fixed in kubernetes/kubernetes#82090 . I verified that's indeed fixed in the next release for k8s 1.16, and I could inject concurrently 1000 pod replicas using 10 injector replicas without hitting the 30s timeout.
In kubernetes/kubernetes#75791 they suggest a temporary workaround for earlier versions, to pass the flag --enable-aggregator-routing to the kube-apiserver, but that didn't do the trick for me.
Turns out the proxy injector is indeed getting blocked at high load as these log entries testify:
time="2019-09-02T18:17:49Z" level=info msg="I0902 18:17:49.451406 1 request.go:530] Throttling request took 6.550489319s, request: GET:https://10.96.0.1:443/apis/apps/v1beta2/namespaces/emojivoto/replicasets/emoji-7f57d64956"
time="2019-09-02T18:17:49Z" level=info msg="I0902 18:17:49.649595 1 request.go:530] Throttling request took 6.701759578s, request: GET:https://10.96.0.1:443/apis/apps/v1beta2/namespaces/emojivoto/replicasets/emoji-7f57d64956"
time="2019-09-02T18:17:49Z" level=info msg="I0902 18:17:49.848109 1 request.go:530] Throttling request took 6.851681464s, request: GET:https://10.96.0.1:443/apis/apps/v1beta2/namespaces/emojivoto/replicasets/emoji-7f57d64956"
This blocking happens because some time ago, in order to retrieve a pod's deployment, we started hitting the api-server directly, instead of going through the cached sharedinformer api, because that cache might have not had enough time to be informed about the existence of this pod's owner. And by hitting the api-server directly we use client-go's default maxQPS and maxBurst which are 5 and 10 respectively which are quickly hit under heavy load. I tested using the sharedinformer again, and the timeout problem disappears. So we might want to use different defaults for those limits when using --ha, or add that info into the runbook...
Another interesting problem I observed during all these concurrent injections is that the events generation is also getting limited. By default, the event queue is 25 items long, and the default behavior is to drop events when the queue gets full. I could indeed verify that, as I only saw 25 injection events getting raised when injecting those 1000 pods. Another tweak we can document for prod.
@alpeb Good find! Thanks for all the hard work on this one :+1:.
I don't think these configs should be HA-only, or prod-only, because IMO, the control plane installed by the linkerd install command should continue to work (even though it might be slow) when under heavy load. E.g, it's not uncommon to conduct stress and performance test with the linkerd install command, replicating workloads to the hundreds, if not thousands count.
For the first issue with the maxQPS and maxBurst defaults, will switching back to sharedinformer with retries (if pod's owner doesn't exist) be a viable solution? (I can't remember what we did before.) My concern with assigning them non-default values is that, we might end up guessing what these values are, without any guarantees that they will work for all users.
As for the event queue, how about just logging only the skipped and rejected events (with reason)? IMO the successfully injected events aren't very interesting. The CLI and dashboard already expose those information.
will switching back to
sharedinformerwith retries (if pod's owner doesn't exist) be a viable solution?
We did at some point use the sharedinformer, falling back to a direct api call if the parent workload hadn't been synced yet in #2737, but that got reverted in #2972. I believe this scenario gives a reason for implementing what we had in #2737.
As for the event queue, how about just logging only the
skippedandrejectedevents (withreason)? IMO thesuccessfully injectedevents aren't very interesting. The CLI and dashboard already expose those information.
If the namespace and the workload aren't configured for injection, we'll also get the failure events getting dropped. Let me dig a bit more here and see what are the options.
We can leave this issue just for tackling the timeout (un-rolling-back #2737).
There's another issue I found while testing this: when the kube-apiserver call times out, the request on the proxy-injector keeps on running and then logs the injection even though it will be ignored by the kube-apiserver. I've created #3373 to track that bug.
Finally, it turns out the events are not really getting dropped but hidden in a cache indefinitely. I've created #3374 to track that.
Looks like we faced the issue and from the symptoms I see, we also reached the events limit.
Right now I'm looking at 85 replicasets that failed to create pods during one night. Their DESIRED state is bigger than 1 while CURRENT and RUNNING are zero. And kubectl describe replicaset shows Events: <none>. In the same time we got multiple alerts from the monitoring system which say:
Events from the app_name-684564bb68 ReplicaSet 1 FailedCreate: Error creating: Internal error occurred: failed calling admission webhook "linkerd-proxy-injector.linkerd.io": Post https://linkerd-proxy-injector.linkerd.svc:443/?timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
From that I can assume that first events were pushed but at some moment the rest of ReplicaSets just stuck without any events. Since I don't really understand the internal details, maybe I'm wrong at this point and just found some buzzwords in this thread that made me believe my issue is the same.
Also my issue can be related to the problem I had before (https://github.com/linkerd/linkerd2/issues/3414) when a PEM cert wasn't accepted. Because some of the events within the alerts for specific replicasets say:
x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "linkerd-proxy-injector.linkerd.svc")
Most of these replicas that failed had been created before I finally successfully reinstalled Linkerd using its Helm chart and since the cert was changed, I can assume that old replicas for some reason were affected by this change. I will try to delete all replicas and will watch if the newer get into the same problem.
Update:
Several days after we are still getting the Client.Timeout exceeded while awaiting headers error.
And I can't say if there's a visible reason for that. Just at some moment pods began to disappear one by one and in a short term the vast majority of them gone from the cluster. I would say that the problem is very critical.
@alpeb , could you suggest any workarounds that can help to improve the situation?
The worst part is that not only new pods cannot start, but the old ones disappear, which can lead to the death of the entire cluster in any random moment.
Error creating: Internal error occurred: failed calling admission webhook "linkerd-proxy-injector.linkerd.io": Post https://linkerd-proxy-injector.linkerd.svc:443/?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
List of the replicasets looks like this (all of them were created several days ago):
kube-system kube-state-metrics-645ffc58c9 1 0 0 3d4h
kube-system tiller-deploy-5f8f5484cd 1 0 0 3d5h
kube-system coredns-5494f6f4b9 2 1 1 3d4h
... # a lot of them, including node-critical and cluster-critical
?timeout=30s configurable?Also I found that Istio has the same problem. In this case L5d v1 looks like the only one option for production usage.
@kivagant-ba The current issue is tracking timeouts when trying to inject hundreds of pods at the same time. Even at that level a portion of all the pods should be injected without problems. You're seeing none, and also a certificate error, so it sounds to me yours is a different issue. To answer your question, that timeout is hard-coded inside kubernetes, so can't change it.
Would it be possible for you to jump in Slack #linkerd2 for a closer look?
From the Slack discussion with @ihcsim and @alpeb:
I am not 100% sure if what you are seeing is exactly the same issue. The issue we saw was that the deployment still succeeded; only a portion of the replicas don' have the Linkerd proxy injected. But in your case, looks like the deployment failed entirely, right?
Can you try changing the Linkerd mutating webook configuration to set itsfailurePolicyto Ignore? )
So I did
kubectl edit mutatingwebhookconfigurations linkerd-proxy-injector-webhook-config
kubectl rollout restart -n linkerd deploy/linkerd-proxy-injector
and changed the failurePolicy to Ignore. Will come back with the results in a couple of days.
UP: I believe that in our case it must not be a Linkerd problem at all. It turned out that Linkerd outages coincide with AWS spot instance replacement chart. For some reason AWS periodically starts aggressively replacing spot instances and does it so fast that the entire cluster we have can be rolled out more than twice per one day. It surprised us after we changed instance types from m4 generation to m5 because with m4 we didn't feel the problem.
UP2:
To protect kube-system and other namespaces from the admission webhook:
linkerd: enabled label into each namespace where I wan't to have Linkerd.kind: Namespace
metadata:
...
labels:
linkerd: enabled
namespaceSelector for MutatingWebhookConfiguration to ignore namespaces that don't have the label. namespaceSelector:
matchExpressions:
- key: linkerd.io/is-control-plane
operator: DoesNotExist
- key: linkerd
operator: In
values:
- enabled
I guess the linkerd.io/is-control-plane label can be used to achieve the same, but it isn't clear if the label is involved (or will be involved) to some other logic in Linkerd.
I found that it's enough to restart 1 of 3 linkerd pods in parallel with some other pods in the cluster to get into the issue:
linkerd-proxy-injector-775c97c8f5-nkv5l proxy-injector time="2019-09-17T14:54:17Z" level=info msg="patch generated for: pod/app-6b8cb67bb7-"
- linkerd-proxy-injector-775c97c8f5-nkv5l
- linkerd-proxy-injector-775c97c8f5-nkv5l
+ linkerd-proxy-injector-775c97c8f5-486fp › linkerd-proxy
+ linkerd-proxy-injector-775c97c8f5-486fp › proxy-injector
...
linkerd-proxy-injector-775c97c8f5-486fp proxy-injector time="2019-09-17T15:05:15Z" level=info msg="listening at :8443"
Events from web-service-8498cb775 ReplicaSet
1 FailedCreate: Error creating: Internal error occurred: failed calling admission webhook "linkerd-proxy-injector.linkerd.io": Post https://linkerd-proxy-injector.linkerd.svc:443/?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
New events emitted by the replicaset-controller seen at 2019-09-17 15:05:43 +0000 UTC
Seems related to https://github.com/kubernetes/kubernetes/issues/75791
@kivagant-ba you might want to look into doing manual injection (inject --manual) for the time being. We've been learning that MWCs are not nearly as reliable as we all thought ....