Hey
We have a customer running a Kubernetes Cluster with Version 1.10.11 and we tried to install Vault 1.4.0 via the your vault-helm Chart (0.5.0). Unfortunatly it's not possible to get it working, because the pod's never get the labels that are documented here: https://www.vaultproject.io/docs/configuration/service-registration/kubernetes
It's not an RBAC Issue, because we tested manually setting labels on pods with the serviceAccount that run's vault. There were no problems doing this. We even temporarily gave the serviceAccount admin rights in the Namespace which also didn't change a thing.
Unfortunatly there are not to many log informations concerning why it didn't work (see below). The only problem is that the labels for the service registration are never patched to the pods and therefore the kubernetes service discovery doesn't work which makes it impossible to initialize and use vault with k8s 1.10.
We have tested the Chart 0.5.0 with Vault 1.4.0 on a microk8s 1.11 instance and it worked right away.
So i'm wondering: What might cause the problem that this doesn't work on kubernetes 1.10 ? Only guess so far is that the kubernetes go client libraries might handle some things differently and are not compatible with 1.10 anymore.
I would greatly appreciate any hints, ideas or an explanation why it doesn't work.
Environment:
Startup Log Output:
[DEBUG] service_registration.kubernetes: "namespace": "vault-testing"
[DEBUG] service_registration.kubernetes: "pod_name": "vault-2"
[WARN] service_registration.kubernetes: unable to set initial state due to PATCH https://172.24.0.1:443/api/v1/namespaces/vault-testing/pods/vault-2 giving up
after 11 attempts, will retry
**Log's when trying to initialise vault
```text
Error initializing: context deadline exceeded
````
Expected Behavior:
Pods should get patched with the labels required for kubernets service discovery.
Actual Behavior:
No labels are set and therefore the service discovery mechanism isn't working.
Steps to Reproduce:
Use a K8s Cluster or Microk8s Version 1.10 install vault-helm chart in HA Mode with Raft enabled.
vault operator init & service discovery will not work. Verify labels that were not applied kubectl get pods vault-0 -o jsonpath='{.metadata.label}'
Moving this to https://github.com/hashicorp/vault-helm/
I'm not sure if this is related to the Chart itself, because there are no problems deploying it. All the resources get created correctly. It's only that when everything is deployed vault can't start in ha mode, because the service registration mechanism doesn't seem to be able to patch the pods. I would have guessed that this has something to do with the PR in the vault repo #8249. I guess the log entry i posted get's created here.
So either the requirements have changed and Kubernetes 1.11 is now a prerequisite. Or there is something in the implementation of the Kubernetes Service Registration (see PR from above) that is not compatible with 1.10 anymore.
Excited to hear any news.
Thank you in advance
Oh my apologies. I misunderstood where the problem was. I'll transfer this back to the Vault repo.
Hi! Thanks for reporting this!
Can you give additional steps to reproduce? For instance, many folks edit the default values.yaml or provide values to override the default ones like shown here. That will help us understand the config that's in use.
Thanks again!
I think this is a documentation bug. The documentation states to create a role allowing verbs: ["get", "update"], but vault is actually trying to use the PATCH method. If you change your role to allow verbs: ["get", "patch", "update"] it should work, at least it did for me.
@tyrannosaurus-becks : Thanks a lot for your feedback.
The only things that we have changed in values.yaml are the following:
server.standalone.enabled=false
server.ha.enabled=true
server.ha.raft.enabled=true
No additionall configurations so far. Default 0.5.0 helm chart with the above changes. Basically just vault with raft in ha mode.
@bitfehler: Thanks for the Input, but the Role for the service discovery has those verbs added (https://github.com/hashicorp/vault-helm/blob/master/templates/server-discovery-role.yaml#L17) and it gets created correctly.
Hi there,
Getting the same error here : "service_registration.kubernetes: unable to set initial state due to PATCH https://172.20.0.1:443/api/v1/namespaces/vault/pods/vault-0 giving up after 11 attempts"
Running helm chart 0.5.0 with HA enabled (standalone is working fine..) on EKS (kubernetes 1.15 with Automatic scaling groups), role on nodes have the correct policies ( https://www.vaultproject.io/docs/platform/k8s/helm/examples/enterprise-best-practice#walk-through )
Chart in HA mode is working fine on minikube in local.
Hi @lukpre
After some chats with my network team, it seems network can be complicated on AWS and proxies, so we ended up adding the "hostNetwork: true" on the statefulset for vault, and label modification seems now to be working, and I can then unseal my pods..
@vvanghelle : Thank you for your feedback. We don't run k8s on AWS and also the Version is 1.10 not 1.15 so i'm not sure if that's the problem. Also as i mentioned it works perfectly fine on k8s 1.11. But i will check if that changes something
Just installed the chart 0.6.0 with image vault:1.4.2 on a GKE cluster 1.14 and its stalled forever on service_registration.kubernetes: unable to set initial state due to PATCH https://10.2.0.1:443/api/v1/namespaces/vault/pods/vault-0 giving up after 11 attempts, will retry.
All operator commands time out. I assume this is due to the missing labels.
Values:
server:
extraArgs: "-log-level=debug"
standalone:
enabled: false
ha:
enabled: true
raft:
enabled: true
Service account token inside the pods appears to work for GET /:namespace/pods/vault-0, haven't tried PATCH yet. DEBUG log level doesn't seem to print anything additional from the service_registration logger.
I tried deleting all pods so they get recreated in case there was a race condition and the default service account token was being used instead of the server-discovery-role bound token. This had no effect.
Kube server info:
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.36", GitCommit:"34a615f32e9a0c9e97cdb9f749adb392758349a6", GitTreeState:"clean", BuildDate:"2020-04-06T16:33:17Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
Nodes are v1.13.11-gke.14.
I believe I've discovered the issue here, if only for GKE clusters.
It wasn't until increasing some logging that I was able to see _why_ the PATCH request was not failing. The easiest way to do this was to provide the client.logger (hclog.Logger) to the retryablehttp.Client client as seen in this diff:
diff --git a/serviceregistration/kubernetes/client/client.go b/serviceregistration/kubernetes/client/client.go
index 934d3bad9..44c3ca084 100644
--- a/serviceregistration/kubernetes/client/client.go
+++ b/serviceregistration/kubernetes/client/client.go
@@ -153,6 +153,7 @@ func (c *Client) do(req *http.Request, ptrToReturnObj interface{}) error {
RetryWaitMin: RetryWaitMin,
RetryWaitMax: RetryWaitMax,
RetryMax: RetryMax,
+ Logger: c.logger,
CheckRetry: c.getCheckRetry(req),
Backoff: retryablehttp.DefaultBackoff,
}
Instead of something a little cryptic, like
[WARN] service_registration.kubernetes: unable to set initial state due to PATCH https://a.b.c.d:443/api/v1/namespaces/vault/pods/vault-0 giving up after 11 attempts, will retry
I was able to see
[DEBUG] service_registration.kubernetes: retrying request: request="PATCH https://10.2.0.1:443/
api/v1/namespaces/vault/pods/vault-0 (status: 504)" timeout=30s remaining=4
when vault's log level was set to debug.
Would a PR be accepted for this change?
It might appear that the client is timing out while connecting to the Kubernetes API, but it is the API that is returning a 504 with a JSON error descriptor response of
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "Timeout: request did not complete within requested timeout 30s",
"reason": "Timeout",
"details": {
},
"code": 504
}
It turns out that _any_ PATCH to a pod in this namespace would produce the same error, suggesting some admission controller or otherwise was doing some additional processing and timed out while doing so.
That led me to find these issues which had similar symptoms:
https://github.com/elastic/cloud-on-k8s/issues/1673
https://github.com/helm/charts/issues/16174
https://github.com/helm/charts/issues/16249#issuecomment-520795222
Common variables:
MutatingWebhookConfigurationThe vault-helm chart does indeed install a MutatingWebhookConfiguration as seen here, requesting a webhook for any UPDATE or CREATE to any pods in the namespace.
The kubernetes API server would be the one making the webhook request to a node (service within) and if a firewall rules does not allow traffic from the master to the node on tcp:8080, the master is unable to reach the service and times out. The PATCH request thus times out as a cascaded failure.
The solution for GKE Private Clusters is to add a firewall rule allowing this traffic (master CIDR -> nodes [by network tag] on tcp:8080) as described here.
I suspect a similar issue exists in other environments. Before making a firewall rule, you could verify that this issue is affecting you by disabling the agent-injector via injector.enabled=false in values or by deleting the MutatingWebhookConfiguration named vault-agent-injector-cfg.
The solution for GKE Private Clusters is to add a firewall rule allowing this traffic (master CIDR -> nodes [by network tag] on tcp:9443) as described here.
I also had to add port 8080 to the firewall rule for the service_registration to work. Any ideas why this is necessary?
@mariusgiger Thanks for pointing that out. I think my original statement was incorrect, as the MutatingWebhook in question here is for the agent-injector-svc (Vault Agent Injector Service) whose target port is 8080.
So 8080 is the correct port to allow in the firewall rules. I'll update my comment. 9443 is used by some other helm chart services that accept mutating webhooks, so it may be useful to allow that as well.
Most helpful comment
I believe I've discovered the issue here, if only for GKE clusters.
Debugging
It wasn't until increasing some logging that I was able to see _why_ the
PATCHrequest was not failing. The easiest way to do this was to provide theclient.logger (hclog.Logger)to theretryablehttp.Clientclient as seen in this diff:Instead of something a little cryptic, like
I was able to see
when vault's log level was set to
debug.Would a PR be accepted for this change?
Root Cause
It might appear that the client is timing out while connecting to the Kubernetes API, but it is the API that is returning a 504 with a JSON error descriptor response of
It turns out that _any_
PATCHto a pod in this namespace would produce the same error, suggesting some admission controller or otherwise was doing some additional processing and timed out while doing so.That led me to find these issues which had similar symptoms:
https://github.com/elastic/cloud-on-k8s/issues/1673
https://github.com/helm/charts/issues/16174
https://github.com/helm/charts/issues/16249#issuecomment-520795222
Common variables:
MutatingWebhookConfigurationThe
vault-helmchart does indeed install aMutatingWebhookConfigurationas seen here, requesting a webhook for anyUPDATEorCREATEto any pods in the namespace.The kubernetes API server would be the one making the webhook request to a node (service within) and if a firewall rules does not allow traffic from the master to the node on
tcp:8080, the master is unable to reach the service and times out. ThePATCHrequest thus times out as a cascaded failure.The solution for GKE Private Clusters is to add a firewall rule allowing this traffic (master CIDR -> nodes [by network tag] on tcp:8080) as described here.
I suspect a similar issue exists in other environments. Before making a firewall rule, you could verify that this issue is affecting you by disabling the agent-injector via
injector.enabled=falsein values or by deleting theMutatingWebhookConfigurationnamedvault-agent-injector-cfg.