The Problem:
The container 'rabbitmq-ha' in pods created by the stateful set after a helm upgrade cannot resolve localhost and thus fail readiness check and never come up. ( NOTE that the initial deployment always works fine and the containers CAN resolve localhost. )
Here is a single replica deployment after changing the image from 3.8.1-alpine to 3.8.2-alpine in values.yaml, running a helm upgrade and deleting the old pod:
kubectl -n sharedservices-rabbitmq3 get all
NAME READY STATUS RESTARTS AGE
pod/rabbitmq-ha3-0 1/2 Running 5 15m
kubectl -n sharedservices-rabbitmq3 describe pod/rabbitmq-ha3-0
Warning Unhealthy 5m40s (x104 over 15m) kubelet, juju-9afcf0-11 Readiness probe failed: wget: can't connect to remote host: Connection refused
Warning Unhealthy 41s (x30 over 13m) kubelet, juju-9afcf0-11 Liveness probe failed: wget: can't connect to remote host: Connection refused
kubectl -n sharedservices-rabbitmq3 exec rabbitmq-ha3-0 -- nslookup localhost
Defaulting container name to rabbitmq-ha.
Use 'kubectl describe pod/rabbitmq-ha3-0 -n sharedservices-rabbitmq3' to see all of the containers in this pod.
Server: 10.152.183.124
Address: 10.152.183.124:53
* server can't find localhost.cluster.local: NXDOMAIN
* server can't find localhost.cluster.local: NXDOMAIN
* server can't find localhost.svc.cluster.local: NXDOMAIN
* server can't find localhost.sharedservices-rabbitmq3.svc.cluster.local: NXDOMAIN
* server can't find localhost.sharedservices-rabbitmq3.svc.cluster.local: NXDOMAIN
* server can't find localhost.svc.cluster.local: NXDOMAIN
command terminated with exit code 1
It appears that the new containers are not referencing their own /etc/hosts file which has the correct entries ( same as the original chart deployment )
kubectl -n sharedservices-rabbitmq3 exec rabbitmq-ha3-0 -- cat /etc/hosts
Defaulting container name to rabbitmq-ha.
Use 'kubectl describe pod/rabbitmq-ha3-0 -n sharedservices-rabbitmq3' to see all of the containers in this pod.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.1.9.46 rabbitmq-ha3-0.rabbitmq-ha3-discovery.sharedservices-rabbitmq3.svc.cluster.local rabbitmq-ha3-0
The readiness probe is based on localhost from values.yaml
wget -O - -q --header "Authorization: Basic echo -n \"$RABBIT_MANAGEMENT_USER:$RABBIT_MANAGEMENT_PASSWORD\" | base64" http://localhost:15672/api/healthchecks/node | grep -qF "{"status":"ok"}"
So this is the cause of the failure. I can bypass this successfully by changing the probe command to use 127.0.0.1 instead of localhost but as the container is not able to resolve localhost and also its own name from its hosts file, we cannot move forward with confidence that the chart will work with this bypass.
Version of Helm and Kubernetes:
Tested on two versions of kube and helm:
Cluster1:
kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-12T13:43:46Z", GoVersion:"go1.13.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-12T13:42:10Z", GoVersion:"go1.13.7", Compiler:"gc", Platform:"linux/amd64"}
helm version
version.BuildInfo{Version:"v3.0.2", GitCommit:"19e47ee3283ae98139d98460de796c1be1e3975f", GitTreeState:"clean", GoVersion:"go1.13.5"}
Cluster2:
kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-12T13:43:46Z", GoVersion:"go1.13.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.4", GitCommit:"224be7bdce5a9dd0c2fd0d46b83865648e2fe0ba", GitTreeState:"clean", BuildDate:"2019-12-16T16:32:41Z", GoVersion:"go1.12.14", Compiler:"gc", Platform:"linux/amd64"}
helm version
Client: &version.Version{SemVer:"v2.16.3", GitCommit:"1ee0254c86d4ed6887327dabed7aa7da29d7eb0d", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.16.1", GitCommit:"bbdfe5e7803a12bbdf97e94cd847859890cf4050", GitTreeState:"clean"}
Which chart:
stable/rabbitmq-ha
What happened:
Deployed the chart successfully, then upgraded the chart to use a new alpine image. The upgrade always because the 'rabbitmq-ha' container in the pods cannot resolve localhost after the pod has been deleted and recreated.
What you expected to happen:
The new 'rabbitmq-ha' containers should be able to resolve localhost and pass their readiness probe/health checks and come up successfully.
How to reproduce it (as minimally and precisely as possible):
helm install --name rabbitmq-ha3 --namespace sharedservices-rabbitmq3 -f ./values.yaml stable/rabbitmq-ha
Wait for it to come up and stabalise
edit values.yaml with a new image and upgrade e.g.
image:
repository: rabbitmq
tag: 3.8.2-alpine
pullPolicy: IfNotPresent
helm upgrade rabbitmq-ha3 --namespace sharedservices-rabbitmq3 -f values.yaml stable/rabbitmq-ha
kubectl -n sharedservices-rabbitmq3 delete pod rabbitmq-ha3-0
Anything else we need to know:
I have tested other charts to see if its something particular to my kube deployments but I cannot recreate this problem. Also I have not used other images than the alpine images specified in the chart.
enable the forceBoot.
forceBoot makes no difference. Same result.
refer to this https://github.com/helm/charts/issues/14893, maybe it's helpful.
Are you setting the erlang cookie? We鈥檝e found that explicitly setting it makes the system more stable for upgrades.
Yes, the Erlang cookie is set in the helm chart. I am going to have a go at deleting the PVC's as part of the upgrade process as suggested by yaphetsglhf referencing #14893 and see if that works.
refer to this #14893, maybe it's helpful.
Deleting the PVC for the pod to be updated ( reverse ordinal order ) in the statefulset does not work in this case. The new pods are still failing their readiness probe.
Having the same issue since 3.8.2 upgrade.
If we use 127.0.0.1 it works:
wget -O - -q --header "Authorization: Basicecho -n "$RABBITMQ_DEFAULT_USER:$RABBITMQ_DEFAULT_PASS"| base64" http://127.0.0.1:15672/api/healthchecks/node | grep -qF "{\"status\":\"ok\"}"
As we can still ping localhost from the cli I tried curl version and this works too!
curl -H "Authorization: Basicecho -n "$RABBITMQ_DEFAULT_USER:$RABBITMQ_DEFAULT_PASS"| base64" http://localhost:15672/api/healthchecks/node
Why wouldn't this work?
wget -O - --header "Authorization: Basicecho -n "$RABBITMQ_DEFAULT_USER:$RABBITMQ_DEFAULT_PASS"| base64" http://localhost:15672/api/healthchecks/node | grep -qF "{\"status\":\"ok\"}"
Connecting to localhost:15672 ([::1]:15672)
wget: can't connect to remote host: Connection refused
This is one out out eight clusters we have that is suffering from this issue. Note this is a fresh install of rabbit with new PV's
I have the same issue, any luck to to solve the issue without delete the PVC?
Hello.
I installed RabbitMQ-ha from chart rabbitmq-ha-1.41.1 app version 3.8.0.
The rabbit cluster is failing liveness probe.
The liveness probe goest to http://127.0.0.1:15672/api/healthchecks/node
When i exec in the container and run netstat -ntpl, i see that nobody is listening on 15672 port:
/ $ netstat -ntpl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:4369 0.0.0.0:* LISTEN 113/epmd
tcp 0 0 :::4369 :::* LISTEN 113/epmd
I think that the problem is here.
I solved my case.
The problem was in small resource limits (100 milicores).
Once i made limits equal to 1 cpu - the rabbit cluster started.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
This issue is being automatically closed due to inactivity.
Hello all, the forceBoot solve most of the issues in my case too. However, when I do the helm upgrade after enabling the Prometheus Exporter, the PODs didn't recreate. Is this expected?
Found that this was bad default health checks in the helm chart, causing it to become extremely unstable.
I re-wrote their health checks based on https://www.rabbitmq.com/monitoring.html#health-checks
Add this to your values.yaml
livenessProbe:
exec:
command:
- /bin/sh
- -c
- rabbitmq-diagnostics -q check_port_connectivity
failureThreshold: 6
initialDelaySeconds: 120
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
exec:
command:
- /bin/sh
- -c
- rabbitmq-diagnostics -q check_virtual_hosts
failureThreshold: 6
initialDelaySeconds: 20
periodSeconds: 5
timeoutSeconds: 3
@tschirmer You meant instead of the below mentioned liveness & readiness check?
livenessProbe:
initialDelaySeconds: 120
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
exec:
command:
- /bin/sh
- -c
- 'timeout 5 wget -O - -q --header "Authorization: Basic `echo -n \"$RABBIT_MANAGEMENT_USER:$RABBIT_MANAGEMENT_PASSWORD\" | base64`" http://localhost:15672/api/healthchecks/node | grep -qF "{\"status\":\"ok\"}"'
readinessProbe:
initialDelaySeconds: 20
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 6
exec:
command:
- /bin/sh
- -c
- 'timeout 3 wget -O - -q --header "Authorization: Basic `echo -n \"$RABBIT_MANAGEMENT_USER:$RABBIT_MANAGEMENT_PASSWORD\" | base64`" http://localhost:15672/api/healthchecks/node | grep -qF "{\"status\":\"ok\"}"'
@rakeshnambiar The values.yaml allow for changing the readiness and liveness Probes, so you shouldn't need to change the main helm chart (unless someone wants to make it into a pull request)
@tschirmer
Thank you for sharing.
As about 3 months passed, did your liveness & readiness checks turn out to be good for production?
We consider changing them as well, as the included ones have big issues.
@AndrewBedscastle we haven't had issues with our checks over the last few months.
Most helpful comment
Found that this was bad default health checks in the helm chart, causing it to become extremely unstable.
I re-wrote their health checks based on https://www.rabbitmq.com/monitoring.html#health-checks
Add this to your values.yaml