Charts: [stable/rabbitmq-ha] pods created after helm upgrade cannot resolve localhost and fail readiness check.

Created on 6 Mar 2020 · 18Comments · Source: helm/charts

The Problem:
The container 'rabbitmq-ha' in pods created by the stateful set after a helm upgrade cannot resolve localhost and thus fail readiness check and never come up. ( NOTE that the initial deployment always works fine and the containers CAN resolve localhost. )

Here is a single replica deployment after changing the image from 3.8.1-alpine to 3.8.2-alpine in values.yaml, running a helm upgrade and deleting the old pod:

kubectl -n sharedservices-rabbitmq3 get all
NAME READY STATUS RESTARTS AGE
pod/rabbitmq-ha3-0 1/2 Running 5 15m

kubectl -n sharedservices-rabbitmq3 describe pod/rabbitmq-ha3-0
Warning Unhealthy 5m40s (x104 over 15m) kubelet, juju-9afcf0-11 Readiness probe failed: wget: can't connect to remote host: Connection refused
Warning Unhealthy 41s (x30 over 13m) kubelet, juju-9afcf0-11 Liveness probe failed: wget: can't connect to remote host: Connection refused

kubectl -n sharedservices-rabbitmq3 exec rabbitmq-ha3-0 -- nslookup localhost
Defaulting container name to rabbitmq-ha.
Use 'kubectl describe pod/rabbitmq-ha3-0 -n sharedservices-rabbitmq3' to see all of the containers in this pod.
Server: 10.152.183.124
Address: 10.152.183.124:53
* server can't find localhost.cluster.local: NXDOMAIN
* server can't find localhost.cluster.local: NXDOMAIN
* server can't find localhost.svc.cluster.local: NXDOMAIN
* server can't find localhost.sharedservices-rabbitmq3.svc.cluster.local: NXDOMAIN
* server can't find localhost.sharedservices-rabbitmq3.svc.cluster.local: NXDOMAIN
* server can't find localhost.svc.cluster.local: NXDOMAIN
command terminated with exit code 1

It appears that the new containers are not referencing their own /etc/hosts file which has the correct entries ( same as the original chart deployment )

kubectl -n sharedservices-rabbitmq3 exec rabbitmq-ha3-0 -- cat /etc/hosts
Defaulting container name to rabbitmq-ha.
Use 'kubectl describe pod/rabbitmq-ha3-0 -n sharedservices-rabbitmq3' to see all of the containers in this pod.

Kubernetes-managed hosts file.

127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.1.9.46 rabbitmq-ha3-0.rabbitmq-ha3-discovery.sharedservices-rabbitmq3.svc.cluster.local rabbitmq-ha3-0

The readiness probe is based on localhost from values.yaml

wget -O - -q --header "Authorization: Basic echo -n \"$RABBIT_MANAGEMENT_USER:$RABBIT_MANAGEMENT_PASSWORD\" | base64" http://localhost:15672/api/healthchecks/node | grep -qF "{"status":"ok"}"

So this is the cause of the failure. I can bypass this successfully by changing the probe command to use 127.0.0.1 instead of localhost but as the container is not able to resolve localhost and also its own name from its hosts file, we cannot move forward with confidence that the chart will work with this bypass.

Version of Helm and Kubernetes:
Tested on two versions of kube and helm:

Cluster1:
kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-12T13:43:46Z", GoVersion:"go1.13.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-12T13:42:10Z", GoVersion:"go1.13.7", Compiler:"gc", Platform:"linux/amd64"}

helm version
version.BuildInfo{Version:"v3.0.2", GitCommit:"19e47ee3283ae98139d98460de796c1be1e3975f", GitTreeState:"clean", GoVersion:"go1.13.5"}

Cluster2:

kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-12T13:43:46Z", GoVersion:"go1.13.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.4", GitCommit:"224be7bdce5a9dd0c2fd0d46b83865648e2fe0ba", GitTreeState:"clean", BuildDate:"2019-12-16T16:32:41Z", GoVersion:"go1.12.14", Compiler:"gc", Platform:"linux/amd64"}

helm version
Client: &version.Version{SemVer:"v2.16.3", GitCommit:"1ee0254c86d4ed6887327dabed7aa7da29d7eb0d", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.16.1", GitCommit:"bbdfe5e7803a12bbdf97e94cd847859890cf4050", GitTreeState:"clean"}

Which chart:
stable/rabbitmq-ha

What happened:
Deployed the chart successfully, then upgraded the chart to use a new alpine image. The upgrade always because the 'rabbitmq-ha' container in the pods cannot resolve localhost after the pod has been deleted and recreated.

What you expected to happen:
The new 'rabbitmq-ha' containers should be able to resolve localhost and pass their readiness probe/health checks and come up successfully.

How to reproduce it (as minimally and precisely as possible):

Deploy a stable/rabbitmq-ha helm chart using a starting image in values.yaml e.g:
image:
repository: rabbitmq
tag: 3.8.1-alpine
pullPolicy: IfNotPresent

helm install --name rabbitmq-ha3 --namespace sharedservices-rabbitmq3 -f ./values.yaml stable/rabbitmq-ha

Wait for it to come up and stabalise
edit values.yaml with a new image and upgrade e.g.
image:
repository: rabbitmq
tag: 3.8.2-alpine
pullPolicy: IfNotPresent

helm upgrade rabbitmq-ha3 --namespace sharedservices-rabbitmq3 -f values.yaml stable/rabbitmq-ha

Delete the old pod/s ( based on the upgrade chart default upgrade strategy: updateStrategy: OnDelete )

kubectl -n sharedservices-rabbitmq3 delete pod rabbitmq-ha3-0

Anything else we need to know:

I have tested other charts to see if its something particular to my kube deployments but I cannot recreate this problem. Also I have not used other images than the alpine images specified in the chart.

lifecyclstale

Source

7ko7

👍3

Most helpful comment

Found that this was bad default health checks in the helm chart, causing it to become extremely unstable.

I re-wrote their health checks based on https://www.rabbitmq.com/monitoring.html#health-checks

Add this to your values.yaml

  livenessProbe:
    exec:
      command:
        - /bin/sh
        - -c
        - rabbitmq-diagnostics -q check_port_connectivity
    failureThreshold: 6
    initialDelaySeconds: 120
    periodSeconds: 10
    timeoutSeconds: 5
  readinessProbe:
    exec:
      command:
        - /bin/sh
        - -c
        - rabbitmq-diagnostics -q check_virtual_hosts
    failureThreshold: 6
    initialDelaySeconds: 20
    periodSeconds: 5
    timeoutSeconds: 3

tschirmer on 13 Jul 2020

🎉1 👍1

All 18 comments

enable the forceBoot.

yaphetsglhf on 9 Mar 2020

forceBoot makes no difference. Same result.

7ko7 on 9 Mar 2020

refer to this https://github.com/helm/charts/issues/14893, maybe it's helpful.

yaphetsglhf on 10 Mar 2020

Are you setting the erlang cookie? We’ve found that explicitly setting it makes the system more stable for upgrades.

austince on 10 Mar 2020

Yes, the Erlang cookie is set in the helm chart. I am going to have a go at deleting the PVC's as part of the upgrade process as suggested by yaphetsglhf referencing #14893 and see if that works.

7ko7 on 10 Mar 2020

refer to this #14893, maybe it's helpful.

Deleting the PVC for the pod to be updated ( reverse ordinal order ) in the statefulset does not work in this case. The new pods are still failing their readiness probe.

7ko7 on 10 Mar 2020

Having the same issue since 3.8.2 upgrade.

If we use 127.0.0.1 it works:
wget -O - -q --header "Authorization: Basicecho -n "$RABBITMQ_DEFAULT_USER:$RABBITMQ_DEFAULT_PASS"| base64" http://127.0.0.1:15672/api/healthchecks/node | grep -qF "{\"status\":\"ok\"}"

As we can still ping localhost from the cli I tried curl version and this works too!

curl -H "Authorization: Basicecho -n "$RABBITMQ_DEFAULT_USER:$RABBITMQ_DEFAULT_PASS"| base64" http://localhost:15672/api/healthchecks/node

Why wouldn't this work?

wget -O - --header "Authorization: Basicecho -n "$RABBITMQ_DEFAULT_USER:$RABBITMQ_DEFAULT_PASS"| base64" http://localhost:15672/api/healthchecks/node | grep -qF "{\"status\":\"ok\"}" Connecting to localhost:15672 ([::1]:15672) wget: can't connect to remote host: Connection refused

This is one out out eight clusters we have that is suffering from this issue. Note this is a fresh install of rabbit with new PV's

nnmccaughley on 11 Mar 2020

I have the same issue, any luck to to solve the issue without delete the PVC?

zzaareer on 15 Mar 2020

Hello.
I installed RabbitMQ-ha from chart rabbitmq-ha-1.41.1 app version 3.8.0.
The rabbit cluster is failing liveness probe.
The liveness probe goest to http://127.0.0.1:15672/api/healthchecks/node
When i exec in the container and run netstat -ntpl, i see that nobody is listening on 15672 port:

/ $ netstat -ntpl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:4369            0.0.0.0:*               LISTEN      113/epmd
tcp        0      0 :::4369                 :::*                    LISTEN      113/epmd

I think that the problem is here.

yuripastushenko on 26 Mar 2020

👍2

I solved my case.
The problem was in small resource limits (100 milicores).
Once i made limits equal to 1 cpu - the rabbit cluster started.

yuripastushenko on 26 Mar 2020

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] on 25 Apr 2020

This issue is being automatically closed due to inactivity.

stale[bot] on 9 May 2020

Hello all, the forceBoot solve most of the issues in my case too. However, when I do the helm upgrade after enabling the Prometheus Exporter, the PODs didn't recreate. Is this expected?

rakeshnambiar on 7 Jun 2020

Found that this was bad default health checks in the helm chart, causing it to become extremely unstable.

I re-wrote their health checks based on https://www.rabbitmq.com/monitoring.html#health-checks

Add this to your values.yaml

  livenessProbe:
    exec:
      command:
        - /bin/sh
        - -c
        - rabbitmq-diagnostics -q check_port_connectivity
    failureThreshold: 6
    initialDelaySeconds: 120
    periodSeconds: 10
    timeoutSeconds: 5
  readinessProbe:
    exec:
      command:
        - /bin/sh
        - -c
        - rabbitmq-diagnostics -q check_virtual_hosts
    failureThreshold: 6
    initialDelaySeconds: 20
    periodSeconds: 5
    timeoutSeconds: 3

tschirmer on 13 Jul 2020

🎉1 👍1

@tschirmer You meant instead of the below mentioned liveness & readiness check?

livenessProbe:
  initialDelaySeconds: 120
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 6
  exec:
    command:
      - /bin/sh
      - -c
      - 'timeout 5 wget -O - -q --header "Authorization: Basic `echo -n \"$RABBIT_MANAGEMENT_USER:$RABBIT_MANAGEMENT_PASSWORD\" | base64`" http://localhost:15672/api/healthchecks/node | grep -qF "{\"status\":\"ok\"}"'

readinessProbe:
  initialDelaySeconds: 20
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 6
  exec:
    command:
      - /bin/sh
      - -c
      - 'timeout 3 wget -O - -q --header "Authorization: Basic `echo -n \"$RABBIT_MANAGEMENT_USER:$RABBIT_MANAGEMENT_PASSWORD\" | base64`" http://localhost:15672/api/healthchecks/node | grep -qF "{\"status\":\"ok\"}"'

rakeshnambiar on 13 Jul 2020

@rakeshnambiar The values.yaml allow for changing the readiness and liveness Probes, so you shouldn't need to change the main helm chart (unless someone wants to make it into a pull request)

tschirmer on 14 Jul 2020

👍1

@tschirmer
Thank you for sharing.
As about 3 months passed, did your liveness & readiness checks turn out to be good for production?
We consider changing them as well, as the included ones have big issues.