Charts: redis-ha slave fails to find master

Created on 6 Oct 2017 · 8Comments · Source: helm/charts

Is this a request for help?:

Yes

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

Version of Helm and Kubernetes:
Helm Client: latest
Kubernetes: latest

Which chart:
redis-ha

What happened:
The redis cluster runs for a little while, but then the slave dies with the following logs:

Failed to find master.
Could not connect to Redis at -p:6379: Name or service not known

logs from one of the sentinels:

[28] 05 Oct 18:32:57.427 # Sentinel runid is 947f23661df9e5d120760df6b3fb21217a3f742e
[28] 05 Oct 18:32:57.427 # +monitor master mymaster 10.24.5.23 6379 quorum 2
[28] 05 Oct 18:32:58.432 * +sentinel sentinel 10.24.5.23:26379 10.24.5.23 26379 @ mymaster 10.24.5.23 6379
[28] 05 Oct 18:32:59.300 * +sentinel sentinel 10.24.4.18:26379 10.24.4.18 26379 @ mymaster 10.24.5.23 6379
[28] 05 Oct 18:33:09.570 * +sentinel sentinel 10.24.10.19:26379 10.24.10.19 26379 @ mymaster 10.24.5.23 6379
[28] 05 Oct 18:33:47.592 * +slave slave 10.24.8.21:6379 10.24.8.21 6379 @ mymaster 10.24.5.23 6379
[28] 05 Oct 18:38:58.828 * +slave slave 10.24.4.14:6379 10.24.4.14 6379 @ mymaster 10.24.5.23 6379
[28] 05 Oct 20:16:07.067 # +sdown slave 10.24.4.14:6379 10.24.4.14 6379 @ mymaster 10.24.5.23 6379
[28] 05 Oct 22:49:30.231 # +sdown slave 10.24.8.21:6379 10.24.8.21 6379 @ mymaster 10.24.5.23 6379

meanwhile, the master is just chugging along without error--and my application is able to use the redis cache.

The service is up just fine, exposing the master as the one node in the service:

Name:           ip-redis-ha
Namespace:      default
Labels:         app=redis-ha
            chart=redis-ha-0.2.0
            heritage=Tiller
            name=redis-ha
            release=ip
            role=service
Annotations:        <none>
Selector:       chart=redis-ha-0.2.0,heritage=Tiller,redis-node=true,release=ip
Type:           ClusterIP
IP:         10.27.248.103
Port:           <unset> 6379/TCP
Endpoints:      10.24.5.23:6379
Session Affinity:   None
Events:         <none>

What you expected to happen:
slave is able to find the master and stays up.

How to reproduce it (as minimally and precisely as possible):
I just ran: helm install stable/redis-ha

Anything else we need to know:

lifecyclrotten

Source

atomantic

Most helpful comment

@timothytierney those variables are populated by Kubernetes if and only if the service already exists when the pod is created.

danisla on 23 Oct 2017

👍2

All 8 comments

This error:

    Failed to find master.
    Could not connect to Redis at -p:6379: Name or service not known

It looks like the REDIS_SENTINEL_SERVICE_HOST and REDIS_SENTINEL_SERVICE_PORT env vars were not populated when the pod was created.

There is probably a race condition with the creation of the redis-sentinel service and the deployment. Per the K8S docs there is no order implied for services and pods, so if the service doesn't already exist, then the env vars won't be set.

Since the entrypoint script for the container image depends on the service env vars, one solution would be to force the values of the env vars to the name of the service for all dependent pod templates like this:

env:
- name: REDIS_SENTINEL_SERVICE_HOST
  value: redis-sentinel

danisla on 10 Oct 2017

👍2

"It looks like the REDIS_SENTINEL_SERVICE_HOST and REDIS_SENTINEL_SERVICE_PORT env vars were not populated when the pod was created."

Are these environment variables supposed to be populated by the Redis sentinel service? Who/what is responsible for doing this. Thx.

timothytierney on 20 Oct 2017

@timothytierney those variables are populated by Kubernetes if and only if the service already exists when the pod is created.

danisla on 23 Oct 2017

👍2

Can we close this issue ?

smileisak on 9 Nov 2017

@smileisak did you remove the dependency on the svc variables?

danisla on 9 Nov 2017

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 7 Feb 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot on 9 Mar 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 9 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings