Charts: [incubator/druid] CrashLoopBackOff STATUS

Created on 15 Jun 2020  路  16Comments  路  Source: helm/charts

Describe the bug

$ kubectl get pods
NAME                                            READY   STATUS             RESTARTS   AGE
druid-1592218780-broker-6f45cf8c46-cxftz        0/1     Running            24         88m
druid-1592218780-coordinator-5ff65c5775-z9fpm   0/1     Running            24         88m
druid-1592218780-historical-0                   0/1     Pending            0          88m
druid-1592218780-middle-manager-0               0/1     Pending            0          88m
druid-1592218780-postgresql-0                   0/1     Pending            0          88m
druid-1592218780-router-7fd5957d84-h2glr        0/1     CrashLoopBackOff   25         88m
druid-1592218780-zookeeper-0                    0/1     Pending            0          88m

Version of Helm and Kubernetes:

$ helm version
version.BuildInfo{Version:"v3.2.3", GitCommit:"8f832046e258e2cb800894579b1b3b50c2d83492", GitTreeState:"clean", GoVersion:"go1.13.12"}

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:52:00Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:43:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Which chart:

https://hub.helm.sh/charts/incubator/druid/0.2.1

What happened:

CrashLoopBackOff STATUS

$ kubectl logs pod/druid-1592218780-broker-6f45cf8c46-cxftz

WARN [main-SendThread(druid-1592218780-zookeeper-headless:2181)] org.apache.zookeeper.ClientCnxn - Session 0x0 for server druid-1592218780-zookeeper-headless:2181, unexpected error, closing socket connection and attempting reconnect
java.lang.IllegalArgumentException: Unable to canonicalize address druid-1592218780-zookeeper-headless:2181 because it's not resolvable
    at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]
    at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]
    at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060) [zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]

What you expected to happen:

No error appears

How to reproduce it (as minimally and precisely as possible):

$ helm repo add incubator https://kubernetes-charts-incubator.storage.googleapis.com
$ helm install incubator/druid --version 0.2.1 --generate-name

Anything else we need to know:

Most helpful comment

@asdf2014 As shown, it is pvc issue. Please make sure your cluster have basic default storageclass setting.

2m52s       Normal    FailedBinding      persistentvolumeclaim/data-druid-1592218780-historical-0       no persistent volumes available for this claim and no storage class is set
2m52s       Normal    FailedBinding      persistentvolumeclaim/data-druid-1592218780-middle-manager-0   no persistent volumes available for this claim and no storage class is set
2m52s       Normal    FailedBinding      persistentvolumeclaim/data-druid-1592218780-postgresql-0       no persistent volumes available for this claim and no storage class is set
2m52s       Normal    FailedBinding      persistentvolumeclaim/data-druid-1592218780-zookeeper-0        no persistent volumes available for this claim and no storage class is set

you can try kubectl get storageclass

All 16 comments

@maver1ck @AWaterColorPen PTAL

@asdf2014
For my understanding, it is alright.
broker is a deployment workload with only one replica by default and zk is a stateful workload with 3 replica and pvc. So that zk will be ready slowly. Before ZK ready, druid apps will crash by ready probe.

Could you provide zk logs? It should have 3 ZK

@AWaterColorPen Thanks for taking a look. After ten hours, these pods are still in CrashLoopBackOff status.. I tried to get logs from ZK through kubectl logs pod/druid-1592218780-zookeeper-0 command, but there is nothing come out.

NAME                                                READY   STATUS             RESTARTS   AGE
pod/druid-1592218780-broker-6f45cf8c46-cxftz        0/1     Running            240        15h
pod/druid-1592218780-coordinator-5ff65c5775-z9fpm   0/1     CrashLoopBackOff   241        15h
pod/druid-1592218780-historical-0                   0/1     Pending            0          15h
pod/druid-1592218780-middle-manager-0               0/1     Pending            0          15h
pod/druid-1592218780-postgresql-0                   0/1     Pending            0          15h
pod/druid-1592218780-router-7fd5957d84-h2glr        0/1     CrashLoopBackOff   246        15h
pod/druid-1592218780-zookeeper-0                    0/1     Pending            0          15h

NAME                                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/druid-1592218780-broker                ClusterIP   10.108.103.170   <none>        8082/TCP                     15h
service/druid-1592218780-coordinator           ClusterIP   10.98.145.22     <none>        8081/TCP                     15h
service/druid-1592218780-historical            ClusterIP   10.102.117.61    <none>        8083/TCP                     15h
service/druid-1592218780-middle-manager        ClusterIP   10.110.130.0     <none>        8091/TCP                     15h
service/druid-1592218780-postgresql            ClusterIP   10.109.19.76     <none>        5432/TCP                     15h
service/druid-1592218780-postgresql-headless   ClusterIP   None             <none>        5432/TCP                     15h
service/druid-1592218780-router                ClusterIP   10.101.204.212   <none>        8888/TCP                     15h
service/druid-1592218780-zookeeper             ClusterIP   10.106.200.185   <none>        2181/TCP                     15h
service/druid-1592218780-zookeeper-headless    ClusterIP   None             <none>        2181/TCP,3888/TCP,2888/TCP   15h
service/kubernetes                             ClusterIP   10.96.0.1        <none>        443/TCP                      17h

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/druid-1592218780-broker        0/1     1            0           15h
deployment.apps/druid-1592218780-coordinator   0/1     1            0           15h
deployment.apps/druid-1592218780-router        0/1     1            0           15h

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/druid-1592218780-broker-6f45cf8c46        1         1         0       15h
replicaset.apps/druid-1592218780-coordinator-5ff65c5775   1         1         0       15h
replicaset.apps/druid-1592218780-router-7fd5957d84        1         1         0       15h

NAME                                               READY   AGE
statefulset.apps/druid-1592218780-historical       0/1     15h
statefulset.apps/druid-1592218780-middle-manager   0/1     15h
statefulset.apps/druid-1592218780-postgresql       0/1     15h
statefulset.apps/druid-1592218780-zookeeper        0/3     15h

I think the key point is your ZK and postgresql can not be ready. you can try kubectl get event

@AWaterColorPen This following is what I got.. Any thought?

$ kubectl get event

LAST SEEN   TYPE      REASON             OBJECT                                                         MESSAGE
2m52s       Normal    FailedBinding      persistentvolumeclaim/data-druid-1592218780-historical-0       no persistent volumes available for this claim and no storage class is set
2m52s       Normal    FailedBinding      persistentvolumeclaim/data-druid-1592218780-middle-manager-0   no persistent volumes available for this claim and no storage class is set
2m52s       Normal    FailedBinding      persistentvolumeclaim/data-druid-1592218780-postgresql-0       no persistent volumes available for this claim and no storage class is set
2m52s       Normal    FailedBinding      persistentvolumeclaim/data-druid-1592218780-zookeeper-0        no persistent volumes available for this claim and no storage class is set
2m11s       Warning   Unhealthy          pod/druid-1592218780-broker-6f45cf8c46-cxftz                   Liveness probe failed: Get http://10.244.3.3:8082/status/health: dial tcp 10.244.3.3:8082: connect: connection refused
32m         Warning   Unhealthy          pod/druid-1592218780-broker-6f45cf8c46-cxftz                   Readiness probe failed: Get http://10.244.3.3:8082/status/health: dial tcp 10.244.3.3:8082: connect: connection refused
7m56s       Warning   BackOff            pod/druid-1592218780-broker-6f45cf8c46-cxftz                   Back-off restarting failed container
37m         Warning   Unhealthy          pod/druid-1592218780-coordinator-5ff65c5775-z9fpm              Liveness probe failed: Get http://10.244.5.4:8081/status/health: dial tcp 10.244.5.4:8081: connect: connection refused
12m         Warning   Unhealthy          pod/druid-1592218780-coordinator-5ff65c5775-z9fpm              Readiness probe failed: Get http://10.244.5.4:8081/status/health: dial tcp 10.244.5.4:8081: connect: connection refused
2m52s       Warning   BackOff            pod/druid-1592218780-coordinator-5ff65c5775-z9fpm              Back-off restarting failed container
2m52s       Warning   FailedScheduling   pod/druid-1592218780-historical-0                              running "VolumeBinding" filter plugin for pod "druid-1592218780-historical-0": pod has unbound immediate PersistentVolumeClaims
2m52s       Warning   FailedScheduling   pod/druid-1592218780-middle-manager-0                          running "VolumeBinding" filter plugin for pod "druid-1592218780-middle-manager-0": pod has unbound immediate PersistentVolumeClaims
2m52s       Warning   FailedScheduling   pod/druid-1592218780-postgresql-0                              running "VolumeBinding" filter plugin for pod "druid-1592218780-postgresql-0": pod has unbound immediate PersistentVolumeClaims
12m         Warning   Unhealthy          pod/druid-1592218780-router-7fd5957d84-h2glr                   Liveness probe failed: Get http://10.244.5.3:8888/status/health: dial tcp 10.244.5.3:8888: connect: connection refused
37m         Normal    Pulled             pod/druid-1592218780-router-7fd5957d84-h2glr                   Container image "apache/druid:0.18.0" already present on machine
2m58s       Warning   BackOff            pod/druid-1592218780-router-7fd5957d84-h2glr                   Back-off restarting failed container
2m52s       Warning   FailedScheduling   pod/druid-1592218780-zookeeper-0                               running "VolumeBinding" filter plugin for pod "druid-1592218780-zookeeper-0": pod has unbound immediate PersistentVolumeClaims

@asdf2014 As shown, it is pvc issue. Please make sure your cluster have basic default storageclass setting.

2m52s       Normal    FailedBinding      persistentvolumeclaim/data-druid-1592218780-historical-0       no persistent volumes available for this claim and no storage class is set
2m52s       Normal    FailedBinding      persistentvolumeclaim/data-druid-1592218780-middle-manager-0   no persistent volumes available for this claim and no storage class is set
2m52s       Normal    FailedBinding      persistentvolumeclaim/data-druid-1592218780-postgresql-0       no persistent volumes available for this claim and no storage class is set
2m52s       Normal    FailedBinding      persistentvolumeclaim/data-druid-1592218780-zookeeper-0        no persistent volumes available for this claim and no storage class is set

you can try kubectl get storageclass

@AWaterColorPen After creating a pv and re-adding these pvc of druid, it is still in a pending state due to no persistent volumes available. What did I miss? :sweat_smile:

$ kubectl get pv
NAME             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                   STORAGECLASS   REASON   AGE
task-pv-volume   1Ti        RWO            Retain           Released   default/task-pv-claim   manual                  17m


$ kubectl get pvc
NAME                                     STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-druid-1592279879-historical-0       Pending                                                     20s
data-druid-1592279879-middle-manager-0   Pending                                                     20s
data-druid-1592279879-postgresql-0       Pending                                                     20s
data-druid-1592279879-zookeeper-0        Pending                                                     20s


$ kubectl describe pvc
Name:          data-druid-1592279879-historical-0
Namespace:     default
StorageClass:
Status:        Pending
Volume:
Labels:        app=druid
               component=historical
               release=druid-1592279879
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Mounted By:    druid-1592279879-historical-0
Events:
  Type    Reason         Age               From                         Message
  ----    ------         ----              ----                         -------
  Normal  FailedBinding  8s (x3 over 27s)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set
$ kubectl get storageclass
No resources found in default namespace.

$ kubectl apply -f storage.yaml
storageclass.storage.k8s.io/local-storage created

$ kubectl get storageclass
NAME            PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-storage   kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  4s

@AWaterColorPen I think I might figure out the problem. After I added storageClassName: manual to these pvc of druid, they were all successfully bound. Should I create another PR to fix it?

$ kubectl get pvc
NAME                                     STATUS   VOLUME             CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-druid-1592280534-historical-0       Bound    task-pv-volume     1Ti        RWO            manual         12m
data-druid-1592280534-middle-manager-0   Bound    task-pv-volume-1   1Ti        RWO            manual         12m
data-druid-1592280534-postgresql-0       Bound    task-pv-volume-2   1Ti        RWO            manual         12m
data-druid-1592280534-zookeeper-0        Bound    task-pv-volume-3   1Ti        RWO            manual         12m

@asdf2014 It is not a real bug.
For the normal case, the cluster will have a default storageclass when kubectl get storageclass.
In your case, there is no default storageclass in your cluster.

@AWaterColorPen Now my cluster has a default storageclass. However these pods are still in Pending status, and report 0/6 nodes are available: 6 node(s) didn't find available persistent volumes to bind message..

$ kubectl get storageclass
NAME                      PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-storage (default)   kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  5h31m
$ kubectl get pv
NAME               CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
task-pv-volume-1   1Ti        RWO            Retain           Available           manual                  164m
task-pv-volume-2   1Ti        RWO            Retain           Available           manual                  164m
task-pv-volume-3   1Ti        RWO            Retain           Available           manual                  164m
task-pv-volume-4   1Ti        RWO            Retain           Available           manual                  164m

$ kubectl get pvc
NAME                                     STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS    AGE
data-druid-1592290737-historical-0       Pending                                      local-storage   163m
data-druid-1592290737-middle-manager-0   Pending                                      local-storage   163m
data-druid-1592290737-postgresql-0       Pending                                      local-storage   163m
data-druid-1592290737-zookeeper-0        Pending                                      local-storage   163m

$ kubectl describe pvc
Name:          data-druid-1592290737-historical-0
Namespace:     default
StorageClass:  local-storage
Status:        Pending
Volume:
Labels:        app=druid
               component=historical
               release=druid-1592290737
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Mounted By:    druid-1592290737-historical-0
Events:
  Type    Reason                Age                     From                         Message
  ----    ------                ----                    ----                         -------
  Normal  WaitForFirstConsumer  3m53s (x642 over 163m)  persistentvolume-controller  waiting for first consumer to be created before binding
$ kubectl get event
LAST SEEN   TYPE      REASON                 OBJECT                                                         MESSAGE
3s          Normal    WaitForFirstConsumer   persistentvolumeclaim/data-druid-1592290737-historical-0       waiting for first consumer to be created before binding
3s          Normal    WaitForFirstConsumer   persistentvolumeclaim/data-druid-1592290737-middle-manager-0   waiting for first consumer to be created before binding
3s          Normal    WaitForFirstConsumer   persistentvolumeclaim/data-druid-1592290737-postgresql-0       waiting for first consumer to be created before binding
3s          Normal    WaitForFirstConsumer   persistentvolumeclaim/data-druid-1592290737-zookeeper-0        waiting for first consumer to be created before binding
8m59s       Warning   Unhealthy              pod/druid-1592290737-broker-5d9d7599b4-27jhm                   Readiness probe failed: Get http://10.244.3.2:8082/status/health: dial tcp 10.244.3.2:8082: connect: connection refused
24m         Warning   Unhealthy              pod/druid-1592290737-broker-5d9d7599b4-27jhm                   Liveness probe failed: Get http://10.244.3.2:8082/status/health: dial tcp 10.244.3.2:8082: connect: connection refused
4m35s       Warning   BackOff                pod/druid-1592290737-broker-5d9d7599b4-27jhm                   Back-off restarting failed container
61s         Warning   Unhealthy              pod/druid-1592290737-coordinator-9cdd47fbf-hl84d               Liveness probe failed: Get http://10.244.0.4:8081/status/health: dial tcp 10.244.0.4:8081: connect: connection refused
10m         Warning   Unhealthy              pod/druid-1592290737-coordinator-9cdd47fbf-hl84d               Readiness probe failed: Get http://10.244.0.4:8081/status/health: dial tcp 10.244.0.4:8081: connect: connection refused
36m         Normal    Pulled                 pod/druid-1592290737-coordinator-9cdd47fbf-hl84d               Container image "apache/druid:0.18.0" already present on machine
5m59s       Warning   BackOff                pod/druid-1592290737-coordinator-9cdd47fbf-hl84d               Back-off restarting failed container
4m18s       Warning   FailedScheduling       pod/druid-1592290737-historical-0                              0/6 nodes are available: 6 node(s) didn't find available persistent volumes to bind.
4m18s       Warning   FailedScheduling       pod/druid-1592290737-middle-manager-0                          0/6 nodes are available: 6 node(s) didn't find available persistent volumes to bind.
4m18s       Warning   FailedScheduling       pod/druid-1592290737-postgresql-0                              0/6 nodes are available: 6 node(s) didn't find available persistent volumes to bind.
6m5s        Warning   Unhealthy              pod/druid-1592290737-router-6f9bc65854-tdnhb                   Liveness probe failed: Get http://10.244.0.5:8888/status/health: dial tcp 10.244.0.5:8888: connect: connection refused
60s         Warning   BackOff                pod/druid-1592290737-router-6f9bc65854-tdnhb                   Back-off restarting failed container
4m18s       Warning   FailedScheduling       pod/druid-1592290737-zookeeper-0                               0/6 nodes are available: 6 node(s) didn't find available persistent volumes to bind.

@asdf2014 Maybe you want a storageclass with dynamic-provisioning.

The follow one is a manual provisioner, it need you to create pv manually.

$ kubectl get storageclass
NAME                      PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-storage (default)   kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  5h31m

@AWaterColorPen Good news. After replacing the Flannel plugin with Weave, all Pods work successfully. Thank you very much. :+1::+1::+1:

Was this page helpful?
0 / 5 - 0 ratings