Describe the bug
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
druid-1592218780-broker-6f45cf8c46-cxftz 0/1 Running 24 88m
druid-1592218780-coordinator-5ff65c5775-z9fpm 0/1 Running 24 88m
druid-1592218780-historical-0 0/1 Pending 0 88m
druid-1592218780-middle-manager-0 0/1 Pending 0 88m
druid-1592218780-postgresql-0 0/1 Pending 0 88m
druid-1592218780-router-7fd5957d84-h2glr 0/1 CrashLoopBackOff 25 88m
druid-1592218780-zookeeper-0 0/1 Pending 0 88m
Version of Helm and Kubernetes:
$ helm version
version.BuildInfo{Version:"v3.2.3", GitCommit:"8f832046e258e2cb800894579b1b3b50c2d83492", GitTreeState:"clean", GoVersion:"go1.13.12"}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:52:00Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:43:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Which chart:
https://hub.helm.sh/charts/incubator/druid/0.2.1
What happened:
CrashLoopBackOff STATUS
$ kubectl logs pod/druid-1592218780-broker-6f45cf8c46-cxftz
WARN [main-SendThread(druid-1592218780-zookeeper-headless:2181)] org.apache.zookeeper.ClientCnxn - Session 0x0 for server druid-1592218780-zookeeper-headless:2181, unexpected error, closing socket connection and attempting reconnect
java.lang.IllegalArgumentException: Unable to canonicalize address druid-1592218780-zookeeper-headless:2181 because it's not resolvable
at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]
at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]
at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060) [zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]
What you expected to happen:
No error appears
How to reproduce it (as minimally and precisely as possible):
$ helm repo add incubator https://kubernetes-charts-incubator.storage.googleapis.com
$ helm install incubator/druid --version 0.2.1 --generate-name
Anything else we need to know:
@maver1ck @AWaterColorPen PTAL
@asdf2014
For my understanding, it is alright.
broker is a deployment workload with only one replica by default and zk is a stateful workload with 3 replica and pvc. So that zk will be ready slowly. Before ZK ready, druid apps will crash by ready probe.
Could you provide zk logs? It should have 3 ZK
@AWaterColorPen Thanks for taking a look. After ten hours, these pods are still in CrashLoopBackOff status.. I tried to get logs from ZK through kubectl logs pod/druid-1592218780-zookeeper-0 command, but there is nothing come out.
NAME READY STATUS RESTARTS AGE
pod/druid-1592218780-broker-6f45cf8c46-cxftz 0/1 Running 240 15h
pod/druid-1592218780-coordinator-5ff65c5775-z9fpm 0/1 CrashLoopBackOff 241 15h
pod/druid-1592218780-historical-0 0/1 Pending 0 15h
pod/druid-1592218780-middle-manager-0 0/1 Pending 0 15h
pod/druid-1592218780-postgresql-0 0/1 Pending 0 15h
pod/druid-1592218780-router-7fd5957d84-h2glr 0/1 CrashLoopBackOff 246 15h
pod/druid-1592218780-zookeeper-0 0/1 Pending 0 15h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/druid-1592218780-broker ClusterIP 10.108.103.170 <none> 8082/TCP 15h
service/druid-1592218780-coordinator ClusterIP 10.98.145.22 <none> 8081/TCP 15h
service/druid-1592218780-historical ClusterIP 10.102.117.61 <none> 8083/TCP 15h
service/druid-1592218780-middle-manager ClusterIP 10.110.130.0 <none> 8091/TCP 15h
service/druid-1592218780-postgresql ClusterIP 10.109.19.76 <none> 5432/TCP 15h
service/druid-1592218780-postgresql-headless ClusterIP None <none> 5432/TCP 15h
service/druid-1592218780-router ClusterIP 10.101.204.212 <none> 8888/TCP 15h
service/druid-1592218780-zookeeper ClusterIP 10.106.200.185 <none> 2181/TCP 15h
service/druid-1592218780-zookeeper-headless ClusterIP None <none> 2181/TCP,3888/TCP,2888/TCP 15h
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 17h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/druid-1592218780-broker 0/1 1 0 15h
deployment.apps/druid-1592218780-coordinator 0/1 1 0 15h
deployment.apps/druid-1592218780-router 0/1 1 0 15h
NAME DESIRED CURRENT READY AGE
replicaset.apps/druid-1592218780-broker-6f45cf8c46 1 1 0 15h
replicaset.apps/druid-1592218780-coordinator-5ff65c5775 1 1 0 15h
replicaset.apps/druid-1592218780-router-7fd5957d84 1 1 0 15h
NAME READY AGE
statefulset.apps/druid-1592218780-historical 0/1 15h
statefulset.apps/druid-1592218780-middle-manager 0/1 15h
statefulset.apps/druid-1592218780-postgresql 0/1 15h
statefulset.apps/druid-1592218780-zookeeper 0/3 15h
I think the key point is your ZK and postgresql can not be ready. you can try kubectl get event
@AWaterColorPen This following is what I got.. Any thought?
$ kubectl get event
LAST SEEN TYPE REASON OBJECT MESSAGE
2m52s Normal FailedBinding persistentvolumeclaim/data-druid-1592218780-historical-0 no persistent volumes available for this claim and no storage class is set
2m52s Normal FailedBinding persistentvolumeclaim/data-druid-1592218780-middle-manager-0 no persistent volumes available for this claim and no storage class is set
2m52s Normal FailedBinding persistentvolumeclaim/data-druid-1592218780-postgresql-0 no persistent volumes available for this claim and no storage class is set
2m52s Normal FailedBinding persistentvolumeclaim/data-druid-1592218780-zookeeper-0 no persistent volumes available for this claim and no storage class is set
2m11s Warning Unhealthy pod/druid-1592218780-broker-6f45cf8c46-cxftz Liveness probe failed: Get http://10.244.3.3:8082/status/health: dial tcp 10.244.3.3:8082: connect: connection refused
32m Warning Unhealthy pod/druid-1592218780-broker-6f45cf8c46-cxftz Readiness probe failed: Get http://10.244.3.3:8082/status/health: dial tcp 10.244.3.3:8082: connect: connection refused
7m56s Warning BackOff pod/druid-1592218780-broker-6f45cf8c46-cxftz Back-off restarting failed container
37m Warning Unhealthy pod/druid-1592218780-coordinator-5ff65c5775-z9fpm Liveness probe failed: Get http://10.244.5.4:8081/status/health: dial tcp 10.244.5.4:8081: connect: connection refused
12m Warning Unhealthy pod/druid-1592218780-coordinator-5ff65c5775-z9fpm Readiness probe failed: Get http://10.244.5.4:8081/status/health: dial tcp 10.244.5.4:8081: connect: connection refused
2m52s Warning BackOff pod/druid-1592218780-coordinator-5ff65c5775-z9fpm Back-off restarting failed container
2m52s Warning FailedScheduling pod/druid-1592218780-historical-0 running "VolumeBinding" filter plugin for pod "druid-1592218780-historical-0": pod has unbound immediate PersistentVolumeClaims
2m52s Warning FailedScheduling pod/druid-1592218780-middle-manager-0 running "VolumeBinding" filter plugin for pod "druid-1592218780-middle-manager-0": pod has unbound immediate PersistentVolumeClaims
2m52s Warning FailedScheduling pod/druid-1592218780-postgresql-0 running "VolumeBinding" filter plugin for pod "druid-1592218780-postgresql-0": pod has unbound immediate PersistentVolumeClaims
12m Warning Unhealthy pod/druid-1592218780-router-7fd5957d84-h2glr Liveness probe failed: Get http://10.244.5.3:8888/status/health: dial tcp 10.244.5.3:8888: connect: connection refused
37m Normal Pulled pod/druid-1592218780-router-7fd5957d84-h2glr Container image "apache/druid:0.18.0" already present on machine
2m58s Warning BackOff pod/druid-1592218780-router-7fd5957d84-h2glr Back-off restarting failed container
2m52s Warning FailedScheduling pod/druid-1592218780-zookeeper-0 running "VolumeBinding" filter plugin for pod "druid-1592218780-zookeeper-0": pod has unbound immediate PersistentVolumeClaims
@asdf2014 As shown, it is pvc issue. Please make sure your cluster have basic default storageclass setting.
2m52s Normal FailedBinding persistentvolumeclaim/data-druid-1592218780-historical-0 no persistent volumes available for this claim and no storage class is set
2m52s Normal FailedBinding persistentvolumeclaim/data-druid-1592218780-middle-manager-0 no persistent volumes available for this claim and no storage class is set
2m52s Normal FailedBinding persistentvolumeclaim/data-druid-1592218780-postgresql-0 no persistent volumes available for this claim and no storage class is set
2m52s Normal FailedBinding persistentvolumeclaim/data-druid-1592218780-zookeeper-0 no persistent volumes available for this claim and no storage class is set
you can try kubectl get storageclass
@AWaterColorPen After creating a pv and re-adding these pvc of druid, it is still in a pending state due to no persistent volumes available. What did I miss? :sweat_smile:
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
task-pv-volume 1Ti RWO Retain Released default/task-pv-claim manual 17m
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-druid-1592279879-historical-0 Pending 20s
data-druid-1592279879-middle-manager-0 Pending 20s
data-druid-1592279879-postgresql-0 Pending 20s
data-druid-1592279879-zookeeper-0 Pending 20s
$ kubectl describe pvc
Name: data-druid-1592279879-historical-0
Namespace: default
StorageClass:
Status: Pending
Volume:
Labels: app=druid
component=historical
release=druid-1592279879
Annotations: <none>
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Mounted By: druid-1592279879-historical-0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal FailedBinding 8s (x3 over 27s) persistentvolume-controller no persistent volumes available for this claim and no storage class is set
$ kubectl get storageclass
No resources found in default namespace.
$ kubectl apply -f storage.yaml
storageclass.storage.k8s.io/local-storage created
$ kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-storage kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 4s
@AWaterColorPen I think I might figure out the problem. After I added storageClassName: manual to these pvc of druid, they were all successfully bound. Should I create another PR to fix it?
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-druid-1592280534-historical-0 Bound task-pv-volume 1Ti RWO manual 12m
data-druid-1592280534-middle-manager-0 Bound task-pv-volume-1 1Ti RWO manual 12m
data-druid-1592280534-postgresql-0 Bound task-pv-volume-2 1Ti RWO manual 12m
data-druid-1592280534-zookeeper-0 Bound task-pv-volume-3 1Ti RWO manual 12m
@asdf2014 It is not a real bug.
For the normal case, the cluster will have a default storageclass when kubectl get storageclass.
In your case, there is no default storageclass in your cluster.
@AWaterColorPen Now my cluster has a default storageclass. However these pods are still in Pending status, and report 0/6 nodes are available: 6 node(s) didn't find available persistent volumes to bind message..
$ kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-storage (default) kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 5h31m
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
task-pv-volume-1 1Ti RWO Retain Available manual 164m
task-pv-volume-2 1Ti RWO Retain Available manual 164m
task-pv-volume-3 1Ti RWO Retain Available manual 164m
task-pv-volume-4 1Ti RWO Retain Available manual 164m
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-druid-1592290737-historical-0 Pending local-storage 163m
data-druid-1592290737-middle-manager-0 Pending local-storage 163m
data-druid-1592290737-postgresql-0 Pending local-storage 163m
data-druid-1592290737-zookeeper-0 Pending local-storage 163m
$ kubectl describe pvc
Name: data-druid-1592290737-historical-0
Namespace: default
StorageClass: local-storage
Status: Pending
Volume:
Labels: app=druid
component=historical
release=druid-1592290737
Annotations: <none>
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Mounted By: druid-1592290737-historical-0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WaitForFirstConsumer 3m53s (x642 over 163m) persistentvolume-controller waiting for first consumer to be created before binding
$ kubectl get event
LAST SEEN TYPE REASON OBJECT MESSAGE
3s Normal WaitForFirstConsumer persistentvolumeclaim/data-druid-1592290737-historical-0 waiting for first consumer to be created before binding
3s Normal WaitForFirstConsumer persistentvolumeclaim/data-druid-1592290737-middle-manager-0 waiting for first consumer to be created before binding
3s Normal WaitForFirstConsumer persistentvolumeclaim/data-druid-1592290737-postgresql-0 waiting for first consumer to be created before binding
3s Normal WaitForFirstConsumer persistentvolumeclaim/data-druid-1592290737-zookeeper-0 waiting for first consumer to be created before binding
8m59s Warning Unhealthy pod/druid-1592290737-broker-5d9d7599b4-27jhm Readiness probe failed: Get http://10.244.3.2:8082/status/health: dial tcp 10.244.3.2:8082: connect: connection refused
24m Warning Unhealthy pod/druid-1592290737-broker-5d9d7599b4-27jhm Liveness probe failed: Get http://10.244.3.2:8082/status/health: dial tcp 10.244.3.2:8082: connect: connection refused
4m35s Warning BackOff pod/druid-1592290737-broker-5d9d7599b4-27jhm Back-off restarting failed container
61s Warning Unhealthy pod/druid-1592290737-coordinator-9cdd47fbf-hl84d Liveness probe failed: Get http://10.244.0.4:8081/status/health: dial tcp 10.244.0.4:8081: connect: connection refused
10m Warning Unhealthy pod/druid-1592290737-coordinator-9cdd47fbf-hl84d Readiness probe failed: Get http://10.244.0.4:8081/status/health: dial tcp 10.244.0.4:8081: connect: connection refused
36m Normal Pulled pod/druid-1592290737-coordinator-9cdd47fbf-hl84d Container image "apache/druid:0.18.0" already present on machine
5m59s Warning BackOff pod/druid-1592290737-coordinator-9cdd47fbf-hl84d Back-off restarting failed container
4m18s Warning FailedScheduling pod/druid-1592290737-historical-0 0/6 nodes are available: 6 node(s) didn't find available persistent volumes to bind.
4m18s Warning FailedScheduling pod/druid-1592290737-middle-manager-0 0/6 nodes are available: 6 node(s) didn't find available persistent volumes to bind.
4m18s Warning FailedScheduling pod/druid-1592290737-postgresql-0 0/6 nodes are available: 6 node(s) didn't find available persistent volumes to bind.
6m5s Warning Unhealthy pod/druid-1592290737-router-6f9bc65854-tdnhb Liveness probe failed: Get http://10.244.0.5:8888/status/health: dial tcp 10.244.0.5:8888: connect: connection refused
60s Warning BackOff pod/druid-1592290737-router-6f9bc65854-tdnhb Back-off restarting failed container
4m18s Warning FailedScheduling pod/druid-1592290737-zookeeper-0 0/6 nodes are available: 6 node(s) didn't find available persistent volumes to bind.
@asdf2014 Maybe you want a storageclass with dynamic-provisioning.
The follow one is a manual provisioner, it need you to create pv manually.
$ kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-storage (default) kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 5h31m
@AWaterColorPen Good news. After replacing the Flannel plugin with Weave, all Pods work successfully. Thank you very much. :+1::+1::+1:
Most helpful comment
@asdf2014 As shown, it is
pvcissue. Please make sure your cluster have basic defaultstorageclasssetting.you can try
kubectl get storageclass