Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened: Liveness/Readiness probes are failing so frequently and the failures are also inconsistent. Some of the pods in the same deployment are getting stuck in the crashLoopBackOff
Back-off restarting failed container
Error syncing pod, skipping: failed to "StartContainer" for "web" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=web pod=web-controller-3987916996-qtfg7_micro-services(a913f25b-400a-11e8-8a2a-0252b8c4655e)"
Liveness probe failed: Get http://100.96.11.194:5000/: dial tcp 100.96.11.194:5000: getsockopt: connection refused
Failed to start container with id 0d47047e3e7a640ad6b4a6a8664bdb01a3601c95518c9e60bdb4966533fe7e6d with error: rpc error: code = 2 desc = failed to create symbolic link "/var/log/pods/a913f25b-400a-11e8-8a2a-0252b8c4655e/web_23.log" to the container log file "/var/lib/docker/containers/0d47047e3e7a640ad6b4a6a8664bdb01a3601c95518c9e60bdb4966533fe7e6d/0d47047e3e7a640ad6b4a6a8664bdb01a3601c95518c9e60bdb4966533fe7e6d-json.log" for container "0d47047e3e7a640ad6b4a6a8664bdb01a3601c95518c9e60bdb4966533fe7e6d": symlink /var/lib/docker/containers/0d47047e3e7a640ad6b4a6a8664bdb01a3601c95518c9e60bdb4966533fe7e6d/0d47047e3e7a640ad6b4a6a8664bdb01a3601c95518c9e60bdb4966533fe7e6d-json.log /var/log/pods/a913f25b-400a-11e8-8a2a-0252b8c4655e/web_23.log: file exists
And few more errors from other pods!
Error syncing pod, skipping: failed to "CreatePodSandbox" for "work-1132229878-zk00f_micro-services(897dd216-41f4-11e8-8a2a-0252b8c4655e)"
with CreatePodSandboxError: "CreatePodSandbox for pod \"work-1132229878-zk00f_micro-services(897dd216-41f4-11e8-8a2a-0252b8c4655e)\"
failed: rpc error: code = 2 desc = NetworkPlugin kubenet failed to set up pod \"work-1132229878-zk00f_micro-services\"
network: Error adding container to network: failed to connect \"vethdaa54c24\" to bridge cbr0: exchange full"
Here is how my livenessProbe config
"livenessProbe": {
"httpGet": {
"path": "/",
"port": 5000,
"scheme": "HTTP"
},
"initialDelaySeconds": 60,
"timeoutSeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"failureThreshold": 3
}
What you expected to happen:
Health checks to pass if the app is running.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Its a NodeJS app and its listening on port 5000 and its also exposed in dockerfile.
Environment:
kubectl version
):Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", BuildDate:"2017-11-09T07:26:38Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.7", GitCommit:"095136c3078ccf887b9034b7ce598a0a1faff769", GitTreeState:"clean", BuildDate:"2017-07-05T16:40:42Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}
uname -a
): Linux ip-172-20-54-255 4.4.78-k8s #1 SMP Fri Jul 28 01:28:39 UTC 2017 x86_64 GNU/Linux
kops
/sig network
/sig aws
/kind bug
This is happening right now in my clusters with kubedns/nginx causing to terminate the rest of the pods.
What is this issue related to?
Yeah same issue on my side when I activate readiness probe... I got this issue "Readiness probe failed: Get getsockopt: connection refused" then when the service is available everything is green.
My probe's configuration:
livenessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 120
timeoutSeconds: 2
periodSeconds: 30
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 10
timeoutSeconds: 1
periodSeconds: 5
I am also having this issue, its very intermittent
This is my config.
readinessProbe:
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 120
timeoutSeconds: 30
livenessProbe:
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 120
timeoutSeconds: 30
Same here!
Same, with a basic config:
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
timeoutSeconds: 10
+1 for similar simple config
I'm also seeing this. New pods get stuck in ContainerCreating
as well.
My case was solved by changing the binding from 127.0.0.1 to 0.0.0.0.
I'm seeing this as well.
Found the same on kube 1.8 with the above basic config
We have same issue on v1.7.5
, intermittent probe failed cause by getsockopt: connection refused
in a specific worker node, other pods on other nodes works perfectly, and go inside of container do curl http://localhost:8080/health
return ok.
A snippet of kubelet
log:
Jul 17 20:31:39 192-168-0-1-B28 kubelet: I0717 20:31:39.569363 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:31:49 192-168-0-1-B28 kubelet: I0717 20:31:49.569312 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:31:59 192-168-0-1-B28 kubelet: I0717 20:31:59.569324 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:32:09 192-168-0-1-B28 kubelet: I0717 20:32:09.569293 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:32:20 192-168-0-1-B28 kubelet: I0717 20:32:20.808546 23774 prober.go:113] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" succeeded
Jul 17 20:32:39 192-168-0-1-B28 kubelet: I0717 20:32:39.638384 23774 prober.go:113] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" succeeded
Jul 17 20:32:49 192-168-0-1-B28 kubelet: I0717 20:32:49.569171 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:32:59 192-168-0-1-B28 kubelet: I0717 20:32:59.577038 23774 prober.go:113] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" succeeded
Jul 17 20:33:09 192-168-0-1-B28 kubelet: I0717 20:33:09.569288 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:33:19 192-168-0-1-B28 kubelet: I0717 20:33:19.575965 23774 prober.go:113] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" succeeded
Jul 17 20:38:49 192-168-0-1-B28 kubelet: I0717 20:38:49.569077 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:38:59 192-168-0-1-B28 kubelet: I0717 20:38:59.569407 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:39:09 192-168-0-1-B28 kubelet: I0717 20:39:09.569431 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:39:20 192-168-0-1-B28 kubelet: I0717 20:39:20.728373 23774 prober.go:113] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" succeeded
Jul 17 20:39:39 192-168-0-1-B28 kubelet: I0717 20:39:39.577240 23774 prober.go:113] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" succeeded
Jul 17 20:39:49 192-168-0-1-B28 kubelet: I0717 20:39:49.577932 23774 prober.go:113] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" succeeded
Jul 17 20:39:59 192-168-0-1-B28 kubelet: I0717 20:39:59.578145 23774 prober.go:113] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" succeeded
Jul 17 20:40:09 192-168-0-1-B28 kubelet: I0717 20:40:09.569164 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:40:19 192-168-0-1-B28 kubelet: I0717 20:40:19.569277 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:40:39 192-168-0-1-B28 kubelet: I0717 20:40:39.569681 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:40:49 192-168-0-1-B28 kubelet: I0717 20:40:49.576956 23774 prober.go:113] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" succeeded
Jul 17 20:40:59 192-168-0-1-B28 kubelet: I0717 20:40:59.569375 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:41:09 192-168-0-1-B28 kubelet: I0717 20:41:09.569403 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:41:19 192-168-0-1-B28 kubelet: I0717 20:41:19.575196 23774 prober.go:113] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" succeeded
Jul 17 20:41:39 192-168-0-1-B28 kubelet: I0717 20:41:39.569339 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:41:49 192-168-0-1-B28 kubelet: I0717 20:41:49.569497 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:41:59 192-168-0-1-B28 kubelet: I0717 20:41:59.569198 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:42:09 192-168-0-1-B28 kubelet: I0717 20:42:09.569606 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:42:19 192-168-0-1-B28 kubelet: I0717 20:42:19.569110 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:42:39 192-168-0-1-B28 kubelet: I0717 20:42:39.576585 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:42:49 192-168-0-1-B28 kubelet: I0717 20:42:49.569490 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:42:59 192-168-0-1-B28 kubelet: I0717 20:42:59.569501 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Jul 17 20:43:09 192-168-0-1-B28 kubelet: I0717 20:43:09.569399 23774 prober.go:106] Readiness probe for "zp-crm-rule-1362938709-4900s_default(92d60c48-89b2-11e8-8d0c-7e635641b742):zp-crm-rule" failed (failure): Get http://11.11.75.4:8080/health: dial tcp 11.11.75.4:8080: getsockopt: connection refused
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T09:14:02Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T08:56:23Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
# cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core)
# uname -a
Linux 192-168-0-1-B28 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
# docker version
Client:
Version: 18.03.1-ce
API version: 1.37
Go version: go1.9.2
Git commit: 9ee9f40
Built: Thu Apr 26 07:12:25 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.03.1-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:23:03 2018
OS/Arch: linux/amd64
Experimental: false
# flanneld --version
v0.8.0
@Phanindra48 Hi there, do you have any updates about this issue?
@Shawyeok Actually I don't have a concrete solution yet as there are several reasons for it to fail and moreover its quite inconsistent. May be they need to provide more information or something useful instead of what we get now in the logs.
+1
+1
With same configuration , same pod one time was successfully started. then stopped it , minikube delete. now I am getting this readiness probe failed .
running with vm-driver=none
We are seeing same issue. When deploying the same pod to multiple nodes, pods on some node can pass liveness probing, while pods on other nodes fails with "connection refused" issue.
The pod was eventually killed and re-scheduled on the same node. However, this time the probing started working.
Hello all. We were experiencing this issue as well.
Using:
kubectl describe pod <pod-name>
We also found that the Exit code was:137 and the Reason: 'Error'.
This is the error code for both Liveness failure and Memory issues but we were fairly certain it was not due to Memory issues as we had more than enough Memory allocated and when we find error issues kill the pod we get the correct reason 'OOMKilled'.
Anyway, We found that the issue occurred when we attempted to statically apply 250m per pod CPU in Lower Environments so to be Resource Efficient.
We run Spring Boot Applications which have a real heavy boot period. Because of this, the 0.25 cores we applied could not boot the App in time to start running the server and pass the health checks. And, we ultimately passed our Failure deadline before the Service was Ready.
I suggest that anyone seeing this issue either could maybe solve their problems by doing 1 of 2 things:
1) Meet the Race Condition
Allocate Higher Quantities of CPU to your Deployments so that the boot process is faster and the pod will be up in time for Liveness and Readiness checks.
2) Change the Race Condition
Set a longer Initial wait time on the Live and Readiness probe and, extend the failure deadline and the test interval . This should give you plenty of time for your Service to boot up and become Ready.
e.g
readinessProbe:
httpGet:
scheme: http
path: /health
port: 8080
initialDelaySeconds: 120
timeoutSeconds: 3
periodSeconds: 30
successThreshold: 1
failureThreshold: 5
livenessProbe:
httpGet:
scheme: http
path: /health
port: 8080
initialDelaySeconds: 120
timeoutSeconds: 3
periodSeconds: 30
successThreshold: 1
failureThreshold: 5
Obviously there could be other reasons for the probe failing and ending up with this type of Error but this is what solved it for us.
Hope this can help.
I am having the same issue on GKE.
Not sure if there is underlying issue. I am using helm chart (concourse)
I don't think it is timeout issue, it seems to me that the container is looking for some response on the 8080 port and that port is blocked on the container.
Warning Unhealthy 40m (x3 over 40m) kubelet, gke-cluster-default-pool-bec82955-0rtc Liveness probe failed: Get http://10.28.0.19:8080/: dial tcp 10.28.0.19:8080: getsockopt: connection refused
Normal Killing 40m kubelet, gke-cluster-default-pool-bec82955-0rtc Killing container with id docker://concourse-web:Container failed liveness probe.. Container will be killed and recreated.
Normal Pulled 40m kubelet, gke-cluster-default-pool-bec82955-0rtc Container image "concourse/concourse:4.2.1" already present on machine
Warning BackOff 8m (x45 over 26m) kubelet, gke-cluster-default-pool-bec82955-0rtc Back-off restarting failed container
Warning Unhealthy 3m (x150 over 42m) kubelet, gke-cluster-default-pool-bec82955-0rtc Readiness probe failed: Get http://10.28.0.19:8080/: dial tcp 10.28.0.19:8080: getsockopt: connection refused
After some troubleshooting, I found out that I was missing the targetPort
directive, as it was different from the LB inbound port.
apiVersion: v1
kind: Service
metadata:
name: app-server
annotations:
service.beta.kubernetes.io/azure-dns-label-name: <hidden>
spec:
type: LoadBalancer
ports:
- port: 80
name: http
targetPort: 3000
protocol: TCP
selector:
app: app-server
May be you have some server.context-path set up in application.properties file.
Try adjusting that in the path: of the liveness/readinessProbe as httpGet:path: The spring.application.name is overridden by the server context-path if specified in the application.properties. I was thinking that in service discovery mechanism the spring.application.name is applicable.
Same issue here. I can get the service running continuously by disabling the readiness probe. This indicated that it's the probe failure and the subsequent shutting down that caused the connection refusal instead of the other way around.
Things I have tried that doesn't work:
containerPort
, targetPort
@njgibbon, Your suggestion to modify the initialDelaySeconds
, periodSeconds
and timeoutSeconds
to bigger values (30,30,10) respectively worked for me. Our problem pod contains an authenticationProxy and a Jupyter-Lab deployment. After applying liveness probes we saw intermittent connection refused
errors. Ideally, we'd like to play with these values to see what can be reduced but In any case, I'd like to say thanks for the good and helpful comment. 馃憤
for whoever encounter this issue while using spring services.
don't forget, like I did, to set the appropriate args.
```
Args:
I encounter this issue too, and finally I found that my server take toooooooo much time to start up.
longer than the initialDelaySeconds
, so the deploy fall into a infinity loop.
Encountered the same issue on the single node. New pods stuck in Waiting: ContainerCreating
status. sudo reboot
on the node helped
I faced the same issue. Increasing the initialDelaySeconds for liveness and readiness probes resolved it. The spring boot app took too long to start.
for spring boot deployment
increasing initialDelaySeconds
from 10
to 45
fixed the problem for me.
Can you update your k8s cluster to a supported version?
As mentioned in https://github.com/kubernetes/kubernetes/issues/62594#issuecomment-420685737 we found this issue was caused by:
initialDelaySeconds
for the probe.After increasing the resource allocation, the container started more quickly which allowed reducing the initialDelaySeconds
value. After that, the probes succeed.
Guys same issue I am facing but it happen after time intervals like after 2 days. It works properly in starting and looks stable but after time intervals pod going in infinite loop of restarting one by one of the same service.
My case was solved by changing the binding from 127.0.0.1 to 0.0.0.0.
This worked for me. Why does this work?
I was receiving same error while I spin up pods for jenkins container.
I removed limits on the container. It start working.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: jenkins-dep
namespace: jenkins-ns
spec:
template:
metadata:
name: jenkins-tmplt
labels:
app: jenkins
spec:
containers:
- name: jenkins
image: jenkins:lastest
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
successThreshold: 1
timeoutSeconds: 1
periodSeconds: 10
initialDelaySeconds: 10
tcpSocket:
port: 8081
readinessProbe:
failureThreshold: 3
successThreshold: 1
timeoutSeconds: 1
periodSeconds: 10
initialDelaySeconds: 30
httpGet:
port: 8081
path: /login
#resources:
# limits:
# cpu: 100m
# memory: 512Mi
ports:
- containerPort: 8081
replicas: 2
strategy:
rollingUpdate:
maxUnavailable: 2
maxSurge: 2
selector:
matchLabels:
app: jenkins
@njgibbon, Your suggestion to modify the
initialDelaySeconds
,periodSeconds
andtimeoutSeconds
to bigger values (30,30,10) respectively worked for me. Our problem pod contains an authenticationProxy and a Jupyter-Lab deployment. After applying liveness probes we saw intermittentconnection refused
errors. Ideally, we'd like to play with these values to see what can be reduced but In any case, I'd like to say thanks for the good and helpful comment. 馃憤
it works for me. So, i think we should increase the times a bit more, even the start time is less than value defined (my initialDelaySeconds was defined 20s but start time take 19s). Here is my config:
readinessProbe:
httpGet:
path: actuator/health/readiness
port: {{ .Values.service.targetPort }}
initialDelaySeconds: 15
timeoutSeconds: 10
periodSeconds: 30
successThreshold: 2
failureThreshold: 5
livenessProbe:
httpGet:
path: actuator/health/liveness
port: {{ .Values.service.targetPort }}
initialDelaySeconds: 30
timeoutSeconds: 10
periodSeconds: 30
failureThreshold: 5
Increasing the timeout doesn't work for us. We had to change the IP to 0.0.0.0
. It looks like that this thread contains two different issues:
For us, it looks like that the readiness feature has an issue. Our server starts immediately and we can curl the health endpoints without any issue.
That's the question https://github.com/kubernetes/kubernetes/issues/62594#issuecomment-605714141 we should try to answer.
When I try to ping the container via the POD ip I get the exact same error.
curl http://10.244.0.89:4000
curl: (7) Failed to connect to 10.244.0.89 port 4000: Connection refused
Is this the same what readinessProbe
use as host option? This would explain why binding to 0.0.0.0
works.
Yes, according to the docs readinessProbe
use the POD IP. 127.0.0.1
can't be accessed outside of the POD network. Either you use 0.0.0.0
or you specify an AAAA record.
Most helpful comment
Hello all. We were experiencing this issue as well.
Using:
We also found that the Exit code was:137 and the Reason: 'Error'.
This is the error code for both Liveness failure and Memory issues but we were fairly certain it was not due to Memory issues as we had more than enough Memory allocated and when we find error issues kill the pod we get the correct reason 'OOMKilled'.
Anyway, We found that the issue occurred when we attempted to statically apply 250m per pod CPU in Lower Environments so to be Resource Efficient.
We run Spring Boot Applications which have a real heavy boot period. Because of this, the 0.25 cores we applied could not boot the App in time to start running the server and pass the health checks. And, we ultimately passed our Failure deadline before the Service was Ready.
I suggest that anyone seeing this issue either could maybe solve their problems by doing 1 of 2 things:
1) Meet the Race Condition
Allocate Higher Quantities of CPU to your Deployments so that the boot process is faster and the pod will be up in time for Liveness and Readiness checks.
2) Change the Race Condition
Set a longer Initial wait time on the Live and Readiness probe and, extend the failure deadline and the test interval . This should give you plenty of time for your Service to boot up and become Ready.
e.g
Obviously there could be other reasons for the probe failing and ending up with this type of Error but this is what solved it for us.
Hope this can help.