Cloud-on-k8s: Kibana is never ready on K8S >= 1.16

Created on 19 Aug 2019  路  13Comments  路  Source: elastic/cloud-on-k8s

While deploying Kibana on K8S >= 1.16:

Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned e2e-b5dae-venus/test-cross-ns-assoc-cc85-kb-7cdfb5f694-2klgl to eck-e2e-control-plane Normal Pulled 85s kubelet, eck-e2e-control-plane Container image "docker.elastic.co/kibana/kibana:7.3.0" already present on machine Normal Created 85s kubelet, eck-e2e-control-plane Created container kibana Normal Started 85s kubelet, eck-e2e-control-plane Started container kibana Warning Unhealthy 67s kubelet, eck-e2e-control-plane Readiness probe failed: Get https://10.244.0.10:5601/login: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Warning Unhealthy 62s kubelet, eck-e2e-control-plane Readiness probe failed: HTTP probe failed with statuscode: 503 Warning Unhealthy 2s (x6 over 51s) kubelet, eck-e2e-control-plane Readiness probe errored: the read limit is reached

It seems to be an issue with the amount of data allowed when doing a http probe, it is now limited to 10Kb: https://github.com/kubernetes/kubernetes/blob/acc57be085cf5414f924680c1c740378cb712915/pkg/probe/http/http.go#L36

>bug

Most helpful comment

馃憤 for the TCP check as described in https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-tcp-liveness-probe _(should be applicable to readinessProbe as well)_

All 13 comments

We might need to update the docs for the released versions 0.9 that ti will not work.

In case we don't find a right endpoint to query, we could still move away from the builtin k8s http healthcheck to a custom command healthcheck, doing the curl ourselves like we do with Elasticsearch.
But better ask Kibana team first if there's a better endpoint to request :)

A TCP healthcheck might be an option as well?

Some feedback from the Kibana team:

/api/status should work

  • api/status will respond with 503 until the server is ready and able to talk to elasticsearch and run migrations
  • If Kibana looses communication with ES and the status.allowAnonymous is not set to true then you will get 401 from Kibana on this endpoint
  • If Kibana looses communication with ES and the status.allowAnonymous is set to true then you will get 200 from Kibana on this endpoint with a status.overall.state property set to red

Response from that endpoint comes in at just 7.5kb at a single test I ran.

I think requiring anonymous access sounds like it's complicating things unnecessarily.

I think I am 馃憤 on @charith-elastic suggestion to use a simple TCP health check

Furthermore, the response of /api/status includes the status of each plugins. With more plugins, the payload of the response might reach 10kB.

On my side, a single test using apm_es_kibana.yaml, gives a response of 8.5kB.

> curl -u $user:$password https://$kibana_ip:5601/api/status -skI | grep length 
content-length: 8452

So I tend to +1 to use a simple TCP health check. https://github.com/tevino/tcp-shaker could be a good option (already used in the past and works great).

馃憤 for the TCP check as described in https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-tcp-liveness-probe _(should be applicable to readinessProbe as well)_

Does the TCP check fail until Kibana is ready and connected to Elasticsearch?

Does the TCP check fail until Kibana is ready and connected to Elasticsearch?

IDK but I doubt it. But the current health check we use does not check for that either. You can have a Kibana that cannot talk to Elasticsearch and still serves up a login page (without login form but with HTML content indicating that it cannot talk to Elasticsearch)

If I'm understanding the release notes correctly, we should be okay with doing an http check against /api/status now that this was merged
https://github.com/kubernetes/kubernetes/pull/82669
where it should just truncate and not error out

This has been considered as a regression by the K8S project and fixed.

kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:36:53Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-24T05:54:40Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
--- PASS: TestSmoke (148.15s)
    --- PASS: TestSmoke/K8S_should_be_accessible (0.02s)
    --- PASS: TestSmoke/Elasticsearch_CRDs_should_exist (0.06s)
    --- PASS: TestSmoke/Remove_Elasticsearch_if_it_already_exists (0.01s)
    --- PASS: TestSmoke/K8S_should_be_accessible#01 (0.01s)
    --- PASS: TestSmoke/Kibana_CRDs_should_exist (0.01s)
    --- PASS: TestSmoke/Remove_Kibana_if_it_already_exists (0.01s)
    --- PASS: TestSmoke/K8S_should_be_accessible#02 (0.01s)
    --- PASS: TestSmoke/APM_Server_CRDs_should_exist (0.01s)
    --- PASS: TestSmoke/Remove_the_resources_if_they_already_exist (0.01s)
    --- PASS: TestSmoke/Creating_an_Elasticsearch_cluster_should_succeed (0.03s)
    --- PASS: TestSmoke/Elasticsearch_cluster_should_be_created (0.00s)
    --- PASS: TestSmoke/Creating_Kibana_should_succeed (0.01s)
    --- PASS: TestSmoke/Kibana_should_be_created (0.00s)
    --- PASS: TestSmoke/Creating_APM_Server_should_succeed (0.01s)
    --- PASS: TestSmoke/APM_Server_should_be_created (0.00s)
    --- PASS: TestSmoke/ES_certificate_authority_should_be_set_and_deployed (6.02s)
    --- PASS: TestSmoke/ES_version_should_be_the_expected_one (3.01s)
    --- PASS: TestSmoke/ES_pods_should_eventually_be_running (47.36s)
    --- PASS: TestSmoke/ES_services_should_be_created (0.01s)
    --- PASS: TestSmoke/ES_pods_should_eventually_be_ready (24.38s)
    --- PASS: TestSmoke/ES_pods_should_eventually_have_a_certificate (0.02s)
    --- PASS: TestSmoke/ES_services_should_have_endpoints (9.02s)
    --- PASS: TestSmoke/ES_cluster_health_should_eventually_be_green (12.02s)
    --- PASS: TestSmoke/ES_cluster_UUID_should_eventually_appear_in_the_ES_status (0.00s)
    --- PASS: TestSmoke/Elastic_password_should_be_available (0.00s)
    --- PASS: TestSmoke/Elasticsearch_data_volumes_should_be_of_the_specified_type (0.01s)
    --- PASS: TestSmoke/ES_cluster_health_endpoint_should_eventually_be_reachable (0.16s)
    --- PASS: TestSmoke/ES_version_should_be_the_expected_one#01 (0.03s)
    --- PASS: TestSmoke/ES_endpoint_should_eventually_be_reachable (0.03s)
    --- PASS: TestSmoke/ES_nodes_topology_should_eventually_be_the_expected_one (0.06s)
    --- PASS: TestSmoke/Kibana_deployment_should_be_set (0.01s)
    --- PASS: TestSmoke/Kibana_pods_count_should_match_the_expected_one (0.00s)
    --- PASS: TestSmoke/Kibana_pods_should_eventually_be_running (0.00s)
    --- PASS: TestSmoke/Kibana_services_should_be_created (0.00s)
    --- PASS: TestSmoke/Kibana_services_should_have_endpoints (0.00s)
    --- PASS: TestSmoke/Create_Kibana_client (0.04s)
    --- PASS: TestSmoke/Kibana_should_be_able_to_connect_to_Elasticsearch (0.08s)
    --- PASS: TestSmoke/ApmServer_deployment_should_be_created (0.00s)
    --- PASS: TestSmoke/ApmServer_pods_count_should_match_the_expected_one (0.00s)
    --- PASS: TestSmoke/ApmServer_pods_should_eventually_be_running (0.00s)
    --- PASS: TestSmoke/ApmServer_services_should_be_created (0.00s)
    --- PASS: TestSmoke/ApmServer_services_should_have_endpoints (0.00s)
    --- PASS: TestSmoke/Every_secret_should_be_set_so_that_we_can_build_an_APM_client (0.16s)
    --- PASS: TestSmoke/ApmServer_endpoint_should_eventually_be_reachable (0.01s)
    --- PASS: TestSmoke/ApmServer_version_should_be_the_expected_one (0.00s)
    --- PASS: TestSmoke/Events_should_be_accepted (0.00s)
    --- PASS: TestSmoke/Events_should_eventually_show_up_in_Elasticsearch (12.33s)
    --- PASS: TestSmoke/Deleting_Elasticsearch_should_return_no_error (0.01s)
    --- PASS: TestSmoke/Elasticsearch_should_not_be_there_anymore (0.00s)
    --- PASS: TestSmoke/Elasticsearch_pods_should_be_eventually_be_removed (15.04s)
    --- PASS: TestSmoke/PVCs_should_eventually_be_removed (0.00s)
    --- PASS: TestSmoke/Deleting_Kibana_should_return_no_error (0.01s)
    --- PASS: TestSmoke/Kibana_should_not_be_there_anymore (0.00s)
    --- PASS: TestSmoke/Kibana_pods_should_be_eventually_be_removed (9.02s)
    --- PASS: TestSmoke/Deleting_the_resources_should_return_no_error (0.01s)
    --- PASS: TestSmoke/The_resources_should_not_be_there_anymore (0.00s)
    --- PASS: TestSmoke/APM_Server_pods_should_be_eventually_be_removed (9.03s)
PASS
ok      github.com/elastic/cloud-on-k8s/test/e2e        148.171s

I just got this exact error on 7.7.1

K8s version:
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:43:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

what should i do?

@maxisam I am unable to reproduce it with minikube 1.18.3 and ES+Kibana 7.7.1 with ECK 1.1.2. Can you share the details of your environment, the manifests, and the specific logs and behavior you're seeing?

It works after i change readinessProbe. Thanks!

Was this page helpful?
0 / 5 - 0 ratings