Prometheus-operator: Kubelet end points not updated - still scraping the old servers

Created on 26 Sep 2017 · 30Comments · Source: prometheus-operator/prometheus-operator

What did you do?
For some reasons, 3 days ago, our 3 masters have been replaced by AWS. All good, everything still works on Kubernetes. We got alerts for "K8SKubeletDown" as expected.

I tried restarting prometheus operator, and then deleting and recreating the kubelet servicemonitor, but nothing changed.

What did you expect to see?
The new masters scraped and added automatically by prometheus operator kubelet servicemonitor. I don't know what to expect was going to happen to the old ones. I guess at a certain point they should be removed?

What did you see instead? Under which circumstances?
I still see the old master nodes scraped and seen as "down", we keep of course getting the alerts. And I don't see the new masters.

Environment
Kubernetes 1.7.2 running on AWS and set up using kops.

Kubernetes version information:

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.1", GitCommit:"1dc5c66f5dd61da08412a74221ecc79208c2165b", GitTreeState:"clean", BuildDate:"2017-07-14T02:00:46Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.2", GitCommit:"922a86cfcd65915a9b2f69f3f193b8907d741d9c", GitTreeState:"clean", BuildDate:"2017-07-21T08:08:00Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes cluster kind:

kops
Manifests:

Default setup of prometheus operator 0.13.0. Everything was working properly before the masters replacement.

Prometheus Operator Logs:
I think this is the error

ts=2017-09-26T11:15:20Z caller=operator.go:306 component=prometheusoperator msg="syncing nodes into Endpoints object failed" err="synchronizing kubelet endpoints object failed: updating kubelet endpoints object failed: Endpoints \"kubelet\" is invalid: [subsets[0].addresses[0].nodeName: Forbidden: Cannot change NodeName for 172.25.111.117 to ip-172-25-96-238.eu-west-1.compute.internal, subsets[0].addresses[1].nodeName: Forbidden: Cannot change NodeName for 172.25.119.81 to ip-172-25-96-238.eu-west-1.compute.internal, subsets[0].addresses[2].nodeName: Forbidden: Cannot change NodeName for 172.25.121.60 to ip-172-25-96-238.eu-west-1.compute.internal, subsets[0].addresses[4].nodeName: Forbidden: Cannot change NodeName for 172.25.55.18 to ip-172-25-96-238.eu-west-1.compute.internal, subsets[0].addresses[5].nodeName: Forbidden: Cannot change NodeName for 172.25.65.86 to ip-172-25-96-238.eu-west-1.compute.internal, subsets[0].addresses[6].nodeName: Forbidden: Cannot change NodeName for 172.25.76.1 to ip-172-25-96-238.eu-west-1.compute.internal]"

Source

emas80

👍2

Most helpful comment

We're actually just finishing up some work, and the next release will be out within the next few days.

brancz on 16 Oct 2017

👍3

All 30 comments

I did open the exact same issue today #643

mbugeia on 26 Sep 2017

I found the solution (not related to prometheus-operator):
kubectl get endpoints kubelet -o yaml > kubelet.yaml
Manually remove the dead nodes from kubelet.yaml
kubectl replace -f kubelet.yaml

mbugeia on 26 Sep 2017

@mbugeia @emas80 we recently changed how this is handled, as we have had a problem with updating the kubelet endpoints object, so we changed it to syncing the state every 3 minutes. So this is actually a problem with the Prometheus Operator.

brancz on 26 Sep 2017

Thanks @mbugeia , replacing manually the nodes worked.
@brancz let me know if I can be of any help.

emas80 on 26 Sep 2017

The Prometheus Operator should be syncing the nodes in to the object every 3 minutes, so it seems that there is a bug in this sync.

@emas80 from the log line, it seems that the problematic thing is that we're trying to set the node name: https://github.com/coreos/prometheus-operator/blob/d372f28b01f5fac4bce946ed99544d1c32486a47/pkg/prometheus/operator.go#L380

Could you just try removing that line, as it really doesn't serve a purpose, we just added it as the information happened to be there.

brancz on 27 Sep 2017

Hi @brancz , I tried to remove that line, and the results are interesting.

Apparently the server got removed after some time, and the new one was added correclty, but the end points have been sort of duplicated.
I can see every server reported twice under prometheus for the kubelet service "target", one on the port 10255, the second one on the port 4194 under the label endpoint: "cadvisor".
I have never seen the port 4194 before, even on 0.10.1 version. I don't know if it's normal or not.

I can see then on the logs once every a while:

ts=2017-09-27T16:38:27Z caller=operator.go:306 component=prometheusoperator msg="syncing nodes into Endpoints object failed" err=null

`
I tried then killing another server of the cluster, and once again the old server was removed automatically after a few minutes, and the new one appeared correctly.
Still on the logs I see that error.

If then I set again the version 0.13.0, on the same cluster I tried the "new" version, I keep seeing on the logs again

ts=2017-09-27T16:38:27Z caller=operator.go:306 component=prometheusoperator msg="syncing nodes into Endpoints object failed" err=null

emas80 on 28 Sep 2017

The target with the new port is expected, that is the port that cAdvisor is natively listening on, it's metric output is just incorporated into the kubelets metric output. A user can choose which one to scrape, just be aware that starting with Kubernetes 1.7.3 the cAdvisor metrics have been removed from the /metrics endpoint of the kubelet and moved to /metrics/cadvisor. The /metrics endpoint on the 4194 port throughout all versions only exports cAdvisor metrics, so I would recommend scraping the metrics from there.

I just looked into the code and there is actually a silly mistake here:

https://github.com/coreos/prometheus-operator/blob/09510ed44994dee57fac8adfe15bcff1b561e15e/pkg/prometheus/operator.go#L305-L306

We forgot to guard the logging with an if err != nil { ... }, however, the good news is that you're not actually erroring :slightly_smiling_face: . Do you want to create a pull request to fix these two issues?

brancz on 28 Sep 2017

Sure I will do!

Thanks!

emas80 on 28 Sep 2017

Do you plan to make a release soon with this fix ? It's kind of critical for us since the bug also appears for new nodes in the cluster, the kublet endpoint is never updated.
Also is there a workaround to regenerate the kubelet endpoint ?

mbugeia on 5 Oct 2017

Yes, a release should be coming up soon! :slightly_smiling_face:

brancz on 5 Oct 2017

👍3

Thanks !

mbugeia on 5 Oct 2017

@brancz Is there any roadmap/milestone we can follow for the next release? I'm just trying to have some sort of ETA on "soon"

Thanks :)

olivierboucher on 16 Oct 2017

We're actually just finishing up some work, and the next release will be out within the next few days.

brancz on 16 Oct 2017

👍3

I just bumped into this bug as well. As a side note: trying to reverse engineer where the kube-system/kubelet service and endpoint came from was a bit of a pain. I think it would be preferable to bias towards explicit configuration (thats easily modifiable by an operator) over hidden magic; my kubelets were exposing cadvisor on a different port, and there was no clear way to reconfigure prom-operator to scrape the appropriate port aside from change what port kubelets expose.

Eagerly awaiting a fix!

byxorna on 17 Oct 2017

Thanks for the release! 👍 my kublet endpoint looks OK now.

atopuzov on 19 Oct 2017

@brancz Thanks for the release. Everything looks fine on our end now !

olivierboucher on 19 Oct 2017

Thanks for the feedback! Happy it did the trick! Closing here.

brancz on 20 Oct 2017

Hi @brancz
Can you make a new release for v0.11.2 with this fix?
We are using v0.11.2 on Kubernetes v1.6.7.

nghialv on 24 Oct 2017

I’m currently at ossummit eu, I won’t get to it anytime this week, please open an issue so I’ll remember to backport the fix.

brancz on 24 Oct 2017

👍1

Was there a release for v0.13.0 ?

I _think_ i am hitting this issue

ts=2017-10-27T15:16:41Z caller=operator.go:306 component=prometheusoperator msg="syncing nodes into Endpoints object failed" err="synchronizing kubelet endpoints object failed: updating kubelet endpoints object failed: Endpoints \"kubelet\" is invalid: [subsets[0].addresses[0].nodeName: Forbidden: Cannot change NodeName for 10.0.50.171 to test-k8s-worker04.cluster2, subsets[0].addresses[1].nodeName: Forbidden: Cannot change NodeName for 10.0.50.172 to test-k8s-worker04.cluster2, subsets[0].addresses[2].nodeName: Forbidden: Cannot change NodeName for 10.0.50.110 to test-k8s-worker04.cluster2, subsets[0].addresses[3].nodeName: Forbidden: Cannot change NodeName for 10.0.50.112 to test-k8s-worker04.cluster2, subsets[0].addresses[4].nodeName: Forbidden: Cannot change NodeName for 10.0.50.161 to test-k8s-worker04.cluster2, subsets[0].addresses[5].nodeName: Forbidden: Cannot change NodeName for 10.0.50.162 to test-k8s-worker04.cluster2, subsets[0].addresses[6].nodeName: Forbidden: Cannot change NodeName for 10.0.50.163 to test-k8s-worker04.cluster2]"

But i can't see any release fixing this other than v0.11.3 that mention this issue in BUGFIX

primeroz on 27 Oct 2017

The fix landed in the v0.14.0 release. Sorry for missing it in the release notes.

brancz on 27 Oct 2017

@brancz Got it

Can i upgrade to v0.14.0 but keep the prometheus and alertmanager at their current versions ? I don't want to go from 1.7.1 to 2.0 right now but still would like this fix.

I tried an hacky update on one of my test cluster by editing the deployment/prometheus-operator and update the prometheus-operator image to 0.14.0 but ended up with an error in trying to update the prometheus statefulset so i guess i will have to update the whole thing.

Thanks!

primeroz on 27 Oct 2017

If your Prometheus object has the version pinned, then it's no problem to just upgrade the Prometheus Operator version. The Prometheus Operator supports all versions of Prometheus as described in https://github.com/coreos/prometheus-operator/blob/master/Documentation/compatibility.md#prometheus .

brancz on 27 Oct 2017

I tried again to just change the prometheus operator version to 0.14.0 while leaving my prometheus and Alertmanager CDR untouched
I do specify the version (1.7.1) since i am basically running the v0.13.0 version of both those CDR

I get the following error though , not sure what field is trying to update. I ll see if i can find out

ts=2017-10-30T12:15:00Z caller=operator.go:670 component=prometheusoperator msg="sync prometheus" key=monitoring/k8s
ts=2017-10-30T12:15:00Z caller=operator.go:979 component=prometheusoperator msg="updating config skipped, no configuration change"
E1030 12:15:00.456938 1 operator.go:577] Sync "monitoring/k8s" failed: updating statefulset failed: StatefulSet.apps "prometheus-k8s" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden.

EDIT: Deleting the stateful set and letting the new operator create it with exactly the same settings seems to have fixed the error, i was not able to find out what changes it was trying to apply

primeroz on 30 Oct 2017

@brancz This issue still persists, we are using 2.15.2. let me know if anything we can do to enable this 3 minutes polling.

tarunptala on 28 Jul 2020

👍2

@brancz still happens on v0.38.1

lironrisk on 4 Oct 2020

I am also facing this issue with Prometheus-operator version v0.38.1, how to resolvethis?

Khemdevi on 9 Dec 2020

Please update to latest, you're more than 6 minor versions behind.

brancz on 10 Dec 2020

@brancz , I have used the latest prometheus-operator image quay.io/prometheus-operator/prometheus-operator:v0.44.0. and the issue still persists, I can see old node getting scraped for kubelet and showing it as arget down
Screenshot 2020-12-17 at 12 18 40 PM

These ae the logs of prometheus operator pod
➜ ~ kubectl logs -f kube-prometheus-stack-operator-7f977f6b86-f7znc level=info ts=2020-12-17T06:40:51.344872356Z caller=main.go:235 msg="Starting Prometheus Operator" version="(version=0.44.0, branch=refs/tags/pkg/apis/monitoring/v0.44.0, revision=35c9101c332b9371172e1d6cc5a57c065f14eddf)" level=info ts=2020-12-17T06:40:51.344915254Z caller=main.go:236 build_context="(go=go1.14.12, user=paulfantom, date=20201202-15:42:46)" level=warn ts=2020-12-17T06:40:51.344923645Z caller=main.go:239 msg="'--config-reloader-image' flag is ignored, only '--prometheus-config-reloader' is used" config-reloader-image=docker.io/jimmidyson/configmap-reload:v0.4.0 prometheus-config-reloader=quay.io/prometheus-operator/prometheus-config-reloader:v0.44.0 ts=2020-12-17T06:40:51.349291525Z caller=main.go:107 msg="Starting insecure server on [::]:8080" level=info ts=2020-12-17T06:40:51.357157809Z caller=operator.go:436 component=alertmanageroperator msg="connection established" cluster-version=v1.18.8 level=info ts=2020-12-17T06:40:51.357189124Z caller=operator.go:445 component=alertmanageroperator msg="CRD API endpoints ready" level=info ts=2020-12-17T06:40:51.357236946Z caller=operator.go:300 component=thanosoperator msg="connection established" cluster-version=v1.18.8 level=info ts=2020-12-17T06:40:51.357258584Z caller=operator.go:309 component=thanosoperator msg="CRD API endpoints ready" level=info ts=2020-12-17T06:40:51.358168723Z caller=operator.go:420 component=prometheusoperator msg="connection established" cluster-version=v1.18.8 level=info ts=2020-12-17T06:40:51.35822798Z caller=operator.go:429 component=prometheusoperator msg="CRD API endpoints ready" level=info ts=2020-12-17T06:40:52.5434552Z caller=operator.go:261 component=thanosoperator msg="successfully synced all caches" level=info ts=2020-12-17T06:40:53.143358973Z caller=operator.go:277 component=alertmanageroperator msg="successfully synced all caches" level=warn ts=2020-12-17T06:40:53.14344985Z caller=operator.go:1345 component=alertmanageroperator msg="alertmanager key=monitoring-stack/kube-prometheus-stack-alertmanager, field spec.baseImage is deprecated, 'spec.image' field should be used instead" level=info ts=2020-12-17T06:40:53.143508338Z caller=operator.go:661 component=alertmanageroperator msg="sync alertmanager" key=monitoring-stack/kube-prometheus-stack-alertmanager level=info ts=2020-12-17T06:40:53.259671523Z caller=operator.go:359 component=prometheusoperator msg="successfully synced all caches" level=warn ts=2020-12-17T06:40:53.259754378Z caller=operator.go:1276 component=prometheusoperator msg="prometheus key=monitoring-stack/kube-prometheus-stack-prometheus, field spec.baseImage is deprecated, 'spec.image' field should be used instead" level=info ts=2020-12-17T06:40:53.259813417Z caller=operator.go:1163 component=prometheusoperator msg="sync prometheus" key=monitoring-stack/kube-prometheus-stack-prometheus level=info ts=2020-12-17T06:40:53.344039927Z caller=operator.go:661 component=alertmanageroperator msg="sync alertmanager" key=monitoring-stack/kube-prometheus-stack-alertmanager level=info ts=2020-12-17T06:40:53.575422112Z caller=operator.go:661 component=alertmanageroperator msg="sync alertmanager" key=monitoring-stack/kube-prometheus-stack-alertmanager

Khemdevi on 17 Dec 2020

👍1

@Khemdevi can you run the operator with debug log level?