Test-infra: Pushgateway serves metrics for dead instances forever.

Created on 4 Jul 2018 · 16Comments · Source: kubernetes/test-infra

The pushgateway doesn't work particularly well with gauges that are provided by kubernetes deployments because the instance field (the pod name) changes whenever the pod restarts, but old metrics for completed pods are never pruned and are continually scraped by upstream.
This is problematic because prometheus doesn't attack timestamps when serving metrics so all instances are considered to be equally 'recent' and we can't easily tell which instance is the current instance.

I think our best bet might be to use something like: https://github.com/Haufe-Lexware/pushgateway-pruner#pushgateway-grouping-pruner

/area prow
/kind bug
/cc @BenTheElder @rmmh @fejta @krzyzacy

areprow kinbug lifecyclrotten

Source

cjwagner

👍1

Most helpful comment

Well it doesn't attack timestamps either, but I meant to say 'attach' 😛

cjwagner on 4 Jul 2018

😄2 👍1

All 16 comments

Attack? 🙃
/me envisions a Prometheus pod viciously deleting timestamps from logs

On Tue, Jul 3, 2018, 15:30 Cole Wagner notifications@github.com wrote:

The pushgateway doesn't work particularly well with gauges that are
provided by kubernetes deployments because the instance field (the pod
name) changes whenever the pod restarts, but old metrics for completed pods
are never pruned and are continually scraped by upstream.
This is problematic because prometheus doesn't attack timestamps when
serving metrics so all instances are considered to be equally 'recent' and
we can't easily tell which instance is the current instance.

I think our best bet might be to use something like:
https://github.com/Haufe-Lexware/pushgateway-pruner#pushgateway-grouping-pruner

/area prow
/kind bug
/cc @BenTheElder https://github.com/BenTheElder @rmmh
https://github.com/rmmh @fejta https://github.com/fejta @krzyzacy
https://github.com/krzyzacy

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/8567, or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq_GFhLO5LiD4kR-tl3cUMUrmxNdTks5uC_BygaJpZM4VBtje
.

BenTheElder on 4 Jul 2018

Well it doesn't attack timestamps either, but I meant to say 'attach' 😛

cjwagner on 4 Jul 2018

😄2 👍1

@smarterclayton did we have a solution for this at a higher level? Do we have similar issues on DeploymentConfig?

stevekuznetsov on 6 Jul 2018

Generally we don't use the push gateway and just report from the process, letting it scrape continuously and reporting metrics on current state from a cache (that's how kube-state-metrics performs a similar reporting action).

smarterclayton on 7 Jul 2018

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 5 Oct 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 4 Nov 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 4 Dec 2018

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 4 Dec 2018

This will become more of an issue now that we are redeploying more often.

/reopen
/remove-lifecycle rotten

cjwagner on 5 Dec 2018

@cjwagner: Reopened this issue.

In response to this:

This will become more of an issue now that we are redeploying more often.

/reopen
/remove-lifecycle rotten

k8s-ci-robot on 5 Dec 2018

Annnnnnnd now it is biting us.
Velodrome is failing to render graphs with "Request Failed" errors now (almost certainly timeouts). e.g. http://velodrome.k8s.io/dashboard/db/monitoring?orgId=1

cjwagner on 15 Jan 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 15 Apr 2019

We should rename this issue to "Prow misuses the pushgateway" -- we should be fine to not use it all

stevekuznetsov on 16 Apr 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 16 May 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 15 Jun 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot on 15 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Cherrypick lables name not consistant

spzala · 4Comments

Prow GitHub Client test flakes

BenTheElder · 4Comments

Prow issue: People in OWNERS files could not add LGTM

Aisuko · 3Comments

submit-queue flags have outdated names

BenTheElder · 3Comments

prow: Github's API endpoint is not configurable

fen4o · 4Comments