Test-infra: Pushgateway serves metrics for dead instances forever.

Created on 4 Jul 2018  Â·  16Comments  Â·  Source: kubernetes/test-infra

The pushgateway doesn't work particularly well with gauges that are provided by kubernetes deployments because the instance field (the pod name) changes whenever the pod restarts, but old metrics for completed pods are never pruned and are continually scraped by upstream.
This is problematic because prometheus doesn't attack timestamps when serving metrics so all instances are considered to be equally 'recent' and we can't easily tell which instance is the current instance.

I think our best bet might be to use something like: https://github.com/Haufe-Lexware/pushgateway-pruner#pushgateway-grouping-pruner

/area prow
/kind bug
/cc @BenTheElder @rmmh @fejta @krzyzacy

areprow kinbug lifecyclrotten

Most helpful comment

Well it doesn't attack timestamps either, but I meant to say 'attach' 😛

All 16 comments

Attack? 🙃
/me envisions a Prometheus pod viciously deleting timestamps from logs

On Tue, Jul 3, 2018, 15:30 Cole Wagner notifications@github.com wrote:

The pushgateway doesn't work particularly well with gauges that are
provided by kubernetes deployments because the instance field (the pod
name) changes whenever the pod restarts, but old metrics for completed pods
are never pruned and are continually scraped by upstream.
This is problematic because prometheus doesn't attack timestamps when
serving metrics so all instances are considered to be equally 'recent' and
we can't easily tell which instance is the current instance.

I think our best bet might be to use something like:
https://github.com/Haufe-Lexware/pushgateway-pruner#pushgateway-grouping-pruner

/area prow
/kind bug
/cc @BenTheElder https://github.com/BenTheElder @rmmh
https://github.com/rmmh @fejta https://github.com/fejta @krzyzacy
https://github.com/krzyzacy

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/8567, or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq_GFhLO5LiD4kR-tl3cUMUrmxNdTks5uC_BygaJpZM4VBtje
.

Well it doesn't attack timestamps either, but I meant to say 'attach' 😛

@smarterclayton did we have a solution for this at a higher level? Do we have similar issues on DeploymentConfig?

Generally we don't use the push gateway and just report from the process, letting it scrape continuously and reporting metrics on current state from a cache (that's how kube-state-metrics performs a similar reporting action).

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This will become more of an issue now that we are redeploying more often.

/reopen
/remove-lifecycle rotten

@cjwagner: Reopened this issue.

In response to this:

This will become more of an issue now that we are redeploying more often.

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Annnnnnnd now it is biting us.
Velodrome is failing to render graphs with "Request Failed" errors now (almost certainly timeouts). e.g. http://velodrome.k8s.io/dashboard/db/monitoring?orgId=1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

We should rename this issue to "Prow misuses the pushgateway" -- we should be fine to not use it all

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

stevekuznetsov picture stevekuznetsov  Â·  4Comments

sjenning picture sjenning  Â·  4Comments

BenTheElder picture BenTheElder  Â·  4Comments

cjwagner picture cjwagner  Â·  3Comments

stevekuznetsov picture stevekuznetsov  Â·  3Comments