The pushgateway doesn't work particularly well with gauges that are provided by kubernetes deployments because the instance field (the pod name) changes whenever the pod restarts, but old metrics for completed pods are never pruned and are continually scraped by upstream.
This is problematic because prometheus doesn't attack timestamps when serving metrics so all instances are considered to be equally 'recent' and we can't easily tell which instance is the current instance.
I think our best bet might be to use something like: https://github.com/Haufe-Lexware/pushgateway-pruner#pushgateway-grouping-pruner
/area prow
/kind bug
/cc @BenTheElder @rmmh @fejta @krzyzacy
Attack? 🙃
/me envisions a Prometheus pod viciously deleting timestamps from logs
On Tue, Jul 3, 2018, 15:30 Cole Wagner notifications@github.com wrote:
The pushgateway doesn't work particularly well with gauges that are
provided by kubernetes deployments because the instance field (the pod
name) changes whenever the pod restarts, but old metrics for completed pods
are never pruned and are continually scraped by upstream.
This is problematic because prometheus doesn't attack timestamps when
serving metrics so all instances are considered to be equally 'recent' and
we can't easily tell which instance is the current instance.I think our best bet might be to use something like:
https://github.com/Haufe-Lexware/pushgateway-pruner#pushgateway-grouping-pruner/area prow
/kind bug
/cc @BenTheElder https://github.com/BenTheElder @rmmh
https://github.com/rmmh @fejta https://github.com/fejta @krzyzacy
https://github.com/krzyzacy—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/8567, or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq_GFhLO5LiD4kR-tl3cUMUrmxNdTks5uC_BygaJpZM4VBtje
.
Well it doesn't attack timestamps either, but I meant to say 'attach' 😛
@smarterclayton did we have a solution for this at a higher level? Do we have similar issues on DeploymentConfig?
Generally we don't use the push gateway and just report from the process, letting it scrape continuously and reporting metrics on current state from a cache (that's how kube-state-metrics performs a similar reporting action).
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
This will become more of an issue now that we are redeploying more often.
/reopen
/remove-lifecycle rotten
@cjwagner: Reopened this issue.
In response to this:
This will become more of an issue now that we are redeploying more often.
/reopen
/remove-lifecycle rotten
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Annnnnnnd now it is biting us.
Velodrome is failing to render graphs with "Request Failed" errors now (almost certainly timeouts). e.g. http://velodrome.k8s.io/dashboard/db/monitoring?orgId=1
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
We should rename this issue to "Prow misuses the pushgateway" -- we should be fine to not use it all
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
Well it doesn't attack timestamps either, but I meant to say 'attach' 😛