As a Prow administrator, I need to:
hookhook plugins took on that eventjenkins-operator) is having anomolous behavior/cc @kargakis
Since we've got structured logs, maybe we can get ELK down and use that to make this easier?
/cc @spxtr
I've been using stackdriver logging for this. It lets you do queries based on the JSON logs, such as jsonPayload.pr=12345 to filter for all events related to PR 12345.
That's available if you're on GKE only, right?
I'm not sure TBH. I was assuming other cloud providers have their own logging systems that can do queries such as the one above. I don't think we should implement our own in prow.
StackDriver itself is supposed to be usable even with AWS, but I'm not sure
about integrating it with k8s/prow. It looks like there are docs for this at
https://kubernetes.io/docs/tasks/debug-application-cluster/logging-stackdriver/
On Thu, Sep 28, 2017 at 4:37 PM, Joe Finney notifications@github.com
wrote:
I'm not sure TBH. I was assuming other cloud providers have their own
logging systems that can do queries such as the one above. I don't think we
should implement our own in prow.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/4788#issuecomment-332992159,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq07EbrawJNs9Q6EE7MKeIVZaUAEwks5snC3VgaJpZM4PnsG3
.
I don't think we should implement our own in prow.
+1, Prow should use native logging in Kubernetes.
We shouldn't implement something ourselves, which is why I suggested ELK, but in general we should not also say "Prow is great! Stand it up! Oh, you want an interface that makes it easy to administer the cluster? You're own your own!"
Perhaps some doc that describes basic maintenance and debugging patterns?
@spxtr yeah, such a doc would be nice. prow should be giving us the necessary logs (eg. https://github.com/kubernetes/test-infra/pull/4885), a logging stack is orthogonal to the system.
Also, we need to start exposing metrics from the controllers if we want to be detecting anomalous behavior as we should. For example, we should expose a metric that records failed API calls to the underlying agent. Then, we could setup a prometheus alert that triggers if the percentage of failed API calls exceeds a percentage of the total calls over a period of time. If the jenkins-operator starts failing to trigger builds during the past hour, that's a problem we need to be notified about.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Prevent issues from auto-closing with an /lifecycle frozen comment.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale
A lot of these are being addressed by the tracer, some are exposed by metrics, we should mostly be OK here I think.
/close