Test-infra: prow: admin console user stories

Created on 28 Sep 2017 · 13Comments · Source: kubernetes/test-infra

As a Prow administrator, I need to:

determine if an event got picked up by hook
determine what action hook plugins took on that event
determine the events in the lifetime of a specific prowjob
connect a prowjob or prowjobs to a trigger comment
see when a controller (_e.g._ jenkins-operator) is having anomolous behavior

/cc @kargakis

areprow kindocumentation lifecyclstale

Source

stevekuznetsov

All 13 comments

Since we've got structured logs, maybe we can get ELK down and use that to make this easier?

stevekuznetsov on 28 Sep 2017

/cc @spxtr

stevekuznetsov on 28 Sep 2017

I've been using stackdriver logging for this. It lets you do queries based on the JSON logs, such as jsonPayload.pr=12345 to filter for all events related to PR 12345.

spxtr on 28 Sep 2017

That's available if you're on GKE only, right?

stevekuznetsov on 28 Sep 2017

I'm not sure TBH. I was assuming other cloud providers have their own logging systems that can do queries such as the one above. I don't think we should implement our own in prow.

spxtr on 29 Sep 2017

StackDriver itself is supposed to be usable even with AWS, but I'm not sure
about integrating it with k8s/prow. It looks like there are docs for this at
https://kubernetes.io/docs/tasks/debug-application-cluster/logging-stackdriver/

On Thu, Sep 28, 2017 at 4:37 PM, Joe Finney notifications@github.com
wrote:

I'm not sure TBH. I was assuming other cloud providers have their own
logging systems that can do queries such as the one above. I don't think we
should implement our own in prow.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/4788#issuecomment-332992159,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq07EbrawJNs9Q6EE7MKeIVZaUAEwks5snC3VgaJpZM4PnsG3
.

BenTheElder on 29 Sep 2017

I don't think we should implement our own in prow.

+1, Prow should use native logging in Kubernetes.

BenTheElder on 29 Sep 2017

We shouldn't implement something ourselves, which is why I suggested ELK, but in general we should not also say "Prow is great! Stand it up! Oh, you want an interface that makes it easy to administer the cluster? You're own your own!"

stevekuznetsov on 29 Sep 2017

👍1

Perhaps some doc that describes basic maintenance and debugging patterns?

spxtr on 29 Sep 2017

@spxtr yeah, such a doc would be nice. prow should be giving us the necessary logs (eg. https://github.com/kubernetes/test-infra/pull/4885), a logging stack is orthogonal to the system.

kargakis on 5 Oct 2017

👍1

Also, we need to start exposing metrics from the controllers if we want to be detecting anomalous behavior as we should. For example, we should expose a metric that records failed API calls to the underlying agent. Then, we could setup a prometheus alert that triggers if the percentage of failed API calls exceeds a percentage of the total calls over a period of time. If the jenkins-operator starts failing to trigger builds during the past hour, that's a problem we need to be notified about.

kargakis on 5 Oct 2017

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale