Test-infra: Allow using inrepoconfig on the prow.k8s.io instance

Created on 23 Dec 2019 · 11Comments · Source: kubernetes/test-infra

What would you like to be added:

Make it possible to use the inrepoconfig feature with the prow.k8s.io instance

Why is this needed:

Today, writing presubmit jobs has to major hurdles:

It is completely not intuitive how to do that
There is no feedback on job additions/changes if that change just broke it

areconfig arejobs kinfeature lifecyclrotten sicontributor-experience sitesting

Source

alvaroaleman

👍1

Most helpful comment

Additionally, it is not terribly clear to me what the real benefit here is — being able to test your jobs without committing any record of what that job was is potentially nice (if maybe risky), but I'm not sold on the real value of keeping jobs in their own repos. Editing multiple repos with major changes is a fact of life in a multi-repo world regardless.

It's not just being able to test jobs, its also:

Giving ppl the chance to find job configuration without requiring them to know about an arcane third repo
Implicitly forking jobs on new branches
Being sure the config used matches what's in git, rather than having a gap (xref https://github.com/kubernetes/test-infra/issues/14389)

It further lowers the trust bar for running new jobs: now we don't even need jobs to be commited; you can simply create a PR and your arbitrary job will be run. This might make it harder to make future changes like, for instance, ensuring only jobs in certain repos can have a particular service account.

If you are a trusted user, yes. Most jobs are already based on executing some script in the target
repo, so being able to manipulate the job itself doesn't change much about the trust model. Additionally, the GCS reporter will upload the job to GCS.

Generally, it would be quite easy to extend the validation to include whatever is relevant. If you want to easily have any prow-instance specific validations, the validation could be moved into a webservice that is called prior to creating jobs.

It makes it much harder for us to centrally manage test jobs. It is frequently valuable to make observe and make sweeping, automated changes to all jobs. This would become much harder without all the jobs being in one place for observation. Losing the ability to make bulk updates, disable/enable batches of poorly-behaved jobs on a whim, or use our batch tooling (e.g. the job fork and rotation tools) is a non-starter for me.

Some of the problems here only exist because of the central repo, like the config forking. Others like the changes could be solved by using preset. Disabling jobs based on any kind of property could be part of the validation, this would also be a better way to communicate this back than just having them vanish.

alvaroaleman on 25 Mar 2020

👍2 👎1

All 11 comments

How do we handle regenerating config?
How do we handle the change in permissions?

Today as an oncall member I can trivially:

Audit all prowjobs by inspecting the source of truth in test-infra
File PRs (with approval) to change or remove any prowjob for any sort of misbehavior (due to the OWNERS heirarchy)
Regenerate the config maps easily when they are broken (with the boostrapper)
Enforce presubmits are run against prowjob configs for basic sanity checks (via test-infra presubmits)

How do I do those things with inrepoconfig?

BenTheElder on 23 Dec 2019

Audit all prowjobs by inspecting the source of truth in test-infra

One option could be to upload the job definition to GCS which would also allow us to easely answer the question "What config was this job run with?" after the prowjob CR is cleaned up.

File PRs (with approval) to change or remove any prowjob for any sort of misbehavior (due to the OWNERS heirarchy)

Maybe make the approval plugin approve all PRs that touch .prow.yaml from ppl that are approvers in test-infra?

Regenerate the config maps easily when they are broken (with the boostrapper)

This problem does not exist, as the jobs from inrepoconfig are never persisted to a configmap, they are fetched from the repo on demand and only exist in memory/as Prowjob CR.

Enforce presubmits are run against prowjob configs for basic sanity checks (via test-infra presubmits)

inrepoconfig already executes jobs from both the central config and the .prow.yaml file. It is strongly suggested to setup checkconfig in the central repo to vet the config in order to avoid issues. One problem here remains thought, namely that there is no way to validate the jobs before they get executed (except for the allowed_clusters: setting).

alvaroaleman on 23 Dec 2019

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 22 Mar 2020

/remove-lifecycle stale

cblecker on 24 Mar 2020

I am strongly opposed to adopting in-repo config for prow.k8s.io:

It makes it much harder for us to centrally manage test jobs. It is frequently valuable to make observe and make sweeping, automated changes to all jobs. This would become much harder without all the jobs being in one place for observation. Losing the ability to make bulk updates, disable/enable batches of poorly-behaved jobs on a whim, or use our batch tooling (e.g. the job fork and rotation tools) is a non-starter for me.
It further lowers the trust bar for running new jobs: now we don't even need jobs to be commited; you can simply create a PR and your arbitrary job will be run. This might make it harder to make future changes like, for instance, ensuring only jobs in certain repos can have a particular service account.

We can try to preemptively handle every case that might come up, but I don't think the result will be understandable or effective.

Additionally, it is not terribly clear to me what the real benefit here is — being able to test your jobs without committing any record of what that job was is potentially nice (if maybe risky), but I'm not sold on the real value of keeping jobs in their own repos. Editing multiple repos with major changes is a fact of life in a multi-repo world regardless.

Katharine on 24 Mar 2020

Additionally, it is not terribly clear to me what the real benefit here is — being able to test your jobs without committing any record of what that job was is potentially nice (if maybe risky), but I'm not sold on the real value of keeping jobs in their own repos. Editing multiple repos with major changes is a fact of life in a multi-repo world regardless.

It's not just being able to test jobs, its also:

Giving ppl the chance to find job configuration without requiring them to know about an arcane third repo
Implicitly forking jobs on new branches
Being sure the config used matches what's in git, rather than having a gap (xref https://github.com/kubernetes/test-infra/issues/14389)

It further lowers the trust bar for running new jobs: now we don't even need jobs to be commited; you can simply create a PR and your arbitrary job will be run. This might make it harder to make future changes like, for instance, ensuring only jobs in certain repos can have a particular service account.

It makes it much harder for us to centrally manage test jobs. It is frequently valuable to make observe and make sweeping, automated changes to all jobs. This would become much harder without all the jobs being in one place for observation. Losing the ability to make bulk updates, disable/enable batches of poorly-behaved jobs on a whim, or use our batch tooling (e.g. the job fork and rotation tools) is a non-starter for me.

alvaroaleman on 25 Mar 2020

👍2 👎1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 23 Jun 2020

/remove-lifecycle stale

rifelpet on 23 Jun 2020

repo, so being able to manipulate the job itself doesn't change much about the trust model. Additionally, the GCS reporter will upload the job to GCS.

It substantially does, you can completely alter the podspec etc.

I don't care much about the script run in most cases (we do in some), I do nearly always care about the definition of the job including resource requests, mounts, k8s service account, etc., not-intentionally-malicious resource misuse and abuse is a bigger concern.

That said we do have clusters where we do NOT allow a single job in them to be configured with a script from another repo, and users will need to continue to use our tools in them for security reasons, where things are in fact relatively locked down.

That is enforced via test-infra configuration validation presubmit & review today.

Most projects also rely on images from this repo, which we control rollout for by automated config changes.

In any case, if anyone wants this change, I strongly suggest you contribute to the wg-k8s infra effort to move prow.k8s.io to community infrastructure, at which time the group of active infrastructure maintainers can decide this.

Currently this work is being done near exclusively by @spiffxp with reviews by myself.

The existing oncall rotation maintaining the existing infrastructure previously discussed this issue and was against it for the time being.

BenTheElder on 23 Jun 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 26 Nov 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 26 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

trusted_org not working in trigger

BenTheElder · 4Comments

The ClientError type is not unmarshalled correctly.

cjwagner · 3Comments

prow/blunderbuss: possibly use UserStatus?

cblecker · 4Comments

Prow issue: People in OWNERS files could not add LGTM

Aisuko · 3Comments

boskos does not have a free gce-project at the moment

xiangpengzhao · 3Comments