Test-infra: Expose git data to prow plugins

Created on 16 Jun 2017 · 12Comments · Source: kubernetes/test-infra

Some research I've been doing. I'm interested in any comments by parties interested in prow. @apelisse was interested at one point.

Motivating use-cases:

The approval handler is currently done in mungegithub. It needs to list all changed files in a PR and use nearly OWNERS files to decide whether the PR is approved or not.
A gofmt/vet/lint plugin that comments on the diff in a PR. It needs exact diff information along with the full files, so that it can see the entire package but only comment on added lines.
The trigger plugin should use configuration checked into each repo under test so that changes can be tested in PRs.

Solution 1: GitHub API alone

The GitHub API exposes low-level git data. To use this, we will need to reimplement much of git's friendly UX on top of it. To solve (2) above, we must do the following when a PR event comes in:

Compare the PR head and merge base. This gives us a list of files that are changed in the PR, along with the patches.
Get each .go file and decode. If we want to get the entire package instead of just changed files then we need to make several more API calls.
Run them through golint. Because we only have individual files and not the whole package, we might get a spurious "set a package-level doc comment" if we set the confidence too low.
Parse the patches from (1) and make a comment for any golint problems on added lines.

This is rather painful for plugin authors to understand. However, solving (3) above would be very easy with this method. It would just be a constant couple API calls per PR update.

Solution 2: Keep a local git clone

Solving (2) now looks like this when a PR event comes in:

If we already have the repo cloned, do a fetch and make sure we have the HEAD commit from the PR.
Checkout a working tree, try to merge the PR.
Run the changed files through golint. Use standard git and unix tools to handle patches and so on.
Make comments.

This is tricky because we now need to keep a mirror git repository. All of a sudden we need to be careful about corrupt repos, garbage collection, and the other nuances of git. We need to very carefully control which operations we allow, since plugins run in parallel and so the git commands need to be thread-safe. This is simpler and more flexible for plugins, but more complicated to implement.

areprow

Source

spxtr

All 12 comments

We also have to consider future use-cases, beyond the current set of motivating use-cases.
Are we going to have more and more of these use-cases, and if so, how are these two solutions
going to scale? What other types of operations might be needed in the future?
Even if the number of use-cases stays low, one use-case that would need all the files in the
repository could use all the token limit in one pass, even with proper caching of all operations.
Beyond the token limit, it would also most likely be extremely slow (n calls to the API rather
than a single optimized git fetch).

Checkout a working tree, try to merge the PR.

I'm not sure why we need to try to merge the PR here. Do we need write access to the repo? Because if we do, we have to discard the first solution.

It's interesting to note that if a plugin wants to make local changes to the repository, we will indeed have to use this second solution. This would of course create a bunch of new "thread-safe" type of problems (more on that below).

This is tricky because we now need to keep a mirror git repository. All of a sudden we need to be careful about corrupt repos, garbage collection, and the other nuances of git. We need to very carefully control which operations we allow, since plugins run in parallel and so the git commands need to be thread-safe. This is simpler and more flexible for plugins, but more complicated to implement.

Can we have the repository path be mounted read-only for plugins? That would prevent all sorts of thread-safe issue.

If we want plugins to be able to change the repository locally, we could have each plugin make a git-clone copy of the repository. Or if that's too expensive, we could have the GIT_DIR shared amongst plugins, and each plugin have their own GIT_WORK_TREE (potentially on a tmpfs for maximal speed).

PS: This comment has been edited.

apelisse on 18 Jun 2017

Agreed that solution 1 will probably not suffice.

I'm not sure why we need to try to merge the PR here. Do we need write access to the repo? Because if we do, we have to discard the first solution.

We do need to merge locally so that we're testing the whole file as it is in the PR. You can't run go vet on a patch alone!

If we want plugins to be able to change the repository locally, we could have each plugin make a git-clone copy of the repository.

I think that idea makes the most sense. It's not too expensive, especially if we can get away with a shallow clone and some automated cleanup.

spxtr on 20 Jun 2017

/joke

rmmh on 20 Jun 2017

@rmmh: What do you call cheese by itself? Provolone..

In response to this:

/joke

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 20 Jun 2017

👎2

Cloning locally makes sense to me. We are doing git commit message validation on every PR and I could see that being moved into a prow plugin. Do prow plugins run in parallel? From a quick look I had, it seemed they didn't.

cc: @stevekuznetsov

kargakis on 20 Jun 2017

Interesting idea -- would be extremely cool to factor in some of the other ideas that had been floating around in issues like https://github.com/kubernetes/test-infra/issues/3023 -- any tools based on parsing Go code could plug into this and make comments.

Re: go vet and go lint -- they both require $GOPATH-level scope (for instance, checks like unkeyed_composite_literals need the definition of the struct to work) so I think you _need_ a full checkout with dependencies and $GOPATH to get the best output from vet and lint.

stevekuznetsov on 20 Jun 2017

We are doing git commit message validation on every PR and I could see that being moved into a prow plugin.

Sounds like a good motivating case number 4.

Do prow plugins run in parallel? From a quick look I had, it seemed they didn't.

No, but we should be safe to do so. At least, I put a comment saying that it should work.

spxtr on 21 Jun 2017

No, but we should be safe to do so. At least, I put a comment saying that it should work.

Now they do: https://github.com/kubernetes/test-infra/pull/3159 :)

kargakis on 23 Jun 2017

We are doing git commit message validation on every PR and I could see that being moved into a prow plugin.

This seems like it would be easiest with the GitHub API, not a local clone. I think you would just use this API call to get the commit messages for the PR.

spxtr on 5 Jul 2017

@spxtr our validation touches both the commit messages as well as their contents -- an example validation is that a commit touching vendored code for repo X declares itself as such in the message. Looks like you can grab commits blobs and trees but it seems to me that a validator that a dev can use on their local checkout quickly and effectively is more useful than one doing so over the web and relying on GitHub.

stevekuznetsov on 5 Jul 2017

👍1

Oh I see. Yes, it makes sense to do a local clone for that case.

spxtr on 5 Jul 2017

See this doc for some more thoughts.

spxtr on 6 Jul 2017

Was this page helpful?

0 / 5 - 0 ratings