Test-infra: Pod-utils, bootstrap and testgrid should work with the same set of fields

Created on 2 Feb 2019 · 29Comments · Source: kubernetes/test-infra

/area testgrid
/area prow
/area prow/pod-utilities

prow/pod-utils/gcs makes a large pile of dubious claims about deprecation. We need to update this library (as well as initupload and sidecar) to output the same fields as bootstrap.py

https://github.com/kubernetes/test-infra/blob/191f27fef4ef1dd388f9a8e1c08c0fce84de7b0b/prow/pod-utils/gcs/metadata.go#L19-L54

Basically all of this is wrong, and most likely because we didn't understand the previous format when creating this structure.

areprow arepropod-utilities aretestgrid kinbug lifecyclrotten

Source

fejta

Most helpful comment

I was hoping we can simply upload prowjob.yaml and get rid of ALL of the metadata from started/finished.json.

(aka, imho move towards trying to produce the same fields like bootstrap is a step backwards)

krzyzacy on 2 Feb 2019

❤1 🎉1

All 29 comments

I was hoping we can simply upload prowjob.yaml and get rid of ALL of the metadata from started/finished.json.

(aka, imho move towards trying to produce the same fields like bootstrap is a step backwards)

krzyzacy on 2 Feb 2019

❤1 🎉1

Can we instead, please update tools to expect a full-featured:

ProwJob CRD YAML
Pod YAML

in the artifacts folder? I can work next on making artifact-uploader push those up

stevekuznetsov on 2 Feb 2019

👎1

/assign

stevekuznetsov on 2 Feb 2019

https://github.com/kubernetes/test-infra/issues/10544#issuecomment-454562161 would also be nice either way ... 99% of the time when triaging a job the only thing I really need besides the actual job artifacts is the commit(s) and maybe the start / end time.

Right now we lack the commits, and the YAMLs are going to show EG "master" as the ref for periodics etc.

We use an ugly bandaid for a few jobs today to write out a minimal metadata.json with the primary repo commit at least to get around this.
https://github.com/kubernetes/test-infra/blob/ed6a7c4f28a2f87de404f8844a14a512e4c8f3d6/images/bootstrap/runner.sh#L106-L111
https://testgrid.k8s.io/sig-release-master-blocking#verify-master

BenTheElder on 2 Feb 2019

No one uses artifact-uploader. We run 20,000 builds a day, most of them still use bootstrap, and we have at least a year and a half of this historical data. We should continue uploading the fields that bootstrap.py does that make sense, rather than arbitrarily changing them for no good reason (to revision, for example).

And while it is certainly the most verbose solution, I am unconvinced that uploading the pod and prowjob yaml is correct and/or most useful.

fejta on 2 Feb 2019

We should resolve https://github.com/kubernetes/test-infra/issues/10544 1000000% - when whoever has some spare cycles.

We run 20,000 builds a day, most of them still use bootstrap

We will migrate them to podutils :-)

most likely because we didn't understand the previous format when creating this structure.

Yep - that's also the problem when we making podutils isn't it? 😏

We should continue uploading the fields that bootstrap.py does that make sense, rather than arbitrarily changing them for no good reason

IMHO version and job-version are bad names, and bootstrap fields are not very well documented to begin with. (Yes we might not fully understand the meaning of each field, but all of them can be found in prowjob.yaml right?) We control downstream applications, if we have a clean definition, like today we created the struct, we can easily make things like gubernator/spyglass to support extracting more info from prowjob.yaml (while still support all the old fields, for a while, sure).

krzyzacy on 2 Feb 2019

Also:

        // Node is deprecated. 
    Node string `json:"node,omitempty"` 
    // Repos is deprecated. 
    Repos map[string]string `json:"repos,omitempty"` 
    // Pull is deprecated. 
    Pull string `json:"pull,omitempty"`

I don't think anyone is actually consuming these fields today?

krzyzacy on 2 Feb 2019

bootstrap fields are not very well documented

We should do better than this. The code that outputs these fields is open source, in this repo, and created by the same people who work on prow.

It is fine to migrate fields, but we should be more intentional about the reasons why we should do this, rather just because we failed to read the python file or ask someone how to generate the fields correctly.

version and job-version are bad names

These are different things that we intentionally decoupled:

job-version - the kubernetes version we tested
repo-version - the git commit of the primary repo
version - many years deprecated value from before bootstrap.py which conflated these two

Originally we waited a long time to upload started.json, and this was the responsibility of the job itself (along with cloning any repos). Eventually we migrated this out of the jobs and into bootstrap, which became more general purpose (for any repo) rather than just k/k. Eventually we migrated this out of the test container entirely thanks to pod-utils.

Once this became a general purpose checkout+metadata utility, computing the kubernetes/kubernetes job version:

Makes no sense for a test-infra job, and even less sense for a kubeflow and/or rules_k8s job.
requires reading k/k files
requires extracting GCS files for a CI job

Basically this value does not belong in a general purpose utility, however since it is deeply useful for the k/k community we kept it for our ci and k/k pull jobs.

We created repo-version for the general purpose bootstrap and pod-utils to upload.

Notice how the job-version is v1.13.3 for https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-gce-conformance-stable-1-13/345.

or for a dev version it will be something like v1.14.0-alpha.2.219+30566b990aa139 - https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/34487

This is the wrong value to add into the revision field.

Yes we might not fully understand the meaning of each field, but all of them can be found in prowjob.yaml right?

We should be intentional about the metadata we expose and write these values correctly, not just throw up a pod and hope that our tooling can figure it out.

I don't think anyone is actually consuming these fields today?

Pull is the deprecated version of repos when we only supported cloning one repo. Repos was intended to specify what version of which repos were cloned, which pod-utils stuck in a separate file.

Node we should fix to report the physical node on which testing occurred, as this is important debugging information. It reported this value back when we ran on VMs, but jobs started reporting the container name instead as we migrated them to prow

fejta on 2 Feb 2019

Semi-related note: for TestGrid at least other tools produce some of this metadata too including kubetest and the conformance uploader used to federate results. For the latter, uploading a prowjob spec would be ... odd.

I also don't see the specs covering things like finish time....?

BenTheElder on 2 Feb 2019

Also note that public testgrid has its own variant of these fields:

https://github.com/kubernetes/test-infra/blob/ed6a7c4f28a2f87de404f8844a14a512e4c8f3d6/testgrid/cmd/updater/main.go#L116-L133

Google's internal testgrid has its own interpretation of things too

fejta on 2 Feb 2019

Also note that public testgrid has its own variant of these fields:

afaik the OSS updater is not being used?

Google's internal testgrid has its own interpretation of things too

I don't think that uses these fields from started.json either?

I'd still want to keep the existing started.json and finished.json - but I also want to have the additional prowjob.yaml to have more debug information.

Fields like: Node - which computes from https://github.com/kubernetes/test-infra/blob/master/jenkins/bootstrap.py#L577-L593, has been not working for a while... can be found from pod.yaml.

Pull, Repos can be found from Refs in prowjob.yaml.

Aka - I agree that we should provide as much information for info & debugging purposes... If you have strong feeling that these fields in started.json should not be deprecated, feel free to PR, even re-design, and maintain them :-)

In the meantime I also want to see we start to upload pod.yaml and prowjob.yaml, which is orthogonal.

krzyzacy on 2 Feb 2019

Sure, sidecar and initupload (or artifact-uploader) should absolutely read these fields from the pod and prowjob to populate them, but pod and prowjob is the wrong abstraction for a build result.

This does not mean that we should require testgrid, resultstore, gubernator, etc to understand the internals of prow and kubernetes in order to do their work. There should be a (Result) layer of abstraction between the job runner (whether prow, circleci, travis, etc) and what displays these results to the end user.

fejta on 2 Feb 2019

I'm mostly sad about every time I feel like I understand the fields from bootstrap but I was not :-) (for example, repo-version was obviously only work for k/k)

Have a unified result object from a job sounds reasonable - but I feel that's different than trying to back-fill the old fields in started.json? (For example, currently Pull only reveals SHA from the main repo, for a proper result object it need to show all repos?) - Maybe there are other fields we are interested in?

(I just don't convinced these bespoke fields suddenly becomes the source of truth for all our jobs :-) Especially they have gone through the Jenkins -> Bootstrap -> Podutil migrations)

krzyzacy on 2 Feb 2019

for a proper result object it need to show all repos

Repos is where we put info about all the repos we checked out. Pull specifies which PR it is testing:
https://github.com/kubernetes/test-infra/blob/ed6a7c4f28a2f87de404f8844a14a512e4c8f3d6/jenkins/bootstrap.py#L327-L328

these bespoke fields suddenly becomes the source of truth for all our jobs :-) Especially they have gone through the Jenkins -> Bootstrap -> Podutil migrations

We went through the jenkins -> containerized jenkins -> bootstrap containerized jenkins -> bootstrap prow while maintaining backwards compatibility.

We should do the same for the ongoing podutil migration.

So My request is that:

we are intentional about the field choices we make and why
fields we change should require updating all our tooling, which includes bootstrap and internal testgrid
all things being equal, we should continue using existing fields (although things often need to change too)

fejta on 2 Feb 2019

Which appear to be used to populate gubernator + testgrid from circleCI, alongside Prow based results.

I think Kubeflow is also populating GCS:
https://github.com/kubeflow/kubeflow/blob/76107ff3aa7f9fde6fd7f15f2ec0a69a35874efa/testing/run.sh

BenTheElder on 2 Feb 2019

kubeflow has a homebrew'd bash which ends up clone test-infra and run bootstrap.py, and I believe it's the same case for istio... (aka they are not hand-crafting started/finished.json)

krzyzacy on 2 Feb 2019

See above: istio is not using bootstrap to do this on CircleCI, they have their own go program to write our metadata format.

BenTheElder on 2 Feb 2019

hummmmm I see.. https://github.com/istio/test-infra/blob/8feb3634980f971ed0e027e0f6a8d5e672d6bf91/sisyphus/ci.go#L52-L76 makes me 😢

krzyzacy on 2 Feb 2019

ok I'll agree with the issue as long as all the fields are well documented and we know what each fields mean, and they will be maintained in the future.

krzyzacy on 2 Feb 2019

Why shouldn't tools just read the ProwJob or Pod? I guess for computed fields like duration of you don't want to do ProwJob.Status.FinishedTimestamp - ProwJob.Status.StartedTimestamp on every client, maybe, but Refs should be presented identically IMO - no "job version" or otherwise.

stevekuznetsov on 4 Feb 2019

The actual k8s objects are stable and well documented, they've got helpers and serialization targets that you can import in go, it's clear what fields may or may not be set...

stevekuznetsov on 4 Feb 2019

Should everything uploading to testgrid be coupled to prow jobs..? We have
EG OpenStack's CI uploading CI signal for their cloud provider integration
against Kubernetes.

Right now it's easy to target the minimal metadata fields without even
depending on test-infra.

On Mon, Feb 4, 2019, 08:07 Steve Kuznetsov <[email protected] wrote:

The actual k8s objects are stable and well documented, they've got helpers
and serialization targets that you can import in go, it's clear what fields
may or may not be set...

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/11100#issuecomment-460305568,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bqy6An7EqrScTTHsJjvWu6JSR7rE1ks5vKFq2gaJpZM4afa6K
.

BenTheElder on 4 Feb 2019

Hmm okay that's valid, didn't know we had non prow uploads. Let's create a reference implementation of the metadata digestion from ProwJobs and Pods and use that to keep us documented etc

stevekuznetsov on 4 Feb 2019

Refs should be presented identically IMO - no "job version" or otherwise.

The reason we split version into job-version and repo-version is so that we can do exactly this for repo-version and repos, which is owned by the checkout tool, not something the job (aka test container) has the right to control.

On the other hand job-version is owned by the job internals, which can set it to whatever it wants. Our k/k jobs have some complicated shenanigans that involve bash, reading extracted files and/or git tags to produce v1.14.0-alpha.2.219+30566b990aa139, and we expect to find this value after the job completes in testgrid, gubernator, etc.

fejta on 4 Feb 2019

Sure, but repo-version should be the Refs struct and _only_ the Refs struct as derived from the ProwJobSpec, yeah?

stevekuznetsov on 5 Feb 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 6 May 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 5 Jun 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 5 Jul 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.