Test-infra: crier failed to report job after pod pending timeout, no commit found

Created on 14 Aug 2019  路  29Comments  路  Source: kubernetes/test-infra

/area prow/crier
/area prow
/assign @krzyzacy

There are a bunch of these messages on crier startup. Is this an expected state of affairs and if so can we make the message less noisy?

jsonPayload: {
  component:  "crier"   
  error:  "error setting status: status code 422 not one of [201], body: {"message":"No commit found for SHA: 987e7a399c7b270cf22ed0f3c8b948df8b4f5400","documentation_url":"https://developer.github.com/v3/repos/statuses/#create-a-status"}"   
  file:  "prow/crier/controller.go:274"   
  func:  "k8s.io/test-infra/prow/crier.(*Controller).processNextItem"   
  jobName:  "7442d7ab-bdc1-11e9-a69c-de35c67e51d2"   
  jobStatus: {
   build_id:  "1161245346127417346"    
   completionTime:  "2019-08-13T12:57:20Z"    
   description:  "Pod pending timeout."    
   pod_name:  "7442d7ab-bdc1-11e9-a69c-de35c67e51d2"    
   startTime:  "2019-08-13T11:56:58Z"    
   state:  "error"    
   url:  "https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-bazel-test/1161245346127417346"    
  }
  level:  "error"   
  msg:  "failed to report job"   
  prowjob:  "default/7442d7ab-bdc1-11e9-a69c-de35c67e51d2"   
 }
areprow areprocrier help wanted kincleanup lifecyclrotten sitesting

Most helpful comment

Also see the relevant slack conversation.

To summarise my take:

We have advertised the behaviour of triggering on postsubmits as a feature, as well as the specific feature of the branches option matching on tags as well. We have users who depend on this both in the kubernetes project and downstream, for purposes such as publishing tags and ensuring correctly tagged builds are generated.

The current behaviour aroung tag postsubmits is poor - it is surprising that your postsubmits are triggered on tags, and you cannot distringuish between branches and tags in configuration except by naming convention. While we cannot remove this functionality without replacement, I am in favour of moving to a configuration that allows specifically configuring tags separately - it would allow for more configurable and less surprising behaviour.

All 29 comments

/assign @cjwagner

I remember we've talked about this - is it a github thing?

/milestone v1.16

/sig testing

It's possible we will try to report to a commit that's no longer present due to a force-push or rebase. We should detect that case and stop trying to report, it shouldn't be an error.

/unassign

this is a github thing, not specific to crier.

/kind oncall-hotlist

Error:

{
 insertId: "14syhmxfk1r2g6"  
 jsonPayload: {
  component: "crier"   
  error: "error setting status: status code 422 not one of [201], body: {"message":"No commit found for SHA: 7fb7b93714ea8779c197a5fc0aed814bab5c236e","documentation_url":"https://developer.github.com/v3/repos/statuses/#create-a-status"}"   
  file: "prow/crier/controller.go:279"   
  func: "k8s.io/test-infra/prow/crier.(*Controller).processNextItem"   
  jobName: "4860f7dc-ef71-11e9-877f-5ee225b8289a"   
  jobStatus: {
   startTime: "2019-10-15T17:29:03Z"    
   state: "triggered"    
  }
  level: "error"   
  msg: "failed to report job"   
  prowjob: "default/4860f7dc-ef71-11e9-877f-5ee225b8289a"   
 }
 labels: {
  compute.googleapis.com/resource_name: "gke-prow-n1-standard-8-24b71378-l0s8"   
  container.googleapis.com/namespace_name: "default"   
  container.googleapis.com/pod_name: "crier-5f5f445486-wdrnd"   
  container.googleapis.com/stream: "stderr"   
 }
 logName: "projects/k8s-prow/logs/crier"  
 receiveTimestamp: "2019-10-15T17:29:11.459709639Z"  
 resource: {
  labels: {鈥   
  type: "container"   
 }
 severity: "ERROR"  
 timestamp: "2019-10-15T17:29:05Z"  
}

/area prow
/area prow/crier
/kind oncall-hotlist

kubectl get prowjobs

apiVersion: prow.k8s.io/v1
kind: ProwJob
metadata:
  annotations:
    description: 'OWNER: sig-testing; Runs kubernetes unit tests at each commit using
      bazel'
    fork-per-release: "true"
    prow.k8s.io/job: post-kubernetes-bazel-test
    testgrid-alert-email: [email protected], [email protected]
    testgrid-dashboards: sig-release-master-informing, google-unit
  creationTimestamp: "2019-10-15T17:29:03Z"
  generation: 3
  labels:
    created-by-prow: "true"
    event-GUID: 47f01a98-ef71-11e9-97b6-048451a54df3
    preset-bazel-scratch-dir: "true"
    preset-service-account: "true"
    prow.k8s.io/job: post-kubernetes-bazel-test
    prow.k8s.io/refs.org: kubernetes
    prow.k8s.io/refs.repo: kubernetes
    prow.k8s.io/type: postsubmit
  name: 4860f7dc-ef71-11e9-877f-5ee225b8289a
  namespace: default
  resourceVersion: "294449698"
  selfLink: /apis/prow.k8s.io/v1/namespaces/default/prowjobs/4860f7dc-ef71-11e9-877f-5ee225b8289a
  uid: 486903b3-ef71-11e9-90b9-42010a8001da
spec:
  agent: kubernetes
  cluster: default
  context: post-kubernetes-bazel-test
  decoration_config:
    gcs_configuration:
      bucket: kubernetes-jenkins
      default_org: kubernetes
      default_repo: kubernetes
      path_strategy: legacy
    gcs_credentials_secret: service-account
    grace_period: 15s
    timeout: 2h0m0s
    utility_images:
      clonerefs: gcr.io/k8s-prow/clonerefs:v20191014-5b6eb77e5
      entrypoint: gcr.io/k8s-prow/entrypoint:v20191014-5b6eb77e5
      initupload: gcr.io/k8s-prow/initupload:v20191014-5b6eb77e5
      sidecar: gcr.io/k8s-prow/sidecar:v20191014-5b6eb77e5
  extra_refs:
  - base_ref: master
    org: kubernetes
    repo: test-infra
  job: post-kubernetes-bazel-test
  namespace: test-pods
  pod_spec:
    containers:
    - args:
      - test
      - --config=unit
      - --config=remote
      - --remote_instance_name=projects/k8s-prow-builds/instances/default_instance
      - //...
      - //hack:verify-all
      - --
      - -//build/...
      - -//vendor/...
      command:
      - ../test-infra/hack/bazel.sh
      env:
      - name: GOOGLE_APPLICATION_CREDENTIALS
        value: /etc/service-account/service-account.json
      - name: E2E_GOOGLE_APPLICATION_CREDENTIALS
        value: /etc/service-account/service-account.json
      - name: TEST_TMPDIR
        value: /bazel-scratch/.cache/bazel
      - name: GOPROXY
        value: https://proxy.golang.org
      image: launcher.gcr.io/google/bazel:0.25.2
      name: ""
      resources: {}
      volumeMounts:
      - mountPath: /etc/service-account
        name: service
        readOnly: true
      - mountPath: /bazel-scratch/.cache
        name: bazel-scratch
    volumes:
    - name: service
      secret:
        secretName: service-account
    - emptyDir: {}
      name: bazel-scratch
  refs:
    base_link: https://github.com/kubernetes/kubernetes/commit/551f03f1b346
    base_ref: v1.13.13-beta.0
    base_sha: 7fb7b93714ea8779c197a5fc0aed814bab5c236e
    org: kubernetes
    repo: kubernetes
  report: true
  rerun_auth_config: {}
  type: postsubmit
status:
  build_id: "1184159447686057985"
  completionTime: "2019-10-15T17:31:46Z"
  description: Job failed.
  pendingTime: "2019-10-15T17:29:47Z"
  pod_name: 4860f7dc-ef71-11e9-877f-5ee225b8289a
  startTime: "2019-10-15T17:29:03Z"
  state: failure
  url: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/post-kubernetes-bazel-test/1184159447686057985

/assign @fejta @krzyzacy

This is the problem:

error:

"No commit found for SHA: 7fb7b93714ea8779c197a5fc0aed814bab5c236e"

prowjob

  refs:
    base_link: https://github.com/kubernetes/kubernetes/commit/551f03f1b346
    base_ref: v1.13.13-beta.0
    base_sha: 7fb7b93714ea8779c197a5fc0aed814bab5c236e

https://github.com/kubernetes/kubernetes/commits/551f03f1b346 does exist, where does the 7fb7b937 come from??

We should just detect that this specific error was given to us from the GitHub API and ignore it -- it is totally normal if someone had a PR and force-pushed their commits and overwrote things.

This is a postsubmit

Or wait, on a postsubmit this is some other bug we've seen in the past ...

I think that's the SHA of the tag instead of the commit?

I'm suspecting a race condition in crier? Oh interesting...

Yep...

$ git show 7fb7b93714ea8779c197a5fc0aed814bab5c236e
tag v1.13.13-beta.0
Tagger: Anago GCB <[email protected]>
Date:   Tue Oct 15 11:44:23 2019 +0000

Kubernetes beta release v1.13.13-beta.0

commit 551f03f1b34655a7fef8629a8f445c6e78576b54 (tag: v1.13.13-beta.0)
Author: Anago GCB <[email protected]>
Date:   Tue Oct 15 11:44:23 2019 +0000

    Kubernetes version v1.13.13-beta.0 openapi-spec file updates

diff --git a/api/openapi-spec/swagger.json b/api/openapi-spec/swagger.json
index 5e65ebcdbb..e746d5517e 100644
--- a/api/openapi-spec/swagger.json
+++ b/api/openapi-spec/swagger.json
@@ -2,7 +2,7 @@
   "swagger": "2.0",
   "info": {
    "title": "Kubernetes",
-   "version": "v1.13.12"
+   "version": "v1.13.13"
   },
   "paths": {
    "/api/": {

Also will look into this one

Any updates?

Basically this is the push event for a tag, not pushing commits to the branch.

https://developer.github.com/v3/activity/events/types/#webhook-event-name-36

https://github.com/kubernetes/test-infra/blob/02f11f95f77dd130965ada69ff022984efe8e851/prow/github/types.go#L733-L737

https://github.com/kubernetes/test-infra/blob/b25c3c2748570fc04d52808ec2710ac32168eae3/prow/plugins/trigger/push.go#L48-L56

If it is a tag we could either:
1) not trigger a postsubmit (do we have anyone relying on this behavior?)
2) update handlePE() to get the tag data and trigger this instead: https://developer.github.com/v3/git/tags/#get-a-tag

https://github.com/kubernetes/test-infra/blob/b25c3c2748570fc04d52808ec2710ac32168eae3/prow/plugins/trigger/push.go#L64

I'd be inclined to do (1) and then if we want to do tag based triggering we can support more explicitly than via implicitly treating refs/heads/foo and refs/tags/foo as both a foo branch in https://github.com/kubernetes/test-infra/blob/cb1899b6e99705c68705749a9311a7310e73c671/prow/config/jobs.go#L203-L214

IMO no postsubmits on tags is fine. You push a tag when you like what you have in the repo and it's tested and you want to publish it. That happens after postsubmits run on the commits.

@cjwagner @Katharine @krzyzacy @munnerz @alvaroaleman WDYT?

not trigger a postsubmit (do we have anyone relying on this behavior?)

Yes, we do have publishing postsubmits that rely on getting triggered on tags. Others do as well, e.G. Jetstack

Also see the relevant slack conversation.

To summarise my take:

We have advertised the behaviour of triggering on postsubmits as a feature, as well as the specific feature of the branches option matching on tags as well. We have users who depend on this both in the kubernetes project and downstream, for purposes such as publishing tags and ensuring correctly tagged builds are generated.

The current behaviour aroung tag postsubmits is poor - it is surprising that your postsubmits are triggered on tags, and you cannot distringuish between branches and tags in configuration except by naming convention. While we cannot remove this functionality without replacement, I am in favour of moving to a configuration that allows specifically configuring tags separately - it would allow for more configurable and less surprising behaviour.

How about we add tags and skip_tags to just postsubmits? (you cannot PR a tag change, so presubmits are irrelevant). Once that is in then a flag to turn off tags on branches

Sounds good to me

$ grep -R '\.Branch()'  prow
prow/plugins/trigger/push.go:       BaseRef:  pe.Branch(),
prow/plugins/trigger/push.go:       if shouldRun, err := j.ShouldRun(pe.Branch(), listPushEventChanges(pe)); err != nil {
prow/plugins/slackevents/slackevents.go:                message = fmt.Sprintf("*Warning:* %s (<@%s>) pushed a new branch (%s): %s", pe.Sender.Login, pe.Sender.Login, pe.Branch(), pe.Compare)
prow/plugins/slackevents/slackevents.go:                message = fmt.Sprintf("*Warning:* %s (<@%s>) deleted a branch (%s): %s", pe.Sender.Login, pe.Sender.Login, pe.Branch(), pe.Compare)
prow/plugins/slackevents/slackevents.go:                message = fmt.Sprintf("*Warning:* %s (<@%s>) *force* merged %d commit(s) into %s: %s", pe.Sender.Login, pe.Sender.Login, len(pe.Commits), pe.Branch(), pe.Compare)
prow/plugins/slackevents/slackevents.go:                message = fmt.Sprintf("*Warning:* %s (<@%s>) manually merged %d commit(s) into %s: %s", pe.Sender.Login, pe.Sender.Login, len(pe.Commits), pe.Branch(), pe.Compare)
prow/plugins/slackevents/slackevents.go:    for _, login := range append(mw.WhiteList, mw.BranchWhiteList[pe.Branch()]...) {

Have some work on this, but I got stuck trying to choose the best way to scope down the change

Still an issue, haven't found time to work on it

Same

/unassign

Is this still "oncall-hostlist"-worthy?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Was this page helpful?
0 / 5 - 0 ratings