eg: https://storage.googleapis.com/k8s-gubernator/triage/index.html#6028a5ad633695a738fc
Since the results are getting picked up by triage and gubernator, we can conclude they're landing in GCS.
The summary page https://k8s-testgrid.appspot.com/sig-node-containerd#Summary shows the following for "e2e-gci". The "Tests last ran on" is indicative of our problem
Last green run: 7706
Tests last ran on: 09-26 10:26
Last update: 09-27 09:21
/priority critical-urgent
/kind bug
/sig testing
/area testgrid
FYI @cjwagner as oncall, @michelle192837 as testgrid
This only seems to be affecting some jobs. For example, the following has up-to-date results: https://k8s-testgrid.appspot.com/presubmits-kubernetes-blocking#pull-kubernetes-bazel-build
TestGrid logs don't indicate any failures to update, or failures to contact GCS (for viewing results or uploading the state). Investigating further.
Logs have:
Invalid version in /kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/1044652869598318592/finished.json(key:job-version): {u'metadata': {u'version': u'unknown', u'infra-commit': u'1ccb6478e', u'job-version': u'unknown', u'pod': u'3479ce5d-c0e9-11e8-898a-0a580a6c0606'}, u'version': u'unknown', u'timestamp': 1537900058, u'job-version': u'unknown', u'passed': False, u'result': u'FAILURE'}
Invalid version in /kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/1044652869598318592/finished.json(key:version): {u'metadata': {u'version': u'unknown', u'infra-commit': u'1ccb6478e', u'job-version': u'unknown', u'pod': u'3479ce5d-c0e9-11e8-898a-0a580a6c0606'}, u'version': u'unknown', u'timestamp': 1537900058, u'job-version': u'unknown', u'passed': False, u'result': u'FAILURE'}
Hmm, though I don't know that that's actually stopping any updates; the update proceeds afterwards. Going to try debugging this locally with more logging.
Update:
During Tuesday's outage tot was obviously down for a while, so we saw snowflake IDs, and when tot was restored, the build IDs picks up but we sort everything with descending so...
@michelle192837 is doing some clean up
and, we should kill tot :upside_down_face:
I'm going to be removing recently-generated builds that have snowflake IDs (i.e. within the last 3 days) from kubernetes-jenkins/logs; ETA for identifying and removing all of them about an hour. Once that's done, TestGrid should go update builds it missed within the last day. If you're missing results and really need to check up on them, feel free to regenerate your dashboard by bumping up 'days_of_results' in your test group config by 1 (which will force TestGrid to re-collect those results).
Alright, weird builds should be cleaned up!
thanks @michelle192837 :-)
/woof
In response to this:
thanks @michelle192837 :-)
/woof
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
And having double-checked some dashboards, looks like they're properly showing recent results from today. Think we can consider this closed, and probably work on a more long-term item to prevent snowflake vs. tot IDs from doing this again.
/close
@michelle192837: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Thanks Michelle! (and everyone who helped!)
Should we file a follow-up to discuss tot vs snowflake IDs?
we don't have an open issue ?!
/shrug
I don't think we do.
Yeah, a follow-up issue for that sounds great ^^
On Thu, Sep 27, 2018, 10:42 PM Benjamin Elder notifications@github.com
wrote:
I don't think we do.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/9593#issuecomment-425327223,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAn2NakdmZ3EvsdsWTbng0sGeH-86HZZks5ufbbNgaJpZM4W9BUl
.
Most helpful comment
I'm going to be removing recently-generated builds that have snowflake IDs (i.e. within the last 3 days) from kubernetes-jenkins/logs; ETA for identifying and removing all of them about an hour. Once that's done, TestGrid should go update builds it missed within the last day. If you're missing results and really need to check up on them, feel free to regenerate your dashboard by bumping up 'days_of_results' in your test group config by 1 (which will force TestGrid to re-collect those results).