Test-infra: Testgrid not reporting any job status since 11-18-2020

Created on 23 Nov 2020  路  17Comments  路  Source: kubernetes/test-infra

What happened:

No jobs are reporting status of runs on testgrid, though I can see they are running successfully in Prow.

Please provide links to example occurrences, if any:

All jobs on sig-release-master-blocking: https://testgrid.k8s.io/sig-release-master-blocking

/cc @kubernetes/ci-signal

kinbug

All 17 comments

/cc

The cause is known (Updater is unable to run), but unsure how to fix atm. Will continue working on this today, apologies for the trouble folks!

Related to https://github.com/GoogleCloudPlatform/testgrid/issues/211 (though this is not code issues, this is an issue with actually running the updater).

Until the issue is resolved, I am going to make grid=old results the default results on testgrid.k8s.io. (ETA: This will take a few hours, ETA 3-4PM PT?)

Tabs should be showing more recent results now! However, the summary tab will still show old summaries for now. (I'll look at a mitigation if we can't get things running properly today).

Thanks @michelle192837!

Summary issue is mitigated, so recent summaries should also be showing up! (I'll keep this issue open since the underlying cause isn't _fixed_, but users broadly shouldn't have issues with stale summaries or tabs anymore.)

Alright, underlying issue appears to be fixed, so I'm going to remove the mitigations and go back to the official way of running stuff. (That should be live tomorrow morning, around 9-11 AM PT?)

@michelle192837 Looks like the summaries are updating correctly:
image

But the tabs are still showing old data:
image

The &grid=old workaround will still show the new data though.

The updater _is_ running this time, so that's good. It appears to log that it's updated states (and I've confirmed that the states in storage _are_ recently updated, so the timestamps on the summary check out), but all the dashboards I've checked don't have recent results, so something seems suspicious here.

We have the same problem for some of our jobs, e.g.:
https://testgrid.k8s.io/sig-scalability-gce#gce-cos-master-scalability-100

But surprisingly, this one looks fine as an example:
https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance

Ah, thanks for pointing this out!

Pinging @fejta in case you have ideas!

Yeah, correlating some of the update messages with their tabs more closely, it does look like some of these are perfectly fine and have updated pretty recently, but not a high percentage of groups.

From the logs, it looks like we're making progress such that logs report up to 25% of groups updated...but then they stop? (Main updater is on the 10-20 image I believe, canary is on 12-03 but seems to have the same issue, though it looks like it got to 31% sometimes).

We also aren't seeing the message Update completed in <time>. The last time we saw this was on 12-03, Update completed in 2h15m31.458828297s, so it seems like we aren't actually updating all groups. (I believe update order is consistent, so that would explain why some are up to date even though a number aren't).

Two issues are 1) concurrency/node size is too high, resulting in evictions and 2) always update in the same order

I have a solution for (2) and trying to mitigate (1) without just using a bigger node

Looks like Erick's fixes have resolved this issue, and everything is updating consistently. Thanks Erick! 馃帀

/assign @fejta

Test-grid is not reporting job status for "ppc64le unit tests" since 11-12-2020
https://k8s-testgrid.appspot.com/sig-node-ppc64le#unit-tests is showing results only till 11th December (11-12-2020)
As per the workaround mentioned in https://github.com/kubernetes/test-infra/issues/20010, I am able to view all the latest results via https://k8s-testgrid.appspot.com/sig-node-ppc64le#unit-tests&grid=old

69015e80-43ae-11eb-8a6d-dd91393aa1fa

https://github.com/GoogleCloudPlatform/testgrid/pull/303 should have mitigate this @Rajalakshmi-Girish and is now in production

This is also caused by the 27k stale tests, which I can look into

Was this page helpful?
0 / 5 - 0 ratings

Related issues

BenTheElder picture BenTheElder  路  4Comments

BenTheElder picture BenTheElder  路  4Comments

fen4o picture fen4o  路  4Comments

lavalamp picture lavalamp  路  3Comments

chaosaffe picture chaosaffe  路  3Comments