What happened:
W1004 01:31:15.149] Creating router [e2e-51036-95a39-nat-router]...
W1004 01:31:18.991] ....................failed.
W1004 01:31:19.173] ERROR: (gcloud.compute.routers.create) Quota 'ROUTERS' exceeded. Limit: 10.0 globally.
Please provide links to example occurrences, if any:
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/51036/pull-kubernetes-e2e-gce-100-performance/1179926617770692608/
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/51036/pull-kubernetes-kubemark-e2e-gce-big/1179926617825218560/
Anything else we need to know?:
Potential boskos cleaning issue
from boskos janitor logs:
jsonPayload: {
error: "exit status 1"
level: "error"
msg: "failed to clean up project k8s-jkns-e2e-gke-ci-canary, error info: Activated service account credentials for: [[email protected]]
ERROR: (gcloud.compute.disks.delete) unrecognized arguments: --global
To search the help text of gcloud commands, run:
gcloud help -- SEARCH_TERMS
Error try to delete resources disks: CalledProcessError()
[=== Start Janitor on project 'k8s-jkns-e2e-gke-ci-canary' ===]
[=== Activating service_account /etc/service-account/service-account.json ===]
[=== Finish Janitor on project 'k8s-jkns-e2e-gke-ci-canary' with status 1 ===]
"
}
cc @krzyzacy
it looks like the image was last updated a month ago https://github.com/kubernetes/test-infra/commit/0fd634d70423b71b85bba2bc0687c0fbb732c31e
gcloud compute disks delete --help
NAME
gcloud compute disks delete - delete Google Compute Engine persistent disks
SYNOPSIS
gcloud compute disks delete DISK_NAME [DISK_NAME ...] [--zone=ZONE]
[GCLOUD_WIDE_FLAG ...]
DESCRIPTION
gcloud compute disks delete deletes one or more Google Compute Engine
persistent disks. Disks can be deleted only if they are not being used by
any virtual machine instances.
POSITIONAL ARGUMENTS
DISK_NAME [DISK_NAME ...]
Names of the disks to delete.
FLAGS
--zone=ZONE
Zone of the disks to delete. If not specified and the compute/zone
property isn't set, you may be prompted to select a zone.
To avoid prompting when this flag is omitted, you can set the
compute/zone property:
$ gcloud config set compute/zone ZONE
# gcloud compute disks delete --help | tail
--flags-file, --flatten, --format, --help, --log-http, --project, --quiet,
--trace-token, --user-output-enabled, --verbosity. Run $ gcloud help for
details.
NOTES
These variants are also available:
$ gcloud alpha compute disks delete
$ gcloud beta compute disks delete
# gcloud compute disks delete --global
ERROR: (gcloud.compute.disks.delete) unrecognized arguments: --global
To search the help text of gcloud commands, run:
gcloud help -- SEARCH_TERMS
I can't tell what actually broke when yet here. AFAICT we're running an image from august since then and haven't been having issues, also, the previous image has the same missing --global flag ...
$ kubectl get po -n=test-pods -l=app=boskos-janitor-nongke
NAME READY STATUS RESTARTS AGE
boskos-janitor-nongke-7c78646b5d-8rwjm 1/1 Running 0 6d6h
boskos-janitor-nongke-7c78646b5d-wm2fn 1/1 Running 0 6d9h
boskos-janitor-nongke-7c78646b5d-xj5sl 1/1 Running 0 6d8h
boskos-janitor-nongke-7c78646b5d-xl2bw 1/1 Running 0 6d3h
I execed to the pods and unsurprisingly they do seem to be running the janitor script from when the image was updated, so I don't think there were any terribly recent changes actually deployed.
@krzyzacy feel free to punt this back, but I don't feel that I have the context on what happened here.
@dims can you fill us in on the router issue with cluster-api-provider-gcp?
we are obviously not cleaning up routers in https://github.com/kubernetes/test-infra/blob/master/boskos/janitor/gcp_janitor.py#L35-L67
also seems gcloud deprecated some flags (that --global one), but should be unrelated.
Thanks @krzyzacy Sen!
@BenTheElder the new CAPG job uses boskos to acquire a project to create the actual cluster (uses kind to boostrap and then gcp to run the actual cluster) seems to have ended up with some problems. I do try to clean that up here, but some runs may have run into trouble and ended up leaking.
https://github.com/kubernetes-sigs/cluster-api-provider-gcp/blob/master/hack/ci/e2e-conformance.sh#L101-L105
@BenTheElder @krzyzacy Here's a fix for one more thing that could leak:
https://github.com/kubernetes/test-infra/pull/14617
waiting for https://github.com/kubernetes/test-infra/pull/14617 to merge and then we need to update the deployment
if anyone can see https://github.com/kubernetes/test-infra/compare/master...BenTheElder:github-compare-is-broken-ugh?expand=1 or https://github.com/kubernetes/test-infra/compare/master...BenTheElder:upgrade-gcloud-bazel?expand=1 I can't file the PR because github is erroring
... after a few minutes of server errors: https://github.com/kubernetes/test-infra/pull/14622, https://github.com/kubernetes/test-infra/issues/14623
ok so we have the gcloud bump in, running a new https://prow.k8s.io/?job=ci-test-infra-autobump-prow and then will let prow bump / deploy
see https://github.com/kubernetes/kubernetes/issues/83493 for the real root cause 馃う鈥嶁檪
TLDR these scale presubmits are using a fixed GCP project, I've bumped the quota 3x from 10 -> 30, but I have no idea if that's sufficient.
So far I've observed through manual polling a max of 16/30.
AFAICT this is fixed.