I think we want to kill the pod and abort the prowjob if the pod stucks in pending > podMaxAge, or stuck in running > PJMaxAge regardless... (I remember I've opened a similar issue but cannot locate that)
thoughts?
cc @stevekuznetsov @cjwagner @BenTheElder
Don't we already do this? @cjwagner just changed it so when we kill a Pod in Pending we set the ProwJobState correctly
@cjwagner @stevekuznetsov that's not the case in our cluster? See https://github.com/kubernetes/test-infra/issues/11594#issuecomment-469484840
Who's deleting the pending pods? Plank? I don't see the logic in sinker..
https://github.com/kubernetes/test-infra/blob/master/prow/config.yaml#L5
which is.. not really WAI?
We are also seeing a lot of stale "running" jobs, on top of the pending jobs:
The last line of log is which seems correct, but the pod didn't exit after that. https://github.com/kubernetes/test-infra/issues/11631
{"component":"entrypoint","error":"wrapped process failed: exit status 1","level":"error","msg":"Error executing test process","time":"2019-03-02T18:07:45Z"}
Let's investigate this as a bug
(I also want to propose we add a pod running timeout here on top of the pending timeout bug)
So these prowjobs are in error state properly but the pod didn't get deleted - need to double check if we have https://github.com/kubernetes/test-infra/pull/11579 deployed or not, might also worth add a logging.
Isn't a pod running timeout #7152
Just to be clear, pending has real significance here, it does not refer to any pod that hasn't completed yet like running pods (that is what the running state is for). Pending specifically refers to pods with State: pending.
Isn't a pod running timeout #7152
hummmm I guess the podutil timeout didn't work...
Sounds like there was some confusion over vocabulary on our end and none of these pods are actually in the pending state. Can you confirm this @krzyzacy or @amwat ?
@cjwagner yeah, just try to kubectl get po -n=test-pods | grep ImagePullBackOff
@cjwagner we want to deal with two problems here actually:
@krzyzacy instead of showing the kubectl can you paste kubectl describe output please? I don't have access...
You said on the call that you saw the entrypoint timeout fire and kill the process, but the pod did not exit? can you please post kubectl describe? What is happening? Is the sidecar still running?
Let's literally put the Pod YAML or describe output here so we can be very specific about what the pods are doing
I just checked one of the pods stuck in ImagePullBackOff and its corresponding PJ. The PJ has been marked as errored, but the pod is still pending. Plank should try to delete the pod here: https://github.com/kubernetes/test-infra/blob/4a2426fc38ba45bacc40c5b49724114e8272b529/prow/plank/controller.go#L427
But clearly the pod wasn't deleted and I don't see the error logged.
Here is one of the pending pods.
Lets handle the separate problem with running pods not timing out in a separate issue? Do we have any reason to believe these problems are related?
Name: fdee574b-3ed1-11e9-ab39-0a580a6c0f43
Namespace: test-pods
Priority: 0
PriorityClassName: <none>
Node: gke-prow-containerd-pool-99179761-4p9n/10.128.0.18
Start Time: Mon, 04 Mar 2019 15:05:46 -0800
Labels: created-by-prow=true
event-GUID=fd821b00-3ed1-11e9-9e1e-7b7edb5b5699
preset-service-account=true
prow.k8s.io/id=fdee574b-3ed1-11e9-ab39-0a580a6c0f43
prow.k8s.io/job=pull-test-infra-verify-file-perms
prow.k8s.io/refs.org=kubernetes
prow.k8s.io/refs.pull=11606
prow.k8s.io/refs.repo=test-infra
prow.k8s.io/type=presubmit
Annotations: kubernetes.io/limit-ranger=LimitRanger plugin set: memory request for container test; memory request for container sidecar; memory request for init container clonerefs; memory request for init contain...
prow.k8s.io/job=pull-test-infra-verify-file-perms
Status: Pending
IP: 10.60.247.60
Init Containers:
clonerefs:
Container ID: containerd://ed928cf5b56f8b0cbfae444e38a04554d2009b4cb4f9795767d9325a43b8a5dd
Image: gcr.io/k8s-prow/clonerefs:v20190304-b7b76c8
Image ID: gcr.io/k8s-prow/clonerefs@sha256:a0e1557cd9ab88af979d665d9da5dd15120c598fa960bc027e0244d729a02623
Port: <none>
Host Port: <none>
Command:
/clonerefs
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 04 Mar 2019 15:05:51 -0800
Finished: Mon, 04 Mar 2019 15:06:19 -0800
Ready: True
Restart Count: 0
Requests:
memory: 1Gi
Environment:
CLONEREFS_OPTIONS: {"src_root":"/home/prow/go","log":"/logs/clone.json","git_user_name":"ci-robot","git_user_email":"[email protected]","refs":[{"org":"kubernetes","repo":"test-infra","repo_link":"https://github.com/kubernetes/test-infra","base_ref":"master","base_sha":"fec65a32d1fde010684bfc40834b77d124e3898e","base_link":"https://github.com/kubernetes/test-infra/commit/fec65a32d1fde010684bfc40834b77d124e3898e","pulls":[{"number":11606,"author":"krzyzacy","sha":"cabb8b877ccaef956fecbcdbcfbb8fb0a08d171c","link":"https://github.com/kubernetes/test-infra/pull/11606","commit_link":"https://github.com/kubernetes/test-infra/pull/11606/commits/cabb8b877ccaef956fecbcdbcfbb8fb0a08d171c","author_link":"https://github.com/krzyzacy"}]}]}
Mounts:
/home/prow/go from code (rw)
/logs from logs (rw)
initupload:
Container ID: containerd://52a15fd1a73948a329311bf31e8d4c729b5e23e8858793bf52cb4eaefa43b67c
Image: gcr.io/k8s-prow/initupload:v20190304-b7b76c8
Image ID: gcr.io/k8s-prow/initupload@sha256:1c6a57bb5c325c793f172105ca46f43eb25fae1b513f9968730f52fde81da497
Port: <none>
Host Port: <none>
Command:
/initupload
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 04 Mar 2019 15:06:24 -0800
Finished: Mon, 04 Mar 2019 15:06:27 -0800
Ready: True
Restart Count: 0
Requests:
memory: 1Gi
Environment:
INITUPLOAD_OPTIONS: {"bucket":"kubernetes-jenkins","path_strategy":"legacy","default_org":"kubernetes","default_repo":"kubernetes","gcs_credentials_file":"/secrets/gcs/service-account.json","dry_run":false,"log":"/logs/clone.json"}
JOB_SPEC: {"type":"presubmit","job":"pull-test-infra-verify-file-perms","buildid":"633","prowjobid":"fdee574b-3ed1-11e9-ab39-0a580a6c0f43","refs":{"org":"kubernetes","repo":"test-infra","repo_link":"https://github.com/kubernetes/test-infra","base_ref":"master","base_sha":"fec65a32d1fde010684bfc40834b77d124e3898e","base_link":"https://github.com/kubernetes/test-infra/commit/fec65a32d1fde010684bfc40834b77d124e3898e","pulls":[{"number":11606,"author":"krzyzacy","sha":"cabb8b877ccaef956fecbcdbcfbb8fb0a08d171c","link":"https://github.com/kubernetes/test-infra/pull/11606","commit_link":"https://github.com/kubernetes/test-infra/pull/11606/commits/cabb8b877ccaef956fecbcdbcfbb8fb0a08d171c","author_link":"https://github.com/krzyzacy"}]}}
Mounts:
/logs from logs (rw)
/secrets/gcs from gcs-credentials (rw)
place-entrypoint:
Container ID: containerd://fa7debd709b9bd2561c7140bdd46ccb539da162304de49eac904bcd3eb2ae229
Image: gcr.io/k8s-prow/entrypoint:v20190304-b7b76c8
Image ID: gcr.io/k8s-prow/entrypoint@sha256:318cb424770b13bd2bcc2e8aba5ad9347d110d9ce511991e0432feee83309db2
Port: <none>
Host Port: <none>
Command:
/bin/cp
Args:
/entrypoint
/tools/entrypoint
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 04 Mar 2019 15:06:30 -0800
Finished: Mon, 04 Mar 2019 15:06:30 -0800
Ready: True
Restart Count: 0
Requests:
memory: 1Gi
Environment: <none>
Mounts:
/tools from tools (rw)
Containers:
test:
Container ID: containerd://36fb391e3bf5a19c16a9d2e80bd1761fdff8e4a74c36978a2c27de5b4c15db57
Image: gcr.io/k8s-testimages/kubekins-e2e:v20190301-76bc03340-experimental
Image ID: gcr.io/k8s-testimages/kubekins-e2e@sha256:a635300fce400134f6c0c3c47da228f8540adf3388a12e4dd6c9ca5e8a602a2a
Port: <none>
Host Port: <none>
Command:
/tools/entrypoint
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 04 Mar 2019 15:06:33 -0800
Finished: Mon, 04 Mar 2019 15:06:33 -0800
Ready: False
Restart Count: 0
Requests:
memory: 1Gi
Environment:
GOOGLE_APPLICATION_CREDENTIALS: /etc/service-account/service-account.json
E2E_GOOGLE_APPLICATION_CREDENTIALS: /etc/service-account/service-account.json
ARTIFACTS: /logs/artifacts
BUILD_ID: 633
BUILD_NUMBER: 633
GOPATH: /home/prow/go
JOB_NAME: pull-test-infra-verify-file-perms
JOB_SPEC: {"type":"presubmit","job":"pull-test-infra-verify-file-perms","buildid":"633","prowjobid":"fdee574b-3ed1-11e9-ab39-0a580a6c0f43","refs":{"org":"kubernetes","repo":"test-infra","repo_link":"https://github.com/kubernetes/test-infra","base_ref":"master","base_sha":"fec65a32d1fde010684bfc40834b77d124e3898e","base_link":"https://github.com/kubernetes/test-infra/commit/fec65a32d1fde010684bfc40834b77d124e3898e","pulls":[{"number":11606,"author":"krzyzacy","sha":"cabb8b877ccaef956fecbcdbcfbb8fb0a08d171c","link":"https://github.com/kubernetes/test-infra/pull/11606","commit_link":"https://github.com/kubernetes/test-infra/pull/11606/commits/cabb8b877ccaef956fecbcdbcfbb8fb0a08d171c","author_link":"https://github.com/krzyzacy"}]}}
JOB_TYPE: presubmit
PROW_JOB_ID: fdee574b-3ed1-11e9-ab39-0a580a6c0f43
PULL_BASE_REF: master
PULL_BASE_SHA: fec65a32d1fde010684bfc40834b77d124e3898e
PULL_NUMBER: 11606
PULL_PULL_SHA: cabb8b877ccaef956fecbcdbcfbb8fb0a08d171c
PULL_REFS: master:fec65a32d1fde010684bfc40834b77d124e3898e,11606:cabb8b877ccaef956fecbcdbcfbb8fb0a08d171c
REPO_NAME: test-infra
REPO_OWNER: kubernetes
ENTRYPOINT_OPTIONS: {"timeout":7200000000000,"grace_period":15000000000,"artifact_dir":"/logs/artifacts","args":["./hack/verify-file-perms.sh"],"process_log":"/logs/process-log.txt","marker_file":"/logs/marker-file.txt","metadata_file":"/logs/artifacts/metadata.json"}
Mounts:
/etc/service-account from service (ro)
/home/prow/go from code (rw)
/logs from logs (rw)
/tools from tools (rw)
sidecar:
Container ID:
Image: gcr.io/k8s-prow/sidecar:v20190304-b7b76c8
Image ID:
Port: <none>
Host Port: <none>
Command:
/sidecar
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Requests:
memory: 1Gi
Environment:
JOB_SPEC: {"type":"presubmit","job":"pull-test-infra-verify-file-perms","buildid":"633","prowjobid":"fdee574b-3ed1-11e9-ab39-0a580a6c0f43","refs":{"org":"kubernetes","repo":"test-infra","repo_link":"https://github.com/kubernetes/test-infra","base_ref":"master","base_sha":"fec65a32d1fde010684bfc40834b77d124e3898e","base_link":"https://github.com/kubernetes/test-infra/commit/fec65a32d1fde010684bfc40834b77d124e3898e","pulls":[{"number":11606,"author":"krzyzacy","sha":"cabb8b877ccaef956fecbcdbcfbb8fb0a08d171c","link":"https://github.com/kubernetes/test-infra/pull/11606","commit_link":"https://github.com/kubernetes/test-infra/pull/11606/commits/cabb8b877ccaef956fecbcdbcfbb8fb0a08d171c","author_link":"https://github.com/krzyzacy"}]}}
SIDECAR_OPTIONS: {"gcs_options":{"items":["/logs/artifacts"],"bucket":"kubernetes-jenkins","path_strategy":"legacy","default_org":"kubernetes","default_repo":"kubernetes","gcs_credentials_file":"/secrets/gcs/service-account.json","dry_run":false},"entries":[{"args":["./hack/verify-file-perms.sh"],"process_log":"/logs/process-log.txt","marker_file":"/logs/marker-file.txt","metadata_file":"/logs/artifacts/metadata.json"}]}
Mounts:
/logs from logs (rw)
/secrets/gcs from gcs-credentials (rw)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
service:
Type: Secret (a volume populated by a Secret)
SecretName: service-account
Optional: false
logs:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
tools:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
gcs-credentials:
Type: Secret (a volume populated by a Secret)
SecretName: service-account
Optional: false
code:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal BackOff 6m (x6078 over 23h) kubelet, gke-prow-containerd-pool-99179761-4p9n Back-off pulling image "gcr.io/k8s-prow/sidecar:v20190304-b7b76c8"
Warning Failed 1m (x6100 over 23h) kubelet, gke-prow-containerd-pool-99179761-4p9n Error: ImagePullBackOff
I bet that's just because we didn't push a podutil image when we bump prow... /shrug
(We probably want to add a presubmit to prow/cluster to validate all the images are pushed...)
Reading the plank code, it should notice the Pod is pending in every loop and try to delete, right?
@cjwagner yes let's make a separate issue for pod utilities not timing out -- @krzyzacy do you have an example of that
Reading the
plankcode, it should notice the Pod ispendingin every loop and try to delete, right?
Yeah thats what I was pointing out here: https://github.com/kubernetes/test-infra/issues/11617#issuecomment-469893078
I don't see any logs in the past week with "unknown cluster alias" or "failed to delete pod" though. So I'm not sure what is going on.
@krzyzacy do you have an example of that
@stevekuznetsov ehhhh I killed the only podutil pod... the other hanging ones are all bootstrap ones... I'll open a new issue when I see it again
Any logs for "error listing pods"?
Here is the yaml for the associated PJ:
apiVersion: prow.k8s.io/v1
kind: ProwJob
metadata:
annotations:
prow.k8s.io/job: pull-test-infra-verify-file-perms
creationTimestamp: 2019-03-04T23:05:24Z
generation: 1
labels:
created-by-prow: "true"
event-GUID: fd821b00-3ed1-11e9-9e1e-7b7edb5b5699
preset-service-account: "true"
prow.k8s.io/id: fdee574b-3ed1-11e9-ab39-0a580a6c0f43
prow.k8s.io/job: pull-test-infra-verify-file-perms
prow.k8s.io/refs.org: kubernetes
prow.k8s.io/refs.pull: "11606"
prow.k8s.io/refs.repo: test-infra
prow.k8s.io/type: presubmit
name: fdee574b-3ed1-11e9-ab39-0a580a6c0f43
namespace: default
resourceVersion: "189808236"
selfLink: /apis/prow.k8s.io/v1/namespaces/default/prowjobs/fdee574b-3ed1-11e9-ab39-0a580a6c0f43
uid: fdeeab67-3ed1-11e9-898b-42010a80003a
spec:
agent: kubernetes
cluster: default
context: pull-test-infra-verify-file-perms
decoration_config:
gcs_configuration:
bucket: kubernetes-jenkins
default_org: kubernetes
default_repo: kubernetes
path_strategy: legacy
gcs_credentials_secret: service-account
grace_period: 15000000000
timeout: 7200000000000
utility_images:
clonerefs: gcr.io/k8s-prow/clonerefs:v20190304-b7b76c8
entrypoint: gcr.io/k8s-prow/entrypoint:v20190304-b7b76c8
initupload: gcr.io/k8s-prow/initupload:v20190304-b7b76c8
sidecar: gcr.io/k8s-prow/sidecar:v20190304-b7b76c8
job: pull-test-infra-verify-file-perms
namespace: test-pods
pod_spec:
containers:
- command:
- ./hack/verify-file-perms.sh
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /etc/service-account/service-account.json
- name: E2E_GOOGLE_APPLICATION_CREDENTIALS
value: /etc/service-account/service-account.json
image: gcr.io/k8s-testimages/kubekins-e2e:v20190301-76bc03340-experimental
name: ""
resources: {}
volumeMounts:
- mountPath: /etc/service-account
name: service
readOnly: true
volumes:
- name: service
secret:
secretName: service-account
refs:
base_link: https://github.com/kubernetes/test-infra/commit/fec65a32d1fde010684bfc40834b77d124e3898e
base_ref: master
base_sha: fec65a32d1fde010684bfc40834b77d124e3898e
org: kubernetes
pulls:
- author: krzyzacy
author_link: https://github.com/krzyzacy
commit_link: https://github.com/kubernetes/test-infra/pull/11606/commits/cabb8b877ccaef956fecbcdbcfbb8fb0a08d171c
link: https://github.com/kubernetes/test-infra/pull/11606
number: 11606
sha: cabb8b877ccaef956fecbcdbcfbb8fb0a08d171c
repo: test-infra
repo_link: https://github.com/kubernetes/test-infra
report: true
rerun_command: /test pull-test-infra-verify-file-perms
type: presubmit
status:
build_id: "633"
completionTime: 2019-03-05T00:06:13Z
description: Pod pending timeout.
pod_name: fdee574b-3ed1-11e9-ab39-0a580a6c0f43
prev_report_states:
github-reporter: error
startTime: 2019-03-04T23:05:24Z
state: error
url: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/test-infra/11606/pull-test-infra-verify-file-perms/633
Any logs for
"error listing pods"?
Nope.
hummmm I guess just because v20190301-2d35634 doesn't have that logic yet
I take it back - it should be one after https://github.com/kubernetes/test-infra/commit/c3922de8126fb0fec61cb40b3f6288d1ac267929
now I'm really confused :joy:
edit again: nvm - we don't have that commit, this should be solved when we bump prow again
Ok, no real issue then :)
/close
@stevekuznetsov: Closing this issue.
In response to this:
Ok, no real issue then :)
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Note that we still need to clean up these old stale pending pods as a one-off shot until we deploy the new change... will take a stab tomorrow