I kicked off a bunch of workflows (like 200) and expected them all to run to completion success/fail. The last 20 or so workflows have hung after completing the first 5/7 steps successfully and the next step was never triggered. The previous ones all completed in ~20 mins, but the last 20 have hung for over 5 hours (so I just deleted them all).
in argo server the "workflow" was running/spinning, but some containers had finished and no subsequent containers/stpes were kicked off.
What version of Argo Workflows are you running?
2.10.0
(no yaml since I deleted the workflows :( )
Message from the maintainers:
Impacted by this bug? Give it a 馃憤. We prioritise the issues with the most 馃憤.
Can you please attach the controller logs?
I do have logs from then still, but I don't know what I'd be looking for. After looking and filterign manually for about half hour everything looks OK (to me). I unfortunately dont have the references to the workflows that were hanging so I was just looking for things that might seem "weird". Any pointers on what to look for?
Without either reproduction steps, or more diagnostics it is not possible to determine the cause of this problem.
When this reoccurs - can you please capture the logs?
Then can you try restarting the controller?
ok i had this issue happen again and i think i found some relevant logs
i was tracing the most recent hung pod (~40 minutes earlier) : ys-play-8846w
this location in the logs of the workflow controller is the last time this workflow is mentioned
restarting the workflow controller picks up the workflows from where they left off though 馃憤
time="2020-09-22T03:55:22Z" level=info msg="Alloc=177500 TotalAlloc=57750984 Sys=836015 NumGC=3600 Goroutines=166"
E0922 03:56:33.870098 1 event.go:263] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"ys-play-msfdd.1636fe90550fbea0", GenerateName:"", Namespace:"argo-events", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Workflow", Namespace:"argo-events", Name:"ys-play-msfdd", UID:"1b54a65a-7c0d-44aa-b316-49d2bd973955", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"23549548", FieldPath:""}, Reason:"WorkflowTimedOut", Message:"ys-play-msfdd error in entry template execution: Deadline exceeded\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/workflow/controller.init\n\t/go/src/github.com/argoproj/argo/workflow/controller/operator.go:103\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5222\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5217\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:190\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357", Source:v1.EventSource{Component:"workflow-controller", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfd27afd59c96ca0, ext:285891151377828, loc:(*time.Location)(0x29a41c0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfd27afd59c96ca0, ext:285891151377828, loc:(*time.Location)(0x29a41c0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'the server was unable to return a response in the time allotted, but may still be processing the request (post events)' (will not retry!)
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-game-packaging-575nn
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-play-h62ck
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-play-hqnt9
time="2020-09-22T03:56:33Z" level=info msg="Processing workflow" namespace=argo-events workflow=ys-play-h62ck
time="2020-09-22T03:56:33Z" level=info msg="Processing workflow" namespace=argo-events workflow=ys-play-hqnt9
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-play-66cgc
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-play-hl9sj
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-play-llgfl
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-play-4tbmx
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-play-r6l6w
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-play-6skb2
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-play-mmw2g
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-play-zwzps
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-play-ll7hj
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-play-z42sk
time="2020-09-22T03:56:33Z" level=error msg="workflow timeout" error="the server was unable to return a response in the time allotted, but may still be processing the request (get pods)" namespace=argo-events workflow=ys-play-8846w
time="2020-09-22T03:56:33Z" level=info msg="Processing workflow" namespace=argo-events workflow=ys-play-4tbmx
time="2020-09-22T03:56:33Z" level=info msg="Processing workflow" namespace=argo-events workflow=ys-play-ll7hj
time="2020-09-22T03:56:34Z" level=info msg="insignificant pod change" key=argo-events/ys-play-6skb2-1526641406
time="2020-09-22T03:56:34Z" level=info msg="Workflow update successful" namespace=argo-events phase=Running resourceVersion=23553495 workflow=ys-game-packaging-cbz7v
time="2020-09-22T03:56:34Z" level=info msg="Workflow update successful" namespace=argo-events phase=Running resourceVersion=23553496 workflow=ys-game-packaging-rcjcz
E0922 03:56:34.984364 1 event.go:272] Unable to write event: 'Patch https://172.20.0.1:443/api/v1/namespaces/argo-events/events/ys-game-packaging-z64r7.1636fe5c383b29a3: http2: server sent GOAWAY and closed the connection; LastStreamID=2693, ErrCode=NO_ERROR, debug=""' (may retry after sleeping)
time="2020-09-22T03:56:34Z" level=error msg="workflow timeout" error="Get https://172.20.0.1:443/api/v1/namespaces/argo-events/pods?labelSelector=workflows.argoproj.io%2Fworkflow%3Dys-game-packaging-z64r7: http2: server sent GOAWAY and closed the connection; LastStreamID=2693, ErrCode=NO_ERROR, debug=\"\"" namespace=argo-events workflow=ys-game-packaging-z64r7
time="2020-09-22T03:56:34Z" level=error msg="workflow timeout" error="Get https://172.20.0.1:443/api/v1/namespaces/argo-events/pods?labelSelector=workflows.argoproj.io%2Fworkflow%3Dys-play-msfdd: http2: server sent GOAWAY and closed the connection; LastStreamID=2693, ErrCode=NO_ERROR, debug=\"\"" namespace=argo-events workflow=ys-play-msfdd
E0922 03:56:34.984466 1 ttlcontroller.go:114] error deleting 'argo-events/ys-game-packaging-bkn24': Delete https://172.20.0.1:443/apis/argoproj.io/v1alpha1/namespaces/argo-events/workflows/ys-game-packaging-bkn24: http2: server sent GOAWAY and closed the connection; LastStreamID=2693, ErrCode=NO_ERROR, debug=""
time="2020-09-22T03:56:34Z" level=error msg="workflow timeout" error="Get https://172.20.0.1:443/api/v1/namespaces/argo-events/pods?labelSelector=workflows.argoproj.io%2Fworkflow%3Dys-play-ll7hj: http2: server sent GOAWAY and closed the connection; LastStreamID=2693, ErrCode=NO_ERROR, debug=\"\"" namespace=argo-events workflow=ys-play-ll7hj
time="2020-09-22T03:56:34Z" level=error msg="workflow timeout" error="Get https://172.20.0.1:443/api/v1/namespaces/argo-events/pods?labelSelector=workflows.argoproj.io%2Fworkflow%3Dys-play-h62ck: http2: server sent GOAWAY and closed the connection; LastStreamID=2693, ErrCode=NO_ERROR, debug=\"\"" namespace=argo-events workflow=ys-play-h62ck
time="2020-09-22T03:56:34Z" level=error msg="Failed to labeled pod completed" err="Patch https://172.20.0.1:443/api/v1/namespaces/argo-events/pods/ys-play-msfdd-2648834506: http2: server sent GOAWAY and closed the connection; LastStreamID=2693, ErrCode=NO_ERROR, debug=\"\"" namespace=argo-events pod=ys-play-msfdd-2648834506
time="2020-09-22T03:56:34Z" level=error msg="workflow timeout" error="Get https://172.20.0.1:443/api/v1/namespaces/argo-events/pods?labelSelector=workflows.argoproj.io%2Fworkflow%3Dys-play-hqnt9: http2: server sent GOAWAY and closed the connection; LastStreamID=2693, ErrCode=NO_ERROR, debug=\"\"" namespace=argo-events workflow=ys-play-hqnt9
time="2020-09-22T03:56:34Z" level=error msg="workflow timeout" error="Get https://172.20.0.1:443/api/v1/namespaces/argo-events/pods?labelSelector=workflows.argoproj.io%2Fworkflow%3Dys-play-5ckzt: http2: server sent GOAWAY and closed the connection; LastStreamID=2693, ErrCode=NO_ERROR, debug=\"\"" namespace=argo-events workflow=ys-play-5ckzt
time="2020-09-22T03:56:34Z" level=error msg="workflow timeout" error="Get https://172.20.0.1:443/api/v1/namespaces/argo-events/pods?labelSelector=workflows.argoproj.io%2Fworkflow%3Dys-play-4tbmx: http2: server sent GOAWAY and closed the connection; LastStreamID=2693, ErrCode=NO_ERROR, debug=\"\"" namespace=argo-events workflow=ys-play-4tbmx
Maybe fixed by #4103
Will be fixed in v2.11.1
Fixed.