When you're using an ingress controller that has an idle timeout configured, it's possible that there are no events that occur within that period of time which results in the UI throwing an error since the workflow-events stream is closed. In the case of my cluster, I use ingress-nginx which has a default idle timeout of 60s.


Since it's expected that these streams are very long lived connections, we should consider one of the following:
Send a piece of data periodically if none has been sent. This is not optimal imo since we'd need to filter this out on the client, and it still may not solve the problem if the user configures an idle timeout shorter than the interval that we sent data.
Retry the connection on the front-end at least once. If the connection is successfully re-established, then it's a candidate to retry again if/when the error occurs. If the connection fails to be re-established, throw our existing error since that might indicate a loss of network connectivity or problems with the argo-server pod.
In both cases, we should also provide a nicer way to retry when this occurs rather than reloading the page.
What Kubernetes provider are you using?
EKS 1.17 with Ingress NGINX
What version of Argo Workflows are you running?
v2.11.1
Message from the maintainers:
Impacted by this bug? Give it a 馃憤. We prioritise the issues with the most 馃憤.
I think we have addressed some of this in v2.11.7 and v2.12. Could you try latest?
Yep, looks like you're right - v2.12.0-rc2 retries automatically after 10s and provides a way to explicitly reload.

Wonder if we can improve the UX on that though since for low traffic instances, this might be a relatively common error to run into. Maybe we can retry seamlessly without interrupting the UI, or make the error not completely replace the workflow listing?
Do you want to suggest something?
Sure - I think a pretty straightforward change to make would be when a disconnect occurs:
Disconnected from workflow streaming. Reconnecting in 10s which gives a better idea about what is currently non-functional and what action will be taken when the timer completesThis makes sense. Each page is different and needs different disconnect logic.
I'm fixing in the v3 UI