Argo: Handle idle timeouts more gracefully

Created on 13 Nov 2020  路  6Comments  路  Source: argoproj/argo

Summary

When you're using an ingress controller that has an idle timeout configured, it's possible that there are no events that occur within that period of time which results in the UI throwing an error since the workflow-events stream is closed. In the case of my cluster, I use ingress-nginx which has a default idle timeout of 60s.

image

image

Since it's expected that these streams are very long lived connections, we should consider one of the following:

  1. Send a piece of data periodically if none has been sent. This is not optimal imo since we'd need to filter this out on the client, and it still may not solve the problem if the user configures an idle timeout shorter than the interval that we sent data.

  2. Retry the connection on the front-end at least once. If the connection is successfully re-established, then it's a candidate to retry again if/when the error occurs. If the connection fails to be re-established, throw our existing error since that might indicate a loss of network connectivity or problems with the argo-server pod.

In both cases, we should also provide a nicer way to retry when this occurs rather than reloading the page.

Diagnostics

What Kubernetes provider are you using?

EKS 1.17 with Ingress NGINX

What version of Argo Workflows are you running?

v2.11.1



Message from the maintainers:

Impacted by this bug? Give it a 馃憤. We prioritise the issues with the most 馃憤.

bug

All 6 comments

I think we have addressed some of this in v2.11.7 and v2.12. Could you try latest?

Yep, looks like you're right - v2.12.0-rc2 retries automatically after 10s and provides a way to explicitly reload.

argo retry (2)

Wonder if we can improve the UX on that though since for low traffic instances, this might be a relatively common error to run into. Maybe we can retry seamlessly without interrupting the UI, or make the error not completely replace the workflow listing?

Do you want to suggest something?

Sure - I think a pretty straightforward change to make would be when a disconnect occurs:

  • Don't remove the existing workflow listing since the existing data is all still valid and in the case of users that don't necessarily understand the error, causes them to reach out thinking that Argo is broken when it's just a normally occurring situation due to the idle timeout on the ingress controller if there's low traffic
  • Re-word the error message from Unknown error to something like Disconnected from workflow streaming. Reconnecting in 10s which gives a better idea about what is currently non-functional and what action will be taken when the timer completes

This makes sense. Each page is different and needs different disconnect logic.

I'm fixing in the v3 UI

Was this page helpful?
0 / 5 - 0 ratings