When viewing a workflow in the argo server web UI at the /workflows/wf/[WORKFLOW-NAME] route, the UI appears to watch for updates using a streaming connection to the /api/v1/workflow-events/wf?listOptions.fieldSelector=metadata.name=[WORKFLOW-NAME] route. If the connection is interrupted for any reason, the UI reestablishes the connection, but it adds the listOptions.resourceVersion parameter to the request. The resource version it uses is the current resource version on the workflow being viewed.
If the workflow is old enough, then the reconnect attempt (with the listOptions.resourceVersion parameter added) fails with this error in the argo server log, despite the fact that the resource version is the current resource version on the workflow:
time="2020-11-20T22:56:19Z" level=error msg="finished streaming call with code Unknown" error="too old resource version: 1217 (1323)" grpc.code=Unknown grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2020-11-20T22:56:19Z" grpc.time_ms=14.491 span.kind=server system=grpc
The UI retries infinitely as fast as it can, flooding the server with requests that will always fail with this error.
I first noticed this issue in our Amazon EKS cluster, though I was able to easily reproduce in minikube using the helloworld workflow example from https://argoproj.github.io/argo/examples/#hello-world. Here is how I was able to reproduce:
minikube start --kubernetes-version='v1.19.2' --cpus=4argo submit the hello-world workflow. I needed about 25 instances before the oldest workflow in the list would start producing this error. In our real cluster with larger workflows that are producing more resource versions due to more updates, I saw it happen with fewer.kubectl -n argo port-forward deployment/argo-server 2746:2746kubectl port-forward. This will interrupt the connection to the server, triggering the UI to begin trying to reconnect. kubectl port forward so that the webpage can reach the server again. The server starts replying with 500 errors, and the argo server log is flooded with too old resource version as shown above.Killing a port-forward is of course a contrived scenario. A real life example could be as simple as a user putting their laptop to sleep while sitting on a workflow page. When it wakes up and tries to reconnect later, the user will now unknowingly be flooding the server with these requests. We use the nginx ingress controller in our EKS cluster. In my case, I first discovered this because our ingress controller is timing out the streaming connection after 1 minute. We have alerts set up to tell us if a high number of requests to the controller are failing, and this triggered an alert.
It seems like there are 2 issues:
too old resource version is not handled properlyWhat Kubernetes provider are you using?
What version of Argo Workflows are you running?
v2.11.7
Paste a workflow that reproduces the bug, including status:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
creationTimestamp: "2020-11-20T22:41:19Z"
generateName: hello-world-
generation: 5
labels:
workflows.argoproj.io/completed: "true"
workflows.argoproj.io/phase: Succeeded
managedFields:
- apiVersion: argoproj.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:generateName: {}
f:spec:
.: {}
f:arguments: {}
f:entrypoint: {}
f:serviceAccountName: {}
f:templates: {}
f:status: {}
manager: argo
operation: Update
time: "2020-11-20T22:41:19Z"
- apiVersion: argoproj.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.: {}
f:workflows.argoproj.io/completed: {}
f:workflows.argoproj.io/phase: {}
f:status:
f:conditions: {}
f:finishedAt: {}
f:nodes:
.: {}
f:hello-world-v5rs5:
.: {}
f:displayName: {}
f:finishedAt: {}
f:hostNodeName: {}
f:id: {}
f:name: {}
f:outputs:
.: {}
f:exitCode: {}
f:phase: {}
f:resourcesDuration:
.: {}
f:cpu: {}
f:memory: {}
f:startedAt: {}
f:templateName: {}
f:templateScope: {}
f:type: {}
f:phase: {}
f:resourcesDuration:
.: {}
f:cpu: {}
f:memory: {}
f:startedAt: {}
manager: workflow-controller
operation: Update
time: "2020-11-20T22:42:39Z"
name: hello-world-v5rs5
namespace: wf
resourceVersion: "1217"
selfLink: /apis/argoproj.io/v1alpha1/namespaces/wf/workflows/hello-world-v5rs5
uid: 6eda94da-c7cc-4fb2-9ae4-80f28386fccb
spec:
arguments: {}
entrypoint: whalesay
serviceAccountName: argo-workflow
templates:
- arguments: {}
container:
args:
- hello world
command:
- cowsay
image: docker/whalesay
name: ""
resources:
limits:
cpu: 100m
memory: 32Mi
inputs: {}
metadata: {}
name: whalesay
outputs: {}
status:
conditions:
- status: "True"
type: Completed
finishedAt: "2020-11-20T22:42:39Z"
nodes:
hello-world-v5rs5:
displayName: hello-world-v5rs5
finishedAt: "2020-11-20T22:42:38Z"
hostNodeName: minikube
id: hello-world-v5rs5
name: hello-world-v5rs5
outputs:
exitCode: "0"
phase: Succeeded
resourcesDuration:
cpu: 41
memory: 41
startedAt: "2020-11-20T22:41:19Z"
templateName: whalesay
templateScope: local/hello-world-v5rs5
type: Pod
phase: Succeeded
resourcesDuration:
cpu: 41
memory: 41
startedAt: "2020-11-20T22:41:19Z"
Paste the logs from the workflow controller:
time="2020-11-20T22:41:19Z" level=info msg="Processing workflow" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:41:19Z" level=info msg="Updated phase -> Running" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:41:19Z" level=info msg="Pod node hello-world-v5rs5 initialized Pending" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:41:19Z" level=info msg="Created pod: hello-world-v5rs5 (hello-world-v5rs5)" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:41:19Z" level=info msg="Workflow update successful" namespace=wf phase=Running resourceVersion=1054 workflow=hello-world-v5rs5
time="2020-11-20T22:41:20Z" level=info msg="Processing workflow" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:41:20Z" level=info msg="Updating node hello-world-v5rs5 message: ContainerCreating"
time="2020-11-20T22:41:20Z" level=info msg="Skipped pod hello-world-v5rs5 (hello-world-v5rs5) creation: already exists" namespace=wf podPhase=Pending workflow=hello-world-v5rs5
time="2020-11-20T22:41:21Z" level=info msg="Workflow update successful" namespace=wf phase=Running resourceVersion=1057 workflow=hello-world-v5rs5
time="2020-11-20T22:41:22Z" level=info msg="Processing workflow" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:41:22Z" level=info msg="Skipped pod hello-world-v5rs5 (hello-world-v5rs5) creation: already exists" namespace=wf podPhase=Pending workflow=hello-world-v5rs5
time="2020-11-20T22:42:38Z" level=info msg="Processing workflow" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:42:38Z" level=info msg="Setting node hello-world-v5rs5 outputs"
time="2020-11-20T22:42:38Z" level=info msg="Updating node hello-world-v5rs5 status Pending -> Running"
time="2020-11-20T22:42:38Z" level=info msg="Workflow update successful" namespace=wf phase=Running resourceVersion=1206 workflow=hello-world-v5rs5
time="2020-11-20T22:42:39Z" level=info msg="Processing workflow" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:42:39Z" level=info msg="Updating node hello-world-v5rs5 status Running -> Succeeded"
time="2020-11-20T22:42:39Z" level=info msg="Updated phase Running -> Succeeded" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:42:39Z" level=info msg="Marking workflow completed" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:42:39Z" level=info msg="Checking daemoned children of " namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:42:39Z" level=info msg="Workflow update successful" namespace=wf phase=Succeeded resourceVersion=1217 workflow=hello-world-v5rs5
time="2020-11-20T22:42:40Z" level=info msg="Labeled pod completed" namespace=wf pod=hello-world-v5rs5
Message from the maintainers:
Impacted by this bug? Give it a 馃憤. We prioritise the issues with the most 馃憤.
A work-around is to close the browser so it stops making requests.
With the suggested workaround of "close the browser", if one or more users' browsers start flooding the server with these requests, we would have to somehow track those people down and ask them to close their browser, or make netpol or security group changes to block traffic from their machine, and then eventually unblock it once they realize they can no longer reach the server and complain to us. This is unfortunately not really a viable workaround for us.
See fix(ui): Reconnect to DAG. Fixes #4301 (#4378)
@kashook I've just pushed a dev image for you to test argoproj/argocli:ui-err.
Thanks @alexec! I tested the ui-err dev image following the same reproduction steps I originally posted. The server is no longer flooded so fast with requests. Instead, the UI attempts to reconnect every 10 seconds, still adding the resourceVersion parameter to the URL when it reconnects, using the workflow's actual current resource version. The server still returns a 500 and the log still shows the too old resource version message. So, it still will never succeed in reconnecting unless the page is completely refreshed, because a watch in Kubernetes using that resource version will never succeed. While the load on the server is significantly reduced, it seems like this probably is still not the intended behavior? While the situation is greatly improved, these repeated 500 errors will still trigger our alert system.
Once the too old resource version error occurs, I believe that the resourceVersion value being used needs to be updated. Based on the "Efficient detection of changes" heading of this page in the Kubernetes documentation:
A given Kubernetes server will only preserve a historical list of changes for a limited time. Clusters using etcd3 preserve changes in the last 5 minutes by default. When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a list operation, and starting the watch from the
resourceVersionreturned by that new list operation.
I believe the resource version that should be used once the version on the workflow starts failing should be reset to the version from the list operation, not the one on the workflow itself. The resource version on the workflow no longer exists and cannot be watched any longer.
All that being said, is passing the resource version necessary? When I first browse to the workflow it isn't passed. The code seems to be trying to get the latest version by using the current resource version from the workflow. Doing a watch in Kubernetes without passing the resource version gets you the latest, which seems like what you want in this case? (Ref https://kubernetes.io/docs/reference/using-api/api-concepts/#the-resourceversion-parameter). I certainly may be missing some detail that makes this incorrect.
@alexec I noticed this issue was closed. There really were two issues here, one being the rapid reconnect loop, and the other being that the reconnect can never work as currently implemented once the resource version on the workflow becomes too old. The rapid reconnect is fixed, but not the too old resource version. Should I log a new issue for that?
I think we can re-open this issue.
@kashook I'm trying to decide if we should fix this. It will be fixed in v3, but I'll have to create a fix for this too. I don't want to do that if I can help it.
Give that the UI v3 should come in January - do you think we should fix it on v2.12? Would you like to try and fix it?
@alexec if it's fixed in v3 in January we can wait until then, especially since the rapid reconnect loop is fixed. I certainly could try to fix it, but will probably hold off since a fix is coming. Thanks!
No problem. I'll leave open as it is still a bug.
Most helpful comment
No problem. I'll leave open as it is still a bug.