Argo: Argo Server UI repeatedly requests resource with too old version

Created on 21 Nov 2020 · 10Comments · Source: argoproj/argo

Summary

When viewing a workflow in the argo server web UI at the /workflows/wf/[WORKFLOW-NAME] route, the UI appears to watch for updates using a streaming connection to the /api/v1/workflow-events/wf?listOptions.fieldSelector=metadata.name=[WORKFLOW-NAME] route. If the connection is interrupted for any reason, the UI reestablishes the connection, but it adds the listOptions.resourceVersion parameter to the request. The resource version it uses is the current resource version on the workflow being viewed.

If the workflow is old enough, then the reconnect attempt (with the listOptions.resourceVersion parameter added) fails with this error in the argo server log, despite the fact that the resource version is the current resource version on the workflow:

time="2020-11-20T22:56:19Z" level=error msg="finished streaming call with code Unknown" error="too old resource version: 1217 (1323)" grpc.code=Unknown grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2020-11-20T22:56:19Z" grpc.time_ms=14.491 span.kind=server system=grpc

The UI retries infinitely as fast as it can, flooding the server with requests that will always fail with this error.

I first noticed this issue in our Amazon EKS cluster, though I was able to easily reproduce in minikube using the helloworld workflow example from https://argoproj.github.io/argo/examples/#hello-world. Here is how I was able to reproduce:

Stand up a kube cluster with minikube: minikube start --kubernetes-version='v1.19.2' --cpus=4
Deploy argo. I used the helm chart
argo submit the hello-world workflow. I needed about 25 instances before the oldest workflow in the list would start producing this error. In our real cluster with larger workflows that are producing more resource versions due to more updates, I saw it happen with fewer.
Expose the argo web UI: kubectl -n argo port-forward deployment/argo-server 2746:2746
Browse to the oldest workflow. (I was using Chrome). Open the developer console in your browser, and also watch the argo server logs.
Kill the kubectl port-forward. This will interrupt the connection to the server, triggering the UI to begin trying to reconnect.
Restart the kubectl port forward so that the webpage can reach the server again. The server starts replying with 500 errors, and the argo server log is flooded with too old resource version as shown above.

Killing a port-forward is of course a contrived scenario. A real life example could be as simple as a user putting their laptop to sleep while sitting on a workflow page. When it wakes up and tries to reconnect later, the user will now unknowingly be flooding the server with these requests. We use the nginx ingress controller in our EKS cluster. In my case, I first discovered this because our ingress controller is timing out the streaming connection after 1 minute. We have alerts set up to tell us if a high number of requests to the controller are failing, and this triggered an alert.

It seems like there are 2 issues:

The too old resource version is not handled properly
The UI should maybe back off if it is getting error responses from the server after a few tries. It doesn't take long for this to send thousands of requests to the server.

Diagnostics

What Kubernetes provider are you using?

Amazon EKS, Kubernetes 1.16.15
minikube, Kubernetes versions 1.16.15 and 1.19.2

What version of Argo Workflows are you running?
v2.11.7

Paste a workflow that reproduces the bug, including status:


apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  creationTimestamp: "2020-11-20T22:41:19Z"
  generateName: hello-world-
  generation: 5
  labels:
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/phase: Succeeded
  managedFields:
  - apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName: {}
      f:spec:
        .: {}
        f:arguments: {}
        f:entrypoint: {}
        f:serviceAccountName: {}
        f:templates: {}
      f:status: {}
    manager: argo
    operation: Update
    time: "2020-11-20T22:41:19Z"
  - apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:workflows.argoproj.io/completed: {}
          f:workflows.argoproj.io/phase: {}
      f:status:
        f:conditions: {}
        f:finishedAt: {}
        f:nodes:
          .: {}
          f:hello-world-v5rs5:
            .: {}
            f:displayName: {}
            f:finishedAt: {}
            f:hostNodeName: {}
            f:id: {}
            f:name: {}
            f:outputs:
              .: {}
              f:exitCode: {}
            f:phase: {}
            f:resourcesDuration:
              .: {}
              f:cpu: {}
              f:memory: {}
            f:startedAt: {}
            f:templateName: {}
            f:templateScope: {}
            f:type: {}
        f:phase: {}
        f:resourcesDuration:
          .: {}
          f:cpu: {}
          f:memory: {}
        f:startedAt: {}
    manager: workflow-controller
    operation: Update
    time: "2020-11-20T22:42:39Z"
  name: hello-world-v5rs5
  namespace: wf
  resourceVersion: "1217"
  selfLink: /apis/argoproj.io/v1alpha1/namespaces/wf/workflows/hello-world-v5rs5
  uid: 6eda94da-c7cc-4fb2-9ae4-80f28386fccb
spec:
  arguments: {}
  entrypoint: whalesay
  serviceAccountName: argo-workflow
  templates:
  - arguments: {}
    container:
      args:
      - hello world
      command:
      - cowsay
      image: docker/whalesay
      name: ""
      resources:
        limits:
          cpu: 100m
          memory: 32Mi
    inputs: {}
    metadata: {}
    name: whalesay
    outputs: {}
status:
  conditions:
  - status: "True"
    type: Completed
  finishedAt: "2020-11-20T22:42:39Z"
  nodes:
    hello-world-v5rs5:
      displayName: hello-world-v5rs5
      finishedAt: "2020-11-20T22:42:38Z"
      hostNodeName: minikube
      id: hello-world-v5rs5
      name: hello-world-v5rs5
      outputs:
        exitCode: "0"
      phase: Succeeded
      resourcesDuration:
        cpu: 41
        memory: 41
      startedAt: "2020-11-20T22:41:19Z"
      templateName: whalesay
      templateScope: local/hello-world-v5rs5
      type: Pod
  phase: Succeeded
  resourcesDuration:
    cpu: 41
    memory: 41
  startedAt: "2020-11-20T22:41:19Z"

Paste the logs from the workflow controller:

time="2020-11-20T22:41:19Z" level=info msg="Processing workflow" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:41:19Z" level=info msg="Updated phase  -> Running" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:41:19Z" level=info msg="Pod node hello-world-v5rs5 initialized Pending" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:41:19Z" level=info msg="Created pod: hello-world-v5rs5 (hello-world-v5rs5)" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:41:19Z" level=info msg="Workflow update successful" namespace=wf phase=Running resourceVersion=1054 workflow=hello-world-v5rs5
time="2020-11-20T22:41:20Z" level=info msg="Processing workflow" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:41:20Z" level=info msg="Updating node hello-world-v5rs5 message: ContainerCreating"
time="2020-11-20T22:41:20Z" level=info msg="Skipped pod hello-world-v5rs5 (hello-world-v5rs5) creation: already exists" namespace=wf podPhase=Pending workflow=hello-world-v5rs5
time="2020-11-20T22:41:21Z" level=info msg="Workflow update successful" namespace=wf phase=Running resourceVersion=1057 workflow=hello-world-v5rs5
time="2020-11-20T22:41:22Z" level=info msg="Processing workflow" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:41:22Z" level=info msg="Skipped pod hello-world-v5rs5 (hello-world-v5rs5) creation: already exists" namespace=wf podPhase=Pending workflow=hello-world-v5rs5
time="2020-11-20T22:42:38Z" level=info msg="Processing workflow" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:42:38Z" level=info msg="Setting node hello-world-v5rs5 outputs"
time="2020-11-20T22:42:38Z" level=info msg="Updating node hello-world-v5rs5 status Pending -> Running"
time="2020-11-20T22:42:38Z" level=info msg="Workflow update successful" namespace=wf phase=Running resourceVersion=1206 workflow=hello-world-v5rs5
time="2020-11-20T22:42:39Z" level=info msg="Processing workflow" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:42:39Z" level=info msg="Updating node hello-world-v5rs5 status Running -> Succeeded"
time="2020-11-20T22:42:39Z" level=info msg="Updated phase Running -> Succeeded" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:42:39Z" level=info msg="Marking workflow completed" namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:42:39Z" level=info msg="Checking daemoned children of " namespace=wf workflow=hello-world-v5rs5
time="2020-11-20T22:42:39Z" level=info msg="Workflow update successful" namespace=wf phase=Succeeded resourceVersion=1217 workflow=hello-world-v5rs5
time="2020-11-20T22:42:40Z" level=info msg="Labeled pod completed" namespace=wf pod=hello-world-v5rs5

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

bug

Source

kashook

👍9

Most helpful comment

No problem. I'll leave open as it is still a bug.

alexec on 10 Dec 2020

👍2

All 10 comments

A work-around is to close the browser so it stops making requests.

alexec on 23 Nov 2020

With the suggested workaround of "close the browser", if one or more users' browsers start flooding the server with these requests, we would have to somehow track those people down and ask them to close their browser, or make netpol or security group changes to block traffic from their machine, and then eventually unblock it once they realize they can no longer reach the server and complain to us. This is unfortunately not really a viable workaround for us.

kashook on 23 Nov 2020

👍1

See fix(ui): Reconnect to DAG. Fixes #4301 (#4378)

alexec on 7 Dec 2020

❤2

@kashook I've just pushed a dev image for you to test argoproj/argocli:ui-err.

alexec on 8 Dec 2020

Thanks @alexec! I tested the ui-err dev image following the same reproduction steps I originally posted. The server is no longer flooded so fast with requests. Instead, the UI attempts to reconnect every 10 seconds, still adding the resourceVersion parameter to the URL when it reconnects, using the workflow's actual current resource version. The server still returns a 500 and the log still shows the too old resource version message. So, it still will never succeed in reconnecting unless the page is completely refreshed, because a watch in Kubernetes using that resource version will never succeed. While the load on the server is significantly reduced, it seems like this probably is still not the intended behavior? While the situation is greatly improved, these repeated 500 errors will still trigger our alert system.

Once the too old resource version error occurs, I believe that the resourceVersion value being used needs to be updated. Based on the "Efficient detection of changes" heading of this page in the Kubernetes documentation:

A given Kubernetes server will only preserve a historical list of changes for a limited time. Clusters using etcd3 preserve changes in the last 5 minutes by default. When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a list operation, and starting the watch from the resourceVersion returned by that new list operation.

I believe the resource version that should be used once the version on the workflow starts failing should be reset to the version from the list operation, not the one on the workflow itself. The resource version on the workflow no longer exists and cannot be watched any longer.

All that being said, is passing the resource version necessary? When I first browse to the workflow it isn't passed. The code seems to be trying to get the latest version by using the current resource version from the workflow. Doing a watch in Kubernetes without passing the resource version gets you the latest, which seems like what you want in this case? (Ref https://kubernetes.io/docs/reference/using-api/api-concepts/#the-resourceversion-parameter). I certainly may be missing some detail that makes this incorrect.

kashook on 8 Dec 2020

@alexec I noticed this issue was closed. There really were two issues here, one being the rapid reconnect loop, and the other being that the reconnect can never work as currently implemented once the resource version on the workflow becomes too old. The rapid reconnect is fixed, but not the too old resource version. Should I log a new issue for that?

kashook on 10 Dec 2020

I think we can re-open this issue.

alexec on 10 Dec 2020

@kashook I'm trying to decide if we should fix this. It will be fixed in v3, but I'll have to create a fix for this too. I don't want to do that if I can help it.

Give that the UI v3 should come in January - do you think we should fix it on v2.12? Would you like to try and fix it?

alexec on 10 Dec 2020

@alexec if it's fixed in v3 in January we can wait until then, especially since the rapid reconnect loop is fixed. I certainly could try to fix it, but will probably hold off since a fix is coming. Thanks!

kashook on 10 Dec 2020

No problem. I'll leave open as it is still a bug.

alexec on 10 Dec 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings