Argo: View Workflow Page Not Receiving Progress Updates

Created on 16 Oct 2020  ·  20Comments  ·  Source: argoproj/argo

Summary

It seems like the view workflow page is no longer continuously receiving events. On 2.10.2 this did work where progress of the workflow could be viewed without refreshing. The SSE endpoint used to retrieve these events /workflow-events seems to be called once then after the request times out, a new watch request is not started. This may not be the only affected SSE endpoint.

Update:
This also appears to occur on the workflow list page. Where the SSE request eventually terminates with a 504
Gateway Timeout.

Diagnostics

GKE 1.15

On 2.11.4 i've noticed that the workflow graph isn't updating anymore. I upgraded to 2.11.4 from 2.10.2 where this was working before.

This has occurred when viewing any in progress workflow i've tried.

Paste the logs from the workflow controller:
these don't seem relevant


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

bug

All 20 comments

Would you like to see if you can identify the offending change?

Sure I'll have a look

I have a feeling it has something to do with the changes here: https://github.com/argoproj/argo/commit/6e3c5bef5c2bbfbef4a74b4c9c91e288b8e94735#diff-884cea1f0a354998c12cefc2b8097a881fe6d621fa91c1c549d241d6023fa5b5L42 .

In my browser i added a breakpoint here ui/src/app/shared/services/requests.ts and here ui/src/app/shared/services/requests.ts . In 2.10 it seems like the error (where the error is actually undefined) is caught and received by the chained then but in 2.11 its passed to the onerror handler. But i can't tell what would be causing the EventSource to not be reestablished.

What is the UI behaviour? Does it show an error?

No error in the ui i can see. It just stays static in the state it was when loading the page (workflow-details).

Locally i reverted just these changes https://github.com/argoproj/argo/commit/6e3c5bef5c2bbfbef4a74b4c9c91e288b8e94735#diff-0670537605e4212d6dada675c6498eabe38d45eb0d3594d59c0779905a83f18bR319-R322 and after the first request times out another one is established.

i have a nginx proxy set up locally in order to mimic the ingress in my deployment with configuration like:
```nginx.conf
events { }

http {
server {
proxy_http_version 1.1;

 listen 2345;

 set $argo_upstream http://127.0.0.1:2746;
 location / {
     proxy_set_header Connection '';
     proxy_buffering off;
     proxy_cache off;
     chunked_transfer_encoding off;
     proxy_pass $argo_upstream;
 }

}
}

`nginx -c ./nginx.conf`
And i modified the webpack dev config to proxy through the nginx proxy
```js
 devServer: {
   historyApiFallback: {
      disableDotRule: true
    },
    proxy: {
      "/api": {
        "target": isProd ? "" : "http://localhost:2345",
        "secure": false
      },

Can you test with argocli:latest please?

@alexec I had to modify my proxy config since the cli only works with a grpc connection:
```nginx.conf
events { }

error_log /Users/trevorfoster/tempproxy/error.log info;

http {
access_log /Users/trevorfoster/tempproxy/access.log;
server {
resolver 127.0.0.1;

 server_name localhost;
 listen 2345 http2;

 set $argo_upstream grpc://127.0.0.1:2746;
 location / {
     grpc_pass $argo_upstream;
 }

}
}


However the watch still times out after one minute because of nginx's default keep alive setting of 1 minute. This keepalive setting is also the reason why the ui request times out. If the request times out the client should re start the watch in my opinion.

Name: fantastic-whale
Namespace: argo
ServiceAccount: default
Status: Running
Created: Wed Oct 21 17:44:05 -0400 (1 minute ago)
Started: Wed Oct 21 17:44:05 -0400 (1 minute ago)
Duration: 1 minute 5 seconds
Progress: 0/1
Parameters:
message: hello argo

STEP TEMPLATE PODNAME DURATION MESSAGE
● fantastic-whale argosay fantastic-whale 1m
FATA[0060] rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: INTERNAL_ERROR
```

@alexmt any thoughts on what does Argo CD does about this? Does it send some kind of keep-alive?

The way the UI is coded is that it should present an error notice if there is an error, and then automatically reconnect.

that's what i thought as well since the eventsource handles that but it doesn't seem to work.

In the event of an upstream timeout it does appear the request ends with a response code of 200. So it's possible the logic reestablishing the connection isn't triggered in this case. However the same response is returned on 2.10 and it does establish a new request.

Screen Shot 2020-10-22 at 10 03 32 PM

So we think that the request ends with 200, but we do not reconnect like we should.

Can you please test argoproj/argocli:ui-prog and see if this fixes your issue?

I checked out your changes locally and it seems like a new call still isn't being established in the details ui. I added a breakpoint around the code called on subscribe finish:

Screen Shot 2020-10-26 at 4 20 39 PM

It seems like the error handler is being called instead of the third handler where you added the recursive call.

I've made another change. Can you pull the image again and try again?

That seems to work! Very neat, should we make this change for the other eventsource subscribers?

@terev would you be interested in taking over on this fix?

I think we should maybe determine if the error is a disconnect (so reconnecting is a good idea) rather than 403 or 500 (which would result in going into a hot-loop).

Sweet that makes sense

Was this page helpful?
0 / 5 - 0 ratings

Related issues

iterion picture iterion  ·  3Comments

kounoike picture kounoike  ·  4Comments

basanthjenuhb picture basanthjenuhb  ·  3Comments

ddseapy picture ddseapy  ·  3Comments

alexlatchford picture alexlatchford  ·  3Comments