Is this a BUG REPORT or FEATURE REQUEST?:
Feature Request
What happened:
It would be helpful if it were possible to see how much progress is being made by a long-running task
What you expected to happen:
A progress indication of some kind - ideally something that can be watched at the CLI and rendered in argo-ui.
For example I have a job that runs in the middle of a workflow over a few million lines and it takes a while. I'm trying to tune k8s autoscaling to scale out a service it uses and it would be helpful to know how fast it is going and how far it is through the work. Rather than implement my own solution to log this or produce metrics it would be neat if there were a way to publish this information straight into Argo.
How to reproduce it (as minimally and precisely as possible):
N/A from here down...
Anything else we need to know?:
Environment:
$ argo version
$ kubectl version -o yaml
Other debugging information (if applicable):
$ argo get <workflowname>
$ kubectl logs <failedpodname> -c init
$ kubectl logs <failedpodname> -c wait
$ kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)
We are also writing a custom solution for this, where workflows report progress to a REST endpoint. If this were built into argo, that would be nice.
See #3557
We've an ask for this in the user interface, both at a node and workflow level.
This could be done in the UI using existing backend code by getting the last successful execution of the workflow (determine by workflows.argoproj.io/workflow-template label - i.e. only works for templates and only if there was a last successful run). It is a straight forward look up on the workflow.
This is not a _very_ popular issue, so warrants a low cost solution.
I've linked the PR, which provides a coarse way to track your workflow. It estimates how long it will take to complete and displays progress towards that time. This is purely time-based.
This is not as fined grained or nuanced as some of the ideas in this PR. For example, if a workflow had 10 steps, another way to do this would be to base progress on the number of steps complete.
@brabster @ddseapy please take a look and make suggestions or comments.
Unfortunately most our current workflow templates yield workflows that vary drastically from a couple minutes to several hours, so as the pr mentions in the docs it's not ideal for our current use case but will definitely keep this in mind for future wftmpl.
vary drastically from a couple minutes
Any thought on other ways to estimate this that would work for you?
Currently for most workflows we have a specific parameter (hardcoded name we look for), whose value is the amount that workflow progress should be incremented for once that node is complete. This parameter is passed from a previous step that knows how many items/nodes there will be. Progress is computed by looking at the workflow in a shared informer and summing the parameter values up across all nodes.
For workflows with just a few nodes, we also have a rest service that allows nodes to increment the progress themselves. This progress is stored in a separate postgres table.
Both of these have the downside of the workflow needing to know/report about their own progress. What is in the PR clearly doesn't have that restriction. So Im not sure I have a better generalized suggestion. At least not a performant one that doesn't involve periodically analysing lots of workflows from the archive.
@dseapy that's really interesting. Let me play that back so I can be sure I understand. In effect, you have a method for nodes to report their progress? That is just another way to report progress.
Proposal:
We currently nodes report status by annotating their pod. What if there was an annotation that we recognise as progress your nodes could update this and we could report back via the CLI and UI.
If the annotation was absent, we would default a computed metric.
Yeah, I believe that would work for our wftmpls.
@dseapy I've tweaked my POC. Setting annotations works today, but (a) requires the workflow role to have pod patch and this is is a security issue we want to remove and (b) exposes implementation details, instead what about just recognising a log line:
#argo progress=25/100
Better still:
#argo progress=25/100 message=custom message
That does indeed sound much better for security and friendlier to the container. I'm guessing there is not too much of a performance hit to do the log parsing/matching for each pod?
I think this approach works for me too. I had long running tasks within a workflow that I tracked via logging,
I really like the progress stuff, definitely solves some UX issues I have! One thought is: The ability to have multiple independent progress meters would be nice (think things like monitoring rollouts of multiple different kubernetes workloads). Obviously can be handled currently by just breaking those out into distinct leaf nodes, so not really an issue.
Another thought would be to have the node config include a regex that would yield the progress information?
I think this way of parsing the stdout log works for me too 馃憤
@brabster @andyleap @dseapy I think we'll be including basic N/M progress in v2.12, but not workflows-self reporting their progress. I think this is a cool feature, and I want to know if one of you would like to take it on based on the design in #4015?
The implementation in https://github.com/argoproj/argo/pull/4015, reporting progress via #argo progress=N/M message="m" would work for our use cases!
@salanki thank you. I think what I'm saying is that I don't plan to work on this anymore. But if someone wants to take it on - that'd be great!
Most helpful comment
Better still: