What happened/what you expected to happen?
I have enabled argo metrics and scrape those metrics using prometheus, but find that the default argo metrics argo_workflows_count{status="Pending"} is increasing continusly and argo_workflows_count{status="Running"} is decreasing continusly.
What Kubernetes provider are you using?
v1.14
What version of Argo Workflows are you running?
Paste a workflow that reproduces the bug, including status:
kubectl get wf -o yaml ${workflow}
any workflow spec will be ok since it is a bug about argo metrics
Paste the logs from the workflow controller:
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name) | grep ${workflow}
Message from the maintainers:
Impacted by this bug? Give it a 馃憤. We prioritise the issues with the most 馃憤.
argo metrics in workflow-controller v2.11.7 is correct, so the bug is involved in code between v2.11.7 and v2.12.0-rc2
I added logs in WorkflowUpdated and found that no workflows updated from Pending status to Running status, all calls returned due to fromPhase == toPhase.
Thanks, will take a look
Offending commit is #4025
For an immediate workaround set the environment variable INFORMER_WRITE_BACK=false in the workflow-controller environment
@simster7 anyway we could redesign this? This is not the first time it's had problems. That change was far away from the place of impact. What about using the workflow informer and querying it? Maybe using an indexer?
@simster7 anyway we could redesign this? This is not the first time it's had problems. That change was far away from the place of impact. What about using the workflow informer and querying it? Maybe using an indexer?
Yes, I am considering a "polling" redesign similar to that of the workflow controller, where the metrics code reads from the workflow lister every 10 seconds and updates the metrics as needed.
However I am reluctant to redesign this off the bat. The design of this was fairly robust before #4025. Although the change is far from the metrics code, this part of the code is dependent on the informers pattern working correctly. I'm still investigating whether this is a bug/issue with the informer code (and not with the metrics code), which seems likely based on the contents of #4025.
I get it, but my take is that if you read from the informer - you'll get accurate values - and it is hard for someone else to break this with another change.
I'm happy to take ownership of a fix for this if you like - I'm keen for you to get your hands dirty with the new UI.
POC:
diff.txt
I get it, but my take is that if you read from the informer - you'll get accurate values - and it is hard for someone else to break this with another change.
This is exactly the issue. I have pinpointed the bug to the informer not producing correct information.
I'm happy to take ownership of a fix for this if you like
No thanks, I'd like to own this issue
This is exactly the issue. I have pinpointed the bug to the informer not producing correct information.
Question is now if we should investigate why this is the issue or change tracks and redesign
The Running count looks correct now, and Succeeded is not just growing as before, however, Pending is consistently at 0 which is false.
@simster7 鈽濓笍
The Running count looks correct now, and Succeeded is not just growing as before, however, Pending is consistently at 0 which is false.
The new model polls workflows every 15 seconds. Could I ask you to verify if workflows are in Pending state for more than 15 seconds? It is possible that workflows simply don't spend enough time in Pending state for them to be picked up by the polling model
Just confirmed this is not the case... working on a fix
Fix is done: https://github.com/argoproj/argo/pull/4628