Users have been requesting better observability and error reporting in Flux for a while.
We should improve that situation, letting users monitor the state of Flux and diagnose any problems easily.
Right now we provide a few alternatives which aren't 100% satisfactory and should be improved:
An event API. Which is not documented and which was originally aimed at getting notifications on Weave Cloud's Deploy UI. This API isn't properly documented and is no _official_ integrations exist. There is FluxCloud which is great, but isn't maintained by us and only really covers the Slack notifications use case. We should consider revamping the API to make integrations easier (e.g. using WebSub) or, at the very least, document it. Related issues: https://github.com/fluxcd/flux/issues/2695
Logs. Very often users end up needing to grep logs in order to know what's happening. This is hairy and often confusing due to the quality of errors. Users ideally shouldn't need to grep logs. And, if they end up needing, the error messages should be clear (right now they aren't in many situations. Related issues #https://github.com/fluxcd/flux/issues/874
Metrics. Flux does provide some prometheus metrics as documented at https://docs.fluxcd.io/en/1.17.1/references/monitoring.html . However, those metrics are not sufficient to diagnose problems and create alarms and we don't provide a dashboard for them. Related issues #2792 #2793 #2199
On top of that:
The errors reported by fluxctl are sometimes not very intuitive (this happens transitively, since in many cases it gets the same errors fluxd prints in the logs). Related issues: #2839
We should give better errors and warnings when users write inconsistent update annotations. This is particularly confusing for HelmRelease workloads. Related: https://github.com/fluxcd/flux/issues/2354#issuecomment-573014706
Hi @2opremio, I am looking forward to contribute to Flux and willing to participate in GSOC 2020. How can I get things started for this issue?
Hi @supra08! Thanks a lot for the interest. I would start by reading the issues and the text above (the issue list is not exhaustive so you may want to dive through the other flux issues), using Flux, reproducing the problems mentioned and thinking about a strategy to improve things.
I don't know the logistic details of GSoC. I would think that the project needs to be accepted first? @dholbach will probably know better.
@hiddeco @2opremio
I am using fluxcd+helm-operator with fluxcloud, I deploy a container and the helmrelease goes in failed with reason: HelmUpgradeFailed, even if I have in the helmrelease yaml file the attribute wait: true.
Fluxcd always reports to fluxcloud "result":{"default:helmrelease/myapp":{"Status":"success"
It's almost impossible to work with notifications failures in this way.
fluxcd execute kubectl -f release/myapp.yaml without check any result of the deployed helmrelease
Per events, Kube events also has a pretty decent ecosystem now of tools that can monitor events like Kube watch and Argo Events.
Another option could be a webhook setting that shoots off a CloudEvent to a specified endpoint.
I was really looking forward to implementing flux in our clusters, but the inability to know what's going on or what went wrong is a real deal breaker.
info field, some have error field, some have both err and output, even with logstash+elasticsearch parsing the logs is a nightmare.:(
Hello everyone.
As shown at the link above, I've asked for some guidance and best practices when implementing new metrics with the CNCF-SIG Observability team.
I'm trying to write a proposal for new metrics, following what was informed on that issue, which you can see here.
Any feedback, from users or maintainers, on the metrics proposed would be awesome. If anyone would like to add anything(specially use-cases), that would be great too!
Most helpful comment
I was really looking forward to implementing flux in our clusters, but the inability to know what's going on or what went wrong is a real deal breaker.
infofield, some haveerrorfield, some have botherrandoutput, even with logstash+elasticsearch parsing the logs is a nightmare.:(