Flux: Improve observability and error reporting in Flux

Created on 3 Feb 2020 · 8Comments · Source: fluxcd/flux

Users have been requesting better observability and error reporting in Flux for a while.

We should improve that situation, letting users monitor the state of Flux and diagnose any problems easily.

Right now we provide a few alternatives which aren't 100% satisfactory and should be improved:

An event API. Which is not documented and which was originally aimed at getting notifications on Weave Cloud's Deploy UI. This API isn't properly documented and is no _official_ integrations exist. There is FluxCloud which is great, but isn't maintained by us and only really covers the Slack notifications use case. We should consider revamping the API to make integrations easier (e.g. using WebSub) or, at the very least, document it. Related issues: https://github.com/fluxcd/flux/issues/2695
Logs. Very often users end up needing to grep logs in order to know what's happening. This is hairy and often confusing due to the quality of errors. Users ideally shouldn't need to grep logs. And, if they end up needing, the error messages should be clear (right now they aren't in many situations. Related issues #https://github.com/fluxcd/flux/issues/874
Metrics. Flux does provide some prometheus metrics as documented at https://docs.fluxcd.io/en/1.17.1/references/monitoring.html . However, those metrics are not sufficient to diagnose problems and create alarms and we don't provide a dashboard for them. Related issues #2792 #2793 #2199
On top of that:
The errors reported by fluxctl are sometimes not very intuitive (this happens transitively, since in many cases it gets the same errors fluxd prints in the logs). Related issues: #2839
We should give better errors and warnings when users write inconsistent update annotations. This is particularly confusing for HelmRelease workloads. Related: https://github.com/fluxcd/flux/issues/2354#issuecomment-573014706

enhancement help wanted ☂️ umbrella issue

Source

2opremio

👍31

Most helpful comment

I was really looking forward to implementing flux in our clusters, but the inability to know what's going on or what went wrong is a real deal breaker.

Log structure is terrible, every log message seems to have a different structure, some have info field, some have error field, some have both err and output, even with logstash+elasticsearch parsing the logs is a nightmare.
The errors in the logs aren't very informative, even if there is an error message, it doesn't always specify what file caused the error.
Available metrics don't help a lot, I still need to search the logs to find what's wrong.
No support for notification of any kind (even simple webhook).

azelezni on 6 Apr 2020

👍7

All 8 comments

Hi @2opremio, I am looking forward to contribute to Flux and willing to participate in GSOC 2020. How can I get things started for this issue?

supra08 on 12 Feb 2020

Hi @supra08! Thanks a lot for the interest. I would start by reading the issues and the text above (the issue list is not exhaustive so you may want to dive through the other flux issues), using Flux, reproducing the problems mentioned and thinking about a strategy to improve things.

I don't know the logistic details of GSoC. I would think that the project needs to be accepted first? @dholbach will probably know better.

2opremio on 12 Feb 2020

Yes: https://summerofcode.withgoogle.com/how-it-works/#timeline

dholbach on 12 Feb 2020

@hiddeco @2opremio

For logging, would creating a central log like using fluentd to collect logs, store them in elastic and providing an interface through kibana be a good option?
Grafana Dashboard with Prometheus would allow displaying metrics, which other metrics would require to be captured for an indication of a problem?
To look into improving error reporting, is there a list of possible errors which defined already?
For the event API, is it about broadcasting events like when a commit is pushed and sync starts etc.?

omkarprabhu-98 on 12 Mar 2020

I am using fluxcd+helm-operator with fluxcloud, I deploy a container and the helmrelease goes in failed with reason: HelmUpgradeFailed, even if I have in the helmrelease yaml file the attribute wait: true.

Fluxcd always reports to fluxcloud "result":{"default:helmrelease/myapp":{"Status":"success"

It's almost impossible to work with notifications failures in this way.

fluxcd execute kubectl -f release/myapp.yaml without check any result of the deployed helmrelease

c4m4 on 28 Mar 2020

Per events, Kube events also has a pretty decent ecosystem now of tools that can monitor events like Kube watch and Argo Events.

Another option could be a webhook setting that shoots off a CloudEvent to a specified endpoint.

RichiCoder1 on 29 Mar 2020

I was really looking forward to implementing flux in our clusters, but the inability to know what's going on or what went wrong is a real deal breaker.

Log structure is terrible, every log message seems to have a different structure, some have info field, some have error field, some have both err and output, even with logstash+elasticsearch parsing the logs is a nightmare.
The errors in the logs aren't very informative, even if there is an error message, it doesn't always specify what file caused the error.
Available metrics don't help a lot, I still need to search the logs to find what's wrong.
No support for notification of any kind (even simple webhook).

azelezni on 6 Apr 2020

👍7

Hello everyone.
As shown at the link above, I've asked for some guidance and best practices when implementing new metrics with the CNCF-SIG Observability team.

I'm trying to write a proposal for new metrics, following what was informed on that issue, which you can see here.

Any feedback, from users or maintainers, on the metrics proposed would be awesome. If anyone would like to add anything(specially use-cases), that would be great too!