Is there an API that allows to create alert rules and target servers?
Right now I must create manually .rules and .json files.
There's no such API. This has often be discussed in the past and we are firmly decided on not providing such a feature.
There are many ways users might want to accomplish configuration of targets and rules and files are our canonical interface to build tooling around.
Closing here but let us know if you have any further questions.
@fabxc @brian-brazil @yan0s It would be really nice if there was a link to another thread where this was discussed so it could be understood why it came to this.
This is highly inconvenient--not being able to create and manage alerting rules via an API.
alert.rules means a rebuild and redeploy of a container image.Any further information as to why the alert.rules file was decided as the only interface with no plans for API exposure would very helpful. Thank you for your hard work on this project.
It basically boils down to the fact that alerting rules should be checked into source code management. They should be reviewed when changes are made as they are critical parts of your infrastructure.
Config management can easily roll them out to your Prometheus from there.
What inevitably happens when making such things part of the local DB is that you will need to replicate + backup and have procedures to restore from that. Setting these up is significantly more trouble than having a hook in git that roles out changes to rule files.
Sooner or later people realize they actually want versioning, reviewability, etc. for those reasons. Over time, we see tools like Grafana gradually adding features solved by git back to their tools. Most recently versioning for dashboards. Another example is probably Jenkins, where you can configure everything via UI/API but as people get more serious about it, they tend to switch to declarative jobs checked into git.
This becomes particularly important once you want people to handle their own monitoring with just occasional advice and sanity checks from operations teams.
As for your specific case. Baking rule and config files into your container images is understandably annoying. It is generally strongly discouraged. Kubernetes provides config maps and volume mounts for exactly that purpose. With that you can then simply reload the Prometheus process without restarts.
With config maps Kubernetes solves the rollout config management would usually have to handle. In a way, it provides you with a REST API for rule files. I'd still recommend updating those configmaps via a CD pipeline from git.
TL;DR an API seems very appealing at first but does not have the properties you'll need as you get serious about rolling out automated and robust monitoring.
Since you are using Kubernetes already, you might want to give the prometheus-operator a look. It solves a lot of common tasks and makes it easy to follow Prometheus best practices while sticking to Kubernetes idioms. For that it expectedly takes a bit of an opinionated approach.
Thanks very much for the explanation. I understand that there are varying opinions on how alerts should be created and managed, and all the options you summarize are good options. The option for Kubernetes config maps is, indeed, the best option, but would still be considered inelegant by some. Especially for consumers or Prometheus services that are managing/configuring all supporting services via API, it is just not a good experience. Some could also reasonably assert that this operation belongs with Prometheus and the onus is not on Kubernetes. Our goal is to provide a way for developer/delivery teams to manage their own alerts. Depending on how many services are being delivered by our toolchain, the Github-Kubernetes solution could mean many process reloads of Prometheus on a daily basis, that ops folks may not be so comfortable with.
I'm all for devs managing their alerts in source code in their repos, but triggering a reload of a mission critical system process multiple times a day is something not so welcome.
As I researched this just within the last hour, I saw several others asking the same question, so there is definitely a lot of interest in this feature. It would be great if, at some point, the project folks would reconsider the position taken here. Thanks for listening.
Prometheus reloads do not have any notable costs. You can reload Prometheus several times an hour without any issues. It does not act like an in-process restart but just updates config parameters where necessary.
Feedback is definitely noted. But without wanting to give any false hopes, this particular issue is one we are pretty settled on.
OK, thanks. What if Kubernetes is not being used? Then what is the best solution in that case? Of course, something like a Chef databag (configuration management) could be used as a database to store aggregated alert rules or something similar, but it also does not seem like an efficient and resilient solution. Thanks again.
Yes, databags or simple plain/templated files are the right fit. Ultimately, we consider it part of the configuration. So whatever you are using to deploy the config file is a good fit for rule files as well.
Hi @brian-brazil @fabxc it appears that /reload causes Prometheus to cancel current inflight scapes and rule evaluation. I often see on target page that targets go down or unknown for sometime upon /reload. If this is the case then multiple reloads could cause issues with scraps and alert evaluation. Can you please help understand the reload behavior?
I am thinking from the point where several hundred or several thousand developers modify alerts several times a day due to code/config push.
I agree that the alert definitions belong in source control, but I would argue they belong in the repositories of the services being monitored. Having an API would enable tools to use service discovery to pass alerts configuration on to Prometheus based on the service definition. Having the filesystem being the API makes this approach/world-view very inconvenient ...
@antonagestam have you found a way to keep the alert definitions in source control together with the services ?
We have the same issue. Our services define our alerts since they also define certain service-specific metrics and those change over time. An API would significantly help with this and make the most sense logically.
We're considering adding a stand-alone service that provides a REST API to put configs on a shared volume and use the file_sd_config + trigger the /-/reload endpoint.
@tgeens Sorry for the lack of answer. No I didn't find a way at the time and I'm no longer working much with Prometheus (unfortunately, I'd really like to work with it more!). I think the way your proposing is what I considered at the time. We were heavily using service discovery with Consul, and so it might be worth making the integration that way if you are too, so that the integration fetches configuration from Consul and writes it to the filesystem.
Most helpful comment
@fabxc @brian-brazil @yan0s It would be really nice if there was a link to another thread where this was discussed so it could be understood why it came to this.
This is highly inconvenient--not being able to create and manage alerting rules via an API.
alert.rulesmeans a rebuild and redeploy of a container image.Any further information as to why the
alert.rulesfile was decided as the only interface with no plans for API exposure would very helpful. Thank you for your hard work on this project.