For those who, like me, support multiple teams on a single Prometheus infrastructure it would be useful to be able to spread the Alertmanager config across multiple files.
I would like different teams to be able to control their own alert destinations and routing rules. It is simple enough to allow them to specify receivers and routes in their own files and combine those into a single file but then a syntax error in any file will result in a config which Alertmanager will not load. Therefore allowing any single customer to break alerting for all customers.
Prometheus already does this by allowing us to specifying multiple files for alert rules (and rewrite rules.) So I have given control of alert rules files to each customer. Any syntax error causes prometheus to refuse to load that individual file. Therefor customers can only break their own alerts.
Typically we recommend that this be managed via configuration management (ansible, chef, etc). After stitching all the files together, the resultant config can be checked with amtool check-config config.yml.
That's what I'm currently doing. (stitching together not running amtool yet.)
However that really just tells me if the resulting config is broken but make it far harder to determine which team caused the failure.
I doubt amtool check-config is going to help me with individual team config files since all of those will be missing the global configuration options.
Amtool gives an O.K. but not excellent description of what the failure is if this is being read by a human being:
~$ amtool check-config ./sandbox/alertmanager.conf -v
Checking './sandbox/alertmanager.conf' FAILED: unknown fields in route: hippopotamus
Error: failed to validate 1 file(s)
But this is largely useless to the automation I'm using to stitch the config together. I can't take action on this and then re-run without the offending team's file. I can't tell which route has this extra field or where in a route that field is.
So I can stitch them together and run amtool then not push that config to the server if it fails. Which prevents a team from breaking the existing alerting for all teams but does still allow any team to put the global config into a state where it is no longer being updated to include the changes made by other teams. (Which still pretty much fits my definition of 'breaking alerting' for other teams.)
+1
This is something that needs to be handled outside the Alertmanager, you can use your existing configuration management systems (plus likely some tooling) to stitch things together. Pre-commit checks that the config is valid will also help.
May be helps:
https://github.com/sysincz/k8s-sidecar-cm-to-file
Without something like this, it becomes exponentially more difficult to use alertmanager in a multi tenant K8s cluster. If each team could specify their routes in a dedicated config, it would make the automation in an operator so much more simple. Please support this.
If you really have that kind of contraints I would recommend you to run multiple AM.
@andrewrynhard We have written an alertmanager-config-controller, it watches for new/updated/deleted ConfigMaps and if they define the specified annotations as true it will save each resource (e.g. receivers, routes) from ConfigMap to Alertmanagers local storage and reload the Alertmanager. It will only reload the Alertmanager if the config is valid, so it will be "harder" for one single customer to break alerting for all customers. It supports also multiple AMs in one cluster. We're in the internal process of releasing as Open Source :-)
@dbluxo Please let me know when you have open sourced it. That is _exactly_ what I am writing as we speak :D.
@andrewrynhard https://github.com/dbsystel/alertmanager-config-controller
@dbluxo awesome. Thanks!
some news about this?..
90% of users are deploying AlertManager on Kubernetes ... doesn't exist some tool for this purpose? :weary:
Most helpful comment
Without something like this, it becomes exponentially more difficult to use alertmanager in a multi tenant K8s cluster. If each team could specify their routes in a dedicated config, it would make the automation in an operator so much more simple. Please support this.