Alertmanager: Silences are not propagated in a ha/mesh configuration (v0.15.0-rc1)

Created on 5 Apr 2018  路  13Comments  路  Source: prometheus/alertmanager

What did you do?
Create a 2 replicas alertmanager setup.
Create a silence in alertmanager from the exposed UI (silenced the default "DeadManSwitch")

What did you expect to see?
All alertmanagers in the mesh should have the silence set

What did you see instead? Under which circumstances?
Only one of the 2 replicas seems to have the silent set

Environment
prometheus-operator: v0.18.0
alertmanager: v0.15.0-rc.1
prometheus: v2.2.1

Notes
Reverted back to alertmanager v0.14.0 and it was working properly.
Sorry in advance if this is already on the radar.

Thanks guys.

Most helpful comment

@gmauleon https://github.com/coreos/prometheus-operator/pull/1193 should fix the issue.


I didn't encounter this issue at all with minikube - not even when I was using rc0.

@jolson490 That is very surprising to me. This should have never worked, even on minikube.

All 13 comments

Can you provide any log information? Have you checked the status page in the web ui to confirm that the mesh has been formed?

here is an example of the cluster when running the example HA setup, which is successfully gossiping silences:
image

I sure can, here is from one of the alertmanager pod:

level=info ts=2018-04-05T14:30:58.590730272Z caller=main.go:140 msg="Starting Alertmanager" version="(version=0.15.0-rc.1, branch=HEAD, revision=acb111e812530bec1ac6d908bc14725793e07cf3)"
level=info ts=2018-04-05T14:30:58.590839906Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-04-05T14:30:58.973752176Z caller=cluster.go:85 component=cluster err="couldn't deduce an advertise address: failed to parse bind addr ''"
level=warn ts=2018-04-05T14:30:58.99164516Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to resolve alertmanager-main-0.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-0.alertmanager-operated.monitoring.svc on 192.168.51.2:53: no such host\n* Failed to resolve alertmanager-main-1.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-1.alertmanager-operated.monitoring.svc on 192.168.51.2:53: no such host"
level=info ts=2018-04-05T14:30:58.992091208Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2018-04-05T14:30:58.992683838Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-04-05T14:30:58.996979028Z caller=main.go:346 msg=Listening address=:9093
level=info ts=2018-04-05T14:31:00.993007787Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000159597s
level=info ts=2018-04-05T14:31:04.993966965Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=2 before=1 now=2 elapsed=6.001219133s
level=info ts=2018-04-05T14:31:12.994686524Z caller=cluster.go:266 component=cluster msg="gossip settled; proceeding" elapsed=14.001941182s

Here is the view from alertmanager:
image

And here is the result from the api for each pod after creating a silence:
```
k port-forward alertmanager-main-0 9093:9093 -n monitoring
curl localhost:9093/api/v1/silences

{"status":"success","data":[{"id":"76857ac4-e656-4446-a15e-3726389bf809","matchers":[{"name":"severity","value":"none","isRegex":false},{"name":"alertname","value":"DeadMansSwitch","isRegex":false}],"startsAt":"2018-04-05T14:36:29.821772974Z","endsAt":"2018-04-05T16:36:25.737Z","updatedAt":"2018-04-05T14:36:29.821795651Z","createdBy":"Gael Mauleon","comment":"Test","status":{"state":"active"}}]}


k port-forward alertmanager-main-1 9093:9093 -n monitoring
curl localhost:9093/api/v1/silences

{"status":"success","data":[]}

level=warn ts=2018-04-05T14:30:58.973752176Z caller=cluster.go:85 component=cluster err="couldn't deduce an advertise address: failed to parse bind addr ''"

What are you setting as your --cluster.listen-address? Is it set to the port :6783? That would seem to be a possible error that the current code isn't accounting for.

There's also a lookup error, so it could be related to https://github.com/prometheus/alertmanager/issues/1307

Humm, might be related indeed, I don't have that error with 0.14.0.
Althought I mentionned it in my environment section, to be extra clear I'm using the prometheus-operator so maybe 0.15 has some modifications that are not yet supported in the operator?

Looking at the generated args from the prometheus-operator, they are indeed different:

v0.14.0

  - args:
    - --config.file=/etc/alertmanager/config/alertmanager.yaml
    - --mesh.listen-address=:6783
    - --storage.path=/alertmanager
    - --web.listen-address=:9093
    - --web.external-url=http://my-private-external-adress-here/alertmanager
    - --web.route-prefix=/
    - --mesh.peer=alertmanager-main-0.alertmanager-operated.monitoring.svc
    - --mesh.peer=alertmanager-main-1.alertmanager-operated.monitoring.svc

v0.15.0-rc.1

  - args:
    - --config.file=/etc/alertmanager/config/alertmanager.yaml
    - --cluster.listen-address=:6783
    - --storage.path=/alertmanager
    - --web.listen-address=:9093
    - --web.external-url=http://my-private-external-adress-here/alertmanager
    - --web.route-prefix=/
    - --cluster.peer=alertmanager-main-0.alertmanager-operated.monitoring.svc:6783
    - --cluster.peer=alertmanager-main-1.alertmanager-operated.monitoring.svc:6783

@gmauleon your status page shows that the cluster is up and running so it is weird that the silences aren't propagated. I've tested in my local env (without the Prometheus operator but very similar setup with Statefulsets) and I can't reproduce it. Maybe you could share the statefulset definition which is generated by the operator?

@brancz @fabxc have you all encountered anything like this using prometheus operator? Maybe one of you has a chance to take a look, i don't have access to a k8s cluster using this.

@gmauleon based on the status page, it does look like it's connected to a peer ... can you specify 0.0.0.0:6783 as the cluster.listen-address? I'm really not sure what that would do, but it would eliminate one error on startup.

I ran into this same issue last night when running v0.18.0 of prometheus-operator/kube-prometheus (on a K8s cluster in AWS) - with my own modified copy of manifests/alertmanager.yaml to change the version of alertmanager being used to v0.15.0-rc.0.

But then I switched to using v0.15.0-rc.1 and everything worked. So perhaps a change was made in v0.15.0-rc.1 that resolves this issue? Though I do see the initial comment on this issue indicates rc1 was being used.

(Reference info: the Support Alertmanager v0.15.0 PR was merged to master on 3/22, thus it was included when Cut 0.18.0 happened on 4/4.)

Sorry guys couldn't find the time to test further today (stuart suggestion) . Will look into it worst case by Monday evening.

And in my case I was indeed testing with rc1

I am able to reproduce this issue with:

  • Minikube: v0.24.1
  • K8s: v1.9.0
  • PO: v0.18.0
  • AM: v0.15.0-rc.1

I will look further into this. Eventually we should add a test _AddingSilenceCheckIfPropagated_ to the Prometheus operator e2e test suite.

@gmauleon Thanks a lot for reporting this.

Another thing that was interesting for me is I didn't encounter this issue at all with minikube - not even when I was using rc0.
The only scenario where I ran into this issue was using rc0 on AWS.

@gmauleon https://github.com/coreos/prometheus-operator/pull/1193 should fix the issue.


I didn't encounter this issue at all with minikube - not even when I was using rc0.

@jolson490 That is very surprising to me. This should have never worked, even on minikube.

Thanks!

I will look further into this. Eventually we should add a test AddingSilenceCheckIfPropagated to the Prometheus operator e2e test suite.

@mxinden all for this, let's make it happen.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

oryband picture oryband  路  3Comments

username1222 picture username1222  路  5Comments

marcan picture marcan  路  4Comments

stuartnelson3 picture stuartnelson3  路  5Comments

tomplus picture tomplus  路  3Comments