This is a continuation from my findings on https://github.com/prometheus/prometheus/issues/1871
What did you do?
Sent following request to the AM where the startTime is zero and the endTime is an arbitrary time in the past:
Endpoint: http://127.0.0.1:9093/api/v1/alerts
[
{
"labels": {
"alertname": "InstanceUp",
"instance": "127.0.0.1:9115",
"job": "node",
"monitor": "test_prod",
"severity": "page"
},
"annotations": {
"description": "127.0.0.1:9115 of job node has been down for more than 5 minutes.",
"summary": "Instance 127.0.0.1:9115 down"
},
"startsAt": "0001-01-01T00:00:00Z",
"endsAt": "2018-01-10T12:42:30.584249Z",
"generatorURL": "http://Conors-MacBook-Pro-3.local:9090/graph?g0.expr=up+%3D%3D+0\\u0026g0.tab=1"
}
]
What seems to be happening:
The endTime is set to the current time and the startTime set to zero in Prometheus.
The Alertmanager then sets the received alert's startTime to the current time which will be ahead of the endTime and result in the "start time must be before end time" error.
Looking at the problem it doesn't make sense to alter the start or end times when this occurs. Perhaps we should only set a default start time when the the start time is both zero and not resolved?
What did you expect to see?
{
"status": "success"
}
What did you see instead? Under which circumstances?
{
"status": "error",
"errorType": "bad_data",
"error": "start time must be before end time"
}
Environment
System information:
Darwin 17.3.0 x86_64
Alertmanager version:
HEAD
Prometheus version:
HEAD
Alertmanager configuration file:
N/A
Prometheus configuration file:
N/A
Logs:
level=error ts=2018-01-10T17:35:57.233839Z caller=api.go:803 msg="API error" err="bad_data: start time must be before end time"
Looking at the problem it doesn't make sense to alter the star or end times when this occurs. Perhaps we should only set a default start time when the the start time is both zero and not resolved?
I think this makes sense. I'm a bit unsure about setting the start time when both values are empty, I would have to check how this would affect AM if a client is submitting alerts but never setting their start time.
When a client sends alerts to AM without a startTime, AM responds with a validation error that the start time is missing.
This issue may require some rethinking of the protocol for sending alerts.
Any possible fixes or workarounds for this? I'm seeing this with prometheus 2.1.0 and alertmanager 0.13.0.
I've merged a fix on head for Prometheus which should resolve this issue.
@Conorbro Will you plan to make a patch release for this, please?
I've dug into the code and found that the AM API delegates the validation of the incoming alerts to model.Alert from prometheus/common.
As @Conorbro noted previously, the current validation enforces both that
StartsAt isn't empty/zero.EndsAt is after StartsAt.If we decide as suggested above that one of StartsAt and EndsAt (but not both) can be zero then changes have to be made to prometheus/common (with the risk of breaking other package's users?). It also requires changes to the way AlertManager merges alerts otherwise valid timestamps can be overwritten by zero. FWIW I've got a local branch with the changes, rapid tests seem to be ok but there are probably still corner cases to be found.
Both fields are documented as optional, so it's more about making sure the defaults are okay.
Am continuing to get message with prometheus 2.1.0 and alertmanager 0.14.0.
Happy to provide debug info if directed.
@aioue you would need to test with prometheus 2.2.0 rc.0 and alertmanager from master.
Most helpful comment
@Conorbro Will you plan to make a patch release for this, please?