Describe the bug
Test loki instance seemed to lose its labels after a while (30+ minutes of inactivity)
To Reproduce
Steps initially used to reproduce the behavior:
EDIT: A better way to reproduce this is shown below. The above behaviour depends on an Ubuntu container running cron.hourly at 17 minutes past the hour, and otherwise not logging anything. From 47 minutes past the hour until 17 minutes past the next hour, labels had vanished.
Expected behavior
At startup I was able to read labels, and query logs via the API:
curl 'localhost:3100/api/prom/label'
curl 'localhost:3100/api/prom/label/job/values'
curl 'localhost:3100/api/prom/query?query=%7bjob="varlogs"%7d' | jq .
... saw a bunch of logs
But after going away and coming back again, no labels were shown:
# curl 'localhost:3100/api/prom/label'
{}
# curl 'localhost:3100/api/prom/label/job/values'
{}
And yet the chunks are still there:
# ls /tmp/loki/chunks/
ZmFrZS83YTI1N2M5ZWI2YTYyMDkwOjE2OWYzMTZiODE2OjE2OWYzMTZiODYyOjdkNzMyNGQy
ZmFrZS83YTI1N2M5ZWI2YTYyMDkwOjE2OWYzMzk2ZWExOjE2OWYzMzk2ZWExOjE1MGVkZDA2
ZmFrZS83YTI1N2M5ZWI2YTYyMDkwOjE2OWYzNzA1Y2YzOjE2OWYzNzA1Y2YzOmY3YzNhNjk3
ZmFrZS83ZTAyYmFmNDA1NmYyN2Q0OjE2OWYzMTZiODE5OjE2OWYzMWE2MjU0OjE0ZDliNGE4
ZmFrZS83ZTAyYmFmNDA1NmYyN2Q0OjE2OWYzMjExYmEyOjE2OWYzMjI5ZmU0OjNkMzM2M2I3
ZmFrZS84ZTk0MDgzMzdiZDZlYzA0OjE2OWYzMTZiODFhOjE2OWYzMTZiODJmOjIyYmFhYWI2
'ZmFrZS85N2RiMjQ0MDM5ZWFiOGY6MTY5ZjM3MDVjZjM6MTY5ZjM3MDVjZjM6OGNiNmIwMjY='
'ZmFrZS85N2RiMjQ0MDM5ZWFiOGY6MTY5ZjMxNmI4MTc6MTY5ZjMxNmI4MjI6YWM3NWFkZmI='
'ZmFrZS85N2RiMjQ0MDM5ZWFiOGY6MTY5ZjMzOTZlYTE6MTY5ZjMzOTZlYTE6YTc3MjQ0MDA='
ZmFrZS85NDY0MWIwNjQ0MGRkNGE0OjE2OWYzMTZiODFiOjE2OWYzMTZiOGVmOjI0ODBjYWVh
ZmFrZS85ZjY0NjJiM2YxZmM5ZmZlOjE2OWYzMTZiODE5OjE2OWYzMTZiODFmOjE3NjA0ODFk
After a few minutes, some logs reappeared:
# curl 'localhost:3100/api/prom/label'
{"values":["__filename__","job"]}
# curl 'localhost:3100/api/prom/query?query=%7bjob="varlogs"%7d' | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1084 100 1084 0 0 132k 0 --:--:-- --:--:-- --:--:-- 132k
{
"streams": [
{
"labels": "{__filename__=\"/var/log/syslog\", job=\"varlogs\"}",
"entries": [
{
"ts": "2019-04-06T17:02:01.225904931Z",
"line": "Apr 6 17:02:01 loki CRON[13227]: (root) CMD ( test -x /etc/cron.daily/popularity-contest && /etc/cron.daily/popularity-contest --crond)"
},
{
"ts": "2019-04-06T16:17:01.17174724Z",
"line": "Apr 6 16:17:01 loki CRON[13204]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)"
}
]
},
{
"labels": "{__filename__=\"/var/log/auth.log\", job=\"varlogs\"}",
"entries": [
{
"ts": "2019-04-06T17:02:01.22562594Z",
"line": "Apr 6 17:02:01 loki CRON[13226]: pam_unix(cron:session): session closed for user root"
},
{
"ts": "2019-04-06T17:02:01.22561571Z",
"line": "Apr 6 17:02:01 loki CRON[13226]: pam_unix(cron:session): session opened for user root by (uid=0)"
},
{
"ts": "2019-04-06T16:17:01.171625148Z",
"line": "Apr 6 16:17:01 loki CRON[13203]: pam_unix(cron:session): session closed for user root"
},
{
"ts": "2019-04-06T16:17:01.171614933Z",
"line": "Apr 6 16:17:01 loki CRON[13203]: pam_unix(cron:session): session opened for user root by (uid=0)"
}
]
}
]
}
It's as if the API doesn't always return all available labels and log records. However the API documentation doesn't mention this.
Actually, the API documentation doesn't give any defaults for the query parameters. I would expect 'end' to default to the current time. It's unclear if there's some default for 'limit', and/or 'start' defaults to some fixed offset before the end time.
Environment:
Additional
During all this time, no messages were logged to the screen where loki was running, except for the messages which were output when it started up.
# cd ~/go/src/github.com/grafana/loki; ./loki -config.file=./cmd/loki/loki-local-config.yaml
level=info ts=2019-04-06T14:44:43.408338242Z caller=loki.go:122 msg=initialising module=server
level=info ts=2019-04-06T14:44:43.408965936Z caller=gokit.go:36 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2019-04-06T14:44:43.409629113Z caller=loki.go:122 msg=initialising module=overrides
level=info ts=2019-04-06T14:44:43.409891026Z caller=override.go:33 msg="per-tenant overides disabled"
level=info ts=2019-04-06T14:44:43.410067021Z caller=loki.go:122 msg=initialising module=store
level=info ts=2019-04-06T14:44:43.410367709Z caller=loki.go:122 msg=initialising module=ingester
level=info ts=2019-04-06T14:44:43.410842862Z caller=loki.go:122 msg=initialising module=ring
level=info ts=2019-04-06T14:44:43.411310724Z caller=loki.go:122 msg=initialising module=distributor
level=info ts=2019-04-06T14:44:43.411437991Z caller=loki.go:122 msg=initialising module=querier
level=info ts=2019-04-06T14:44:43.413546771Z caller=loki.go:122 msg=initialising module=all
level=info ts=2019-04-06T14:44:43.413740112Z caller=main.go:56 msg="Starting Loki" version="(version=, branch=, revision=)"
level=info ts=2019-04-06T14:44:43.416685505Z caller=lifecycler.go:366 msg="entry not found in ring, adding with no tokens"
level=error ts=2019-04-06T14:44:43.416927349Z caller=consul_client.go:182 msg="error getting path" key=ring err=null
level=info ts=2019-04-06T14:44:43.417484986Z caller=lifecycler.go:296 msg="auto-joining cluster after timeout"
OK, it does seem to be something to do with start time; I am able to get all the logs I saw previously, by using:
# curl 'localhost:3100/api/prom/query?query=%7bjob="varlogs"%7d&start=1554556352000000000' | jq .
where that start time is date +%s - 14400, multiplied up to nanoseconds.
(Aside: it's a bit annoying that the JSON output has "ts" in ISO datetime format, while the query string uses Unix epoch nanoseconds. It would be nice to stick with one or the other)
But I still don't understand why the labels vanished. If I query even an empty time range, with start=end, I now get labels:
# curl 'localhost:3100/api/prom/label?start=1554567417000000000&end=1554567417000000000'
{"values":["__filename__","job"]}
# curl 'localhost:3100/api/prom/label?start=0&end=1'
{"values":["__filename__","job"]}
But I definitely wasn't getting any earlier, when writing up this ticket.
I was able to reproduce, just leaving the system alone.
# while [ 1 ]; do curl 'localhost:3100/api/prom/label'; sleep 300; done
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{}
{}
{}
{}
{}
{}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{}
{}
{}
{}
{}
{}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{}
{}
{}
{}
{}
{}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{}
{}
{}
{}
{}
{}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
Suggests that maybe labels are only shown for log entries in the past half hour? Is this documented?
# date
Sat Apr 6 21:32:29 UTC 2019
# curl 'localhost:3100/api/prom/query?query=%7bjob="varlogs"%7d' | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 606 100 606 0 0 60600 0 --:--:-- --:--:-- --:--:-- 60600
{
"streams": [
{
"labels": "{__filename__=\"/var/log/auth.log\", job=\"varlogs\"}",
"entries": [
{
"ts": "2019-04-06T21:17:01.407876399Z",
"line": "Apr 6 21:17:01 loki CRON[13472]: pam_unix(cron:session): session closed for user root"
},
{
"ts": "2019-04-06T21:17:01.407867827Z",
"line": "Apr 6 21:17:01 loki CRON[13472]: pam_unix(cron:session): session opened for user root by (uid=0)"
}
]
},
{
"labels": "{__filename__=\"/var/log/syslog\", job=\"varlogs\"}",
"entries": [
{
"ts": "2019-04-06T21:17:01.401544901Z",
"line": "Apr 6 21:17:01 loki CRON[13473]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)"
}
]
}
]
}
The label API endpoint doesn't appear to honour start and end. Testing during one of the times when the labels are missing:
# curl "localhost:3100/api/prom/label"
{}
# echo "localhost:3100/api/prom/label?start=$(($(date +%s)-3600))000000000&end=$(($(date +%s)))000000000"
localhost:3100/api/prom/label?start=1554621389000000000&end=1554624989000000000
# curl "localhost:3100/api/prom/label?start=$(($(date +%s)-3600))000000000&end=$(($(date +%s)))000000000"
{}
# curl "localhost:3100/api/prom/label?start=$(($(date +%s)-14400))000000000&end=$(($(date +%s)))000000000"
{}
The last query covers the time up to 4 hours ago, but still no labels are seen.
I also notice that logcli labels doesn't appear to send a start or end time in the request it sends, whereas logcli query does.
It's now pretty clear that the labels are visible for 30 minutes after HH:17, which is the time that cron.hourly runs.
# grep hourly /etc/crontab
17 * * * * root cd / && run-parts --report /etc/cron.hourly
now logcli can not query with timestamp.
Do you restart loki, now we do not support WAL(no write-ahead log implemented).
now
logclican not query with timestamp.
I'm using curl, not logcli, and I'm following the API documentation. It says that the /api/prom/label endpoint is "For retrieving the names of the labels one can query on". It does not say that it only returns labels which have been active in the last 30 minutes. Hence I was surprised that labels vanished from the list, and later reappeared.
This could be a question of fixing the documentation (if this is intentional behaviour), or it could be a question of fixing the the implementation.
One thing is for sure: I would be surprised to go into Grafana and query for available labels, and not find any. (This is what actually happened when I first tried Grafana+Loki; I switched to using the API directly to try to narrow down why logs were randomly disappearing).
Do you restart loki, now we do not support WAL(no write-ahead log implemented).
No - Loki was left running continuously for the whole test.
https://github.com/grafana/loki/blob/master/pkg/querier/http.go#L20
it is a default setting.
@tomwilkie hi, is there any thought about this setting?
I'm not sure that's the same.
defaulSince (sic) is one hour before current time, but as you can see from my sleep 300 loop, labels disappear after 30 minutes./api/prom/label with explicit start and end times, but it didn't make any difference. Therefore I think start/end are only used for log query, not label requests.I did find this 30 minute constant:
./pkg/ingester/ingester.go: f.DurationVar(&cfg.MaxChunkIdle, "ingester.chunks-idle-period", 30*time.Minute, "")
Could it be something here? When chunks are flushed, the labels are forgotten?
I will turn this into a reproducer which doesn't require promtail and doesn't depend on Ubuntu's cron.hourly behaviour.
/tmp/loki doesn't existcd ~/go/src/github.com/grafana/loki; ./loki -config.file=./cmd/loki/loki-local-config.yaml# curl -H Content-Type:application/json -d '{"streams":[{"labels":"{foo=\"bar\"}","entries":[{ "ts": "2018-12-18T08:28:06.801064-04:00", "line": "baz" }]}]}}' localhost:3100/api/prom/push
# curl localhost:3100/api/prom/label
{"values":["foo"]}
# sleep 1860; curl localhost:3100/api/prom/label
{}
# curl 'localhost:3100/api/prom/label?start=1545049686000000000&end=1545222486000000000'
{}
(Aside: for consistency I would have expected {"values":[]} here?)
But the main thing is, labels have vanished. Even if you give an explicit time range. The time range specified above is from 24 hours before to 24 hours after the given ts value.
irb(main):003:0> require 'time'
=> true
irb(main):004:0> t = Time.parse("2018-12-18T08:28:06.801064-04:00")
=> 2018-12-18 08:28:06 -0400
irb(main):005:0> t.to_i
=> 1545136086
irb(main):006:0> t.to_i - 86400
=> 1545049686
irb(main):007:0> t.to_i + 86400
=> 1545222486
Based on what I learned from the source code the /api/prom/label only queries the labels from the ingestion instances, which means that you are getting labels which are currently "active" (or have just recently been) (see https://github.com/grafana/loki/blob/master/pkg/querier/querier.go#L130)
When you query actual log data the query is sent to both ingestion nodes and the index backend. But for this you need to specify the "start" option as the label indices are spread along the time axis. I just created a PR https://github.com/grafana/loki/pull/461 which describes this behaviour, but not the first (active labels).
Does this explain what you are getting? It's another question should this be improved.
Does this explain what you are getting?
Sounds like it could. The 30 minute ingester idle time is not documented anywhere.
This behaviour isn't good. When I'm in the Grafana UI, I cannot see any logs (including historical logs) until I've selected at least one label. But if that particular label hasn't been active for >30 minutes, it's not shown. This makes loki pretty useless.
This issue is being worked on over the next couple of days. Thanks for the thorough reporting.
we are hitting the same issue. Any ETA on when the fix will be available?
Fix is in review !
Thank you.
Any more update on this?
Is there a fix yet? I'm having the same problem with latest docker image.
This particular issue was fixed with #521 where labels would be queried from the store.
If you are missing labels in explore you should expand out the timeframe of your query and reload the page (the reload is only required to work around a temporary change made in explore to accommodate slow label query performance which is fixed in the v11 schema update that will be made the default as soon as we figure out how to best adapt this in the helm chart)
Most helpful comment
Any more update on this?