Loki: Labels vanishing after 30 minutes of inactivity

Created on 6 Apr 2019  路  17Comments  路  Source: grafana/loki

Describe the bug
Test loki instance seemed to lose its labels after a while (30+ minutes of inactivity)

To Reproduce
Steps initially used to reproduce the behavior:

  1. Started loki (c9948233) in a screen session
  2. Started promtail (c9948233) to tail '/var/logs' using default config, in another screen session
  3. Show labels and log entries
  4. Go away for a while
  5. Show labels and log entries

EDIT: A better way to reproduce this is shown below. The above behaviour depends on an Ubuntu container running cron.hourly at 17 minutes past the hour, and otherwise not logging anything. From 47 minutes past the hour until 17 minutes past the next hour, labels had vanished.

Expected behavior

At startup I was able to read labels, and query logs via the API:

curl 'localhost:3100/api/prom/label'
curl 'localhost:3100/api/prom/label/job/values'
curl 'localhost:3100/api/prom/query?query=%7bjob="varlogs"%7d' | jq .
... saw a bunch of logs

But after going away and coming back again, no labels were shown:

# curl 'localhost:3100/api/prom/label'
{}
# curl 'localhost:3100/api/prom/label/job/values'
{}

And yet the chunks are still there:

# ls /tmp/loki/chunks/
 ZmFrZS83YTI1N2M5ZWI2YTYyMDkwOjE2OWYzMTZiODE2OjE2OWYzMTZiODYyOjdkNzMyNGQy
 ZmFrZS83YTI1N2M5ZWI2YTYyMDkwOjE2OWYzMzk2ZWExOjE2OWYzMzk2ZWExOjE1MGVkZDA2
 ZmFrZS83YTI1N2M5ZWI2YTYyMDkwOjE2OWYzNzA1Y2YzOjE2OWYzNzA1Y2YzOmY3YzNhNjk3
 ZmFrZS83ZTAyYmFmNDA1NmYyN2Q0OjE2OWYzMTZiODE5OjE2OWYzMWE2MjU0OjE0ZDliNGE4
 ZmFrZS83ZTAyYmFmNDA1NmYyN2Q0OjE2OWYzMjExYmEyOjE2OWYzMjI5ZmU0OjNkMzM2M2I3
 ZmFrZS84ZTk0MDgzMzdiZDZlYzA0OjE2OWYzMTZiODFhOjE2OWYzMTZiODJmOjIyYmFhYWI2
'ZmFrZS85N2RiMjQ0MDM5ZWFiOGY6MTY5ZjM3MDVjZjM6MTY5ZjM3MDVjZjM6OGNiNmIwMjY='
'ZmFrZS85N2RiMjQ0MDM5ZWFiOGY6MTY5ZjMxNmI4MTc6MTY5ZjMxNmI4MjI6YWM3NWFkZmI='
'ZmFrZS85N2RiMjQ0MDM5ZWFiOGY6MTY5ZjMzOTZlYTE6MTY5ZjMzOTZlYTE6YTc3MjQ0MDA='
 ZmFrZS85NDY0MWIwNjQ0MGRkNGE0OjE2OWYzMTZiODFiOjE2OWYzMTZiOGVmOjI0ODBjYWVh
 ZmFrZS85ZjY0NjJiM2YxZmM5ZmZlOjE2OWYzMTZiODE5OjE2OWYzMTZiODFmOjE3NjA0ODFk

After a few minutes, some logs reappeared:

# curl 'localhost:3100/api/prom/label'
{"values":["__filename__","job"]}
# curl 'localhost:3100/api/prom/query?query=%7bjob="varlogs"%7d' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1084  100  1084    0     0   132k      0 --:--:-- --:--:-- --:--:--  132k
{
  "streams": [
    {
      "labels": "{__filename__=\"/var/log/syslog\", job=\"varlogs\"}",
      "entries": [
        {
          "ts": "2019-04-06T17:02:01.225904931Z",
          "line": "Apr  6 17:02:01 loki CRON[13227]: (root) CMD (   test -x /etc/cron.daily/popularity-contest && /etc/cron.daily/popularity-contest --crond)"
        },
        {
          "ts": "2019-04-06T16:17:01.17174724Z",
          "line": "Apr  6 16:17:01 loki CRON[13204]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)"
        }
      ]
    },
    {
      "labels": "{__filename__=\"/var/log/auth.log\", job=\"varlogs\"}",
      "entries": [
        {
          "ts": "2019-04-06T17:02:01.22562594Z",
          "line": "Apr  6 17:02:01 loki CRON[13226]: pam_unix(cron:session): session closed for user root"
        },
        {
          "ts": "2019-04-06T17:02:01.22561571Z",
          "line": "Apr  6 17:02:01 loki CRON[13226]: pam_unix(cron:session): session opened for user root by (uid=0)"
        },
        {
          "ts": "2019-04-06T16:17:01.171625148Z",
          "line": "Apr  6 16:17:01 loki CRON[13203]: pam_unix(cron:session): session closed for user root"
        },
        {
          "ts": "2019-04-06T16:17:01.171614933Z",
          "line": "Apr  6 16:17:01 loki CRON[13203]: pam_unix(cron:session): session opened for user root by (uid=0)"
        }
      ]
    }
  ]
}

It's as if the API doesn't always return all available labels and log records. However the API documentation doesn't mention this.

Actually, the API documentation doesn't give any defaults for the query parameters. I would expect 'end' to default to the current time. It's unclear if there's some default for 'limit', and/or 'start' defaults to some fixed offset before the end time.

Environment:

  • Ubuntu 18.04 lxd container running on Ubuntu 18.04 host
  • Loki built from source with go 1.12.1

Additional
During all this time, no messages were logged to the screen where loki was running, except for the messages which were output when it started up.

# cd ~/go/src/github.com/grafana/loki; ./loki -config.file=./cmd/loki/loki-local-config.yaml
level=info ts=2019-04-06T14:44:43.408338242Z caller=loki.go:122 msg=initialising module=server
level=info ts=2019-04-06T14:44:43.408965936Z caller=gokit.go:36 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2019-04-06T14:44:43.409629113Z caller=loki.go:122 msg=initialising module=overrides
level=info ts=2019-04-06T14:44:43.409891026Z caller=override.go:33 msg="per-tenant overides disabled"
level=info ts=2019-04-06T14:44:43.410067021Z caller=loki.go:122 msg=initialising module=store
level=info ts=2019-04-06T14:44:43.410367709Z caller=loki.go:122 msg=initialising module=ingester
level=info ts=2019-04-06T14:44:43.410842862Z caller=loki.go:122 msg=initialising module=ring
level=info ts=2019-04-06T14:44:43.411310724Z caller=loki.go:122 msg=initialising module=distributor
level=info ts=2019-04-06T14:44:43.411437991Z caller=loki.go:122 msg=initialising module=querier
level=info ts=2019-04-06T14:44:43.413546771Z caller=loki.go:122 msg=initialising module=all
level=info ts=2019-04-06T14:44:43.413740112Z caller=main.go:56 msg="Starting Loki" version="(version=, branch=, revision=)"
level=info ts=2019-04-06T14:44:43.416685505Z caller=lifecycler.go:366 msg="entry not found in ring, adding with no tokens"
level=error ts=2019-04-06T14:44:43.416927349Z caller=consul_client.go:182 msg="error getting path" key=ring err=null
level=info ts=2019-04-06T14:44:43.417484986Z caller=lifecycler.go:296 msg="auto-joining cluster after timeout"
componenloki kinbug

Most helpful comment

Any more update on this?

All 17 comments

OK, it does seem to be something to do with start time; I am able to get all the logs I saw previously, by using:

# curl 'localhost:3100/api/prom/query?query=%7bjob="varlogs"%7d&start=1554556352000000000' | jq .

where that start time is date +%s - 14400, multiplied up to nanoseconds.

(Aside: it's a bit annoying that the JSON output has "ts" in ISO datetime format, while the query string uses Unix epoch nanoseconds. It would be nice to stick with one or the other)

But I still don't understand why the labels vanished. If I query even an empty time range, with start=end, I now get labels:

# curl 'localhost:3100/api/prom/label?start=1554567417000000000&end=1554567417000000000'
{"values":["__filename__","job"]}
# curl 'localhost:3100/api/prom/label?start=0&end=1'
{"values":["__filename__","job"]}

But I definitely wasn't getting any earlier, when writing up this ticket.

I was able to reproduce, just leaving the system alone.

# while [ 1 ]; do curl 'localhost:3100/api/prom/label'; sleep 300; done
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{}
{}
{}
{}
{}
{}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{}
{}
{}
{}
{}
{}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{}
{}
{}
{}
{}
{}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{}
{}
{}
{}
{}
{}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}

Suggests that maybe labels are only shown for log entries in the past half hour? Is this documented?

# date
Sat Apr  6 21:32:29 UTC 2019
# curl 'localhost:3100/api/prom/query?query=%7bjob="varlogs"%7d' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   606  100   606    0     0  60600      0 --:--:-- --:--:-- --:--:-- 60600
{
  "streams": [
    {
      "labels": "{__filename__=\"/var/log/auth.log\", job=\"varlogs\"}",
      "entries": [
        {
          "ts": "2019-04-06T21:17:01.407876399Z",
          "line": "Apr  6 21:17:01 loki CRON[13472]: pam_unix(cron:session): session closed for user root"
        },
        {
          "ts": "2019-04-06T21:17:01.407867827Z",
          "line": "Apr  6 21:17:01 loki CRON[13472]: pam_unix(cron:session): session opened for user root by (uid=0)"
        }
      ]
    },
    {
      "labels": "{__filename__=\"/var/log/syslog\", job=\"varlogs\"}",
      "entries": [
        {
          "ts": "2019-04-06T21:17:01.401544901Z",
          "line": "Apr  6 21:17:01 loki CRON[13473]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)"
        }
      ]
    }
  ]
}

The label API endpoint doesn't appear to honour start and end. Testing during one of the times when the labels are missing:

# curl "localhost:3100/api/prom/label"
{}
# echo "localhost:3100/api/prom/label?start=$(($(date +%s)-3600))000000000&end=$(($(date +%s)))000000000"
localhost:3100/api/prom/label?start=1554621389000000000&end=1554624989000000000
# curl "localhost:3100/api/prom/label?start=$(($(date +%s)-3600))000000000&end=$(($(date +%s)))000000000"
{}
# curl "localhost:3100/api/prom/label?start=$(($(date +%s)-14400))000000000&end=$(($(date +%s)))000000000"
{}

The last query covers the time up to 4 hours ago, but still no labels are seen.

I also notice that logcli labels doesn't appear to send a start or end time in the request it sends, whereas logcli query does.

It's now pretty clear that the labels are visible for 30 minutes after HH:17, which is the time that cron.hourly runs.

# grep hourly /etc/crontab
17 *    * * *   root    cd / && run-parts --report /etc/cron.hourly

now logcli can not query with timestamp.
Do you restart loki, now we do not support WAL(no write-ahead log implemented).

now logcli can not query with timestamp.

I'm using curl, not logcli, and I'm following the API documentation. It says that the /api/prom/label endpoint is "For retrieving the names of the labels one can query on". It does not say that it only returns labels which have been active in the last 30 minutes. Hence I was surprised that labels vanished from the list, and later reappeared.

This could be a question of fixing the documentation (if this is intentional behaviour), or it could be a question of fixing the the implementation.

One thing is for sure: I would be surprised to go into Grafana and query for available labels, and not find any. (This is what actually happened when I first tried Grafana+Loki; I switched to using the API directly to try to narrow down why logs were randomly disappearing).

Do you restart loki, now we do not support WAL(no write-ahead log implemented).

No - Loki was left running continuously for the whole test.

https://github.com/grafana/loki/blob/master/pkg/querier/http.go#L20
it is a default setting.
@tomwilkie hi, is there any thought about this setting?

I'm not sure that's the same.

  • defaulSince (sic) is one hour before current time, but as you can see from my sleep 300 loop, labels disappear after 30 minutes.
  • I tried making calls to /api/prom/label with explicit start and end times, but it didn't make any difference. Therefore I think start/end are only used for log query, not label requests.

I did find this 30 minute constant:

./pkg/ingester/ingester.go: f.DurationVar(&cfg.MaxChunkIdle, "ingester.chunks-idle-period", 30*time.Minute, "")

Could it be something here? When chunks are flushed, the labels are forgotten?

I will turn this into a reproducer which doesn't require promtail and doesn't depend on Ubuntu's cron.hourly behaviour.

  1. Install loki
  2. Ensure /tmp/loki doesn't exist
  3. Start loki in a separate terminal or screen session:
    cd ~/go/src/github.com/grafana/loki; ./loki -config.file=./cmd/loki/loki-local-config.yaml
  4. Add a log entry, check labels, wait >30 minutes, check labels again. Like this:
# curl -H Content-Type:application/json -d '{"streams":[{"labels":"{foo=\"bar\"}","entries":[{ "ts": "2018-12-18T08:28:06.801064-04:00", "line": "baz" }]}]}}' localhost:3100/api/prom/push
# curl localhost:3100/api/prom/label
{"values":["foo"]}
# sleep 1860; curl localhost:3100/api/prom/label
{}
# curl 'localhost:3100/api/prom/label?start=1545049686000000000&end=1545222486000000000'
{}

(Aside: for consistency I would have expected {"values":[]} here?)

But the main thing is, labels have vanished. Even if you give an explicit time range. The time range specified above is from 24 hours before to 24 hours after the given ts value.

irb(main):003:0> require 'time'
=> true
irb(main):004:0> t = Time.parse("2018-12-18T08:28:06.801064-04:00")
=> 2018-12-18 08:28:06 -0400
irb(main):005:0> t.to_i
=> 1545136086
irb(main):006:0> t.to_i - 86400
=> 1545049686
irb(main):007:0> t.to_i + 86400
=> 1545222486

Based on what I learned from the source code the /api/prom/label only queries the labels from the ingestion instances, which means that you are getting labels which are currently "active" (or have just recently been) (see https://github.com/grafana/loki/blob/master/pkg/querier/querier.go#L130)

When you query actual log data the query is sent to both ingestion nodes and the index backend. But for this you need to specify the "start" option as the label indices are spread along the time axis. I just created a PR https://github.com/grafana/loki/pull/461 which describes this behaviour, but not the first (active labels).

Does this explain what you are getting? It's another question should this be improved.

Does this explain what you are getting?

Sounds like it could. The 30 minute ingester idle time is not documented anywhere.

This behaviour isn't good. When I'm in the Grafana UI, I cannot see any logs (including historical logs) until I've selected at least one label. But if that particular label hasn't been active for >30 minutes, it's not shown. This makes loki pretty useless.

This issue is being worked on over the next couple of days. Thanks for the thorough reporting.

we are hitting the same issue. Any ETA on when the fix will be available?

Fix is in review !

Thank you.

Any more update on this?

Is there a fix yet? I'm having the same problem with latest docker image.

This particular issue was fixed with #521 where labels would be queried from the store.

If you are missing labels in explore you should expand out the timeframe of your query and reload the page (the reload is only required to work around a temporary change made in explore to accommodate slow label query performance which is fixed in the v11 schema update that will be made the default as soon as we figure out how to best adapt this in the helm chart)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cyriltovena picture cyriltovena  路  4Comments

naughtyGitCat picture naughtyGitCat  路  3Comments

arnitolog picture arnitolog  路  6Comments

steven-sheehy picture steven-sheehy  路  4Comments

adityacs picture adityacs  路  5Comments