Loki: Broad search misses entries

Created on 24 May 2020  路  7Comments  路  Source: grafana/loki

Describe the bug
Querying logs with a broad stream selector return no entries.

To Reproduce
Loki version

loki, version 1.5.0 (branch: HEAD, revision: 12c7eab8)
  build user:       root@9873e604c08a
  build date:       2020-05-20T15:54:17Z
  go version:       go1.13.4

Search using broad selector that matches all streams produces no results
Limiting the stream selector to a subset of streams produces results.

Expected behavior
All logs should appear

Environment:

  • Infrastructure: CentOS 7
  • Deployment tool: process running on OS

Screenshots, Promtail config, or terminal output
Search using broad selector produces no result:

logcli --addr http://lokiserver:3100  query  '{service=~".*"}|="qvKp53cVkh"'
http://lokiserver:3100/loki/api/v1/query_range?direction=BACKWARD&end=1590285315386793656&limit=30&query=%7Bservice%3D~%22.%2A%22%7D%7C%3D%22qvKp53cVkh%22&start=1590281715386793656

Query with narrow stream selector

logcli --addr http://lokiserver:3100  query  '{service=~"i.*"}|="qvKp53cVkh"'
http://lokiserver:3100/loki/api/v1/query_range?direction=BACKWARD&end=1590285513524785177&limit=30&query=%7Bservice%3D~%22i.%2A%22%7D%7C%3D%22qvKp53cVkh%22&start=1590281913524785177
Common labels: {hostname="app-1", service="infobanking"}
2020-05-24T01:43:44Z {level="info"}  qvKp53cVkh [Req-stop  ] GET /infobanking/heartbeat status=200, size=255, time=0.0002
2020-05-24T01:43:44Z {level="info"}  qvKp53cVkh [-         ] 127.0.0.1 - - [24/May/2020 01:43:44] "GET /infobanking/heartbeat HTTP/1.1 200 255" 0.0002
2020-05-24T01:43:44Z {level="debug"} qvKp53cVkh [-         ] response header: GET /infobanking/heartbeat
X-Correlation-Id: [qvKp53cVkh]
Access-Control-Expose-Headers: [X-Correlation-ID]
Access-Control-Allow-Origin: [*]
Content-Type: [application/json]
keepalive kinbug kinenhancement

All 7 comments

@tanioschahine This is kind of strange. I am unable to reproduce this

So I dig into this a bit, and in fact the problem comes from the ingesters. On the store side though it works.

The problem is that ".*" is considered a filter and since it's the only matchers given in this query, nothing from the integester is returned.

Here is the ingester code :

// forMatchingStreams will execute a function for each stream that satisfies a set of requirements (time range, matchers, etc).
// It uses a function in order to enable generic stream access without accidentally leaking streams under the mutex.
func (i *instance) forMatchingStreams(
    matchers []*labels.Matcher,
    fn func(*stream) error,
) error {
    i.streamsMtx.RLock()
    defer i.streamsMtx.RUnlock()

    filters, matchers := cutil.SplitFiltersAndMatchers(matchers)
    ids := i.index.Lookup(matchers)

and the SplitFiltersAndMatchers looks like this.

// SplitFiltersAndMatchers splits empty matchers off, which are treated as filters, see #220
func SplitFiltersAndMatchers(allMatchers []*labels.Matcher) (filters, matchers []*labels.Matcher) {
    for _, matcher := range allMatchers {
        // If a matcher matches "", we need to fetch possible chunks where
        // there is no value and will therefore not be in our label index.
        // e.g. {foo=""} and {foo!="bar"} both match "", so we need to return
        // chunks which do not have a foo label set. When looking entries in
        // the index, we should ignore this matcher to fetch all possible chunks
        // and then filter on the matcher after the chunks have been fetched.
        if matcher.Matches("") {
            filters = append(filters, matcher)
        } else {
            matchers = append(matchers, matcher)
        }
    }
    return
}

Now this is not a problem in cortex because they don't accept those queries:
image

Moreover in Cortex/Prometheus you usually give a metric name which translate into an equality matcher like this foo{bar=~".*"} => {__name__="foo", bar=~".*"}

If we test this out no error this time:
image

I see 2 options to solve this:

1) Send a 400 for those requests (like Cortex) and require the user to supply at least one non filter matcher, it can be very simple to add a cluster or a job label that is assigned to all streams.
2) Shortcircuit Cortex when no matchers is supplied and returns the full list of streams/finguerprint we have in the ingesters.

I personally vouch for 1) as those queries will not really scale well and super easy to rewrite with a common label, but I'm open to suggestions since this is kinda an API break. I'm happy to implement 2) too but compared to Cortex we don't have this on our side.

    fps := u.index.Lookup(matchers)
    if len(fps) > u.limiter.MaxSeriesPerQuery(u.userID) {
        return httpgrpc.Errorf(http.StatusRequestEntityTooLarge, "exceeded maximum number of series in a query")
    }

e.g MaxStreamPerQuery doesn't exist to protect us.

/cc @owen-d @slim-bean

I think we should support querying .*. It's very common for a user to search .*. Of course we should have MaxStreamPerQuery/MaxSeriesPerQuery limit.

Hrm, I would prefer 1). By doing '{service=~".*"}, you're basically saying give me "all series", which Prometheus tries to protect against. See: https://github.com/prometheus/prometheus/blob/master/promql/parser/parse.go#L586-L598

Now, you can basically query for all series in Prometheus, by doing {__name__=~".+"}, and similarly here by doing {service=~".+"}, but this also means you kinda know exactly what you're doing (or atleast thats the assumption).

While I would prefer 1), I am not sure if the guard-rail makes a lot of sense for Loki.

Thank you all. I will go with {service=~".+"} for now. But I guess it would make sense to have an "all series" switch by just not specifying the stream selector.

Hrm. I think we validate against this and eventually suggest alternatives like {service=~".*"}. While we could rely on limits to enforce this, I'd prefer not to as we'd still try to resolve _every_ stream against the index. I don't have any concrete data, but I'm afraid that we could see performance issues at the index level for high throughput tenants.

This seems like another data point in favor of the heuristics pkg we haven't built yet :)

We discussed this a bit more and the approach we are going to go with here is #1 of the options @cyriltovena provided.

Additionally it would be good if the error message provided some constructive feedback to the user such as: please add at least one non regex label matcher or use .+ instead of .*

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TakumaNakagame picture TakumaNakagame  路  5Comments

gouthamve picture gouthamve  路  4Comments

suppix picture suppix  路  3Comments

bzon picture bzon  路  5Comments

Menda picture Menda  路  5Comments