Thanos: Reduce unnecessary StoreAPI series calls

Created on 8 Oct 2019 · 21Comments · Source: thanos-io/thanos

Thanos currently performs a "possibly in set" check in order to exclude certain StoreAPIs from calls. This check currently only excludes a StoreAPI if all sets of external labels indicate that time-series selected by a certain query will never include data from this StoreAPI. While this check only has true positives, the potential for empty responses is still large, as it requires the query to truly select on the labels that a StoreAPI happens to expose.

In order to expand these "possibly in set" checks, I propose to have StoreAPIs additionally expose a probabilistic datastructure(s) via the Info API next to the sets of external labels. Datastructures such as Bloom filters or Cuckoo filters, could be a good candidate. This way we could in a space efficient way minimize unnecessary calls to StoreAPIs.

This would not only improve performance, but also reliability, as any additional and unnecessary network request has the potential to cause what seems like a partial response even when it's not.

In order to verify further, that this would be useful, I would propose to instrument the querier with metrics to verify that these types of empty StoreAPI series requests are actually happening and therefore impacting performance, otherwise this optimization is likely not worth it.

@bwplotka @domgreen @GiedriusS @metalmatze @squat @kakkoyun

query hard help wanted proposal

Source

brancz

👍6

Most helpful comment

Bloom filters would be a great tool for this. +1 on investigating the false positive rate for querier api requests.

squat on 8 Oct 2019

👍5

All 21 comments

Bloom filters would be a great tool for this. +1 on investigating the false positive rate for querier api requests.

squat on 8 Oct 2019

👍5

In theory, it sounds good but I wonder about the performance implications of keeping that bloom map up-to-date, especially in the case of Sidecar/Prometheus. This probably makes the most sense in environments where are a bunch of disparate Thanos Query instances and there is one "on top" which is being queried. Either way, it would be nice to have such a metric so that we could rationalize this idea better.

GiedriusS on 8 Oct 2019

In my eyes there will just small and very often used portion of labels maintained by such filter, so it should be fine.

bwplotka on 8 Oct 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 11 Jan 2020

I think we have yet to instrument the proxy store type (used with the query component) to validate whether and how much this is happening. I do suspect that it will highly vary across environments.

brancz on 13 Jan 2020

With #2030 merged we should be able to at least identify the potential for this.

brancz on 24 Jan 2020

👍2

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

stale[bot] on 23 Feb 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

stale[bot] on 25 Mar 2020

Maybe it's still valid.

jojohappy on 25 Mar 2020

Looking at the metrics in our prod environment, the metrics seem to show that we do _a lot_ of requests that return empty responses.

brancz on 26 Mar 2020

👍1

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] on 25 Apr 2020

🚀2

As mentioned in the previous comment, there is definitely room for improvement still.

brancz on 27 Apr 2020

stale[bot] on 27 May 2020

stale[bot] on 26 Jun 2020

Still valid.

GiedriusS on 26 Jun 2020

stale[bot] on 26 Jul 2020

stale[bot] on 26 Aug 2020

Do we have any plans/design docs about this feature? I have some time recently so maybe I can do some experiments first.

One simple idea is to get all metric names (/api/v1/label/__name__/values) for each store in Info() StoreAPI and keep them in a bloomfilter in StoreRef. Only keeping metric names cannot work for queries without specifying __name__, but it works for most cases and I think it is a good start.

The problem is that Info API is called every 5s each store, so I am not sure about the performance overhead (Might be big because we need to query all metric names in the store's time range).

For updating/deleting the elements in the bloomfilter, how to make it efficient in each update?

yeya24 on 8 Sep 2020

Do we have any plans/design docs about this feature?

I don't think we have. It would be great place to start when one decides to pick up this issue. We can elaborate more on those challenges you have mentioned.

Also there's also a bit of experimentation needed to pick the correct approach/data structure for implementation.
I have come across some nice articles and projects recently worth looking before start. (Haven't dug deep though)
On Bloom Filter efficiency: https://gopiandcode.uk/logs/log-bloomfilters-debunked.html
On Quotient filters: https://github.com/facebookincubator/go-qfext

kakkoyun on 8 Sep 2020

👍1

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] on 7 Nov 2020

I think we have yet to measure how often were unnecessarily hitting stores to validate further work.

brancz on 8 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings