Thanos currently performs a "possibly in set" check in order to exclude certain StoreAPIs from calls. This check currently only excludes a StoreAPI if all sets of external labels indicate that time-series selected by a certain query will never include data from this StoreAPI. While this check only has true positives, the potential for empty responses is still large, as it requires the query to truly select on the labels that a StoreAPI happens to expose.
In order to expand these "possibly in set" checks, I propose to have StoreAPIs additionally expose a probabilistic datastructure(s) via the Info API next to the sets of external labels. Datastructures such as Bloom filters or Cuckoo filters, could be a good candidate. This way we could in a space efficient way minimize unnecessary calls to StoreAPIs.
This would not only improve performance, but also reliability, as any additional and unnecessary network request has the potential to cause what seems like a partial response even when it's not.
In order to verify further, that this would be useful, I would propose to instrument the querier with metrics to verify that these types of empty StoreAPI series requests are actually happening and therefore impacting performance, otherwise this optimization is likely not worth it.
@bwplotka @domgreen @GiedriusS @metalmatze @squat @kakkoyun
Bloom filters would be a great tool for this. +1 on investigating the false positive rate for querier api requests.
In theory, it sounds good but I wonder about the performance implications of keeping that bloom map up-to-date, especially in the case of Sidecar/Prometheus. This probably makes the most sense in environments where are a bunch of disparate Thanos Query instances and there is one "on top" which is being queried. Either way, it would be nice to have such a metric so that we could rationalize this idea better.
In my eyes there will just small and very often used portion of labels maintained by such filter, so it should be fine.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I think we have yet to instrument the proxy store type (used with the query component) to validate whether and how much this is happening. I do suspect that it will highly vary across environments.
With #2030 merged we should be able to at least identify the potential for this.
This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.
This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.
Maybe it's still valid.
Looking at the metrics in our prod environment, the metrics seem to show that we do _a lot_ of requests that return empty responses.
Hello 馃憢 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 馃
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
As mentioned in the previous comment, there is definitely room for improvement still.
Hello 馃憢 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 馃
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Hello 馃憢 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 馃
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Still valid.
Hello 馃憢 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 馃
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Hello 馃憢 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 馃
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Do we have any plans/design docs about this feature? I have some time recently so maybe I can do some experiments first.
One simple idea is to get all metric names (/api/v1/label/__name__/values) for each store in Info() StoreAPI and keep them in a bloomfilter in StoreRef. Only keeping metric names cannot work for queries without specifying __name__, but it works for most cases and I think it is a good start.
The problem is that Info API is called every 5s each store, so I am not sure about the performance overhead (Might be big because we need to query all metric names in the store's time range).
For updating/deleting the elements in the bloomfilter, how to make it efficient in each update?
Do we have any plans/design docs about this feature?
I don't think we have. It would be great place to start when one decides to pick up this issue. We can elaborate more on those challenges you have mentioned.
Also there's also a bit of experimentation needed to pick the correct approach/data structure for implementation.
I have come across some nice articles and projects recently worth looking before start. (Haven't dug deep though)
On Bloom Filter efficiency: https://gopiandcode.uk/logs/log-bloomfilters-debunked.html
On Quotient filters: https://github.com/facebookincubator/go-qfext
Hello 馃憢 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 馃
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
I think we have yet to measure how often were unnecessarily hitting stores to validate further work.
Most helpful comment
Bloom filters would be a great tool for this. +1 on investigating the false positive rate for querier api requests.