Thanos: [feature request] store: distributed queries against objstore backends

Created on 28 Mar 2019 · 11Comments · Source: thanos-io/thanos

IIUC Today's store gateway is a dumb gateway that can only be scaled vertically. HA configurations provide redundancy, but every replica runs every query with no performance or scability benefits. I'm not proposing any particular implementation here, but wanted to start the conversation as distributing queries across all available gateways seems desirable.

Thoughts?

store hard feature request / improvement help wanted stale

Source

davidquarles

👍1

Most helpful comment

Can you tell us more, why? I agree, but for certain operations it is totaly fine, right? like simple sum -> it's just easy as summing all in map-reduce fashion.

Say a time series had samples interleaved across two leaves, you can't safely do a rate() on either.

brian-brazil on 3 Apr 2019

👍2

All 11 comments

/cc @SuperQ @bwplotka

davidquarles on 28 Mar 2019

Well, if you run the store in more replicas you should balance requests to them. This way only one replica runs the query. Other thing is the in-memory cache which is per instance and thus can be duplicated IMHO.

FUSAKLA on 28 Mar 2019

How come you can't scale vertically? just put a lb in front of Thanos Store and you are good to go.

povilasv on 29 Mar 2019

👍1

The problem with "just put a LB in front of it" is Cache Coherency. While we don't need strict cache coherence, not having it causes all nodes to have to have identical caches.

Without some kind of caching pool, like a memcache or redis cluster, all nodes have to maintain the same cache data since the LB will round-robin the same queries to all the nodes.

But that's only one aspect of this. By doing distributed evaluation, each shard can process a subset of the data, allowing for faster processing of large query blocks.

One option would be to take range queries and distribute the time range among the shards in a consistent way. Say each shard gets consistently hashed for a specific day of data.

SuperQ on 29 Mar 2019

👍1

I agree, and we are working towards it on https://github.com/improbable-eng/thanos/pull/957

But the issue states is a dumb gateway that can only be scaled vertically., which is not true.

While we don't need strict cache coherence, not having it causes all nodes to have to have identical caches.

IMO that's okay for now.

povilasv on 29 Mar 2019

I apologize for my ignorance and any misinformation here... I was paraphrasing an off-the-cuff conversation from the slack channel.

davidquarles on 29 Mar 2019

As @povilasv said, in the current moment (without gossip), there is a way to put Store GW after LB of some kind (e.g Kubernetes service) and round robin requests. It's even recommended for HA purposes.

@SuperQ is SUPER right about Cache coherence issue. But I am confused what issue is all about, can we clarify?

Do you want to share cached resources on some backend (e.g memcached) across stores?
Do we want to start work on distributed querying? IMO it's super worth to try, but really really difficult. Worth to sync with @bbrazil AND design it properly, with benchmark etc. It requires good knowledge of PromQL. As for me, distributed query means spread PromQL evaluations to leaf nodes. Do you mean that?

BTW I created issue for Querier cache which should improve things a lot - and is more important and easier than this feature proposal I think (I might be wrong - let me know!) : https://github.com/improbable-eng/thanos/issues/1006

bwplotka on 3 Apr 2019

It requires good knowledge of PromQL. As for me, distributed query means spread PromQL evaluations to leaf nodes.

In general this isn't possible, however if you know that certain data (i.e. one timeseries for one time period) is only on one leaf node when you can do functions on range vectors there, and some aggregations.

brian-brazil on 3 Apr 2019

In general this isn't possible

Can you tell us more, why? I agree, but for certain operations it is totaly fine, right? like simple sum -> it's just easy as summing all in map-reduce fashion.

I think overall this is extremely complex and we need to be sure about benefits.. but doable, at least for some querier?

And yes, if we add this prior knowledge, it's even better.

bwplotka on 3 Apr 2019

Can you tell us more, why? I agree, but for certain operations it is totaly fine, right? like simple sum -> it's just easy as summing all in map-reduce fashion.

Say a time series had samples interleaved across two leaves, you can't safely do a rate() on either.

brian-brazil on 3 Apr 2019

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.