Prometheus: Remote storage

Created on 4 Jan 2013  Â·  170Comments  Â·  Source: prometheus/prometheus

Prometheus needs to be able to interface with a remote and scalable data store for long-term storage/retrieval.

kinenhancement

Most helpful comment

@sacreman Thanks, great comparison. Glad that Prometheus won the query language competition with the only 5/5 points :) I do prefer PromQL over SQL dialects for the kind of time-series computations that are common in Prometheus...

As for storage: a bunch of us and people from companies building first remote storages stuck our heads together today (on a Saturday - just for you :)) to finally unblock this whole remote storage discussion. We concluded how we want to build a generic interface for writing and reading back data, and then anyone could build their own adapter ontop of any long-term storage system that they want (since we'll never be able to just support one directly and have that be good enough for everyone).

It'll go roughly like this: we'll send raw samples via protobuf+gRPC on write. For reads, we'll send a time range, a set of label selectors, and an opaque data field (useful for things like hinting at desired downsampling), and expect back a list of time series (label sets + samples) for those selectors and time ranges. Exact implementations of this are going to follow soon... expect the generic write side first, with the read side to follow.

All 170 comments

Is there anyone planning to work on this? Is the work done in the opentsdb-integration branch still valid or has the rest of the code-base moved past that?

The opentsdb-integration branch is indeed completely outdated (still using the old storage backend etc.). Personally, I'm a great fan of the OpenTSDB integration, but where I work, there is not an urgent enough requirement to justify a high priority from my side...

To be clear, the outdated "opentsdb-integration" was only for the
proof-of-concept _read-back_ support (querying OpenTSDB through Prometheus).

_Writing_ into OpenTSDB should be experimentally supported in master, but
the last time we tried it was a year ago on a single-node OpenTSDB.

You initially asked on https://github.com/prometheus/prometheus/issues/10:

"I added the storage.remote.url command line flag, but as far as I can tell
Prometheus doesn't attempt to store any metrics there."

A couple of questions:

  • did you enable the OpenTSDB option "tsd.core.auto_create_metrics"?
    Otherwise OpenTSDB won't auto-create metrics for you, as the option is
    false by default. See
    http://opentsdb.net/docs/build/html/user_guide/configuration.html
  • if you run Prometheus with -logtostderr, do you see any relevant log
    output? If there is an error sending samples to TSDB, it should be logged
    (glog.Warningf("error sending %d samples to TSDB: %s", len(s), err))
  • Prometheus also exports metrics itself about sending to OpenTSDB. On
    /metrics of your Prometheus server, you should find the counter metrics
    "prometheus_remote_storage_sent_errors_total" and
    "prometheus_remote_storage_sent_samples_total". What do these say?

Cheers,
Julius

On Thu, Feb 5, 2015 at 9:22 AM, Björn Rabenstein [email protected]
wrote:

The opentsdb-integration branch is indeed completely outdated (still using
the old storage backend etc.). Personally, I'm a great fan of the OpenTSDB
integration, but where I work, there is not an urgent enough requirement to
justify a high priority from my side...

—
Reply to this email directly or view it on GitHub
https://github.com/prometheus/prometheus/issues/10#issuecomment-73010313
.

I cannot +1 this enough

Is InfluxDB on the cards in any way? :)

Radio Yerevan: "In principle yes." (Please forgive that Eastern European digression... ;)

:D That was slightly before my time ;)

See also: https://twitter.com/juliusvolz/status/569509228462931968

We're just waiting for InfluxDB 0.9.0, which has a new data model which
should be more compatible with Prometheus's.

On Thu, Mar 5, 2015 at 10:31 AM, Michal Witkowski [email protected]
wrote:

:D That was slightly before my time ;)

—
Reply to this email directly or view it on GitHub
https://github.com/prometheus/prometheus/issues/10#issuecomment-77332742
.

We're just waiting for InfluxDB 0.9.0, which has a new data model which
should be more compatible with Prometheus's.

Can I say awesome more than once? Awesome!

Unfortunately, @juliusv ran some tests with 0.9 and InfluxDB consumed 14x more storage than Prometheus.

Before it was an overhead of 11x but Prometheus's could reduce storage size significantly since then - so in reality InfluxDB has apparently improved in that regard.
Nonetheless, InfluxDB did not turn out to be the eventual answer for long-term storage, yet.

At least experimental write support is in master, as of today, so anybody can play with Influxdb receiving Prometheus metrics. Quite possible somebody finds the reason for the blow-up in storage space and everything will be unicorns and rainbows in the end...

@beorn7 that's great. TBH I'm not concerned about disk space, it's the cheapest resource on the cloud after all. Not to mention, I'm expecting to hold data with a very small TTL, i.e. few weeks.

@pires In that case, why not just run two identically configured Prometheis with a reasonably large disk?
A few weeks or months is usually fine as retention time for Prometheus. (Default is 15d for a reason... :) The only problem is that if your disk breaks, your data is gone, but for that, you have the other server.

@pires do you have a particular reason to hold the data in another database for that time? "A few weeks" does not seem to require a long-term storage solution. Prometheus's default retention time is 15 days - increasing that to 30 or even 60 days should not be a problem.

@beorn7 @fabxc I am currently using a proprietary & very specific solution that writes monitoring metrics into InfluxDB. This can eventually be replaced with Prometheus.

Thing is I have some tailored apps that read metrics from InfluxDB in order to reactively scale up/down, that would need to be rewritten to read from Prometheus instead. Also, I use continuous queries. Does Prometheus deliver such a feature?

http://prometheus.io/docs/querying/rules/#recording-rules are the equivalent to InfluxDB's continuous queries.

+1

:+1:

How does remote storage as currently implemented interact with PromDash or grafana?

I have a use case where I want to run Prometheus in a 'heroku-like' environment, where the instances could conceivably go away at any time.

Then I would configure a remote, traditional influxdb cluster to store data in.

Could this configuration function normally?

This depends on your definition of "normally", but mostly, no.

Remote storage as it is is write-only; from Prometheus you would only get what it has locally.

To get at older data, you need to query OpenTSDB or InfluxDB directly, using their own interfaces and query languages. With PromDash you're out of luck in that regard; AFAIK Grafana knows all of them.

You could build your dashboards fully based on querying them and leave Prometheus to be a collection and rule evaluation engine, but you would miss out on its query language for ad hoc drilldowns over extended time spans.

Also note that both InfluxDB and OpenTSDB support are somewhat experimental, under-exercised on our side, and in flux.

We're kicking around the idea of a flat file exporter, thus we can start storing long term data and then once bulk import issue is done we can use that https://github.com/prometheus/prometheus/issues/535. Would you guys be open for a PR around this?

For #535 take a look at my way outdated branch import-api, where I once added an import API as a proof-of-concept: https://github.com/prometheus/prometheus/commits/import-api. It's from March, so it doesn't apply to master anymore, but it just shows that in principle adding such an API using the existing transfer formats would be trivial. We just need to agree that we want this (it's a contentious issue, /cc @brian-brazil) and whether it should use the same sample transfer format as we use for scraping. The issue with this transfer format is that it's optimized for the many-series-one-sample (scrape) case, while with batch imports you often care more about importing all samples of a series at once, without having to repeat the metric name and labels for each sample (massive overhead). But maybe we don't care about efficiency in the (rare?) bulk import case, so the existing format could be fine.

For the remote storage part, there was this discussion
https://groups.google.com/forum/#!searchin/prometheus-developers/json/prometheus-developers/QsjXwQDLHxI/Cw0YWmevAgAJ about decoupling the remote storage in some generic way, but some details haven't been resolved yet. The basic idea was that Prometheus could send all samples in some well-defined format (JSON, protobuf, or whatever) to a user-specified endpoint which could then do anything it wants with it (write it to a file, send it to another system, etc.).

So it might be ok to add a flat file exporter as a remote storage backend directly to Prometheus, or resolve that discussion above and use said well-defined transfer format and an external daemon.

I think for flat file we'd be talking the external daemon, as it's not something we can ever read back from.

So the more I think about it, it would be nice to have this /import-api (a raw data) api, so we can have backup nodes mirroring the data from the primary prometheus. Would their be appetite for a PR for this and corresponding piece inside of prometheus to import the data. So you can have essentially read slaves?

For that use case we generally recommend running multiple identical Prometheus servers. Remote storage is about long term data, not redundancy or scaling.

I think running multiple scrapers is not a good solution cause the data won't match, also there is no way to backfill data. So we have issue where I need to spin up some redundant nodes and now they are missing a month of data. If you have an api to raw import the data you could at least catch them up. Also the same interface could be used for backups

So we have issue where I need to spin up some redundant nodes and now they are missing a month of data. If you have an api to raw import the data you could at least catch them up. Also the same interface could be used for backups

This is the use case for remote storage, you pull the older data from remote storage rather than depending on Prometheus being stateful. Similarly in such a setup there's no need for backups, as Prometheues doesn't have any notable state.

Remote storage is still not that useful cause there is no way to query it. Seems like you could make a pretty quick long term storage solution with the existing primitives if you allows nodes to be able to do an initial backfill.

The plan is to have a way to query it. The current primitives do not allow for good long term storage, as you're going to be limited by the amount of SSD you can put on a single node. Something more clustery is required.

The existing hash moding scheme already allows you to expand past a single node. If you had a way to spin up a new cluster when you want to resize and import old data you could have a poor mans approach at scaling. In the future you could add some more intelligent resharding techniques. I don't think any of the external storage options are even that good right now to be legitimate solutions

Hashing is for scaling ingestion and processing, not storage, and comes with significant complexity overhead. It should be avoided unless you've no other choice due to this. As you note you have a discontinuity every time you upshard, and generally you want to keep storage and compute disaggregated for better scaling and efficiency.

We don't want to have to end up implementing a clustered storage system within Prometheus, as that's adding a lot of complexity and potential failure modes to a critical monitoring system. We'd much prefer to use something external and let them solve those hard problems, even though none of the current options is looking too great. If that doesn't work out we can consider writing our own as a separate component, but hopefully it doesn't come to it.

I appreciate that you'd like long term storage. I ask you to wait until we can support it properly, rather than depend on users to hack something together in a way that's against the overall architecture of the system and that ends up being a operational and maintenance burden in the future.

its ok to have grand visions, it doesn't mean we can't do anything in the short term. Graphite uses sharding for storage and query performance. It can even shard incoming queries and its not a particularly sophisticated system.

While I agree that long-term and robust storage should be solved properly, I see some benefits of having a batch import endpoint independent of that. It's at least useful for things such as testing (when you want to quickly import a bunch of data into Prometheus to play with it or do benchmarks) or backfilling data in certain situations (like importing batches of metrics generated from delayed Hadoop event log processing that you want to correlate with other metrics).

The downside would of course be that it could attract lots of users to do the wrong thing (e.g. pushing data when they should really pull), and that any feature in general makes a product worse for all users who don't use it (more perceived product complexity, etc.).

You can make the batch import be a pull process, the new secondary (or slave) prometheus can pull the existing data from another prometheus via a /raw endpoint. Not sure it would be that confusing to new users as there are other features like federation which do somewhat similiar tasks but are probably unused by 95% of the users

backfilling data in certain situations (like importing batches of metrics generated from delayed Hadoop event log processing that you want to correlate with other metrics).

That's frontfilling, which is a different use case and where the push vs. pull issue is. Backfiling is what we're talking about here as we're inserting data before existing data, where the questions are more around expectations of durability of Prometheus storage. Backfilling is less of an issue conceptually, as it's mainly an operational question.

If we were to implement backfilling I think something pushish would make more sense as it's an administrative action against a particular server rather than a more generic "expose data to be used somehow". You'd likely also want a reasonable amount of control around how quickly it's done etc. so as not to interfere with ongoing monitoring.

Is it really that much conceptually different then your /federate endpoint? which allows downstream systems to scrape it. I'm just thinking we can expose the entire timeseries with some paging.

Yes, /federate is only about getting in new data to provide high-level aggregations and has no bearing on storage semantics or expectations.

What you're talking about is adding data back in time, which is not supported by the storage engine (and not something that should be even considered outside of this exact use case). This changes the default stance that if a Prometheus loses it's data, then bring up a fresh new one and move on.

Just thinking there, the other place we'll have to have backfill for is for when someone wants to make an expression take effect back in time. So independent of storage related discussions we're likely to add in backfill at some point.

@brian-brazil are you thinking about like downsampling or like aggregations backwards in time

Aggregations/rules back in time. Downsampling is an explicit non-goal, we believe that belongs in long term storage.

In https://github.com/prometheus/prometheus/issues/10#issuecomment-90591577 @fabxc stated:

ran some tests with 0.9 and InfluxDB consumed 14x more storage than Prometheus.

I just wanted to chime in that the latest InfluxDB 0.10 GA seem to have improved their storage engine aggressively. Their blog post states:

Support for Hundreds of Thousands of Writes Per Second and 98% Better Compression

Could be worth revisiting.

Does Blueflood have a compatible data model?

With InfluxDB off the table, it seems to be the most promising open source TSDB, backed by Cassandra.

Rackspace have a reasonably good reputation of keeping things open.

http://blueflood.io/

Blueflood seems to have milisecond timestamps, it's unclear if it supports float64 as a data type (I'm guessing it does) but it has no notion of labels. It seems to have the right idea architecturally, but doesn't quite fit.

Newts is another Cassandra backed option.

The data model is described at https://github.com/OpenNMS/newts/wiki/DataModel, supports labels.

I see a bit of contention about how good Cassandra is for time series but a few of the TSDB's are building on it (KairosDB, Heroic and Newts).

Newts doesn't really have a notion of labels, and I don't think it's Cassandra schema will scale well for the amount of data we're dealing with.

Yeah most of the external options are not very good or require more management then the investment warrants. So currently at DigitalOcean we have some rather large sharded setups with prometheus. We are investigating the possibility of not needing long term storage, by having the prometheus instances allow the data to be backed up to other nodes. Maybe if have a way to reshard data. I haven't heard of anyone talking yet about just extending the existing capabilities instead of pushing it to yet another database, which will have a different query language then prometheus.

Your last part is more or less what it will converge to eventually – at least in my head.
With their custom query languages and models existing solutions come with their own overhead and limitations. For a consistent read/write path it doesn't really make sense to enforce a mapping to work around those.

They are mostly based on Cassandra or HBase anyway and that for good reasons. We have to find a good indexing and chunk storage model that's applicable to similar storage backends, which then might even be choosable.

It's easy to talk about all that – it's not worth much without an implementation, which will take some time of course :)

I see significant challenges with that approach, as it's ultimately making Prometheus into a full-on distributed storage system. I think we need to keep long term storage decoupled from Prometheus itself, so as not to threaten it's primary goal of being a critical monitoring system. We wouldn't want a deadlock or code bug in such a complex system taking out monitoring, it's much easier to get deadlines on a few RPC calls right.

I haven't heard of anyone talking yet about just extending the existing capabilities instead of pushing it to yet another database, which will have a different query language then prometheus.

The plan is that however we resolve this, that you'll be able to seamlessly query the old data via Prometheus. If we just wanted to pump the data to another system with no reading back it'd make things far easier - you can already do that if you want.

Yes, it makes it a full-on distributed storage system. And it shouldn't be part of the main server, of course.
It would be its own thing, but directly catering to our data and querying model.

I know it has challenging implications. But the ones for waiting for a TSDB that fits our model without limitations are worse. The existing ones seem to be unsuitable. And I'm not aware of anyone working on something that will be.

As long as there's decoupling via a generic-ish RPC system I'm okay with that.

The existing and stable ones seem to be unsuitable.

Some are close.

There's actually more problems with OpenTSDB than the 8 tags, there's also limits on the number of values a label can have so we can't even really use it as storage - though we've several users planning on putting their data there.

And I'm not aware of anyone working on something that will be.

:smile:

The question it turns out isn't so much how do you solve it, but more what your budget is.

There's actually more problems with OpenTSDB than the 8 tags

It's not an issue anymore.

there's also limits on the number of values a label can have so we can't even really use it as storage - though we've several users planning on putting their data there.

Limits are pretty big (16M) and you can make them even bigger.

I am an OpenTSDB comitter, I would be happy to coordinate the roadmap to
meet your needs if possible.

This would be very beneficial to me. I believe the authentication and
startup plugins will be useful for this. You could have a TSD startup and
get its configuration from Prometheus. The startup plugins are designed to
allow tight coupling between systems.

Another way to go about this would potentially be to use the Realtime
Publishing plugin in OpenTSDB to accept data into OpenTSDB and publish
additionally to Prometheus. There are benefits and drawbacks to that I am
sure.

-Jonathan
On Mar 30, 2016 7:42 AM, "Ivan Babrou" [email protected] wrote:

There's actually more problems with OpenTSDB than the 8 tags

It's not an issue anymore.

there's also limits on the number of values a label can have so we can't
even really use it as storage - though we've several users planning on
putting their data there.

Limits are pretty big (16M) and you can make them even bigger.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/prometheus/prometheus/issues/10#issuecomment-203414217

Additional items I should mention, I have sustained 10 million writes per
second to OpenTSDB, and repeated that on a other setup at a different
company. HBase scales really well for this.

The 2.3.0 branch, which should have an RC1 any day now has expression
support, things like Sum(), Timeshift(), etc. These should make writing
your query support easier.

There is a new query engine Splicer, written by Turn which provides
significant improvements in query time. It works by breaking up the
incoming queries into slices and querying the TSD that is local to the
Regionserver. It will also cache the results in 1 hour blocks using Redis.
We use it in conjunction with multiple tsd instances per Regionserver
running in Docker containers. This allows us to run queries in parallel
blocks.

These new features and my experience scaling OpenTSDB should help to make
an ideal long term storage solution, in my opinion.
On Mar 30, 2016 8:47 AM, "Jonathan Creasy" [email protected] wrote:

I am an OpenTSDB comitter, I would be happy to coordinate the roadmap to
meet your needs if possible.

This would be very beneficial to me. I believe the authentication and
startup plugins will be useful for this. You could have a TSD startup and
get its configuration from Prometheus. The startup plugins are designed to
allow tight coupling between systems.

Another way to go about this would potentially be to use the Realtime
Publishing plugin in OpenTSDB to accept data into OpenTSDB and publish
additionally to Prometheus. There are benefits and drawbacks to that I am
sure.

-Jonathan
On Mar 30, 2016 7:42 AM, "Ivan Babrou" [email protected] wrote:

There's actually more problems with OpenTSDB than the 8 tags

It's not an issue anymore.

there's also limits on the number of values a label can have so we can't
even really use it as storage - though we've several users planning on
putting their data there.

Limits are pretty big (16M) and you can make them even bigger.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/prometheus/prometheus/issues/10#issuecomment-203414217

It's not an issue anymore.

That's good to know, how many tags does it support now?

Limits are pretty big (16M) and you can make them even bigger.

I can imagine users hitting that, mostly by accident.

You could have a TSD startup and get its configuration from Prometheus.

That doesn't make sense to me, I'd expect Prometheus to be configured to send information to a given OpenTSB endpoint and that'd be the entire configuration required on the write side.

Realtime Publishing plugin in OpenTSDB to accept data into OpenTSDB and publish
additionally to Prometheus.

That's not the Prometheus architecture, Prometheus would be gathering data and also sending it on to OpenTSDB.

If we wanted to pull data in the other direction we'd write an OpenTSDB exporter, similar to the how the InfluxDB exporter works.

Additional items I should mention, I have sustained 10 million writes per
second to OpenTSDB, and repeated that on a other setup at a different
company. HBase scales really well for this.

Can you give an idea of the hardware involved in that?

These should make writing your query support easier.

The minimal support we need is the ability to specify a vector selector like {__name__="up",job="myjob",somelabel!="foo",otherlabel~="a|b"} and get back all the data for all matching timeseries for a given time period efficiently. Queries may not include a name (though usually should), and it's not out of the question for a single name to have millions of timeseries across all time and tends of thousands would not be unusual.

I am an OpenTSDB comitter, I would be happy to coordinate the roadmap to
meet your needs if possible.

It's important for us to store full float64s. Given that ye support 64bit integers, could we just send them as that or is full support an option?

Full utf-8 support in tag values would also be useful, though we've already worked around that.

@bobrik @johann8384 Thanks for those infos, that's great to know!

For anything more than small or toy use cases, we'll need to move query computation to the long-term storage (otherwise, data sets that need to be transferred back to Prometheus would become too large). So any existing remote storage would have to implement pretty much all of Prometheus's query language features in a semantically compatible way (even if maybe aggregation is the most important one, that's not necessarily at the leaf node of a query).

So having float64 value support is kind of crucial if you want to achieve the above, but the OpenTSDB docs actually mention that that's on the roadmap, so that's good: http://opentsdb.net/docs/build/html/user_guide/writing.html#floating-point-values

Whether OpenTSDB would ever be able to compatibly execute all of Prometheus's query language features is another question.

For anything more than small or toy use cases, we'll need to move query computation to the long-term storage (otherwise, data sets that need to be transferred back to Prometheus would become too large). So any existing remote storage would have to implement pretty much all of Prometheus's query language features in a semantically compatible way (even if maybe aggregation is the most important one, that's not necessarily at the leaf node of a query).

While that's the ideal case, I don't think that's achievable. I would more think along Brian's lines above: A vector selector gives us all the data for the relevant time interval, and then query evaluation is done on Prometheus's side. Obviously, that limits queries to those that don't require Gigabits of sample data. But that's probably fine. The same caution as usual applies where you create recording rules for expensive queries.

Yeah. I can see us pushing down parts of some queries in future, but likely only ever to other Prometheus servers in a sharded setup. More than that would be nice, but I don't see it happening in the forseeable future.

For long-term storage, the amount of data you'd need to pull in within the storage system itself is likely to be enough of a bottleneck to prevent a large query from working, before we get to sending the result back to Prometheus.

Limits are pretty big (16M) and you can make them even bigger.

I can imagine users hitting that, mostly by accident.

I'd be willing to live with that limitation.

For anything more than small or toy use cases, we'll need to move query computation to the long-term storage

Should Open TSDB turn out to be suitable after all, I doubt that we can get query feature parity. If we do, it will always be a limiting factor when we want to extend our PromQL.
With a "generic" read/write path it will be using a bridge anyway. That could be extended to care about distributed evaluation.

Ok, if we're fine not supporting large aggregation use cases (but keep a road open to them in the future), that makes things easier of course. I guess there's an argument to be made for the smaller use cases since usually people don't care about as many dimensions (like instance) for historical data, so you might only be operating on metrics with far fewer series.

(edited the comment above for clarification, for the people only reading emails)

I think backfilling would be more important with long-term storage, to support any needed aggregations - not a primary concern though (and backfilling in long-term storage may obviate the need for it in Prometheus itself).

On Wed, Mar 30, 2016 at 9:39 AM, Brian Brazil [email protected]
wrote:

It's not an issue anymore.

That's good to know, how many tags does it support now?

The tag limitation was previously a hard coded constant, I usually always
set this to 16 in my setups. It is not a configuration value rather than a
constant in the code.

Limits are pretty big (16M) and you can make them even bigger.

I can imagine users hitting that, mostly by accident.

I use a UID width of 4 instead of three, I didn't do the math but this
significantly increases this number. Even on the largest setups I have
deployed, there have only been 3 or 4 million UID assignments, and this
includes, metrics, tag keys and tag values combined.

You could have a TSD startup and get its configuration from Prometheus.

That doesn't make sense to me, I'd expect Prometheus to be configured to
send information to a given OpenTSB endpoint and that'd be the entire
configuration required on the write side.

Yes, that would be the most common use case, but technically you could have
a TSD node "bound" to each prometheus node, or a set of TSDs per prometheus
node, and allow each Prometheus node to have its own query cluster that
way. Perhaps a more common way to say what I was trying to say is that the
intended use of the startup plugins is for service discovery. When OpenTSDB
starts up it can register itself with Curator (ZooKeeper), Consul, Etcd,
etc. It could technically also get parameters from the service discovery
like the location of the zkQuorum, what HBase tables to use, or what ports
to listen on.

Realtime Publishing plugin in OpenTSDB to accept data into OpenTSDB and
publish
additionally to Prometheus.

That's not the Prometheus architecture, Prometheus would be gathering data
and also sending it on to OpenTSDB.

If we wanted to pull data in the other direction we'd write an OpenTSDB
exporter, similar to the how the InfluxDB exporter works.

Additional items I should mention, I have sustained 10 million writes per
second to OpenTSDB, and repeated that on a other setup at a different
company. HBase scales really well for this.

Can you give an idea of the hardware involved in that?

I believe the original cluster was 24 Dell R710 machines, maybe 64GB ram, I
don't remember much of the other specs. The cluster at Turn is 36 nodes, 24
cores, 128GB Ram, 8 disks.

These should make writing your query support easier.

The minimal support we need is the ability to specify a vector selector
like {name="up",job="myjob",somelabel!="foo",otherlabel~="a|b"} and
get back all the data for all matching timeseries for a given time period
efficiently. Queries may not include a name (though usually should), and
it's not out of the question for a single name to have millions of
timeseries across all time and tends of thousands would not be unusual.

I am an OpenTSDB comitter, I would be happy to coordinate the roadmap to
meet your needs if possible.

It's important for us to store full float64s. Given that ye support 64bit
integers, could we just send them as that or is full support an option?

Full utf-8 support in tag values would also be useful, though we've
already worked around that.

As far as I am aware, nothing is off the table, let's work together to
implement that.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/prometheus/prometheus/issues/10#issuecomment-203463519

I'm certain I need to learn more about the Prometheus query structure, but
OpenTSDB (seems to be) pretty good at aggregation across series within the
same metric name. There are also new aggregators, aggregator filters, and
of course the expression support.

I would recommend that we find a way to support pre-aggregating and
downsampling the data as we store it to OpenTSDB. So for example, we may
provide a list of tags to strip when writing. Another thought is to write
${metric}.1m-avg, ${metric}.5m-avg and automatically select those
extensions when reading large time ranges. This would mimic the way an RRD
storage system might work. So for the recent part of the query we pull full
resolution but as we get farther back, we can pull from the 1m-avg, and
5m-avg series.

Just thoughts of course.

On Wed, Mar 30, 2016 at 10:39 AM, Julius Volz [email protected]
wrote:

Ok, if we're fine not supporting large aggregation use cases (but keep a
road open to them in the future), that makes things easier of course. I
guess there's an argument to be made for those use cases since usually
people don't care about as many dimensions (like instance) for historical
data, so you might only be operating on metrics with far fewer series.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/prometheus/prometheus/issues/10#issuecomment-203494343

I would recommend that we find a way to support pre-aggregating and
downsampling the data as we store it to OpenTSDB.

That's not something we can do without user input for pretty much every metric, as we don't know which labels are okay to remove. It'd also break user queries as it's no longer the same timeseries.

Yes, pre-aggregation would be tricky.

I'll just chime in that I think a lot of people are excited about Prometheus partially for it's simplicity when it comes to deployment. I also think that's why InfluxDB was a good candidate because it follows in that same Go-single-statically-linked-binary-fashion.

I understand InfluxDB has been taken out of the equation for good reasons, but I also believe OpenTSDB is a monster when it comes to deployment; No company under 10 employees wants to run a fully fledged Hadoop with HBase and I think long-term storage should be a viable alternative also for smaller organizations not running Hadoop+HBase. I hope that the "long storage" solution, whatever it may be, is a solution that can be easily deployed. Both companies with and without Hadoop will want long-term storage. Those were my two cents...

That is a completely valid, and good observation. HBase can run in stand-alone mode, it uses local files for storage rather than HDFS. It isn't really talked about in the documentation, but I have used it for a few small OpenTSDB deployments where the durability and performance of the cluster were not important. This may or may not be a good option here, for obvious reasons.

Running hbase in standalone defeats the point of it being distributed ;) Prometheus already has its own store. Honestly I'm hoping it doesn't go towards OpenTSDB. We used to run a 50 node cluster and it was a full time job managing the cluster.

Yeah, OpenTSDB being complex to operate is a common complaint and makes me very wary about it as well. Of course, you'd still have to weigh that against the likelihood of any other viable alternative materializing anytime soon... still also hoping for something more Go-ey and with fewer dependencies, but I'm not seeing it quite yet :)

Is anyone actually working on this at this time?

@johann8384 Nobody is currently working on a completely new distributed storage system, no. But there's some related work and discussion around a generic remote write API (https://github.com/prometheus/prometheus/pull/1487), but nothing concrete about read-back yet.

Just wanted to reiterate the sentiment of others above... even though we're a reasonably sized team, we don't want to operationalise HBase. OpenTSDB isn't even on our short list for this reason.

Disclaimer: We're not technically Prometheus users at this time, but we should be in the next couple of weeks :).

Running hbase in standalone defeats the point of it being distributed ;)

@mattkanwisher True.

Prometheus already has its own store.

True. However, AFAIK Prometheus storage doesn't support downsampling. AFAIK, OpenTSDB doesn't. This could be a reason for wanting to run OpenTSDB (or something else) non-distributed.

Is anyone actually working on this at this time?

Good question. I wouldn't be surprised to hear noone is working on it because this issue is large and not well defined. As I see it, there are actually multiple subissues:

  • Prometheus doesn't support downsampling of data which means data takes up more space than necessary. Workaround: Use larger disks.
  • Prometheus doesn't support replication of data. Loosing a master, means that you will loose that data. Workaround: Use a distributed file system.

Something missing?

Disclaimer Just like @jkinred, I am not a Prometheus user. However, I keep coming back to it...

The general idea is that it's the remote storage that's distributed and does downsampling. Prometheus local storage is quite efficient, but ultimately you want to keep only a few weeks of data in Prometheus itself for fast/reliable access and depend on remote storage beyond that. Then you don't really care about how much space Prometheus uses (as long as it holds at least a few days, you're good) or if you lose one of a HA pair every now and then.

What about a BigQuery exporter by stream loading? Should be the analogous option to borgmon -> tsdb?

I don't see any way to sanely use BigQuery here due to the columnar data model, unless you're querying the data extremely rarely.

What about https://github.com/hawkular/hawkular-metrics
Hawkular Metrics is the metric data store for the Hawkular project. It can also be used independently.

See also https://github.com/kubernetes/heapster/blob/master/docs/storage-schema.md#hawkular

Collecting Metrics from Prometheus Endpoints : http://www.hawkular.org/blog/2016/04/22/collecting-metrics-from-prometheus-endpoints.html
The agent now has the ability to monitor Prometheus endpoints and store their metrics to Hawkular Metrics. This means any component that exports Prometheus metric data via either the Prometheus binary or text formats can have those metric data collected by the agent and pushed up into Hawkular Metrics

Prometheus Metrics Scraper : https://github.com/hawkular/hawkular-agent/tree/master/prometheus-scraper

Hawkular has float64, millisecond timestamps and key/value pair labels. A given metric has only one set of tags, but that we could work around. API is JSON, so unclear if it can handle non-real values. C* is backend. Seems to support the operations we need on labels.

What's unclear are how exactly it's using in C*, and how it has implemented the label lookups. Looking at the schema, neither look to be efficient enough for our use case.

See :

  • tags and label based search (https://github.com/openshift/origin-metrics/issues/34)
  • Querying based on tag (https://github.com/openshift/origin-metrics/blob/master/docs/hawkular_metrics.adoc#querying-based-on-tag)

Also, Cassandra 3.4 added support for the SASI custom index :

  • SASIIndex (https://github.com/apache/cassandra/blob/trunk/doc/SASI.md)
  • Improved Secondary Indexing with new Query Capabilities (OR, scoping) for Cassandra (https://github.com/xedin/sasi)
  • SASI Empowering Secondary Indexes (http://www.planetcassandra.org/blog/sasi-empowering-secondary-indexes/)
  • Indexing with SSTable attached secondary indexes (SASI) (https://docs.datastax.com/en/cql/3.3/cql/cql_using/useSASIIndexConcept.html)
  • SASI: Real-Time Search and Analytics with Cassandra (http://www.meetup.com/fr-FR/DataStax-Cassandra-South-Bay-Users/events/229467895/?eventId=229467895)

SASI use case : Implement string metric type (https://issues.jboss.org/browse/HWKMETRICS-384)

I created the issue "Hawkular metrics as the Long-term storage backend for prometheus.io" (https://issues.jboss.org/browse/HWKMETRICS-400)

I think it'd take a major redesign of Hawkular to make it work for the data volumes Prometheus produces. For example a single label matcher such as job=node can easily match tens of millions of time series on a large setup. That's going to blow out the 4GB row size limit in C* for metrics_tags_idx.

Hawkular also appears to use at least 32 bytes per sample, before replication.

I do not know the 4GB row size limit in C*
Cassandra has a 2 billion column limit (https://github.com/kairosdb/kairosdb/issues/224)
CQL limits (https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html)

What's the relation with PR "Generic write #1487" (https://github.com/prometheus/prometheus/pull/1487) ?

I think it'd take a major redesign of Hawkular to make it work for the data volumes Prometheus produces. For example a single label matcher such as job=node can easily match tens of millions of time series on a large setup. That's going to blow out the 4GB row size limit in C* for metrics_tags_idx.

You are correct that there is the potential that we could wind up with very wide rows in metrics_tags_idx. It has not been a concern because we have not been dealing with data sets large enough to necessitate a change.

Changing the schema for metrics_tags_idx as well as our other index tables is something we certainly could do. One possibility would be to implement some manual sharding. We would add a shard or hash id column to the partition key. This would also allow us to effectively cap the number of rows per partition. As the data set grows we might need to reshard and increase the number of shards. I think the solution would have to take this into account as well.

Hawkular also appears to use at least 32 bytes per sample, before replication.

Which table(s) are you referring to? We recently moved to Cassandra 3.x which includes a major refactoring of the underlying storage engine. I need review the changes some before I can say precisely how many bytes are used per sample. Keep the following in mind though. In Cassandra 2.x the column name portion of clustering columns was repeated on disk for every cell. This is no longer the case in Cassandra 3. This can result in considerable space savings. And Cassandra stores data in compressed format by default.

Just wanting to make sure I understand the state of this area of enhancement... are the remote providers in /storage/remote supported now? If so, what remains to be completed here?

They are effectively deprecated. They will be removed once we have some form of generic write in Prometheus (e.g. #1487).

Hi all, I wanted to bring up another possibility for a storage layer: Apache Kudu (incubating). Kudu bills itself as an analytics storage layer for the Hadoop ecosystem, and while it is that, it's also an excellent platform for storing timeseries metrics data. Kudu brings some interesting features to the table:

  • Columnar on-disk format: supports great compression with timeseries data, and enables extremely fast scans. Fancy encodings like bitshuffle, dictionary, and run length, as well as multiple compression types (LZ4, etc.) are built in
  • Strongly consistent replication via Raft
  • No dependencies: doesn't require HDFS/Zookeeper/anything else
  • Designed with operations in mind: no garbage collection pauses, and advanced heuristics for disk compactions that make them predictable and smooth
  • Advanced partitioning: metrics can be hash-distributed among nodes, _and_ partitions can be organized by time, so that new partitions can be brought online, and old partitions can be dropped as they ttl. This provides great scalability for metrics workloads
  • Designed for scan-heavy workloads- unlike a lot of databases that are optimized first and formost for single record or value retrieval, Kudu is optimized for scans

My goal is to let you all know about Kudu as a potential storage solution, and ask what Prometheus is looking for in a storage layer. I've had some experimental success with a TSDB-compatible HTTP endpoint in front of Kudu, but perhaps Prometheus is looking for a different sort of API? It would be great to get a sense of what Prometheus needs from a distributed storage layer, and if Kudu could fill the role.

As far as the maturity of Kudu, we are planning to have a 1.0 production ready release later this summer. The project has been under development for more than three years, and we already have some very large production installations (75 nodes, ingesting 250K records/sec continuously), and routinely test on 200+ node clusters.

I expect Kudu to fall down for the same reasons BigQuery did, the access patterns for time series data and columnar data are very different.

75 nodes, ingesting 250K records/sec continuously

A single Prometheus server can generate over three times that load.

@brian-brazil could you be more specific about the problematic access patterns? Except for being columnar, our architecture isn't really comparable to BigQuery. Kudu is designed for low-latency, highly concurrent scans.

A single Prometheus server can generate over three times that load.

Yah, that is probably a bad example, that usecase isn't metrics collection and I believe the record sizes are quite large comparatively. I was able to max out a single node Kudu setup at ~200K metrics datapoint writes/second without Kudu breaking a sweat on my laptop (it was bottlenecked in the HTTP proxy layer). I haven't really dug in and gotten solid numbers yet, though. Definitely on my TODO list.

In particular - Kudu keeps data sorted by a primary key index, so scanning for a particular metric and time range only requires reading the required data. As a result, timeseries scans can have <10ms latencies.

If I had a TSDB with 1 billion metrics, and gave you an expression that matched 100 of them over a year how would that work out?

With the experimental project I linked to earlier, it keeps a secondary index in a side table of (tag key, tag value) -> [series id], and then uses the resulting IDs to perform the scan. So if you have a billion point dataset but your query only matches 100 points based on the metric name, tagset and time, it will only scan the data table for exactly those 100 points. It's modeled after how OpenTSDB uses HBase, but with a few important differences. There's a bit more info on that here. One huge benefit of having a columnar format instead of rowwise, is that a system like this doesn't need to do any external compactions/datapoint rewriting, like OpenTSDB has to do for HBase. The columnar encoding and compression options are already built in and better than anything that OpenTSDB will do.

That is just a particular instance of how timeseries storage could be done with Kudu. If the data model doesn't look like the OpenTSDB-style metric/tagset/timestamp/value, it could be done differently.

One of those tag matches (of which there's likely 2-4) could easily be for 10M time series. For 100 time series over a year with 5m resolution, that's about 10 billion points.

Sorry, I don't follow. So for query such as

SELECT * WHERE
   metric = some_metric AND
   timestamp >= 2015-01-01T00:00:00 AND
   timestamp < 2016-01-01T00:00:00 AND
   tag_a = "val_a" AND
   tag_b = "val_b" AND
   tag_c = "val_c";

it finds the set of timeseries where tag_a = "val_a", the set where tag_b = "val_b" and the set where tag_c = "val_c". It takes these three sets, finds the intersection (in order to find the series which match all three predicates), and then issues a scan in parallel for each. Each of these scans can read back only the necessary data (although the data may be spread across multiple partitions). The schema wasn't really designed for the case where an individual (tag_key, tag_value) pair might have millions of matching series, so there is probably a more efficient way to do it with that constraint in mind.

Yes, a label such as job=node could easily have 10s of millions of matching metrics.

The tag to series ID lookup I mentioned earlier is independent of the metric, so many metrics can share a single series ID. A series ID really just describes a unique set of labels, in Prometheus terms. Again, this is how I structured that solution, so it's not inherent to Kudu as a storage layer. Obviously multi-attribute indexing is a difficult problem and there are a lot of ways to go about it.

@brian-brazil more generally, it sounds like you have a pretty good idea of what you are looking for in an external storage system. Is that written up anywhere, or could you elucidate?

Hi,

This is an interesting thread, and yes, even I was about to ask if there was some proposal somewhere about the long-term storage. It would make it easy for us to get context and understand if this is a task that we can take up and experiment with.

Thanks,
Goutham.

The idea behind #1487 is to expose an interface to allow users to experiment with solutions. The full-on solution is quite difficult, however more constrained forms of the problem (such as not needing indexing) are more tractable.

Realistically an index being able to handle those requirement would be a
different problem from storing the actual sample data.
Maybe it can leverage some lookup efficiencies provided by that layer.

On Thu, Jun 9, 2016 at 12:51 AM Brian Brazil [email protected]
wrote:

The idea behind #1487 https://github.com/prometheus/prometheus/pull/1487
is to expose an interface to allow users to experiment with solutions. The
full-on solution is quite difficult, however more constrained forms of the
problem (such as not needing indexing) are more tractable.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/prometheus/prometheus/issues/10#issuecomment-224754971,
or mute the thread
https://github.com/notifications/unsubscribe/AEuA8ic17IjFeYvqWQSZvGBp35BcK4tHks5qJ0dmgaJpZM4AV6IB
.

Hello,

@brian-brazil I was wondering if this read-back from a timeseries database feature was being worked on at the moment and if so do you guys have any rough estimates of when a beta version might be released whether it be sometime this year or next? The company I work for is very interested in this tool and the feature to query data long term for a service that we are working on to release March 2017. Any information would be greatly appreciated.

Thanks,
Gustavo

There is no set timeline, however I'd expect something to be possible by that time. #1487 is one proposal to get us going on the path, and Weave also had a proposal which'd involve both the read and write aspects.

A couple questions:

  1. Have you looked at InfluxDB recently? While clustering is not longer available (and kind-of sucked IMHO), there's a open-source project called influxdb-relay that handles replication and load-balancing amongst several InfluxDB nodes. https://github.com/influxdata/influxdb-relay/blob/master/README.md
  2. Have you considered Riak TS? I haven't seen anyone mention it, so I figured I'd ask. http://basho.com/products/riak-ts/

We haven't reevaluated Influx following the closed source announcement, if a working low-maintenance open clustering solution becomes available we can look at it again.

I had a peek at Riak TS a while back, my recollection is that the storage wasn't quite efficient enough for the sort of volumes we have and you'd still need something for indexing.

Just wondering if you guys have any thoughts on using Prometheus with OpenTSDB storage built on top of Amazon EMR? It seems like that could reduce the cluster maintenance burden, do you agree with that and if so would OpenTSDB export be a better option?

Any solution must be workable on-prem, which means we can't delegate Hadoop to the cloud (and last I checked, EMR isn't a fully managed solution anyway).

What about Elasticsearch 5 (currently at alpha5)? They announced that the new version will use Lucene 6 , which has lot's of optimizations specifically for the numeric data indexing.

from Elastic blog:

Lucene 6 brings a major new feature (implemented since alpha2) called Dimensional Points: a new tree-based data structure which will be used for numeric, date, and geospatial fields. Compared to the existing field format, this new structure uses half the disk space, is twice as fast to index, and increases search performance by 25%. It will be of particular benefit to the logging and metrics use cases.

I don't think ES is an appropriate solution for a primary data store, and the type and volume of data isn't really what it's meant for.

I'm not sure that's going to be true any more. Kibana is already a popular tool for displaying metrics from ElasticSearch. I think there is a concerted effort to make it more efficient for this type of work. Also, I don't think volume will be a concern if it's properly set up.

Also, it is the best data store I've used in terms of clustering. Granted, I haven't used that many, but it's been working well in a live cluster so far even with occasional node failures.

@sybrandy exactly was my point, Elastic stack as for Today is much more focused on monitoring. For example, beats tools are similar in their purpose to exporters in Prometheus. And the output goes into ES.

Taking the fact that the project is fully open-source, has easy clustering, nice REST interface and with Lucene 6 (when ES5 comes out) it can potentially serve Prometheus as long-term storage. IMHO

ES is commonly used for logs, metrics are a very different use case. Lucene is irrelevant for example.

Also, I don't think volume will be a concern if it's properly set up.

I wouldn't associate ES with petabyte scale data sets that need to be safely stored for years, which is what we're talking about here. It's a text search engine, not a bulk data store.

Also, it is the best data store I've used in terms of clustering. Granted, I haven't used that many, but it's been working well in a live cluster so far even with occasional node failures.

That is not my experience, and I note a number of outstanding bugs around incorrect handling of node failures.

Approaching this from an agnostic point of view seemed sensible. If there is a well documented way of getting data out of Prometheus we can build solutions against it to test.

DalmatinerDB was designed for time series data and is based on Riak Core (not KV) and has a Postgres index specifically designed for making dimensional time series queries fast. We could do some additional work to increase the 62bit precision storage to 64bit or above. We're currently working towards making the query language feature comparable.

If there was a sensible way to either push or pull the data from Prometheus either myself or someone else at Dataloop or Project Fifo would write an initial implementation for storing the data in a DalmatinerDB cluster. If that works we could see about writing query translation from Prometheus to DalmatinerDB.

It would be nice to see some movement on this one as it appears to have gone on for a couple of years now. I feel like a lot of the work could be done and maintained outside of the core Prometheus project team if the interface was just defined, released and iterated on.

I feel like a lot of the work could be done and maintained outside of the core Prometheus project team if the interface was just defined, released and iterated on.

That's also my opinion. #1487 is one approach to this.

I think a lot of people are in board for the interface as other proprietary
dB's may also be used (which may not be inline with the current feature
spec)
On Thu, 11 Aug 2016 at 7:12 AM Brian Brazil [email protected]
wrote:

I feel like a lot of the work could be done and maintained outside of the
core Prometheus project team if the interface was just defined, released
and iterated on.

That's also my opinion. #1487
https://github.com/prometheus/prometheus/pull/1487 is one approach to
this.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/prometheus/prometheus/issues/10#issuecomment-239006063,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA-K-9aHZrtjg3cWCVWml-ajEKoRyu3Oks5qej7CgaJpZM4AV6IB
.

I wouldn't associate ES with petabyte scale data sets that need to be safely stored for years, which is what we're talking about here. It's a text search engine, not a bulk data store.

First, some quick googling shows that people are doing it. I know it's technically a text search engine, but it looks like it's going beyond that. Even then, you do have replication and backup capabilities, so the risk of losing data is mitigated. (I'm not convinced that any system 100% guarantees no data loss.)

Second, not everybody needs long-term data. Perhaps I only care about what happened in the past month. I don't think a storage solution should be discounted just because it doesn't fit one use case.

However, if the work to create a generic interface gets finalized and works well, it's a moot point and, IMHO, it really should be. Let people contribute code to allow them to use ElasticSearch, InfluxDB, RiakTS, whatever they want. It's fine to recommend certain solutions as "this is what we found to work best", but sometimes you just have to work with whatever is in your stack or that your team is most familiar with.

For example, in our current system, we're using InfluxDB for our metrics. It works great, but we may be moving this into our existing ElasticSearch cluster. Why? Well, we're already putting logs in there, it clusters/scales much easier, Kibana has improved in some areas, and it's one less thing to set up and maintain. Is it the best fit? Probably not. However, there are significant advantages to moving towards it that can't be ignored.

Second, not everybody needs long-term data. Perhaps I only care about what happened in the past month. I don't think a storage solution should be discounted just because it doesn't fit one use case.

The subject of this PR is long-term storage. Prometheus already handles short-term storage.

Ah, apologies. It's been so long since I read the top, I forgot.

I've started a spreadsheet comparing time series databases which is probably quite relevant to this ticket.

https://docs.google.com/spreadsheets/d/1sMQe9oOKhMhIVw9WmuCEWdPtAoccJ4a-IuZv4fXDHxM/edit#gid=0

If anyone would like edit rights to make changes let me know and I'll send an invite.

There's a corresponding blog to go with it, but it really just sums up the sentiment in this thread so probably isn't all that useful.

https://blog.dataloop.io/top11-open-source-time-series-databases

We are approaching a state where DalmatinerDB would work nicely alongside Prometheus as long term storage. It's extremely fast on writes so can easily handle several Prometheus servers worth of data. Here's a reproducible benchmark showing 3 million writes per second on some Joyent VM's:

https://gist.github.com/sacreman/b77eb561270e19ca973dd5055270fb28

Queries are also quick and we avoid all of the issues found in the columnar solutions. Setup is pretty easy compared to the Hadoop based systems. It's not very well advertised yet as we are cleaning up the builds and releasing packages next week, although this Gist works right now:

https://gist.github.com/sacreman/9015bf466b4fa2a654486cd79b777e64

The only negative is value precision. I've raised a ticket so we can investigate:

https://github.com/dalmatinerdb/dalmatinerdb/issues/77

However, that doesn't block any work getting started on an integration. Now there's a database that exists that's at least theoretically viable it seems a shame I don't have any way to have a play around yet.

@sacreman Thanks, great comparison. Glad that Prometheus won the query language competition with the only 5/5 points :) I do prefer PromQL over SQL dialects for the kind of time-series computations that are common in Prometheus...

As for storage: a bunch of us and people from companies building first remote storages stuck our heads together today (on a Saturday - just for you :)) to finally unblock this whole remote storage discussion. We concluded how we want to build a generic interface for writing and reading back data, and then anyone could build their own adapter ontop of any long-term storage system that they want (since we'll never be able to just support one directly and have that be good enough for everyone).

It'll go roughly like this: we'll send raw samples via protobuf+gRPC on write. For reads, we'll send a time range, a set of label selectors, and an opaque data field (useful for things like hinting at desired downsampling), and expect back a list of time series (label sets + samples) for those selectors and time ranges. Exact implementations of this are going to follow soon... expect the generic write side first, with the read side to follow.

@juliusv that is great news! Now when I do a future blog ranking all open source monitoring tools in a top 10 I can rank Prometheus/Grafana/DalmatinerDB as number 1 :)

Will keep an eye out of the first version and then get started on playing.

@sacreman See also the notes we took (we first had a general roadmap discussion, then generic remote API): https://docs.google.com/document/d/1Qj_m-lbatySAU0vsVfAPG7mduN1-nevrpD85uQhi4sY/edit#heading=h.6xcexetcoa4g

@juliusv that all looks good. DalmatinerDB has a metrics proxy that we'll update for both the reads and the writes.

https://github.com/dalmatinerdb/ddb_proxy

We can make it do this:

prometheus

@sacreman Sounds good!

For what it's worth, we're willing to work with the Prometheus community to make Warp 10 (www.warp10.io) a viable solution for long term storage.

Given features of borgmon/promql can probably be fully covered by WarpScript this makes it a very flexible solution to consider in my (biased) opinion.

@hbs Cool. The plan is still as outlined in https://github.com/prometheus/prometheus/issues/10#issuecomment-242937078, so we won't have any specific integrations in Prometheus itself, just the generic one, and then people can build whatever they want with it. I'd be happy to see a Warp 10 integration.

My $dayjob has requirements that are going to push us towards long-term (>5 year) storage of certain metrics, or at least performing data downsampling on one or more metrics at much longer intervals. It would be very convenient to have this be a "transparent" feature in Prometheus, where data would move out to the long-term store and be re-imported upon long-timeframe queries even if looking at older data was slower to access due to disk latency, uncached points, API delays etc. I leave "significantly" as a purposefully vague term. I'm hoping to avoid splitting the data into two silos to get the features we need, and having to re-write my views/graphs/etc. that I have with Grafana laid on top of Prometheus. We're even probably willing to put a small amount of dollars/euros towards development of this capability since it would solve many issues and would save some time on our side. I've been encouraged reading this thread over the last three years for the ideas, and I'm hoping the recent activity bodes well for running code. Alternately, having data downsampling capability might go a long way to solving our particular needs.

Checked in v1.4.1 prometheus binary, those "-storage.remote.influxdb." and "-storage.remote.opentsdb" parameters still exist, should they be removed ?

$ prometheus --help
usage: prometheus [<args>]
...
   -storage.remote.graphite-address
      The host:port of the remote Graphite server to send samples to.  None, if empty.

   -storage.remote.graphite-prefix
      The prefix to prepend to all metrics exported to Graphite. None, if empty.

   -storage.remote.graphite-transport "tcp"
      Transport protocol to use to communicate with Graphite. 'tcp', if empty.

   -storage.remote.influxdb-url
      The URL of the remote InfluxDB server to send samples to. None, if empty.

   -storage.remote.influxdb.database "prometheus"
      The name of the database to use for storing samples in InfluxDB.

   -storage.remote.influxdb.retention-policy "default"
      The InfluxDB retention policy to use.

   -storage.remote.influxdb.username
      The username to use when sending samples to InfluxDB. The corresponding password must be provided via the INFLUXDB_PW environment variable.

   -storage.remote.opentsdb-url
      The URL of the remote OpenTSDB server to send samples to. None, if empty.
...

@linfan They'll be removed at some point, but it's not been decided exactly when yet. At the very least with Prometheus 2.0, I presume, if not well before that.

I was hoping to get the groundwork in to remove them over the Christmas, but loadtesting took up my time instead. Basically I believe we should offer example receivers for these before removing them, so as not to leave any existing users in the lurch. That's a fairly easy task if someone would like to take it up, and then we can remove in 1-2 minor revisions.

OpenTSDB itself is not very hard to run, especially if you can offload the DB to Google or internal haoop team (although it shouldn't really be co-located with MapReduce clusters, does best on a small dedicated one). Enable salting with a width of the number of regionservers you will have, up to around a max of 12. Pretty much the only knob you need to worry about, because it affects table creation.

HBase is much harder getting tuning done on initial setup. If you have more than a handful of servers, crank the max region size up to 300 or 400GB. Make sure you are using HBase 1.2+ so the region normalizer can run to keep tiny regions from sprawling. I've seen other advice on this, but it uses a ton of resources on the regionserver and within the tsd to keep track of too many regions. Too many regions also hurt cold start performance and node decom/rebalance. You only want a few hundred to a few thousand for a very very large dataset. In all there are about 40 knobs you have to crank way up, paired with a huge JVM heap (or medium sized heap and offheap L2 cache) and the G1 garbage collector to keep worst case pause time to around 100ms. After proper setup, it is pretty much hands off. You can elastically add and remove nodes, and get close to linear scalability. I hated it until we figured out the magic HBase tuning, and now it's pretty good and low maintenance.

FYI: We will add support to labels to BigGraphite to make it usable as a remote storage for Prometheus (https://github.com/criteo/biggraphite/issues/95).

To make the write path efficient we will need a way to control the batching of samples. We will also need service discovery and basic load balancing if we want something reliable enough. Should a specific label be created for remote storage to track all the things needed to get there ?

The API is what it is, any batching or load balancing needs to be done on your end.

My concern is that I'm not sure the current API makes it easy to build something reliable without too much duplicated work.

On the write path, load balancing is not that important as long as the bridge is at least as performant as prometheus itself. Service discovery is still a must though (same rationale as for the alertmanager). For batching, considering that prometheus already has what looks like a checkpointed buffer cache, it would be sad to have to re-implement this part in every bridge that want to do smart things such as double-delta encoding, but that's probably fine if this is not present in the early version.

On the read path, basic round robin load balacing + service discovery will probably be needed.

The purpose of having the API is to decouple performance and reliability concerns, as writing this sort of system is difficult and it's not particularly difficult for things like buffers to back up and take out monitoring.

Ok, agreed for buffering / batching. At least as a first step.

The other point was service discovery for remote storage, what do you think ? (IMHO it's somewhat the same rationale as for the alertmanager: you need it if you don't want to hack around it on the network layer)

SD for it is certain. Load balancing is not determined, I don't believe that belongs in Prometheus.

What's the current status?
If there's anything other people can help to get the pluggable api storage?

@danni-m There's multiple parts here:

  • The write path: This exists and can be used today. See https://prometheus.io/docs/operating/configuration/#

  • The read path, Prometheus reading back from the remote storage. This does not yet exist, but I hope to build an initial version of it within the next one or two months.

  • The remote storage itself: Anyone can build remote storage implementations using the above-mentioned generic read/write interfaces (well, at the moment, only the write one exists). Two very different real-world examples already exist (https://github.com/weaveworks/cortex and https://github.com/ChronixDB/chronix.ingester). There are likely going to be many implementations, as different people need different things. But maybe there will be one that's so well-done, easy to run, and generically useful that it becomes the "default" one we recommend. That's not really clear yet, I'd encourage people to start experimenting with the write path at least.

Don't forget https://github.com/digitalocean/vulcan. It looks to be the most complicated however.

I'm working on remote read now.

Some questions/points for @brian-brazil et. al.:

  • I'm going for multiple remote read endpoints (via reloadable config), similar to us having multiple remote writers. Anything speaking against that?
  • I'm not planning on SD support in the first pass, just a static URL.
  • When to query a remote endpoint will be configured per endpoint as a duration into the past (query evth. older than X) for now if nobody objects. Local storage will be queried for the entire query range for now, results will be merged / deduped.
  • Should remote read have relabeling as well? I'd probably leave that out at first though.

Cool ! Would be interesting to see the API that you're planning to have. What you currently said looks good.

  • Maybe in the future it would be interesting to have an option to query remote storage up to the start boundary of local storage (when necessary).
  • I guess you're planning to query remote storage each time for now and we will see later if there is a kind of caching (disk and/or memory) or not ?

@iksaif

Would be interesting to see the API that you're planning to have.

It will look basically like Cortex's internal query request / response: https://github.com/weaveworks/cortex/blob/master/cortex.proto#L24-L28
https://github.com/weaveworks/cortex/blob/master/cortex.proto#L24-L32

We'll send it over plain HTTP, not gRPC, though.

Maybe in the future it would be interesting to have an option to query remote storage up to the start boundary of local storage (when necessary).

That isn't too useful though, as just using the retention time doesn't necessarily mean that you already have data in Prometheus all the way to the retention time (like when you wipe your Prometheus server). Also, your Prometheus server might have random gaps at various times, even if it has some old data.

I guess you're planning to query remote storage each time for now and we will see later if there is a kind of caching (disk and/or memory) or not ?

Yeah, for now it would just be queried every time you do a query that queries data older than a configurable age. We can see about doing something more clever later.

I'm going for multiple remote read endpoints (via reloadable config), similar to us having multiple remote writers. Anything speaking against that? I'm not planning on SD support in the first pass, just a static URL.

These are essential, but don't have to be in v1.

We'll also need the ability to enable/disable remote read on a per HTTP query/rule group basis, and override that again to be forced to always on for some (not LTS) endpoints.

When to query a remote endpoint will be configured per endpoint as a duration into the past (query evth. older than X) for now if nobody objects. Local storage will be queried for the entire query range for now, results will be merged / deduped.

For v1, I'd just always query everything.

A simple static duration is not sufficient, as the remote storage may not be caught up that far yet or Prometheus may have retention going further back. I think this is something we'll have to figure out over time for true LTSes (my guess is we'll end up with min(retention period, when local storage was initilised) ), and for non-LTS you always want to query so that's the simplest way to start.

Should remote read have relabeling as well? I'd probably leave that out at first though.

This is kinda complicated. We won't need relabelling (that's the responsibility of the other end). The label semantics we want are to add external labels as defaults, and remove them coming back if we added them. This can possibly be dealt with a bit later.

We may also need to pass on the external labels directly, but let's wait a bit and see if that's needed as I'm hopeful it won't be.

It will look basically like Cortex's internal query request / response

LGTM

That isn't too useful though, as just using the retention time doesn't necessarily mean that you already have data in Prometheus all the way to the retention time (like when you wipe your Prometheus server). Also, your Prometheus server might have random gaps at various times, even if it has some old data.

It is though, for example we run prometheus on marathon and each time an instance restarts it moves to a new empty "disk". That's more true for gaps if you have network issues, but I feel like this would dramatically improve performances for common queries if you accept to loose some consistency. But let's talk about optimizations later.

@brian-brazil Ok, I'll leave out any labeling foo and always query everything for now. Question is whether we'll want the per-query / rule LTS determination in v1. It could influence the design a bit, so I'll think about it at least.

@iksaif ok yeah, so you're not talking about the retention time, but the earliest sample timestamp in Prometheus's current database. We don't track that currently, but that would generally make more sense, yeah.

Question is whether we'll want the per-query / rule LTS determination in v1. It could influence the design a bit, so I'll think about it at least.

It's not v1, but it's something we should have before we start getting into production usage as it's critical for reliability.

@brian-brazil For sure!

The other thing to consider is having the API allow for multiple vector selectors (each with their own timestamps due to offset), so that LTSes can optimise and do better throttling/abuse handling.

@brian-brazil Oh yeah, I have multiple vector selector sets, but great point about different offsets!

A simple static duration is not sufficient, as the remote storage may not be caught up that far yet or Prometheus may have retention going further back. I think this is something we'll have to figure out

I don't think Prometheus having retention going further back is really an issue here, as long as the remote can (already) provide the data. Worst case is with downsampling that you lose granularity.

@pilhuhn I meant it the other way around: if you have a Prometheus retention of 15d and you query only data older than 15d from the remote storage, it doesn't necessarily mean that Prometheus will already have all data younger than 15d (due to storage wipe or whatever).

Well, for a first iteration we're just going to query all time ranges from everywhere.

There's a WIP PR for the remote read integration here for anyone who would like to take a look early: https://github.com/prometheus/prometheus/pull/2499

I'm trying to use the remote_storage_adapter to send metrics from prometheus to opentsdb. But I'm getting these errors in the logs:

WARN[0065] cannot send value NaN to OpenTSDB, skipping sample &model.Sample{Metric:model.Metric{"instance":"localhost:9090", "job":"prometheus", "monitor":"codelab-monitor", "location":"archived", "quantile":"0.5", "__name__":"prometheus_local_storage_maintain_series_duration_seconds"}, Value:NaN, Timestamp:1492267735191}  source=client.go:78
WARN[0065] Error sending samples to remote storage       err=invalid character 'p' after top-level value num_samples=100 source=main.go:281 storage=opentsdb

I've also tried using influxdb instead of opentsdb, with similar results:

EBU[0001] cannot send value NaN to InfluxDB, skipping sample &model.Sample{Metric:model.Metric{"job":"prometheus", "instance":"localhost:9090", "scrape_job":"ns1-web-pinger", "quantile":"0.99", "__name__":"prometheus_target_sync_length_seconds", "monitor":"codelab-monitor"}, Value:NaN, Timestamp:1492268550191}  source=client.go:76

Here's how I'm starting the remote_storage_adapter:

# this is just for influxdb, i make the appropriate changes if trying to use opentsdb
./remote_storage_adapter -influxdb-url=http://138.197.107.211:8086 -influxdb.database=prometheus -influxdb.retention-policy=autogen -log.level debug

Here's the Prometheus config:

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

remote_write:
  url: "http://localhost:9201/write"

Is there something I'm misunderstanding about how to configure the remote_storage_adapter?

@tjboring Neither OpenTSDB nor InfluxDB support float64 NaN (not a number) values, so these samples are skipped when sending samples to them. We have mentioned this problem to InfluxDB, and if we're lucky, they will support NaN values sometime in the future, or maybe we can find another workaround.

OpenTSDB issue: https://github.com/OpenTSDB/opentsdb/issues/183
InfluxDB issue: https://github.com/influxdata/influxdb/issues/4089

I am not sure where the invalid character 'p' after top-level value error comes from though.

@juliusv Thanks for the pointers to the opentsdb/influxdb issues. I was just seeing the error messages on the console and thought nothing was being written, not realizing those are just samples that are being skipped. I've since confirmed that samples are indeed making it to the remote storage db. :)

Now that remote read and write APIs are in place (albeit experimental), should this issue be closed in favour of raising more specific issues as they arise?

https://prometheus.io/docs/operating/configuration/#
https://prometheus.io/docs/operating/configuration/#

Any body tried with container ? Please paste Dockerfile

Because I am not able to find "remote_storage_adapter" executable file in docker "prom/prometheus" version 1.6

/prometheus # find / -name remote_storage_adapter
/prometheus #

Please

@prasenforu I have built a docker image with remote_storage_adapter from current master code: gra2f/remote_storage_adapter, feel free to use it.

@juliusv I have a problems similar to @tjboring ones:

time="2017-04-21T17:45:00Z" level=warning msg="cannot send value NaN to Graphite,skipping sample &model.Sample{Metric:model.Metric{\"__name__\":\"prometheus_target_sync_length_seconds\", \"monitor\":\"codelab-monitor\", \"job\":\"prometheus\", \"instance\":\"localhost:9090\", \"scrape_job\":\"prometheus\", \"quantile\":\"0.9\"}, Value:NaN, Timestamp:1492796695772}" source="client.go:90"

but I am using Graphite. Is it okay?

@sorrowless

Do you see other metrics in Graphite that you know came from Prometheus?

In my case I verified this by connecting to the Influxdb server I was using, and running a query. It gave me back metrics, which confirmed that Prometheus was indeed writing metrics; it's just that some were being skipped, per the log message.

@tjboring yes, I can see some of the metrics in Graphite and what's more strange for me is that I cannot understand why some are there and some are not. For example, sy and us per CPU stored into Graphite but load average is not.

@sorrowless

Not able to find the image, can you please share the url.

Thanks in advance.

@prasenforu just run
$ docker pull gra2f/remote_storage_adapter
in your command line, that's all you need

@sorrowless

Thanks.

@mattbostock As you suggested, I'm closing this issue. We should open more specific remote-storage related issues in the future.

Further usage questions are best asked on our mailing lists or IRC (https://prometheus.io/community/).

@sorrowless

I was looking the images, I saw there was file remote_storage_adapter in /usr/bin

but rest of prometheus file and volume not there,

~ # find / -name remote_storage_adapter
/usr/bin/remote_storage_adapter
~ # find / -name prometheus.yml
~ # find / -name prometheus

Anyway can you please send me the dockerfile of "gra2f/remote_storage_adapter"

@prasenforu
you do not need main prometheus executable to use remote storage adapter. Use prom/prometheus image for that.
What related for Dockerfile - all it is doing is copy prebuilt remote_storage_adapter to it and run it, that's all.

If anyone wants to test it out (like I need to), I wrote a small docker-compose based setup to get this up and running locally - https://github.com/gdmello/prometheus-remote-storage.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings