Elasticsearch: Thinning of historical Data

Created on 6 Oct 2015  路  7Comments  路  Source: elastic/elasticsearch

we are using the ELK stack to store and view our production monitoring metrics response times/error rates/ resource utilization etc...

We are looking to come up with a data retention policy. Ideally we want to keep historical data, however we do not need the same resolution of older historical data. For example we don't need to know CPU usage at 10:32AM two years ago... Ideally we would be able to thin the data of a time-series by removing events within a given interval. (Granted this would cause an issue with our Count Aggregations, it would be useful to be able to apply a multiplier to tween the data, so we can still see count stats like throughput)

Are there any tools, that people have used with elasticsearch to accomplish this kind of stuff?

:CorFeatureIngest :DistributeCRUD discuss high hanging fruit stalled

Most helpful comment

I think I have a solution for this...

https://xkcd.com/979/

But to be serious I'd love to hear about it!

All 7 comments

I think I have a solution for this...

I think I have a solution for this...

https://xkcd.com/979/

But to be serious I'd love to hear about it!

@nik9000 LOL

APM Data Rentention/Archival Process

Snapshot, Compression and Archival of Historical Indices

This process will involve using elastic curator in conjunction with the AWS Cloud Plugin for Elasticsearch in order to store compressed snapshots of the previous days index. This process should run on a nightly basis. Optionally we can move indexes stored in s3 into Amazon Glacier cold storage, by setting up an AWS Lifecycle rule, to be stored at a significantly cheaper cost. The disadvantage being once files are moved to cold storage they take some time to be read and are not immediately accessible. You can read more about glacier here.

Sampling/Thinning out of Historical Indices

This section still needs to be flushed out a bit, and the tooling needs to be created. The general idea is creating a tool, that uses the ES Scan and Scroll API to push documents into a new index based on a sample ratio (e.g Math.random() <= SAMPLE_RATIO). This tool should be run to thin the data of older indices out on a nightly basis. In our case we have an alerting system. This sampling process should omit periods of time in which we had active alerts. In those periods of time we should have full resolution of our data points. After this process has finished being created and the process has completed the newly created index should be given an index alias of the former index name. Then the old index can be dropped.

Currently we are using a count aggregation to visualize traffic/throughput, or rate of events. This will not work when using sampling of data, because it would appear in our charts that indices that had been sampled would have a lower event count. Instead to fix this we will have to attach a new field onto all events that we care about "countFactor". This can easily be done with the logstash mutate filter. So by default all documents will be indexed into elasticsearch with a field value "countFactor" : 1. Now when we run the sampling/thinning process we will have to multiply the countFactor by the divisor e.g. sample 1/2 of events "countFactor" : 2 now let's say a month passes and you want to thin that out by another 50% of documents you would multiply by 2 again and now have "countFactor" : 4. Now to visualize the event rate, instead of using a count aggregation you will use a sum aggregation. This would give you the approximate rate of events at any given time.

@nik9000 not sure if there is a better way to handle this. What are your thoughts?

@cphoover I think its a good plan. I've been thinking about this for the past few months too and your ideas seem pretty good.

If I am not mistaken, A solution for this exists in our new Rollups API. Closing.
Feel free to open if you feel like this is not the solution you were intending.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rbayliss picture rbayliss  路  3Comments

dawi picture dawi  路  3Comments

rjernst picture rjernst  路  3Comments

clintongormley picture clintongormley  路  3Comments

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments