Use ES Rollover API to manage retention. It's an alternative to date based indices currently used in Jaeger. We could make it as an optional feature.
Before running jaeger we have to create write(read) alias:
curl -ivX PUT -H "Content-Type: application/json" localhost:9200/jaeger-span-000001 -d '{
"aliases": {
"jaeger-span": {"is_write_index": true} // note that is_write_index works only in ES6.4
}
}'
The command creates index jaeger-span-000001 and alias jaeger-span.
Now collector can write to jaeger-span alias. Once the index is too large an external service can rollover new index. This API has to be called periodically and once conditions are met (during the call). ES will create a new index.
curl -ivX POST -H "Content-Type: application/json" localhost:9200/jaeger-span/_rollover -d '{
"conditions": {
"max_age": "7d",
"max_docs": 1
}
}'
The command creates index jaeger-span-000002 which is put into alias jaeger-span. Note that the old index jaeger-span-000001 stays in alias if "is_write_index": true (supported only in ES > 6.4).
When using ES < 6.4. We have to also use a read alias because the main alias jaeger-span can contain only one index.
curl -ivX POST -H "Content-Type: application/json" localhost:9200/_aliases -d '{
"actions" : [
{ "add" : { "index" : "jaeger-span", "alias" : "jaeger-span-read" } }
]
}'
This command creates read alias jaeger-span-read which points to jaeger-span index (the write index).
When calling rollover we have to specify the alias names. A newly created index will be put into the alias.
curl -ivX POST -H "Content-Type: application/json" localhost:9200/jaeger-span/_rollover -d '{
"conditions": {
"max_age": "7d",
"max_docs": 1
},
"aliases": {
"jaeger-span-read": {}
}
}'
https://www.elastic.co/guide/en/elasticsearch/reference/6.5/indices-rollover-index.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/indices-rollover-index.html
https://www.elastic.co/blog/managing-time-based-indices-efficiently
Introduce flag which will use a single index (alias) to read and write.
--es.use-single-index Use a single index names without date (e.g. "jaeger-span") to write and read.
--es.read-alias Use "-read" alias for read indices.
cc @jaegertracing/elasticsearch
I just read https://www.elastic.co/blog/managing-time-based-indices-efficiently - while the primitives make sense, the process itself is absolutely horrifying: 7 or more steps, any of which can fail, with undefined wait periods between them. At least our daily indices require almost no maintenance, just the delete job with a single step.
I would only consider the rollover pattern if it's fully supported by the curator. If it is, I think it's a good direction, but it sounds like we'd still need to provide a tool to generate the curator yaml files with all the actions.
7 or more steps, any of which can fail, with undefined wait periods between them. At least our daily indices require almost no maintenance, just the delete job with a single step.
What steps _exactly_ do you mean? I the linked example is using even more complicated deployment model with hot/cold nodes. Curator already supports rollover https://www.elastic.co/guide/en/elasticsearch/client/curator/current/rollover.html. In addition to that I am also interested in adding https://github.com/elastic/curator/issues/1278 to its API.
To make rollover work the only required steps are:
The date based indices will be still supported. Rollover will be an additional feature for users which can benefit from it.
I was referring to the steps in the blog post. Rollover is just one step, all others have to do with managing the index aliases, relocating index to warm nodes, compressing it, etc. Calling the rollover API only triggers index rollover once in a while, it's not sufficient for managing the whole thing via aliases.
I am not opposed to the approach, as long as curator provides the necessary automation for managing the aliases after the rollover.
Adding questions from weekly meeting:
--es.max-span-age - The maximum lookback for spans in Elasticsearch
Reponse: The reader will access only one alias pointing to potentionally multiple indices. An external component (part of this project) will be executed to remove old indices from read alias.
If it helps: https://sematext.com/blog/field-stats-plugin-elasticsearch/ (Github repo for the plugin linked in there)
thanks for the pointer @otisg I think we would like to stay with only official ES distribution if it is possible. The --es.max-span-age - The maximum lookback for spans in Elasticsearch can be managed by curator/ES API by removing old indices from read alias. We will provide a script/component to do that.
The #1197 introduces esRollover to manage rollover index. My last open question is how to manage --es.max-span-age - The maximum lookback for spans in Elasticsearch? At the moment the reader creates a list of indices based on the supplied es.max-span-age. With the rollover we will read always from one index - read alias.
My design is that an external component would remove indices from the alias to mimic the behavior of es.max-span-age.
Any more thoughts on this from @jaegertracing/elasticsearch ?
seems like es.max-span-age should not be used if the alias mode is selected
Yes, But we should provide an alternative solution to that... I have added this functionality to esRollover script in #1197. It removes indices from read alias based on configured age.
the es.max-span-age is a fairly critical feature for us, unfortunately, and it doesn't seem like the archive indices will be enough? I could absolutely be confused here, based on the multiple PRs inflight for this refactor
@pavolloffay Would it be possible to use a NumericRangeQuery? It feels like this might be the most efficient query method from Elasticsearch's perspective:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html#_date_format_in_range_queries
Possible example:
GET _search
{
"query": {
"range" : {
" startTimeMillis" : {
"format": "epoch_millis", # though it appears this field IS NOT needed in Jaeger's query.
"gte" : "now-72h/h",
"lt" : "now/d"
}
}
}
}
@masteinhauser I think using NumericRangeQuery compared to first filtering by indices has performance implications as the ES would have to go through all data for all indices kept in the read alias.
is a fairly critical feature for us
Can you please explain why is it critical to you? Do you deploy multiple query services with a different es.max-span-age? I think this is the only tricky part when using rollover, then you would have to use different index prefixes and keep a different set of indices per index.
There is only one PR related to rollover: https://github.com/jaegertracing/jaeger/pull/1197, see the first comment and section Managing max-span-age and delete old indices to better understand how it works.
as the ES would have to go through all data for all indices kept in the read alias.
Yep, we already see that exact behavior with our es.max-spane-age=720h.
Can you please explain why is it critical to you?
Unfortunately, we have _far_ too many defects filed from production speakers and customers that get worked on outside the default 72h range, but almost all of them get worked within 30 days. These don't always have an exact TraceID to refer to during their investigation. I'm actively trying to determine how to better support this use case, and was hoping this work might be related. My next attempt is to deploy multiple services with different configurations.
I'm not sure how Kibana does this, but I do know it handles far more data over a larger timeframe much better than the Jaeger Query searches seem to. (We use Kibana to figure out all of our Traces, and then use those TraceIDs to pull up Jaeger's view of the spans)
Oh, apologies, I'll take a look at https://github.com/jaegertracing/jaeger/pull/1197 once again to re-familiarize myself. Thanks for the reference!
https://discuss.elastic.co/t/filter-indices-for-range-query-in-time-based-indices/149913 mentions that range query could be used with a large number of indices, that ES does some optimizations to avoid going through all indices. One way or another this will be done separately with some perf tests.
ES 6.5 and 7.0.0 (I was able to test this with 7 only) supports rollover policies https://www.elastic.co/guide/en/elasticsearch/reference/6.x//using-policies-rollover.html. It means that rollover conditions are set in a policy and ES automatically creates new index - no need to periodically call index/_rollover endpoint.
The following example will create a new index every 5s and delete if older than 20s. To make this wor per seconds we have to modify cluster setting indices.lifecycle.poll_interval=1s when starting ES.
docker run --rm -it -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "indices.lifecycle.poll_interval=1s" docker.elastic.co/elasticsearch/elasticsearch:7.0.0-alpha2
curl -ivX PUT -H "Content-Type: application/json" localhost:9200/_ilm/policy/archive_index_policy -d '{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_age": "5s"
}
}
},
"delete": {
"min_age": "20s",
"actions": {
"delete": {}
}
}
}
}
}'
curl -ivX PUT -H "Content-Type: application/json" localhost:9200/_template/archive_index_template -d '{
"index_patterns": ["jaeger-span-archive-*"],
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"index.lifecycle.name": "archive_index_policy",
"index.lifecycle.rollover_alias": "jaeger-span-archive-write"
}
}'
curl -ivX PUT -H "Content-Type: application/json" localhost:9200/jaeger-span-archive-000001 -d '{
"aliases": {
"jaeger-span-archive-write":{"is_write_index": true} // I am using a single index here
}
}'
Heads up https://www.elastic.co/guide/en/elasticsearch/reference/6.7/index-lifecycle-management-api.html is enterprise x-pack feature so we cannot use it in OSS. The only improvement we can do is time range queries #1361.
Maybe there is an OSS plugin which provides index lifecycle management, then the deployment will not require to run esRollover rollover action. However we still have to provide it.
Is ILM now not a basic feature of elastic 7+ now?
Yes, It seems no longer be listed under x-pack https://www.elastic.co/guide/en/elasticsearch/reference/7.x/index-lifecycle-management-api.html
@pavolloffay
https://discuss.elastic.co/t/filter-indices-for-range-query-in-time-based-indices/149913 mentions that range query could be used with a large number of indices, that ES does some optimizations to avoid going through all indices. One way or another this will be done separately with some perf tests.
I have been trying to find more info about that comment, were you able to confirm how this optimizations work ?
Actually I wasn't able to find any concrete docs. There is a PR that implements wildcard index for query - depending only on time range. https://github.com/jaegertracing/jaeger/pull/1969
Our (for now) internal results show that it is slower than providing a complied list of indices to query.
@pavolloffay - I have been trying to use ILM to manage the jaeger rollovers and deletion - Instead of having a cron job hitting rollover api to manually perform rollover - as specified in this blog (https://medium.com/jaegertracing/using-elasticsearch-rollover-to-manage-indices-8b3d0c77915d).
To achieve the same, I am creating override index templates (for span and service) before running the init. Then run esrollover.py init to creating span,service templates ,aliases and first indices (span-00001 and service-00001)
PUT _template/override-jaeger-span-index-template
{
"order": 1,
"index_patterns": [
"jaeger-span-*"
],
"settings": {
"index": {
"lifecycle": {
"name": "jaeger-ILM-Policy",
"rollover_alias": "jaeger-span-write"
}
}
},
"aliases": {
"jaeger-span-read": {}
}
}
jaeger-ILM-Policy is created before hand.
PUT _ilm/policy/jaeger-ILM-Policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "1d"
},
"set_priority": {
"priority": 100
}
}
},
"delete": {
"min_age": "1d",
"actions": {
"delete": {}
}
}
}
}
}
In the override template I add a alias "jaeger-span-read" which will make sure all the indices created by jaeger would have the read alias. And I use "jaeger-span-write" as index_rollover_alias. I see the initial rollover (rollover from hot) working fine. I am having a challenge, when it tries to perform checks after initial rollover (to delete), it fails as the initial index or previous index no longer is part of index_rollover_alias (jaeger-span-write). I wanted to understand the rationale of using two different alias for reading and writing, we could have used one alias and used "is_write_index". I see the same mentioned in one of the above comments for archive-index.
I am having a challenge, when it tries to perform checks after initial rollover (to delete), it fails as the initial index or previous index no longer is part of index_rollover_alias (jaeger-span-write)
What component is causing the issue?
I wanted to understand the rationale of using two different alias for reading and writing, we could have used one alias and used "is_write_index". I see the same mentioned in one of the above comments for archive-index.
IIRC it was done for ES5. The ES5 does not support is_write_index property.
@pavolloffay - Thanks for reverting quickly. As we dont associate is_write_index to initial index. After first rollover, span-0001 is removed from jaeger-span-write alias (which is ilm_rollover_alias). When ilm polls the span-0001 index for further lifecycle events it complains:
{\"type\":\"illegal_argument_exception\",\"reason\":\"index.lifecycle.rollover_alias [jaeger-span-write] does not point to index [jaeger-span-000001]\",....}
If we add is_write_index true while creating span-0001 - I suspect this would work. I am going to give it a try and update.
thanks @bhiravabhatla. It would be great to put a guide/docs or blog post on this topic if you are interested.
Will do @pavolloffay, Thank you!. I think we can add the is_write_index true while adding indices to the write alias here by passing extra_settings here - https://github.com/jaegertracing/jaeger/blob/af985aefca5de4b0e4708418be6fc9fd64be427e/plugin/storage/es/esRollover.py#L124
Correct me if I am wrong
Hi @pavolloffay - Was able to implement the same, made few tweaks to esRollover.py. Pushed the updated image here - https://github.com/bhiravabhatla/jaeger-index-rollover-with-ilm. Have tested it with example application, I could see that Jaeger is able to read from read-alias and ILM is able to rollover and delete indices as specified in config.
Note - Have not tested for archive indices.
Summary:
To use ILM for managing jaeger rollover, I followed below steps:
-- Create a ILM policy for jaeger in elastic search. In below sample for demo, I have kept max_age and delete after in minutes.
Sample :
PUT _ilm/policy/jaeger-ILM-Policy { "policy": { "phases": { "hot": { "min_age": "0ms", "actions": { "rollover": { "max_age": "1m" }, "set_priority": { "priority": 100 } } }, "delete": { "min_age": "2m", "actions": { "delete": {} } } } } }
-- Run Init to create the initial set of aliases and templates. I am creating override templates[with different name and order=1] - as when jaeger starts up, it creates/updates the templates with name - jaeger-service and jaeger-span.
docker run -it --rm --net=host bhiravabhatla/jaeger-es-rollover-init:latest init <ES HOST>
-- Start Jaeger with es.use-aliases=true
Note - By default indices.lifecycle.poll_interval is set to 10m, for testing, we would have to set it to something less say 10s
PUT /_cluster/settings?flat_settings=true
{
"transient" : {
"indices.lifecycle.poll_interval" : "10s"
}
}
@pavolloffay - Could you please share feedback on above. One thing I could have done was to parameterise jaeger ILM policy names in the templates
One thing I could have done was to parameterise jaeger ILM policy names in the templates
In the jaeger index templates? We should make the ILM work with the upstream Jaeger if possible without requiring users to do changes.
I don't have experience with ILM configuration so I cannot really comment if it's good or not. Perhaps somebody from @jaegertracing/elasticsearch can have a look on the approach mentioned above?
@bhiravabhatla would you be interested documentig this in jaegertracing.io or writing a medium post?
In the jaeger index templates? We should make the ILM work with the upstream Jaeger if possible without requiring users to do changes.
I agree we should make this work with upstream Jaegar. The above can be looked as a workaround to use ILM with current jaeger capabilities. In future, I think we can have a flag --es.use-ILM or something similar and create the index template accordingly from jaeger itself - open to discussions on this.
@bhiravabhatla would you be interested documentig this in jaegertracing.io or writing a medium post?
Sure. I can, let me know the process.
, I think we can have a flag --es.use-ILM or something similar and create the index template accordingly from jaeger itself - open to discussions on this.
Would you be also intereted in submitting a PR to do this?
The docs are hosted here https://github.com/jaegertracing/documentation/blob/master/content/docs/next-release/deployment.md#elasticsearch you can create a PR against that. The blog is hosted on medium https://medium.com/jaegertracing. If you prefer the blog I can add you to the medium Jaeger org so that you can submit a publication there - I will need your medium account.
@pavolloffay - I actually have drafted a blog in my medium account. Have not published yet. My medium account https://medium.com/@bhiravabhatla
Would you be also intereted in submitting a PR to do this?
Have not used golang before, I am interested - but would need some help. :)
np we can help you with golang :). I have sent you an invite on medium to join jaegertracing.
np we can help you with golang :)
Thank you :).
I have sent you an invite on medium to join jaegertracing.
Thank you @pavolloffay - I have submitted the draft.