When evaluating ingest pipelines, we determine the full set of pipelines upfront (taking into account default_pipeline and final_pipeline) based on the index name that is included in the incoming request.
While executing those pipelines it is possible that a processor will change the destination index , for example (trivial example)
PUT /_ingest/pipeline/redirect
{
"description": "Redirect to another index",
"processors": [{ "set": { "field": "_index", "value": "new-destination" } }]
}
If that happens it is possible that the new destination index has its own set of default/final pipelines, but we do not execute them.
It would be preferable if the pipelines attached to the new destination were executed, although there are some questions/issues to work through
Pinging @elastic/es-core-features (:Core/Features/Ingest)
CC: @ruflin
@exekias @andresrc This is relevant for our discussion to potentially use a "routing" ingest pipeline.
I think we should aim to keep logic of executing pipelines after the _index has been changed simple.
Prior to performing ingest, a list of pipelines is determined. These pipelines are executed. If
at the end the _index has been changed then only the final pipeline of this index (if exists) should
be executed.
if the default_pipeline redirects to a new index, should we still execute the final_pipeline of the original index?
Given with the above logic, yes.
should we execute the default_pipeline of the new destination, or just the final_pipeline?
I think only the final pipeline should be executed.
how do we prevent the document from being infinitely redirected between indices (could we set a rule that the pipeline of the destination index may not change the destination a second time?)
I think it is fine if request or default pipelines change the index. However I think that final pipelines shouldn't be changing the index.
We discussed this issue with a broader audience and think that the re-execution logic of pipeline should be limited:
_index key during ingest is a bug, which needs to be addressed._index then this changes the default pipeline and final pipeline to be executed based on whether the new index has a default or final pipeline. If the original index was configured with a default or final pipeline, then these pipelines aren't executed._index then it changes the final pipeline. If there is a final pipeline for the original index then that pipeline isn't executed and if there is a final for the new index then that pipeline is executed. The above restrictions would avoid infinitely redirections between pipelines.
@ruflin @tvernum What do you think about this?
Just to check: In the current proposal, if a default pipeline redirects to another index then the default pipeline on that new destination index would not be executed. Is that correct?
I do wonder whether that works for Ingest Manager - My understanding is that they use both default and final pipelines, and would want to rely on both being executed after routing to a new index.
Perhaps they could use the pipeline processor to work around that, but that assumes that the routing process knows which pipelines it needs to execute.
May be could introduce an option on the pipeline processor to lookup the default pipeline for an index and execute that... as long as there's some way for Ingest Manager/Fleet to say "please execute the default pipeline on the new destination index", even if it needs to be explicit.
In the current proposal, if a default pipeline redirects to another index then the default pipeline on that new destination index would not be executed. Is that correct?
Yes, that is the idea.
I do wonder whether that works for Ingest Manager
@ruflin Can you comment on whether the proposal works for ingest manager?
@martijnvg Looking at your proposal I think this makes sense. @andresrc and @exekias do this still work in the routing idea? I think so.
for ingest manager, since we control the pipelines, we could reasonably make the necessary adjustments to the final destination "final" pipelines to make this work. Sketching the solution out:
_index accordinglyOne downside that comes to mind is that if a user wants to bypass the "final" pipeline for some reason, they can't. As an example, as far as I understand, reindexing will apply the "final" pipeline again, which could be problematic.
I don't want to derail the conversation too much, but I think it's worth considering if we need to take a step back and think about building the right abstractions into Elasticsearch for a routing/parsing pipeline like this instead of making do with what we have - similar to what we did with data streams - which I think will end up being incredibly valuable for our users over the long term because we now have an abstraction that fits much better with what our users are trying to achieve and it paves the way for us to deliver new features on this abstraction that would have been otherwise very difficult.
One downside that comes to mind is that if a user wants to bypass the "final" pipeline for some reason, they can't. As an example, as far as I understand, reindexing will apply the "final" pipeline again, which could be problematic.
The idea of final pipelines is that it can't be bypassed, so maybe in that case a default pipeline should be used?
I don't want to derail the conversation too much, but I think it's worth considering if we need to take a step back and think about building the right abstractions into Elasticsearch for a routing/parsing pipeline like this instead of making do with what we have - similar to what we did with data streams - which I think will end up being incredibly valuable for our users over the long term because we now have an abstraction that fits much better with what our users are trying to achieve and it paves the way for us to deliver new features on this abstraction that would have been otherwise very difficult.
馃憤 I think it is a good idea to take a good look at the current implementation and consider alternatives for it.
@tvernum I file an issue around being able to specify multiple ingest pipelines in a data stream and perhaps this could also help with the problem here: https://github.com/elastic/elasticsearch/issues/61185
Most helpful comment
for ingest manager, since we control the pipelines, we could reasonably make the necessary adjustments to the final destination "final" pipelines to make this work. Sketching the solution out:
_indexaccordinglyOne downside that comes to mind is that if a user wants to bypass the "final" pipeline for some reason, they can't. As an example, as far as I understand, reindexing will apply the "final" pipeline again, which could be problematic.
I don't want to derail the conversation too much, but I think it's worth considering if we need to take a step back and think about building the right abstractions into Elasticsearch for a routing/parsing pipeline like this instead of making do with what we have - similar to what we did with data streams - which I think will end up being incredibly valuable for our users over the long term because we now have an abstraction that fits much better with what our users are trying to achieve and it paves the way for us to deliver new features on this abstraction that would have been otherwise very difficult.