Elasticsearch: Applying destination index pipelines after a processor changes the target index

Created on 11 Jun 2020 · 11Comments · Source: elastic/elasticsearch

When evaluating ingest pipelines, we determine the full set of pipelines upfront (taking into account default_pipeline and final_pipeline) based on the index name that is included in the incoming request.

While executing those pipelines it is possible that a processor will change the destination index , for example (trivial example)

PUT /_ingest/pipeline/redirect
{
  "description": "Redirect to another index",
  "processors": [{ "set": { "field": "_index", "value": "new-destination" } }]
}

If that happens it is possible that the new destination index has its own set of default/final pipelines, but we do not execute them.

It would be preferable if the pipelines attached to the new destination were executed, although there are some questions/issues to work through

if the default_pipeline redirects to a new index, should we still execute the final_pipeline of the original index?
should we execute the default_pipeline of the new destination, or just the final_pipeline?
how do we prevent the document from being infinitely redirected between indices (could we set a rule that the pipeline of the destination index may not change the destination a second time?)

:CorFeatureIngest >bug CorFeatures

Source

tvernum

❤1

Most helpful comment

for ingest manager, since we control the pipelines, we could reasonably make the necessary adjustments to the final destination "final" pipelines to make this work. Sketching the solution out:

we have a "generic" data stream with a default/required pipeline that knows about the packages installed.
we send all data there, and the pipeline will redirect to other data streams by checking specific fields in the document that indicate which input/package they came from and changing _index accordingly
we have added the necessary parsing logic to the final pipelines of the data streams that they get directed to

One downside that comes to mind is that if a user wants to bypass the "final" pipeline for some reason, they can't. As an example, as far as I understand, reindexing will apply the "final" pipeline again, which could be problematic.

I don't want to derail the conversation too much, but I think it's worth considering if we need to take a step back and think about building the right abstractions into Elasticsearch for a routing/parsing pipeline like this instead of making do with what we have - similar to what we did with data streams - which I think will end up being incredibly valuable for our users over the long term because we now have an abstraction that fits much better with what our users are trying to achieve and it paves the way for us to deliver new features on this abstraction that would have been otherwise very difficult.

roncohen on 14 Jul 2020

👍2

All 11 comments

Pinging @elastic/es-core-features (:Core/Features/Ingest)

elasticmachine on 11 Jun 2020

CC: @ruflin

tvernum on 11 Jun 2020

@exekias @andresrc This is relevant for our discussion to potentially use a "routing" ingest pipeline.

ruflin on 11 Jun 2020

👍1

I think we should aim to keep logic of executing pipelines after the _index has been changed simple.

Prior to performing ingest, a list of pipelines is determined. These pipelines are executed. If
at the end the _index has been changed then only the final pipeline of this index (if exists) should
be executed.

if the default_pipeline redirects to a new index, should we still execute the final_pipeline of the original index?

Given with the above logic, yes.

should we execute the default_pipeline of the new destination, or just the final_pipeline?

I think only the final pipeline should be executed.

how do we prevent the document from being infinitely redirected between indices (could we set a rule that the pipeline of the destination index may not change the destination a second time?)

I think it is fine if request or default pipelines change the index. However I think that final pipelines shouldn't be changing the index.

martijnvg on 3 Jul 2020

We discussed this issue with a broader audience and think that the re-execution logic of pipeline should be limited:

When a final pipeline is executed, it always should be the last pipeline being executed. Also the fact that a final pipeline can change the _index key during ingest is a bug, which needs to be addressed.
If the request pipeline changes the _index then this changes the default pipeline and final pipeline to be executed based on whether the new index has a default or final pipeline. If the original index was configured with a default or final pipeline, then these pipelines aren't executed.
If the default pipeline changes the _index then it changes the final pipeline. If there is a final pipeline for the original index then that pipeline isn't executed and if there is a final for the new index then that pipeline is executed.

The above restrictions would avoid infinitely redirections between pipelines.

@ruflin @tvernum What do you think about this?

martijnvg on 7 Jul 2020

Just to check: In the current proposal, if a default pipeline redirects to another index then the default pipeline on that new destination index would not be executed. Is that correct?

I do wonder whether that works for Ingest Manager - My understanding is that they use both default and final pipelines, and would want to rely on both being executed after routing to a new index.
Perhaps they could use the pipeline processor to work around that, but that assumes that the routing process knows which pipelines it needs to execute.
May be could introduce an option on the pipeline processor to lookup the default pipeline for an index and execute that... as long as there's some way for Ingest Manager/Fleet to say "please execute the default pipeline on the new destination index", even if it needs to be explicit.

tvernum on 8 Jul 2020

In the current proposal, if a default pipeline redirects to another index then the default pipeline on that new destination index would not be executed. Is that correct?

Yes, that is the idea.

I do wonder whether that works for Ingest Manager

@ruflin Can you comment on whether the proposal works for ingest manager?

martijnvg on 8 Jul 2020

@martijnvg Looking at your proposal I think this makes sense. @andresrc and @exekias do this still work in the routing idea? I think so.

ph on 9 Jul 2020

for ingest manager, since we control the pipelines, we could reasonably make the necessary adjustments to the final destination "final" pipelines to make this work. Sketching the solution out:

we have a "generic" data stream with a default/required pipeline that knows about the packages installed.
we send all data there, and the pipeline will redirect to other data streams by checking specific fields in the document that indicate which input/package they came from and changing _index accordingly
we have added the necessary parsing logic to the final pipelines of the data streams that they get directed to

roncohen on 14 Jul 2020

👍2

One downside that comes to mind is that if a user wants to bypass the "final" pipeline for some reason, they can't. As an example, as far as I understand, reindexing will apply the "final" pipeline again, which could be problematic.

The idea of final pipelines is that it can't be bypassed, so maybe in that case a default pipeline should be used?

I don't want to derail the conversation too much, but I think it's worth considering if we need to take a step back and think about building the right abstractions into Elasticsearch for a routing/parsing pipeline like this instead of making do with what we have - similar to what we did with data streams - which I think will end up being incredibly valuable for our users over the long term because we now have an abstraction that fits much better with what our users are trying to achieve and it paves the way for us to deliver new features on this abstraction that would have been otherwise very difficult.

👍 I think it is a good idea to take a good look at the current implementation and consider alternatives for it.

martijnvg on 16 Jul 2020

👍1

@tvernum I file an issue around being able to specify multiple ingest pipelines in a data stream and perhaps this could also help with the problem here: https://github.com/elastic/elasticsearch/issues/61185

ruflin on 17 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings