Elasticsearch: 5.0 Default pipelines

Created on 24 Oct 2016 · 34Comments · Source: elastic/elasticsearch

The more I use pipelines the more useful they would be come if I could specify a list of pipelines that automatically get run on a type or index level.

This would also save some overhead of specifying what pipeline to use when a huge percentage of use cases that are using pipelines will never change. This would also make the api easier to use.

Some questions would be, if I specify a list of pipelines on an index, what would happen if I specify a pipeline to use, Would it be a merged list or just the specified pipeline to run.

:CorFeatureIngest >feature help wanted

Source

niemyjski

👍21

Most helpful comment

With _timestamp removed if a user wants to add their own timestamp field, a pipeline processor is really the only way to do it. Having to force all clients to specify the same pipeline (or include it in theirs) is problematic. I reached for this within the first 30 minutes of using pipelines and feel it would be very helpful.

djschny on 12 Nov 2016

👍15

All 34 comments

The more I use pipelines the more useful they would be come if I could specify a list of pipelines that automatically get run on a type or index level.

I've been against default pipelines because pipelines should only be run on the first ingestion. When you update or overwrite a document, you may not want the default to run. For this reason I prefer pipelines to be manually specified.

clintongormley on 5 Nov 2016

djschny on 12 Nov 2016

👍15

When you update or overwrite a document, you may not want the default to run.

I agree for a non-logging use case. The ability to enable default pipelines for a logging use case would be very helpful where document updates are non-existent. Further, I would only want to enable default pipelines for certain indices or have the capability to do so. Could it be an index setting?

Example uses for logging might be password stripping with the set processor or field truncation using a script processor.

inqueue on 9 Feb 2017

👍2

+1 to allowing default pipelines for indices.
Ideally you would also be able to specify on what type of operation the pipeline would be run: insert-time, update or both.

tipuban on 7 Mar 2017

👍1

Ideally you would also be able to specify on what type of operation the pipeline would be run: insert-time, update or both.

+1 It would be very useful.

cristimagda on 7 Mar 2017

+1 for this. It helps users that were affected by the removal of _timestamp field.

marius-dr on 19 Apr 2017

👍3

another argument in favor of being able to specify a default pipeline:
There are scenarios where the PUT command (and some other document pre processing is outside the control of the ES operator)

AWS allows to feed an elasticsearch instance from a Amazon Kinesis Firehose Stream. However, the document _id is set by the Firehose Stream. Firehose also controls the command that is used to send the data to the elasticsearch instance, i.e. it is not possible to add the ?pipeline query parameter.
With a default ingest pipeline (based on index/type, ideally specified altogether in the index template) one could set the _id through a preprocessor based on the document _source.

zoellner on 17 May 2017

👍6

+1 especially as an option block in index templates, I guess this would be a perfect spot for it.

titzi on 31 May 2017

👍1

+1 having this in the index template will be very useful

AWS allows to feed an elasticsearch instance from a Amazon Kinesis Firehose Stream. However, the document _id is set by the Firehose Stream. Firehose also controls the command that is used to send the data to the elasticsearch instance, i.e. it is not possible to add the ?pipeline query parameter.
With a default ingest pipeline (based on index/type, ideally specified altogether in the index template) one could set the _id through a preprocessor based on the document _source.

@zoellner I stumbled upon this issue while looking for the exact same option. Did you happen to figure out a way around this?

whiteboardmonk on 8 Jun 2017

👍1

@whiteboardmonk no, I've since stopped using Firehose Streams because of this issue.

zoellner on 8 Jun 2017

+1, _timestamp-replacement as the use-case

wpongra on 7 Aug 2017

Is there any updates on this?

niemyjski on 10 Aug 2017

+1, for the default pipeline (regarding the use-case for _timestamp)

mr-mos on 22 Aug 2017

+1
I do understand @clintongormley argumentation against it. And it should be documented that adding a default pipeline will be executed for first ingestion as well as updates. But having the choice between specifying it on index level or per request gives the flexibility to use what ever is more appropriate for the current job.

soultemptation on 29 Aug 2017

Being able to specify a default pipeline perhaps in an index template would be extremely useful for our case where we don't have control over the bulk put. We are using fluentd and its elasticsearch plugin and I don't believe there is a way for us to specify a pipeline using its output language.

chs-bnet on 6 Sep 2017

+1 for default pipeline.
And the setting should likely be done in index side, not pipeline.
also add option to skip pipeline like ?skip_pipline=true for some interfaces, e.g. reindex, special case. May avoid the case @clintongormley mentioned at beginning.

zfanswer on 8 Sep 2017

+1 to add another real world usecase: we have a tracing implementation that persists to elasticsearch. it keeps track of the time-stamps in microseconds instead of milliseconds. adding an additional field that does the conversion while indexing would be extremely helpful. There is no chance to control the trace collectors PUT requests to ES and therefore no chance to configure a pipeline via queryparams :/

de-robat on 20 Sep 2017

@clintongormley Good point. Perhaps a good idea would be having an index-wide setting for a default pipeline, with some parameters controlling for which operations the default applies? (By operations I mean index or update or whatever.)

dandrestor on 5 Oct 2017

I think I could get behind the following:

an index setting which specifies the default pipeline to use for index or create operations only
update operations would not use the default pipeline
specifying ?pipeline=foo in an index request would result in the foo pipeline being applied instead of the default pipeline
specifying ?pipeline= in an index request would result in no pipeline being applied

clintongormley on 9 Oct 2017

👍8

specifying ?pipeline= in an index request would result in no pipeline being applied

It seems like this could easily be a malformed request. For this corner case (that someone wants to get around the default pipeline), one could create a dummy pipeline that does nothing and specify that explicitly here? Then specifying pipeline with an empty string can return an error?

rjernst on 9 Oct 2017

Or, have a specially named value called _none?

rjernst on 9 Oct 2017

We sure could use this functionality as well.

Has there been update yet from ES whether this is on the roadmap? I'm not finding one.

stevenwall on 12 Dec 2017

Is this supported in ES 6.0?

zfanswer on 14 Dec 2017

I like the ingest pipeline, as it decouples me from any pre-processing of my logs in the source.
But if I cant enable it by default, I am still thrown back to manipulate my sources (that I dont have control over necessarily) to use a specific ingest pipeline

kafis on 1 Mar 2018

prasadkhandagale on 7 Mar 2018

One more use case with default pipelines could be e.g. custom validation/postprocessing of Kibana objects in case of introduction of a pipeline on .kibana index.

sergii-sakharov on 8 Mar 2018

@clintongormley Why do you want to restrict something optional?
I mean letting default pipeline be used in updates could be very useful for calculated fields (in our case suggest fields for completion), while those who wants to use a pipeline only at index time could still do it manually with index parameters.
Or we could specify a default one for index and a default one for update?
Again, as using default pipeline would be mandatory I believe there's no point to make it restricted.
My 2 cents :)

SebC99 on 15 Mar 2018

Pinging @elastic/es-core-infra

elasticmachine on 15 Mar 2018

kunna on 18 Mar 2018

vanntomm on 4 May 2018

trippd6 on 10 May 2018

lukeplausin on 22 May 2018

romanpierson on 15 Jun 2018

requested by Student in Engineer II training. Use-case data validation.

Q. This ^^discussion considers adding a pipeline to index settings. As an alternative, could a default pipeline be specified in an alias, which could be exposed for first ingest while allowing subsequent update or overwrite directly via the index or via an alternative alias using no (or a different) pipeline?