The more I use pipelines the more useful they would be come if I could specify a list of pipelines that automatically get run on a type or index level.
This would also save some overhead of specifying what pipeline to use when a huge percentage of use cases that are using pipelines will never change. This would also make the api easier to use.
Some questions would be, if I specify a list of pipelines on an index, what would happen if I specify a pipeline to use, Would it be a merged list or just the specified pipeline to run.
The more I use pipelines the more useful they would be come if I could specify a list of pipelines that automatically get run on a type or index level.
I've been against default pipelines because pipelines should only be run on the first ingestion. When you update or overwrite a document, you may not want the default to run. For this reason I prefer pipelines to be manually specified.
With _timestamp
removed if a user wants to add their own timestamp field, a pipeline processor is really the only way to do it. Having to force all clients to specify the same pipeline (or include it in theirs) is problematic. I reached for this within the first 30 minutes of using pipelines and feel it would be very helpful.
When you update or overwrite a document, you may not want the default to run.
I agree for a non-logging use case. The ability to enable default pipelines for a logging use case would be very helpful where document updates are non-existent. Further, I would only want to enable default pipelines for certain indices or have the capability to do so. Could it be an index setting?
Example uses for logging might be password stripping with the set processor or field truncation using a script processor.
+1 to allowing default pipelines for indices.
Ideally you would also be able to specify on what type of operation the pipeline would be run: insert-time, update or both.
Ideally you would also be able to specify on what type of operation the pipeline would be run: insert-time, update or both.
+1 It would be very useful.
+1 for this. It helps users that were affected by the removal of _timestamp field.
another argument in favor of being able to specify a default pipeline:
There are scenarios where the PUT command (and some other document pre processing is outside the control of the ES operator)
AWS allows to feed an elasticsearch instance from a Amazon Kinesis Firehose Stream. However, the document _id is set by the Firehose Stream. Firehose also controls the command that is used to send the data to the elasticsearch instance, i.e. it is not possible to add the ?pipeline query parameter.
With a default ingest pipeline (based on index/type, ideally specified altogether in the index template) one could set the _id through a preprocessor based on the document _source.
+1 especially as an option block in index templates, I guess this would be a perfect spot for it.
+1 having this in the index template will be very useful
AWS allows to feed an elasticsearch instance from a Amazon Kinesis Firehose Stream. However, the document _id is set by the Firehose Stream. Firehose also controls the command that is used to send the data to the elasticsearch instance, i.e. it is not possible to add the ?pipeline query parameter.
With a default ingest pipeline (based on index/type, ideally specified altogether in the index template) one could set the _id through a preprocessor based on the document _source.
@zoellner I stumbled upon this issue while looking for the exact same option. Did you happen to figure out a way around this?
@whiteboardmonk no, I've since stopped using Firehose Streams because of this issue.
+1, _timestamp-replacement as the use-case
Is there any updates on this?
+1, for the default pipeline (regarding the use-case for _timestamp)
+1
I do understand @clintongormley argumentation against it. And it should be documented that adding a default pipeline will be executed for first ingestion as well as updates. But having the choice between specifying it on index level or per request gives the flexibility to use what ever is more appropriate for the current job.
Being able to specify a default pipeline perhaps in an index template would be extremely useful for our case where we don't have control over the bulk put. We are using fluentd and its elasticsearch plugin and I don't believe there is a way for us to specify a pipeline using its output language.
+1 for default pipeline.
And the setting should likely be done in index side, not pipeline.
also add option to skip pipeline like ?skip_pipline=true
for some interfaces, e.g. reindex, special case. May avoid the case @clintongormley mentioned at beginning.
+1 to add another real world usecase: we have a tracing implementation that persists to elasticsearch. it keeps track of the time-stamps in microseconds instead of milliseconds. adding an additional field that does the conversion while indexing would be extremely helpful. There is no chance to control the trace collectors PUT requests to ES and therefore no chance to configure a pipeline via queryparams :/
@clintongormley Good point. Perhaps a good idea would be having an index-wide setting for a default pipeline, with some parameters controlling for which operations the default applies? (By operations I mean index or update or whatever.)
I think I could get behind the following:
index
or create
operations onlyupdate
operations would not use the default pipeline?pipeline=foo
in an index request would result in the foo
pipeline being applied instead of the default pipeline?pipeline=
in an index request would result in no pipeline being appliedspecifying ?pipeline= in an index request would result in no pipeline being applied
It seems like this could easily be a malformed request. For this corner case (that someone wants to get around the default pipeline), one could create a dummy pipeline that does nothing and specify that explicitly here? Then specifying pipeline with an empty string can return an error?
Or, have a specially named value called _none
?
+1
We sure could use this functionality as well.
Has there been update yet from ES whether this is on the roadmap? I'm not finding one.
Is this supported in ES 6.0?
+1
I like the ingest pipeline, as it decouples me from any pre-processing of my logs in the source.
But if I cant enable it by default, I am still thrown back to manipulate my sources (that I dont have control over necessarily) to use a specific ingest pipeline
+1
One more use case with default pipelines could be e.g. custom validation/postprocessing of Kibana objects in case of introduction of a pipeline on .kibana index.
@clintongormley Why do you want to restrict something optional?
I mean letting default pipeline be used in updates could be very useful for calculated fields (in our case suggest fields for completion), while those who wants to use a pipeline only at index time could still do it manually with index parameters.
Or we could specify a default one for index and a default one for update?
Again, as using default pipeline would be mandatory I believe there's no point to make it restricted.
My 2 cents :)
Pinging @elastic/es-core-infra
+1
+1
+1
+1
+1
+1
requested by Student in Engineer II training. Use-case data validation.
Q. This ^^discussion considers adding a pipeline to index settings. As an alternative, could a default pipeline be specified in an alias, which could be exposed for first ingest while allowing subsequent update or overwrite directly via the index or via an alternative alias using no (or a different) pipeline?
Most helpful comment
With
_timestamp
removed if a user wants to add their own timestamp field, a pipeline processor is really the only way to do it. Having to force all clients to specify the same pipeline (or include it in theirs) is problematic. I reached for this within the first 30 minutes of using pipelines and feel it would be very helpful.