Provide a Logstash filter that reads arbitrary ES ingest node pipeline definitions and applies them to Logstash events. To ensure that ES ingest node pipelines in Logstash produce the same results as in ES, the ES ingest node code will be hosted and executed inside Logstash.
Users would supply their ES ingest node pipeline definition(s) [multiple pipelines could be specified because the pipeline operator would be supported] in JSON format as a configuration option to the Logstash filter. E.g.:
{
"set_and_lower": {
"processors": [
{
"set": {
"field": "my_field1",
"value": "FOO BAR BAZ"
}
},
{
"lowercase": {
"field": "my_field1",
"target_field": "my_field2",
"ignore_missing": false
}
},
]
},
"rename_hostname": {
"processors": [
{
"rename": {
"field": "hostname",
"target_field": "host",
"ignore_missing": true
}
}
]
}
}
The Logstash filter would apply the ingest node processors defined above to all events in the pipeline.
As of the ES 6.7.0 release, all ingest node processors including user_agent and geoip would be supported with the single exception of the set_security_user processor which provides security functionality that is relevant only within the ES context. The current proposal is to override the set_security_user processor to be a no-op when run within LS.
What advantages would this provide over the existing ingest node pipeline converter tool?
@praseodym, the biggest advantages are probably that unlike the converter tool, there are currently no limitations on the ingest node pipelines that could be run in Logstash -- all processors including user_agent and geoip are supported. There are also no differences in behavior between the Logstash Ruby implementations and ES Java implementations of the various operators such as grok because it is the actual ES ingest node code that is running within Logstash. Additionally, some people may consider it an advantage that the ES ingest node pipeline could be run directly in Logstash without having to go through an intermediate conversion step.
@danhermann @jsvd @jakelandis updated the original issue with the plugin & ES compatibility strategy that we decided on today. I've also removed the open question around dependencies as it's now been resolved.
@praseodym the ingest-converter tool was last updated 2 years ago as I can see. Also it has some bugs (try for example to convert kafka ingest pipeline -https://github.com/elastic/beats/blob/master/filebeat/module/kafka/log/ingest/pipeline.json).
On the other hand, maintaining 2 different implementations (java and ruby) for the same operators could lead to different behaviours.
Last, the effort to migrate the filebeat logstash pipelines filters between major versions of elastic stack is very cumbersome (take for example the breaking changes between 6.x and 7.x regarding filebeat renamed fields - https://www.elastic.co/guide/en/beats/libbeat/7.0/breaking-changes-7.0.html#_field_name_changes).
+1 for off-loading workload to Logstash. The reason we have a Logstash cluster, is to move data-processing work away from Elasticsearch.
As a user, it's awesome to have default out-of-the-box data-processing of every module at hand, that I don't have to maintain. \
If I were to use the conversion tool, I would have to check for differences in the module Ingest code, EVERY update. And since it has a lot of limitations...It gets out of hand very quickly.
\
One option I would love to see from this Filter, is to specify a pipeline to be read and processed. From here, I could further enhance the result of the Ingest Pipeline. E.g., if I were to use the Netflow Filebeat-module:
input {
beats {
port => 5000
}
}
filter {
# Read and process the built-in Ingest Pipeline for this module
# (The pipeline that comes from:
# "filebeat setup --pipelines --modules netflow ...")
pipeline {
name => "%{[@metadata][pipeline]}"
}
# Lookup the tcp_flags in the YML file,
# based on fields from the Ingest Pipeline
translate {
dictionary_path => "/etc/logstash/patterns/tcp_flags.yml"
field => "[netflow][tcp_control_bits]"
destination => "[netflow][tcp_flags]"
}
}
output {
elasticsearch {
hosts => ["https://elasticsearch:9200/"]
user => "user"
password => "password"
ssl => true
index => "my_index"
}
}
%{[@metadata][pipeline]} could be stored in Elasticsearch (as it is now), but read and processed by Logstash.
I have no idea how portable the "Ingest Node"-code is (from Elasticsearch), but making Logstash nodes eligible as Ingest Nodes too, would remove the need to maintain two different implementations of the same thing (Ruby vs. Java). \
This would also remove possible bugs and behaviour differences between the two implementations.
This is even more important given that beats are starting to do more processing via an elastic ingest node out of the box. This means that when a beat forwards to logstash, logstash doesn't actually have the final copy of the message as the ingest node will transform the message further.
In the past this wasn't the case and it was trivial to have logstash forward the message to multiple destinations (ES, Kafka, File, Console) and each would end up with the exact same data.
Unfortunately not because ES is doing the transform we cannot use Logstash itself to forward the incoming parsed and processed log to multiple destinations.
If we could have logstash perform the more complex transforms by running the exact ES pipeline that ES would have run that would solve our issue.
Most helpful comment
@praseodym, the biggest advantages are probably that unlike the converter tool, there are currently no limitations on the ingest node pipelines that could be run in Logstash -- all processors including
user_agentandgeoipare supported. There are also no differences in behavior between the Logstash Ruby implementations and ES Java implementations of the various operators such as grok because it is the actual ES ingest node code that is running within Logstash. Additionally, some people may consider it an advantage that the ES ingest node pipeline could be run directly in Logstash without having to go through an intermediate conversion step.