Logstash: User-defined lookup enrichment

Created on 29 Apr 2016  Â·  16Comments  Â·  Source: elastic/logstash

Logstash should have more dynamic ways to lookup and enrich events, especially with external user-defined datasets. Currently, the main venue of lookup enrichment comes from the translate filter, which is primarily basic key/value lookup and only supports YAML. Here's some ideas:

Use cases

  • Simple key/value lookup enrichment

    • Lookup user name from a user ID

    • Tag/classify bad actors or blacklisted IP addresses

  • Multi-field lookup enrichment

    • “Join” external table data with event

    • Add multiple user fields (name, address, phone #, birthday, etc.) to an event

    • In more traditional BI, there are dimension tables in star schemas or CMDBs tables where relevant lookup data is sourced.

  • Use RDBMS, Elasticsearch, or others as doc store for lookup dataset

    Filter plugin additions and enhancements for user-defined data lookup

  • logstash-filter-translate (Phase 1)

    • [ ] Better multi-field enrichment, currently multi-field lookups are stored as a complex object in the destination field. We should enable this feature to have these multiple fields be directly in the top level. (https://github.com/logstash-plugins/logstash-filter-translate/issues/44)
    • [ ] Full array support - old PR (https://github.com/logstash-plugins/logstash-filter-translate/pull/15)
  • logstash-filter-jdbc (Phase 1) - there's two plugins we're building for this use case:

    • [x] [jdbc_static](https://github.com/elastic/logstash/issues/6502) - this enables caching a JDBC query result set locally in memory on LS for scalable event enrichment. It will support multi-field lookups on primary/composite keys and also periodic refreshes of the lookup cache. This will have the broadest set of use cases especially with CMDBs and data warehouses.

    • [ ] [jdbc_streaming](https://github.com/logstash-plugins/logstash-filter-jdbc_streaming) - this allows doing a JDBC call per event, and can be helpful when the lookup data changes often. It does require a network roundtrip per event, which can result in limited performance.

  • logstash-filter-elasticsearch (Phase 1)

    • [x] Official plugin support - starting in 5.3

    • [ ] Better integration with ES percolation

    • [ ] Better scalability, query latency optimization

  • logstash-filter-http (Phase 2)
  • Redis document (Future)

Ignore below, retaining for precedence

Lookup source file formats (for file/http)

  • CSV

    • Multi-field lookup enrichment (Phase 1)

    • Popular format for tabular data, enables RDBMS/Excel table exports for CSV lookup

  • JSON and YAML

    • Simple key/value lookups (Phase 1)

    • Multi-field lookup enrichment (Future)

The lookup data should be cached:

  • O(1) lookups
  • Configurable max memory size allocated
  • Periodic reloading of cache - no need to bounce the pipeline to refresh lookup cache with changes

Multi-field lookup

CSV Format

  • Must contain a header line
  • Must contain at least two columns
  • Looks up on a single or compound key. The lookup key to use should be unique and must be defined at config time.
code,status_description,status_type,color
200,OK,Successful,Green
201,Created,Successful,Green
202,Accepted,Successful,Green
300,Multiple Choices,Redirection,Yellow

Example

#Config
filter {
  lookup_file {
    path => "~/conf/lookup.csv"
    format => "csv"
    cache_size => "1MB"
    refresh_interval => 10000
    event_fields => ["http_code", "color"]  # (required) 1+ event keys to match with.  event_fields.length() == lookup_fields.length()
    lookup_fields => ["code", "color"]  # (required for csv) 1+ lookup keys to match against
    target_fields => ["status_description", "status_type"]  # (optional) whitelist of 1+ looked up fields to add to event.  If not defined, adds all fields (not including lookup key fields e.g. "code" and "color") to event top level.
  }
}

#Event in
Event {
  http_code => "202"
  color => "Green"
}

#Event out
Event {
  http_code => "202"
  color => "Green"
  status_description => "Accepted"
  status_type => "Successful"
}

Simple key/value lookup

JSON Format

{
  "200": "Green",
  "201": "Green",
  "202": "Green",
  "300": "Yellow",
  "elastic": true,
  "version": 5.0
}

YAML Format

200: ‘Green’
201: ‘Green’
202: ‘Green’
300: ‘Yellow’
elastic: true
version: 5.0

Example

#Config
filter {
  lookup_file {
    path => "~/conf/lookup.json"
    format => ["json" | "yaml"]
    cache_size => "1MB"
    refresh_interval => 10000
    event_fields => "key"
    target_fields => "lookup_value"  # (optional) new field name of looked up value.  If not defined, new field name defaults to "lookup_value".
  }
}

#Event in
Event {
  key => "elastic"
  product => "logstash"
}

#Event out
Event {
  key => "elastic"
  product => "logstash"
  lookup_value => true
}

HTTP example

Very similar to file counterpart, except 'url' instead of 'path'.

filter {
  lookup_http {
    url => "localhost:9200/lookup1/"
    tls => false
    # other fields are the same...
  }
}

Ref: https://github.com/elastic/logstash/issues/5087, https://github.com/elastic/logstash/issues/3633, https://github.com/elastic/logstash/issues/3446, https://github.com/elastic/logstash/issues/4510

P.S. - open to suggestions on new plugin names...~~

design enhancement new plugin roadmap v5.5.0

Most helpful comment

I've created a plugin that does a lot of what's requested here. I call the plugin logstash-filter-augment. It allows joining multiple fields from a CSV/JSON/YAML file onto an event. I based it initially on the translate filter.

The gem is published to ruby-gems: https://rubygems.org/gems/logstash-filter-augment
And it's public on github: https://github.com/alcanzar/logstash-filter-augment

I'd appreciate any feedback/bug fixes/enhancement requests.

All 16 comments

Could we use ES as a backend store for the lookup? just reread carefully.

format => ["json" | "yaml"] # This could be auto-detected by the file name or when reading?

@ph you're right, it could be and we should consider it when implementing.

@acchen97 I love the idea of enhanced lookups for logstash pipeline, what about pushing priority on redis lookup, specially when the lookup is dynamic, having lookup with both ES and Redis might be very helpful to enhance events at runtime. Specially when there are two flows that have connections somehow.

@ph also could be detected by the parser, filename might be tricky but I agree usually a .yml extension indicate a yaml file :-P

+1
Are there any timelines for this feature?

@vnadgir-ef file lookups are already supported by the translate filter (supports yaml, csv, and json format). There is no timeline. We have tentatively set this for Logstash 5.2.0 but do not have a release date (and Logstash 5.1.0 isn't out yet, either).

I've created a plugin that does a lot of what's requested here. I call the plugin logstash-filter-augment. It allows joining multiple fields from a CSV/JSON/YAML file onto an event. I based it initially on the translate filter.

The gem is published to ruby-gems: https://rubygems.org/gems/logstash-filter-augment
And it's public on github: https://github.com/alcanzar/logstash-filter-augment

I'd appreciate any feedback/bug fixes/enhancement requests.

@acchen97 can you update the description of this ticket (or close it and open a new one) to reflect some of the recent work in this area? I remember us having some discussions on slack/zoom about features we've already got implemented in the translate filter, for example.

@jordansissel updated this based on our most recent discussions. Let me know if I missed anything.

It would be nice to allow not only the elasticsearch _search endpoint, but also the _analyze endpoint as well.

I use the Translate filter heavily and the Ruby filter also for the same reasons so this is a very welcome addition.

In one case I am using the Translate filter to lookup certain values and if nothing matches I have the ruby filter execute a Go program that queries a HTTP api, returns the result and appends the results to the translate dictionary. The issue is that if there are for example 100 incoming messages with the same value that does not exist in the dictionary the HTTP api will be hit 100 times, if there would be some way to trigger a reload of the dictionary if the file changes then that would be extremely valuable.

Just instead of having a reload the file every X seconds have it watch the file for modifications and if it is changed reload it. To prevent constant reloads if the dictionary changes fast then have a setting to to wait at least X seconds before reloading it again.

@jordansissel @suyograo just updated this based on our recent discussions with specific action items for translate, elasticsearch, and jdbc filters. One thing we should discuss is the design for better integrating the ES filter with ES percolations.

Any news here or other issues to follow up the work?

Is this still a planned feature?

Database lookup enrichment is now generally available with the JDBC static and JDBC streaming filters.

Was this page helpful?
0 / 5 - 0 ratings