Logstash: User-defined lookup enrichment

Created on 29 Apr 2016 · 16Comments · Source: elastic/logstash

Logstash should have more dynamic ways to lookup and enrich events, especially with external user-defined datasets. Currently, the main venue of lookup enrichment comes from the translate filter, which is primarily basic key/value lookup and only supports YAML. Here's some ideas:

Use cases

Simple key/value lookup enrichment
- Lookup user name from a user ID
- Tag/classify bad actors or blacklisted IP addresses
Multi-field lookup enrichment
- “Join” external table data with event
- Add multiple user fields (name, address, phone #, birthday, etc.) to an event
- In more traditional BI, there are dimension tables in star schemas or CMDBs tables where relevant lookup data is sourced.
Use RDBMS, Elasticsearch, or others as doc store for lookup dataset

Filter plugin additions and enhancements for user-defined data lookup
logstash-filter-translate (Phase 1)
- [ ] Better multi-field enrichment, currently multi-field lookups are stored as a complex object in the destination field. We should enable this feature to have these multiple fields be directly in the top level. (https://github.com/logstash-plugins/logstash-filter-translate/issues/44)
- [ ] Full array support - old PR (https://github.com/logstash-plugins/logstash-filter-translate/pull/15)
logstash-filter-jdbc (Phase 1) - there's two plugins we're building for this use case:
- [x] [jdbc_static](https://github.com/elastic/logstash/issues/6502) - this enables caching a JDBC query result set locally in memory on LS for scalable event enrichment. It will support multi-field lookups on primary/composite keys and also periodic refreshes of the lookup cache. This will have the broadest set of use cases especially with CMDBs and data warehouses.
- [ ] [jdbc_streaming](https://github.com/logstash-plugins/logstash-filter-jdbc_streaming) - this allows doing a JDBC call per event, and can be helpful when the lookup data changes often. It does require a network roundtrip per event, which can result in limited performance.
logstash-filter-elasticsearch (Phase 1)
- [x] Official plugin support - starting in 5.3
- [ ] Better integration with ES percolation
- [ ] Better scalability, query latency optimization
logstash-filter-http (Phase 2)
- Lookups from file returned from HTTP REST endpoint
- SSL/TLS support
- Area of inspiration: https://github.com/elastic/logstash/issues/3489 and https://github.com/logstash-plugins/logstash-filter-lookup/pull/1
Redis document (Future)

Ignore below, retaining for precedence

Lookup source file formats (for file/http)

CSV
- Multi-field lookup enrichment (Phase 1)
- Popular format for tabular data, enables RDBMS/Excel table exports for CSV lookup
JSON and YAML
- Simple key/value lookups (Phase 1)
- Multi-field lookup enrichment (Future)

The lookup data should be cached:

O(1) lookups
Configurable max memory size allocated
Periodic reloading of cache - no need to bounce the pipeline to refresh lookup cache with changes

Multi-field lookup

CSV Format

Must contain a header line
Must contain at least two columns
Looks up on a single or compound key. The lookup key to use should be unique and must be defined at config time.

code,status_description,status_type,color
200,OK,Successful,Green
201,Created,Successful,Green
202,Accepted,Successful,Green
300,Multiple Choices,Redirection,Yellow

Example

#Config
filter {
  lookup_file {
    path => "~/conf/lookup.csv"
    format => "csv"
    cache_size => "1MB"
    refresh_interval => 10000
    event_fields => ["http_code", "color"]  # (required) 1+ event keys to match with.  event_fields.length() == lookup_fields.length()
    lookup_fields => ["code", "color"]  # (required for csv) 1+ lookup keys to match against
    target_fields => ["status_description", "status_type"]  # (optional) whitelist of 1+ looked up fields to add to event.  If not defined, adds all fields (not including lookup key fields e.g. "code" and "color") to event top level.
  }
}

#Event in
Event {
  http_code => "202"
  color => "Green"
}

#Event out
Event {
  http_code => "202"
  color => "Green"
  status_description => "Accepted"
  status_type => "Successful"
}

Simple key/value lookup

JSON Format

{
  "200": "Green",
  "201": "Green",
  "202": "Green",
  "300": "Yellow",
  "elastic": true,
  "version": 5.0
}

YAML Format

200: ‘Green’
201: ‘Green’
202: ‘Green’
300: ‘Yellow’
elastic: true
version: 5.0

Example

#Config
filter {
  lookup_file {
    path => "~/conf/lookup.json"
    format => ["json" | "yaml"]
    cache_size => "1MB"
    refresh_interval => 10000
    event_fields => "key"
    target_fields => "lookup_value"  # (optional) new field name of looked up value.  If not defined, new field name defaults to "lookup_value".
  }
}

#Event in
Event {
  key => "elastic"
  product => "logstash"
}

#Event out
Event {
  key => "elastic"
  product => "logstash"
  lookup_value => true
}

HTTP example

Very similar to file counterpart, except 'url' instead of 'path'.

filter {
  lookup_http {
    url => "localhost:9200/lookup1/"
    tls => false
    # other fields are the same...
  }
}

Ref: https://github.com/elastic/logstash/issues/5087, https://github.com/elastic/logstash/issues/3633, https://github.com/elastic/logstash/issues/3446, https://github.com/elastic/logstash/issues/4510

P.S. - open to suggestions on new plugin names...~~

design enhancement new plugin roadmap v5.5.0

Source

acchen97

👍20 ❤5 😄1

Most helpful comment

I've created a plugin that does a lot of what's requested here. I call the plugin logstash-filter-augment. It allows joining multiple fields from a CSV/JSON/YAML file onto an event. I based it initially on the translate filter.

The gem is published to ruby-gems: https://rubygems.org/gems/logstash-filter-augment
And it's public on github: https://github.com/alcanzar/logstash-filter-augment

I'd appreciate any feedback/bug fixes/enhancement requests.

alcanzar on 16 Jan 2017

👍2

All 16 comments

~~Could we use ES as a backend store for the lookup?~~ just reread carefully.

ph on 29 Apr 2016

format => ["json" | "yaml"] # This could be auto-detected by the file name or when reading?

ph on 29 Apr 2016

@ph you're right, it could be and we should consider it when implementing.

acchen97 on 29 Apr 2016

@acchen97 I love the idea of enhanced lookups for logstash pipeline, what about pushing priority on redis lookup, specially when the lookup is dynamic, having lookup with both ES and Redis might be very helpful to enhance events at runtime. Specially when there are two flows that have connections somehow.

purbon on 16 Jun 2016

❤1

@ph also could be detected by the parser, filename might be tricky but I agree usually a .yml extension indicate a yaml file :-P

purbon on 16 Jun 2016

+1
Are there any timelines for this feature?

vnadgir-ef on 24 Nov 2016

@vnadgir-ef file lookups are already supported by the translate filter (supports yaml, csv, and json format). There is no timeline. We have tentatively set this for Logstash 5.2.0 but do not have a release date (and Logstash 5.1.0 isn't out yet, either).

jordansissel on 1 Dec 2016

The gem is published to ruby-gems: https://rubygems.org/gems/logstash-filter-augment
And it's public on github: https://github.com/alcanzar/logstash-filter-augment

I'd appreciate any feedback/bug fixes/enhancement requests.

alcanzar on 16 Jan 2017

👍2

@acchen97 can you update the description of this ticket (or close it and open a new one) to reflect some of the recent work in this area? I remember us having some discussions on slack/zoom about features we've already got implemented in the translate filter, for example.

jordansissel on 27 Jan 2017

@jordansissel updated this based on our most recent discussions. Let me know if I missed anything.

acchen97 on 4 Feb 2017

It would be nice to allow not only the elasticsearch _search endpoint, but also the _analyze endpoint as well.

pmusa on 27 Feb 2017

I use the Translate filter heavily and the Ruby filter also for the same reasons so this is a very welcome addition.

In one case I am using the Translate filter to lookup certain values and if nothing matches I have the ruby filter execute a Go program that queries a HTTP api, returns the result and appends the results to the translate dictionary. The issue is that if there are for example 100 incoming messages with the same value that does not exist in the dictionary the HTTP api will be hit 100 times, if there would be some way to trigger a reload of the dictionary if the file changes then that would be extremely valuable.

Just instead of having a reload the file every X seconds have it watch the file for modifications and if it is changed reload it. To prevent constant reloads if the dictionary changes fast then have a setting to to wait at least X seconds before reloading it again.

elvarb on 17 Mar 2017

@jordansissel @suyograo just updated this based on our recent discussions with specific action items for translate, elasticsearch, and jdbc filters. One thing we should discuss is the design for better integrating the ES filter with ES percolations.

acchen97 on 18 Mar 2017

Any news here or other issues to follow up the work?

pmusa on 12 Jul 2017

👍1

Is this still a planned feature?