I want contribute to logstash-plugins my new filter plugin logstash-filter-memoize.
This filter can provides memoization to any filter.
See #8530 to show history of this plugin
filter {
memoize {
key => "%{host}"
fields => ["host_owner", "host_location"]
filter_name => "elasticsearch"
filter_options => {
query => "host:%{host}"
index => "known_host"
fields => {
"host_owner" => "host_owner"
"host_location" => "host_location"
}
}
}
}
I understand the idea. However, my feeling is that most filters are probably not doing things that would benefit from memoization, so it would be confusing to have this available to wrap all possible filters.
We could think about adding caching tricks specifically to external lookup plugins instead of this?
@jordansissel
I understand the idea. However, my feeling is that most filters are probably not doing things that would benefit from memoization, so it would be confusing to have this available to wrap all possible filters.
I agree with this comment. Most of filters that currently require cache already have built-in. e.g. dns, geoip, useragent, ...
We could think about adding caching tricks specifically to external lookup plugins instead of this?
Except for the elasticsearch filter, this plugin is expected to be used for grok or ruby filters in limited situations. It can also be used for some filters that are not contributed to logstash-plugins. e.g. logstash-filter-http, logstash-filter-rest.
However, I think it would be useful to create new filter plugin. Caching is necessary in quite a few cases. With this filter, new filters will be free from caching implementations.
@sw-jung I think you're on the right path WRT bringing caching to more things.
I think there are two concerns here:
I think we can solve both of these with mixins, without the complexity of having what amounts to macro plugins (as cool as that is!).
Mixins, like the logstash-http-client mixin, can provide common options and common functionality.
I actually wonder if functionality like this could/should live in LS core to make the dependency graph smaller. WDYT @jordansissel ?
A method to add caching to more things would solve quite a few problems. For example, I routinely use logstash-filter-elasticsearch for log enrichment in the SIEM space. However, there are some things I've had to quit doing because there is no caching and the performance hit was massive. An example, I collect DNS logs into a specific index and when I have IDS alerts I would take the IP address and use logstash-filter-elasticsearch to lookup the IP address to see which client had a DNS query that resolved to the IP address in the alert. This is something the dns module would not be able to handle yet logstash-filter-elasticsearch cannot handle massive amounts of these look ups.
Taking @sw-jung 's new logstash-filter-memoize has solved some huge performance hurdles for me. The use case above would take hours to process 1+ million documents on my test server yet logstash-filter-memoize handles it in less than 6 minutes!
@sw-jung thank you for this awesome plugin. I wish I had it earlier. I plan on sharing this with students of my SANS SEC555: SIEM with Tactical Analytics class.
@andrewvc Thanks for your comment.
- Sharing caching code between filters
I was worried about adding this feature. It can be complicated or confusing. If it implementation in this plugin, it will using like this:
memoize {
...
cache_namespace => "my_global_cache"
...
}
Adding the common interface is welcome!
@SMAPPER Thanks for your comment too. Sharing is welcome!
@andrewvc I am in favor of adding this to logstash core if we make no other grammar changes.
Ultimately the question this opens for me is this: Do we expect folks to want to wrap filters with other functionality? Keeping in mind that filters act on events not on keys-and-values, so arbitrary caching support for all filters, for example, is unlikely to be a simple thing. If we did allow this, filters would need to operate on values, not events, imo, so you could, as the memoize filter, know that the value "foo" was passed to a filter for lookup. I have not yet read the code for this PR, so I don't know how much of this is or is not implemented -- bear with my ignorance please until I read the code ;)
I'm +1 on this possibly being an internal feature of the Logstash API since I feel that only some filters are candidates memoization.
I need to read more, and have put this on my todo list for this week.
@SMAPPER thank you for testing! I am greatly in favor of adding some kind of memoization to filters needing it. I'll study this PR and get y'all some feedback about next steps.
@sw-jung thank you for your efforts. :)
In a perfect world each plugin that is doing constant look ups would have caching as a built-in option. However, that may not be realistic. Either memoize or something like it may be the fix for this. Below are update on some use cases with specific benchmark information.
Use case 1 - Pulling in DNS information
Using logstash-filter-elasticsearch to pull in DNS domain names by correlating against a DNS query answer's IP address
Before this kind of look for 5 million records took about 2 hours. Updating it to the code below shrank this to 5 minutes 38 seconds.
Use case 2 - Performing Alexa top-1m lookups
Using logstash-filter-elasticsearch to check if a domain name being accessed is within the Alexa top-1m data set (pre-ingested into Elasticsearch index)
Using only logstash-filter-elasticsearch took a little over 70 minutes to process 5 million records. Config below:
With memoize it took 5 minutes 28 seconds.
Use Case 3 - Performing frequency analysis
Using logstash-filter-rest to query a 3rd party python script (freq_server.py) that use natural language processing to calculate change of something being random.
Using only logstash-filter-rest took approximately 80 minutes to process 5 million records.
With memoize it took 5 minutes and 19 seconds.
User Case 4 - Performing WHOIS lookups
Using logstash-filter-rest to query a 3rd party python script (domain_stats.py) to perform WHOIS lookups on a domain name to pull back the domain registration creation date of the domain
Using only logstash-filter-rest could not be tested as performing this many times over would cause a system performing the WHOIS lookups to be blocked by domain registrars.
With memoize to process 5 million logs this took a few seconds less than five minutes and functioned perfectly.
These benchmarks are not true life comparisons as memoize is going to completely dominate given the same log over and over. However, these kind of use cases in my clients environments perform same data look ups millions of times per hour. So having caching such as from memoize is a major breakthrough.
For example, I've got one client who has 6 Logstash nodes so they could keep up with the EPS penalty from the above use cases using logstash-filter-rest and logstash-filter-elasticsearch. Now by implementing memoize this probably can be shrunk down to 3 Logstash nodes or possibly even 2.
The Excel sheet with the benchmark timings can be found at this link:
I think it's clear that this functionality is useful and powerful.
I think that turning this into a mixin would be the best way to go about this, since it would appear on the docs for each plugin supporting it, each plugin could support common options, etc.
WDYT @sw-jung ? For an example of a mixin: https://github.com/logstash-plugins/logstash-mixin-rabbitmq_connection
@andrewvc It sounds good. I didn't know that I could create mixin as a plugin.
But, mixin is not 'officially' plugable spec. and following problems are expected.
For the above reasons, I'm a bit worried about create mixin as community plugin.
I think that turning this into a mixin
The lesson I've learned from mixins is that they are hard for us (as developers) to get right, and we end up causing ourselves maintenance burdens, etc.
For mixins, we want to remove them anyway in favor of having multiple plugins in a single repo possibly deployed as a universal plugin (so the 'mixin' concept goes away and is just shared code used by one plugin providing multiple inputs/outputs/etc).
OK, some follow-up here.
I think this is an awesome plugin. In fact, I recommend it to people with some regularity. This was a really clever idea @sw-jung , and I think you implemented it well.
That said, it really feels like a feature more plugins should bake in themselves. I think the current situation is fine, where people can download this when they need it.
The other thing to mention is we're pausing new plugin contributions temporarily because we want a more streamlined process for helping users discover and use new plugins. So, there's no movement here quite yet, but we'll be figuring out a way to get this plugin in front of more users eyes that doesn't necessarily require moving it to the logstash-plugins org.
Once again, thanks for the contribution and excellent work!
awesome!!!
/cc @colinsurprenant RE Shared Caching theme
Most helpful comment
In a perfect world each plugin that is doing constant look ups would have caching as a built-in option. However, that may not be realistic. Either memoize or something like it may be the fix for this. Below are update on some use cases with specific benchmark information.
Use case 1 - Pulling in DNS information
Using logstash-filter-elasticsearch to pull in DNS domain names by correlating against a DNS query answer's IP address
Before this kind of look for 5 million records took about 2 hours. Updating it to the code below shrank this to 5 minutes 38 seconds.
https://github.com/HASecuritySolutions/Presentations/blob/master/Modern%20Log%20Parsing%20and%20Enrichment/benchmark_5_pull_in_dns.conf
Use case 2 - Performing Alexa top-1m lookups
Using logstash-filter-elasticsearch to check if a domain name being accessed is within the Alexa top-1m data set (pre-ingested into Elasticsearch index)
Using only logstash-filter-elasticsearch took a little over 70 minutes to process 5 million records. Config below:
https://github.com/HASecuritySolutions/Presentations/blob/master/Modern%20Log%20Parsing%20and%20Enrichment/benchmark_7_alexa_top_1m.conf
With memoize it took 5 minutes 28 seconds.
https://github.com/HASecuritySolutions/Presentations/blob/master/Modern%20Log%20Parsing%20and%20Enrichment/benchmark_7_alexa_top_1m_using_memoize.conf
Use Case 3 - Performing frequency analysis
Using logstash-filter-rest to query a 3rd party python script (freq_server.py) that use natural language processing to calculate change of something being random.
Using only logstash-filter-rest took approximately 80 minutes to process 5 million records.
https://github.com/HASecuritySolutions/Presentations/blob/master/Modern%20Log%20Parsing%20and%20Enrichment/benchmark_8_frequency_scoring.conf
With memoize it took 5 minutes and 19 seconds.
https://github.com/HASecuritySolutions/Presentations/blob/master/Modern%20Log%20Parsing%20and%20Enrichment/benchmark_8_frequency_scoring_with_memoize.conf
User Case 4 - Performing WHOIS lookups
Using logstash-filter-rest to query a 3rd party python script (domain_stats.py) to perform WHOIS lookups on a domain name to pull back the domain registration creation date of the domain
Using only logstash-filter-rest could not be tested as performing this many times over would cause a system performing the WHOIS lookups to be blocked by domain registrars.
With memoize to process 5 million logs this took a few seconds less than five minutes and functioned perfectly.
https://github.com/HASecuritySolutions/Presentations/blob/master/Modern%20Log%20Parsing%20and%20Enrichment/benchmark_10_domain_creation_date.conf
These benchmarks are not true life comparisons as memoize is going to completely dominate given the same log over and over. However, these kind of use cases in my clients environments perform same data look ups millions of times per hour. So having caching such as from memoize is a major breakthrough.
For example, I've got one client who has 6 Logstash nodes so they could keep up with the EPS penalty from the above use cases using logstash-filter-rest and logstash-filter-elasticsearch. Now by implementing memoize this probably can be shrunk down to 3 Logstash nodes or possibly even 2.
The Excel sheet with the benchmark timings can be found at this link:
https://github.com/HASecuritySolutions/Presentations/blob/master/Modern%20Log%20Parsing%20and%20Enrichment/Webcast%20Modern%20Parsing%20and%20Enrichment%20Benchmarks.xlsx