Logstash: Suspected memory leak leading to heap OOM in Logstash 5.6.6

Created on 9 Feb 2018 · 16Comments · Source: elastic/logstash

I'm opening an issue here as we have had no response to our forum post. The post includes graphs showing the problem and a memory dump screenshot https://discuss.elastic.co/t/suspected-memory-leak-leading-to-heap-oom-in-logstash-5-6-6/117807

We upgraded to Logstash 5.6.6 across our estate and soon saw Logstash heap usage climb until failure with OOM heap errors on all boxes. The heavier the load the quicker the failure. Each server runs a specific config for its role, although all use file input and Redis output plugins.

Downgrading to 5.6.5 fixed the problem.

On two proxies built from the same AMI in an ELB, memory usage climbed and CPU usage showed excessive GC when Logstash was upgraded on the 22/01, dropped when Logstash was restarted a couple of times, and then went back to a flat once Logstash was downgraded on the 30/01.

On two other proxies, identical except for the logstash version and doing the same job as part of an ELB, v5.6.6 logstash heap usage climbed until OOM and v5.6.5 stayed flat, and CPU usage on v5.6.6 showed excessive GC.

The heap dump suggests a suspected memory leak, which you can see in the forum post.

We saw Logstash recover a few times (systemd restart), and sometimes carry on processing at full heap usage, but at other times it stopped processing without recovery or the process stopping.
This was on Ubuntu 16.04.3 LTS.

Source

cjhodges77

Most helpful comment

this has been fixed in 4.0.3 of the grok filter

bin/logstash-plugin update logstash-filter-grok

I am going to close this as fixed since the plugin can be updated independently. This version should get shipped by default in the next patch release of Logstash.

jakelandis on 21 Feb 2018

🎉4 👍4

All 16 comments

What plugins are you using? Have you looked at plugin changes?
https://github.com/elastic/logstash/compare/v5.6.5...v5.6.6

Instead of downgrading LS, try looking at the plugins you already use and see if downgrading them makes any impact to your benchmarks.

marioluan on 9 Feb 2018

Thanks, the issue is across all servers, which generally use different plugins dependant on role, although there are a few that they share.

I don't believe any plugins have been upgraded, and the issue occurs on identical servers where the only difference is the logstash version.

Unless an issue with a plugin we use has been introduced with v5.6.6 I think it's unlikely to be a plugin, but we can try.

cjhodges77 on 9 Feb 2018

Hi all,
the same is happening in out cluster; i've never had OOM/memory exaustion issues with logstash with versions below 5.6.6 but as soon we began getting hit with such issue. We're going to do a test downgrading one of the affected instances to 5.6.5 to see if it helps... we'll keep you in touch about the outcoming. The affected logstash instances have the following plugin installed (other than the default ones): x-pack, logstash-filter-prune, logstash-filter-cidr.

alextxm on 13 Feb 2018

@cjhodges77 - Can you share (the non-sensitive parts of) your configuration for "On these proxies, 2 were running v5.6.6 and 2 v5.6.5" ?

The default plugins get updated with (most) every release, so some of your plugins may have been upgraded with the version bump. (you can see which ones via the Gemfile*.lock in the diff)

I've been looking through the diff for this release and can't find any suspects for memory leaks, and profiling of a very basic pipeline shows no difference either.

@alextxm - did downgrading help with your case ? If so, can you also share your config ?

jakelandis on 15 Feb 2018

Actually re-visiting the plugin diff, it seems that the 3.1.26 version of beats input may have an memory issues : https://github.com/logstash-plugins/logstash-input-beats/issues/286

Downgrading to 3.1.24 may alleviate the issue, and we have a fix in the works.

If you are not using the beats input, then ignore this.

jakelandis on 15 Feb 2018

@jakelandis
logstash.yml is default other than paths and logging level.
We aren't using beats, here's the config from those proxies. These were some of the first boxes to fall over because they process a large number of logs, but all our servers were affected.
routing_proxy_logstash_redacted.txt

Of the plugin version changes between the two hosts, we are only using the following unless others are used as defaults:
logstash-input-udp -- extremely low level
logstash-filter-split
logstash-filter-grok

Full list of changed plugin versions on those proxy hosts:

| 5.6.5 | 5.6.6 |
|----------------------------| ------------------------|
| logstash-codec-line, |logstash-codec-line |
| 3.0.5 |3.0.8 |
| logstash-codec-netflow, |logstash-codec-netflow |
| 3.8.3 |3.10.0 |
| logstash-codec-plain, |logstash-codec-plain |
| 3.0.5 |3.0.6 |
| logstash-filter-grok, |logstash-filter-grok |
| 4.0.0 |4.0.1 |
| logstash-filter-ruby, |logstash-filter-ruby |
| 3.1.2 |3.1.3 |
| logstash-filter-split, |logstash-filter-split |
| 3.1.5 |3.1.6 |
| logstash-input-beats, |logstash-input-beats |
| 3.1.24 |3.1.26 |
| logstash-input-http, |logstash-input-http |
| 3.0.7 |3.0.8 |
| logstash-input-irc, |logstash-input-irc |
| 3.0.5 |3.0.6 |
| logstash-input-jdbc, |logstash-input-jdbc |
| 4.3.1 |4.3.3 |
| logstash-input-s3, |logstash-input-s3 |
| 3.1.8 |3.2.0 |
| logstash-input-syslog, |logstash-input-syslog |
| 3.2.3 |3.2.4 |
| logstash-input-udp, |logstash-input-udp |
| 3.1.3 |3.2.1 |

cjhodges77 on 16 Feb 2018

@cjhodges77 - I have spent a decent amount of time combing through the change sets, and profiling memory with similar type of config you provided. I keep coming up short on finding suspects (other then beats) or reproducing memory anomalies. I also checked the .deb packaging to ensure it isn't a packaging issue.

Could it be environmental ? Perhaps different JVM versions ? Could it be a new instance that may be processing more data, due to missing file sincedb or more open firewalls, or beefier machines ? Are there any log file differences that may point to something ?

Your graphs are pretty compelling that there is an issue here, but I have exhausted all of my tricks to hunt this down.

@alextxm - any additional context you can provide would be helpful as well.

jakelandis on 16 Feb 2018

There really are no differences I can find other than the Logstash version. They are built using the exact same configuration, in the same ELB, same launch configuration, Java version 1.8.0_151, same instance type, same everything.

The issue started on all instances deployed immediately after version 5.6.6 was included in the build and stopped immediately after pinning to version 5.6.5.

We just had the issue hit a load of Docker hosts (and cause significant panic) that were still running 5.6.6. Like the proxies, none of the hosts deployed since pinning to 5.6.5 OR deployed prior to the 5.6.6 release were affected. Again, these are all built from the same configuration.

I know this doesn't provide much info but appreciate your help.

cjhodges77 on 16 Feb 2018

We are having the same issues. High CPU load due to OOM errors in logstash 5.6.6 (and 5.6.7). Downgrading to 5.6.5 fixed the issue for us. We are runing Debian Jessie.
We are using beats input, but the error seems to also happens on logstash instance which do NOT process any log from filebeat. (We are running logstash on all elasticsearch cluster nodes, but not all are configured to recive logs)
screen shot 2018-02-12 at 18 06 21

roock on 19 Feb 2018

@roock - what does your memory graph look like for 5.6.5 ? (Does it still sawtooth 200+ MB ?) Can you share your (non sensitive) parts of your config ? Any unusual workflows (like constant reloading) ? Any non default settings in your logstash.yml ?

jakelandis on 19 Feb 2018

Hi,
downgrading to logstash 5.6.5 seems to have solved the issue: the instanc is now at 65m+ events without getting stuck; it previously got stuck every 2-3 days (this instance handles about 7-8m events per day).
cattura
As you can see from the attached screenshot (which comes from Kibana monitoring) the memory usage raises and then drops; with LS 5.6.6+ it only keeps raising until OOM.

@jakelandis just to give you some more hints in orer to poinpoint the problem: the instance does not handle beats, it only handles syslog traffic (it serves as a syslog 'host' for a large range of firewalls and other network devices and appliances). As such, it uses the UDP input filter, performs a wide range of grok and mutate operations and the output the results on ES via the elasticsearch output filter.
In the filter sections it also uses the geoip filter and the logstash-filter-cidr filter (which i manually installed using the logstash-plugin install cmd).
Just for reference: the LS instance is configured with a file-based persistent queue of 8GB with pages of 64mb... does this make any difference ?
Just let me know if you need any further detail; if needed I can also share the LS conf files of the instance.

alextxm on 20 Feb 2018

@cjhodges77 @alextxm @roock - thanks for the insights. I was finally able to reproduce this by setting the worker count to a high value (48), using a generator input with a few grok's and null output. The magic, I believe, was more workers and running the test for multiple hours. I also ran the same test at the same time against 5.5.5 and no issues (other then a really warm laptop). Also, it seems that once it gets close to OOM is significantly slows down processing (probably due to excessive GC).

The problem is 4.0.1 grok, and I will log an issue shortly. https://github.com/logstash-plugins/logstash-filter-grok/compare/v4.0.0...v4.0.1 after staring at this code for a bit, i still don't see anything wrong with it. However, I am pretty certain that this new code is exercising a memory leak in JRuby.

The top offenders here are JRuby internals (Logstash does not explicitly use them), and believe this heap dump roughly matches @cjhodges77 experienced. I suspect that the lambda is leaking references internally, and will attempt to reproduce in isolation outside of Logstash.

I should get this sorted out this week, but if you need to quick fix it is possible to downgrade the version of grok back to 4.0.0

jakelandis on 20 Feb 2018

👍1

Is this actually fixed in 4.0.2 ?

Fixed resource leak where this plugin might get double initialized during plugin reload, leaking a thread + some objects

IrlJidel on 20 Feb 2018

@IrlJidel - no, the leak fixed in4.0.2 is separate and only causes issues when there are multiple reloads.

jakelandis on 20 Feb 2018

Logged the root cause issue: https://github.com/logstash-plugins/logstash-filter-grok/issues/135 and have a PR https://github.com/logstash-plugins/logstash-filter-grok/pull/136 to fix it.

jakelandis on 21 Feb 2018

this has been fixed in 4.0.3 of the grok filter

bin/logstash-plugin update logstash-filter-grok

I am going to close this as fixed since the plugin can be updated independently. This version should get shipped by default in the next patch release of Logstash.

jakelandis on 21 Feb 2018

🎉4 👍4

Was this page helpful?

0 / 5 - 0 ratings