Logstash: Add troubleshooting documentation for very slow startup times due to lack of entropy

Created on 25 Oct 2016  路  16Comments  路  Source: elastic/logstash

The lack of entropy in /dev/random can make logstash (or even any other java/jruby software?) startup time be higher than 5 minutes.

Example issue here where starting logstash took 10 minutes: https://github.com/elastic/logstash/issues/6114

JRuby's wiki has a section on this issue https://github.com/jruby/jruby/wiki/Improving-startup-time#ensure-your-system-has-adequate-entropy

We should somehow address this in a troubleshooting section, either:

1) making the user aware that the problem exists and other software could be causing this by draining /dev/random
2) suggest software that generates entropy for /dev/random, therefore working around the draining issue

Usually there should be no issue unless /dev/random is being read from very frequently (for example a ruby filter that reads from it on every event).

bug v5.5.0

Most helpful comment

I hit this on a small VM and fixed it by installing haveged

All 16 comments

Maybe we could also add something on startup that uses nio to try and read from /dev/random, and if we timeout waiting for data, then we abort startup (consider this a health check required for startup to succeed) and log a this reason.

+1 on the sanity check. For systems that support it, we could simply check /proc/sys/kernel/random/entropy_avail in the bash scripts

We could add a note for this in "getting started" for now

@suyograo I wonder if it makes sense at this point to start a Troubleshooting container? We could have something like:

Troubleshooting
    Performance Troubleshooting Guide
    Startup Issues

It would be a good place to add other troubleshooting advice. But if you think we'll have a programmatic check for adequate entropy (soonish) we could just add a note to the doc for now, as you suggest.

I hit this on a small VM and fixed it by installing haveged

As per chat with @jasontedor, we might be able to solve this by switching to /dev/urandom, which should be configurable via securerandom.source=file:/dev/urandom in JAVA_HOME/jre/lib/security/java.security.

Note that this line will already exist and should be edited from securerandom.source=file:/dev/random to securerandom.source=file:/dev/urandom and note that you can also just add this to JVM options via -Djava.security.egd=file:/dev/urandom. Lastly, this will only help if the underlying issue here is caused by using SecureRandom which defaults to using /dev/random; if someone is misbehaving and gathering randomness directly from /dev/random then there is nothing that we can do other than the suggestion @jakommo already offered (and correcting them to use /dev/urandom).

changing /dev/random to /dev/urandom is ok in dev, but we should avoid recommending that for production as it's considered a security issue

It's not a security issue, /dev/urandom is cryptographically secure.

I agree with @jasontedor that /dev/urandom is good to go on Linux

Anything but documentation or notes - nobody reads these. A selfcheck plus a message in CAPS on startup will do :-)
P.S. fixed with haveged also

Confirmed this bug today with Logstash v5.2.2 on Ubuntu 16.04. No output appeared when running bin/logstash for 15+ minutes. I attempted to start the process multiple times.

After running sudo apt-get install rng-tools, Logstash started immediately. I don't know how secure this is for use in production. haveged may be a better alternative.

For those who are troubleshooting, you can check the available system entropy by running: cat /proc/sys/kernel/random/entropy_avail on Ubuntu.

More information here: http://serverfault.com/questions/214605/gpg-not-enough-entropy

Same problem

Just wanted to chime in and say this made my first experience with logstash a bit stressful (now clocking 3 hours trying to troubleshoot this).

@gtirloni So sorry you had a hard first experience. This is still a bit of a "needle in a haystack" problem that only a relatively small subset of users are experiencing. Some VM environments see it, some don't. We're working out how to document this best, because the problem affects Logstash, but the problem isn't caused by Logstash. We still haven't sorted out the best way forward, and we appreciate your feedback. Sorry again you had a bad experience.

I'm OK with the solution for this (hopefully) being using urandom on Linux systems.

Implementation details being that we update our default jvm flags to include -Djava.security.egd=file:/dev/urandom and write some tests to verify that we are reading from the correct randomness source during startup.

Was this page helpful?
0 / 5 - 0 ratings