elasticsearch data node crashing with OutOfMemoryError

Created on 29 May 2018  ·  9Comments  ·  Source: elastic/elasticsearch

Elasticsearch version (bin/elasticsearch --version):
6.2.4
Plugins installed: [
ingest-attachment
ingest-geoip
mapper-murmur3
mapper-size
repository-azure
repository-gcs
repository-s3
]

JVM version (java -version):
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
OS version (uname -a if on a Unix-like system):
Linux prod-elasticsearch-hot-001 4.13.0-1018-azure #21-Ubuntu SMP Thu May 17 13:58:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:

Many of our data nodes crashed together with OutOfMemoryError.

I can send a link to the memory dump in DM.

call stack of one of the nodes:
<--- OutOfMemoryError happened in this thread State: BLOCKED
java.lang.OutOfMemoryError.() OutOfMemoryError.java:48
io.netty.util.internal.PlatformDependent.allocateUninitializedArray(int) PlatformDependent.java:200
io.netty.buffer.PoolArena$HeapArena.newByteArray(int) PoolArena.java:676
io.netty.buffer.PoolArena$HeapArena.newChunk(int, int, int, int) PoolArena.java:686
io.netty.buffer.PoolArena.allocateNormal(PooledByteBuf, int, int) PoolArena.java:244
io.netty.buffer.PoolArena.allocate(PoolThreadCache, PooledByteBuf, int) PoolArena.java:226
io.netty.buffer.PoolArena.reallocate(PooledByteBuf, int, boolean) PoolArena.java:397
io.netty.buffer.PooledByteBuf.capacity(int) PooledByteBuf.java:118
io.netty.buffer.AbstractByteBuf.ensureWritable0(int) AbstractByteBuf.java:285
io.netty.buffer.AbstractByteBuf.ensureWritable(int) AbstractByteBuf.java:265
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int, int) AbstractByteBuf.java:1077
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int) AbstractByteBuf.java:1070
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf) AbstractByteBuf.java:1060
io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteBufAllocator, ByteBuf, ByteBuf) ByteToMessageDecoder.java:92
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ChannelHandlerContext, Object) ByteToMessageDecoder.java:263
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Object) AbstractChannelHandlerContext.java:340
io.netty.handler.logging.LoggingHandler.channelRead(ChannelHandlerContext, Object) LoggingHandler.java:241
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Object) AbstractChannelHandlerContext.java:340
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(ChannelHandlerContext, Object) DefaultChannelPipeline.java:1359
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.DefaultChannelPipeline.fireChannelRead(Object) DefaultChannelPipeline.java:935
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read() AbstractNioByteChannel.java:134
io.netty.channel.nio.NioEventLoop.processSelectedKey(SelectionKey, AbstractNioChannel) NioEventLoop.java:645
io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(Set) NioEventLoop.java:545
io.netty.channel.nio.NioEventLoop.processSelectedKeys() NioEventLoop.java:499
io.netty.channel.nio.NioEventLoop.run() NioEventLoop.java:459
io.netty.util.concurrent.SingleThreadEventExecutor$5.run() SingleThreadEventExecutor.java:858
java.lang.Thread.run() Thread.java:748

All 9 comments

Do you have any indication that there is a memory leak in Elasticsearch? Unless you have an indication that there is a memory leak or other problem internal to Elasticsearch, I will close this issue as not being a bug. OutOfMemoryErrors can happen due to overloading the cluster with aggregations and/or indexing. For recommendations and help with fixing these issues, you can start a new thread at https://discuss.elastic.co/c/elasticsearch. You may also look at the heap dump that you have in a tool like Eclipse Memory Analyzer (MAT) to provide more details when asking for help in the forums.

Hey @jaymode! Shouldn't the request circuit breaker protect us from this case exactly? it's set on our cluster with its default value.
https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html#request-circuit-breaker
@danielmitterdorfer @jimczi @clintongormley looks like you have already discussed this here: https://github.com/elastic/elasticsearch/issues/20250

The circuit breakers are a best effort attempt to prevent OOM, but it is possible to still overload Elasticsearch and get a OOM. For example, you might have the breaker set to 60% of the total heap but you may not actually have 60% of your total heap free so you can still get a OOM.

That's a pretty strange way to treat this. IMHO, A query, no matter how complex, should not crash an entire cluster if properly configured.

Hey @jaymode
according to the circuit breakers documentation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html#request-circuit-breaker
the defaults is 60% from the JVM and not from the total memory, we didn't changed the circuit breakers defaults and we set the JVM to use only half of the total machine memory, my question is, will it possible to configure elasticsearch not to crash with OOM or is it by designed and the solution is to add more nodes?

@ArielCoralogix We are continuously working on improving our handling of memory and adding safeguards to prevent OOM errors. This issue is closed as there is nothing more than “some data nodes crashed with OOM and I can give you a heap dump”, which is not actionable. We use github for confirmed bugs and features and our forums as a place to get help for issues like this. There are other open issues for specific items that relate to circuit breakers.

@amnons77 I am referring to the JVM heap in my previous answer. Today we cannot prevent OOM 100%. I cannot give you an answer without more details and the forum is a place to get help with these kinds of questions.

@jaymode IMHO an OOM is always a bug and the memory dump should be analyzed to find the root cause.
We are trying to analyze it our selves, of course, it much easierideal for someone who is more familiar with the code base :)
Regarding more information, please let us know what you need. As this is a production issue on a multi user environment, it is hard to provide the specific use case. I would argue that this is why elastic creates a memory dump on OOM by default.

IMHO an OOM is always a bug

As a development team, we try our best to prevent OOM. There are cases of OOM that we know needs work and there are other cases where we cannot control like if there is a high GC overhead that leads to an OOM from the JVM even when memory can be allocated.

the memory dump should be analyzed to find the root cause.

Github is not the right place for analysis. As I mentioned earlier, the forums would be a good place to ask for help. Developers and community members are active on the forums.

Regarding more information, please let us know what you need

Cluster Topology
Number of CPUs and load
Amount of memory
Size of heap
Changed JVM parameters
Number of indices and shards
Number of search req/s
Did anything spike at the time of the issue
SSDs/Spinning Disks
Logs
Any GC logs?

Basically as much information you can provide when you ask for help on the forums.

@jaymode from the memory dump it seems like org.apache.lucene.search.DisjunctionMaxQuery objects take ~70% of the heap. Will you please consider reopening the bug or help us get to the bottom of this crash?
All of the information you requested plus our observations from the memory dump is in the discussion
Thanks!

Was this page helpful?
0 / 5 - 0 ratings