Elasticsearch: Bulk task queue grows infinitely after upgrading to 6.2.1 (from 5.6.4)

Created on 17 Feb 2018  路  7Comments  路  Source: elastic/elasticsearch

Elasticsearch version (bin/elasticsearch --version):
Version: 6.2.1, Build: 7299dc3/2018-02-07T19:34:26.990113Z, JVM: 1.8.0_131

Plugins installed: []
analysis-icu

JVM version (java -version):
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

OS version (uname -a if on a Unix-like system):
FreeBSD 11.1

Description of the problem including expected versus actual behavior:
After upgrading to 6.2.1, one of the data (and also client) nodes reject all bulk operations. In the reject response queued tasks grow infinitely while completed tasks doesn't change (see attached logs).

Steps to reproduce:
We have a lot of different data and indices spread over 40 nodes. So far I could observe this error only on one node. When I try to restart the node with kill, it doesn't stop. Below is the stacktrace.

Provide logs (if relevant):
This is just about a minute (logged by an application, uses python elasticsearch client). queued tasks grow, while completed tasks doesn't.
with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@3684362a on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14926, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@11961ec on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14939, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@573952db on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14951, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@573952db on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14951, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@4127dfd2 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14960, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@4c6147ea on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14961, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@4958c58e on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14960, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@7acb9dba on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14967, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@75af7186 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14967, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@7830996e on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14972, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@513cdd87 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14979, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@1166d220 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14983, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@498b99a2 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14998, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@64e61a2f on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15005, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@372ab4a0 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15006, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@4381386b on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15006, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@2c661b6d on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15025, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@5fcb36db on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15025, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@21834e56 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15038, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@58444a25 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15038, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@25da450 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15047, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@6f3c0966 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15047, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@27485a8e on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15050, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@170e6b30 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15056, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'}

And this is the stacktrace after I tried to kill the node and it didn't stop:
https://pastebin.com/rnzESu5B

:DistributeEngine >bug

All 7 comments

Hi @bra-fsn, we reserve Github for bug reports and feature requests only. Please ask questions like these in the Elasticsearch forum instead. Thank you!

Some hints to get you started:

  • The stack traces indicate that the cluster is doing a couple of update and delete operations when you took them.
  • You should check the documentation on upgrades.
  • You can use the hot threads API to which operations take a significant amount of time (in your case it might make sense to set ignore_idle_threads to false)

There's a deadlock in InternalEngine (the stack dump even says so). /cc: @s1monw

The stack trace below should be pretty self-explanatory of what's going on:

Found one Java-level deadlock:

============================="


"elasticsearch[fmfe16][index][T#24]":

  waiting for ownable synchronizer 0x00000009df4b02c0, (a java.util.co\
    ncurrent.locks.ReentrantLock$NonfairSync)"
,
      which is held by "elasticsearch[fmfe16][index][T#9]\
    ""


"elasticsearch[fmfe16][index][T#9]":

  waiting to lock monitor 0x0000000afc30c4e8"
 (object 0x00000008b40309c8, a org.elasticsearch.index.engine.LiveVers\
    ionMap)"
,
      which is held by "elasticsearch[fmfe16][get][T#14]""


"elasticsearch[fmfe16][get][T#14]":

  waiting for ownable synchronizer 0x00000009df4b02c0, (a java.util.co\
    ncurrent.locks.ReentrantLock$NonfairSync)"
,
      which is held by "elasticsearch[fmfe16][index][T#9]\
    ""




Java stack information for the threads listed above:

===================================================

"elasticsearch[fmfe16][index][T#24]":

    at sun.misc.Unsafe.park(Native Method)

    - parking to wait for  <0x00000009df4b02c0> (a java.util.concu\
    rrent.locks.ReentrantLock$NonfairSync)

    at java.util.concurrent.locks.LockSupport.park(LockSupport.jav\
    a:175)

    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkA\
    ndCheckInterrupt(AbstractQueuedSynchronizer.java:836)

    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acqui\
    reQueued(AbstractQueuedSynchronizer.java:870)

    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acqui\
    re(AbstractQueuedSynchronizer.java:1199)

    at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(R\
    eentrantLock.java:209)

    at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock\
    .java:285)

    at org.elasticsearch.common.util.concurrent.KeyedLock.acquire(\
    KeyedLock.java:76)

    at org.elasticsearch.index.engine.LiveVersionMap.acquireLock(L\
    iveVersionMap.java:431)

    at org.elasticsearch.index.engine.InternalEngine.pruneDeletedT\
    ombstones(InternalEngine.java:1661)

    at org.elasticsearch.index.engine.InternalEngine.maybePruneDel\
    etedTombstones(InternalEngine.java:1348)

    at org.elasticsearch.index.engine.InternalEngine.refresh(Inter\
    nalEngine.java:1427)

    at org.elasticsearch.index.engine.InternalEngine.get(InternalE\
    ngine.java:637)

    at org.elasticsearch.index.shard.IndexShard.get(IndexShard.jav\
    a:850)

    at org.elasticsearch.index.get.ShardGetService.innerGet(ShardG\
    etService.java:153)

    at org.elasticsearch.index.get.ShardGetService.get(ShardGetSer\
    vice.java:81)

    at org.elasticsearch.action.update.UpdateHelper.prepare(Update\
    Helper.java:74)

    at org.elasticsearch.action.update.TransportUpdateAction.shard\
    Operation(TransportUpdateAction.java:174)

    at org.elasticsearch.action.update.TransportUpdateAction.shard\
    Operation(TransportUpdateAction.java:167)

    at org.elasticsearch.action.update.TransportUpdateAction.shard\
    Operation(TransportUpdateAction.java:67)

    at org.elasticsearch.action.support.single.instance.TransportI\
    nstanceSingleOperationAction$ShardTransportHandler.messageReceived(Tra\
    nsportInstanceSingleOperationAction.java:243)

    at org.elasticsearch.action.support.single.instance.TransportI\
    nstanceSingleOperationAction$ShardTransportHandler.messageReceived(Tra\
    nsportInstanceSingleOperationAction.java:239)

    at org.elasticsearch.transport.TransportRequestHandler.message\
    Received(TransportRequestHandler.java:30)

    at org.elasticsearch.transport.RequestHandlerRegistry.processM\
    essageReceived(RequestHandlerRegistry.java:66)

    at org.elasticsearch.transport.TcpTransport$RequestHandler.doR\
    un(TcpTransport.java:1555)

    at org.elasticsearch.common.util.concurrent.ThreadContext$Cont\
    extPreservingAbstractRunnable.doRun(ThreadContext.java:635)

    at org.elasticsearch.common.util.concurrent.AbstractRunnable.r\
    un(AbstractRunnable.java:37)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoo\
    lExecutor.java:1149)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPo\
    olExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)

"elasticsearch[fmfe16][index][T#9]":

    at org.elasticsearch.index.engine.InternalEngine.getVersionFro\
    mMap(InternalEngine.java:732)

    - waiting to lock <0x00000008b40309c8> "
(a org.elasticsearch.index.engine.LiveVersionMap)

    at org.elasticsearch.index.engine.InternalEngine.get(InternalE\
    ngine.java:627)

    at org.elasticsearch.index.shard.IndexShard.get(IndexShard.jav\
    a:850)

    at org.elasticsearch.index.get.ShardGetService.innerGet(ShardG\
    etService.java:153)

    at org.elasticsearch.index.get.ShardGetService.get(ShardGetSer\
    vice.java:81)

    at org.elasticsearch.action.update.UpdateHelper.prepare(Update\
    Helper.java:74)

    at org.elasticsearch.action.update.TransportUpdateAction.shard\
    Operation(TransportUpdateAction.java:174)

    at org.elasticsearch.action.update.TransportUpdateAction.shard\
    Operation(TransportUpdateAction.java:167)

    at org.elasticsearch.action.update.TransportUpdateAction.shard\
    Operation(TransportUpdateAction.java:67)

    at org.elasticsearch.action.support.single.instance.TransportI\
    nstanceSingleOperationAction$ShardTransportHandler.messageReceived(Tra\
    nsportInstanceSingleOperationAction.java:243)

    at org.elasticsearch.action.support.single.instance.TransportI\
    nstanceSingleOperationAction$ShardTransportHandler.messageReceived(Tra\
    nsportInstanceSingleOperationAction.java:239)

    at org.elasticsearch.transport.TransportRequestHandler.message\
    Received(TransportRequestHandler.java:30)

    at org.elasticsearch.transport.RequestHandlerRegistry.processM\
    essageReceived(RequestHandlerRegistry.java:66)

    at org.elasticsearch.transport.TcpTransport$RequestHandler.doR\
    un(TcpTransport.java:1555)

    at org.elasticsearch.common.util.concurrent.ThreadContext$Cont\
    extPreservingAbstractRunnable.doRun(ThreadContext.java:635)

    at org.elasticsearch.common.util.concurrent.AbstractRunnable.r\
    un(AbstractRunnable.java:37)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoo\
    lExecutor.java:1149)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPo\
    olExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)

"elasticsearch[fmfe16][get][T#14]":

    at sun.misc.Unsafe.park(Native Method)

    - parking to wait for  <0x00000009df4b02c0> (a java.util.concu\
    rrent.locks.ReentrantLock$NonfairSync)

    at java.util.concurrent.locks.LockSupport.park(LockSupport.jav\
    a:175)

    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkA\
    ndCheckInterrupt(AbstractQueuedSynchronizer.java:836)

    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acqui\
    reQueued(AbstractQueuedSynchronizer.java:870)

    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acqui\
    re(AbstractQueuedSynchronizer.java:1199)

    at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(R\
    eentrantLock.java:209)

    at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock\
    .java:285)

    at org.elasticsearch.common.util.concurrent.KeyedLock.acquire(\
    KeyedLock.java:76)

    at org.elasticsearch.index.engine.LiveVersionMap.acquireLock(L\
    iveVersionMap.java:431)

    at org.elasticsearch.index.engine.InternalEngine.pruneDeletedT\
    ombstones(InternalEngine.java:1661)

    at org.elasticsearch.index.engine.InternalEngine.maybePruneDel\
    etedTombstones(InternalEngine.java:1348)

    at org.elasticsearch.index.engine.InternalEngine.refresh(Inter\
    nalEngine.java:1427)

    at org.elasticsearch.index.engine.InternalEngine.getVersionFro\
    mMap(InternalEngine.java:737)

    - locked <0x00000008b40309c8> "
(a org.elasticsearch.index.engine.LiveVersionMap)

    at org.elasticsearch.index.engine.InternalEngine.get(InternalE\
    ngine.java:627)

    at org.elasticsearch.index.shard.IndexShard.get(IndexShard.jav\
    a:850)

    at org.elasticsearch.index.get.ShardGetService.innerGet(ShardG\
    etService.java:153)

    at org.elasticsearch.index.get.ShardGetService.get(ShardGetSer\
    vice.java:81)

    at org.elasticsearch.action.get.TransportGetAction.shardOperat\
    ion(TransportGetAction.java:88)

    at org.elasticsearch.action.get.TransportGetAction.shardOperat\
    ion(TransportGetAction.java:44)

    at org.elasticsearch.action.support.single.shard.TransportSing\
    leShardAction$ShardTransportHandler.messageReceived(TransportSingleSha\
    rdAction.java:293)

    at org.elasticsearch.action.support.single.shard.TransportSing\
    leShardAction$ShardTransportHandler.messageReceived(TransportSingleSha\
    rdAction.java:286)

    at org.elasticsearch.transport.TransportRequestHandler.message\
    Received(TransportRequestHandler.java:30)

    at org.elasticsearch.transport.RequestHandlerRegistry.processM\
    essageReceived(RequestHandlerRegistry.java:66)

    at org.elasticsearch.transport.TcpTransport$RequestHandler.doR\
    un(TcpTransport.java:1555)

    at org.elasticsearch.common.util.concurrent.ThreadContext$Cont\
    extPreservingAbstractRunnable.doRun(ThreadContext.java:635)

    at org.elasticsearch.common.util.concurrent.AbstractRunnable.r\
    un(AbstractRunnable.java:37)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoo\
    lExecutor.java:1149)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPo\
    olExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)


    Found 1 deadlock.

@danielmitterdorfer: this is a bug report, not a question...

@ywelsch thanks for double-checking. I did not spot the deadlock in the thread dump originally.

I am looking into it

@bra-fsn thanks for reporting this. @danielmitterdorfer @ywelsch I opened a pr for this.

@s1monw : great, thanks for fixing it!
Also thanks @ywelsch for not letting this down into the sink. :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments

clintongormley picture clintongormley  路  3Comments

brwe picture brwe  路  3Comments

clintongormley picture clintongormley  路  3Comments

makeyang picture makeyang  路  3Comments