Elasticsearch: Elasticsearch 6.2.2 nodes crash after reaching ulimit setting

Created on 15 Mar 2018 · 17Comments · Source: elastic/elasticsearch

Hi,

We're experiencing a critical production issue in elasticsearch 6.2.2 related to open_file_descriptors. The cluster is an exact replica (as much as possible) of a 5.2.2 cluster, and documents are indexed into both clusters in parallel.
While the indexing performance of the new cluster seems to be at least as good as the 5.2.2 cluster, the new nodes open_file_descriptors is reaching record-breaking levels (especially when compared to v5.2.2).

Elasticsearch version
6.2.2

Plugins installed:
ingest-attachment
ingest-geoip
mapper-murmur3
mapper-size
repository-azure
repository-gcs
repository-s3

JVM version:
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

OS version:
Linux 4.13.0-1011-azure #14-Ubuntu SMP 2018 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

All machines have ulimit of 65536, as recommended by the official documentation.
All nodes on the v5.2.2 cluster have up to 4,500 open_file_descripors, while the new v6.2.2 nodes are divided: some have up to 4,500 open_file_descriptors, while others consistently open more and more file descriptors until reaching the limit and crashing with java.nio.file.FileSystemException: Too many open files -

[WARN ][o.e.c.a.s.ShardStateAction] [prod-elasticsearch-master-002] [newlogs_20180315-01][0] received shard failed for shard id [[newlogs_20180315-01][0]], allocation id [G8NGOPNHRNuqNKYKzfiPcg], primary term [0], message [shard failure, reason [already closed by tragic event on the translog]], failure [FileSystemException[/mnt/nodes/0/indices/nIgarkzwRwe0DmT-nmLhvg/0/translog/translog.ckp: Too many open files]]
java.nio.file.FileSystemException: /mnt/nodes/0/indices/nIgarkzwRwe0DmT-nmLhvg/0/translog/translog.ckp: Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) ~[?:?]
at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[?:1.8.0_151]
at java.nio.channels.FileChannel.open(FileChannel.java:335) ~[?:1.8.0_151]
at org.elasticsearch.index.translog.Checkpoint.write(Checkpoint.java:179) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.TranslogWriter.writeCheckpoint(TranslogWriter.java:419) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.TranslogWriter.syncUpTo(TranslogWriter.java:362) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.Translog.ensureSynced(Translog.java:699) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.Translog.ensureSynced(Translog.java:724) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard$3.write(IndexShard.java:2336) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:107) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcess(AsyncIOProcessor.java:99) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:82) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.sync(IndexShard.java:2358) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction.run(TransportWriteAction.java:362) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult.(TransportWriteAction.java:160) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:133) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:110) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:72) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1034) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1012) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:103) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:359) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:299) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:975) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:972) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:238) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2220) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:984) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction.access$500(TransportReplicationAction.java:98) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:320) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:295) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:282) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:656) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

After this exception, some of the nodes throw many exceptions and reduce the number of open file descriptors. Other times they just crash. The issue repeats itself with interleaving nodes.

Steps to reproduce:
None yet. This is a production-scale issue, so it is not easy to reproduce.

I'd be happy to provide additional details, whatever is needed.

Thanks!

:DistributeEngine

Source

redlus

👍1

Most helpful comment

Commit #29125 changes the mechanism controlling the flush operation, and although it has been committed to the master, the latest elasticsearch version does not include it. We've therefore built v6.2.4-SNAPSHOT from the master and have been running it for the last two weeks in production-scale. The issue seems to have been solved, and everything works as expected:

Thank you for the quick response and handling of the issue!

redlus on 9 Apr 2018

👍3

All 17 comments

Would you please take the PID of an Elasticsearch process and as the number of open file descriptors grows capture the output of ls -al /proc/<PID>/fd and send the output to me privately (first name at elastic.co)? You might want to repeat this a few times as it grows and grows so we can diff and easily see what it is growing on.

When you say the node is crashing, do you have logs for this?

jasontedor on 16 Mar 2018

Pinging @elastic/es-core-infra

elasticmachine on 16 Mar 2018

Hi Jason,

I've sent you several files with the output of ls -al /proc/<PID>/fd on timestamps along 17 minutes, during which the open file descriptors increased from ~99k to ~105k.

Nodes reaching the ulimit actually crash the process. Here are the last few logs of a node which crashed:

[2018-03-14T03:27:47,930][WARN ][o.e.i.e.Engine ] [prod-elasticsearch-data-028] [newlogs_20180314-01][1] failed to read latest segment infos on flush
java.nio.file.FileSystemException: /mnt/nodes/0/indices/-mwGJUcDRXKRtezeXPaZSA/1/index: Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:427) ~[?:?]
at java.nio.file.Files.newDirectoryStream(Files.java:457) ~[?:1.8.0_151]
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:215) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:655) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:627) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:434) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:123) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:202) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:187) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.engine.InternalEngine.refreshLastCommittedSegmentInfos(InternalEngine.java:1599) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1570) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1030) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard$4.doRun(IndexShard.java:2403) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-03-14T03:27:48,040][WARN ][o.e.i.s.IndexShard ] [prod-elasticsearch-data-028] [newlogs_20180314-01][1] failed to flush index
org.elasticsearch.index.engine.FlushFailedEngineException: Flush failed
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1568) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1030) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard$4.doRun(IndexShard.java:2403) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: java.nio.file.FileSystemException: /mnt/nodes/0/indices/-mwGJUcDRXKRtezeXPaZSA/1/translog/translog-24471.ckp: Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:243) ~[?:?]
at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253) ~[?:?]
at java.nio.file.Files.copy(Files.java:1274) ~[?:1.8.0_151]
at org.elasticsearch.index.translog.Translog.rollGeneration(Translog.java:1539) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1560) ~[elasticsearch-6.2.2.jar:6.2.2]
... 7 more
[2018-03-14T03:27:48,398][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [prod-elasticsearch-data-028] fatal error in thread [elasticsearch[prod-elasticsearch-data-028][bulk][T#2]], exiting
java.lang.AssertionError: Unexpected AlreadyClosedException
at org.elasticsearch.index.engine.InternalEngine.failOnTragicEvent(InternalEngine.java:1779) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.engine.InternalEngine.maybeFailEngine(InternalEngine.java:1796) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:904) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:738) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:707) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:680) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:518) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:480) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:466) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:72) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:567) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:530) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2314) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2292) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:238) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationPermit(IndexShard.java:2291) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:641) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:513) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:493) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1555) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed
at org.elasticsearch.index.translog.Translog.ensureOpen(Translog.java:1663) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.Translog.add(Translog.java:504) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:886) ~[elasticsearch-6.2.2.jar:6.2.2]
... 24 more

Could this be related to #25005? https://github.com/elastic/elasticsearch/pull/25005

redlus on 16 Mar 2018

from running: sudo lsof -p <es process>
90% of the file descriptors are from translog-xxxxx.tlog files. Seems like elastic never deletes these files

farin99 on 16 Mar 2018

@farin99 can you describe how many active shards you have on each nodes - i.e., how many shards receive index operations in a 12 hours period ?

bleskes on 16 Mar 2018

Hey @bleskes
Answering for @farin99 here - we have a few hundreds of active shards per node.

ArielCoralogix on 16 Mar 2018

Hey @bleskes,

We have about 2600 shards that received an index operations in a 24H period.
We tried to change the "index.translog.retention.age" to 30m for the indexes who holds that shards and it didn't reduced the amount of "open_file_descriptors".

amnons77 on 16 Mar 2018

@bleskes
Sorry, my previous comment described the number of open shards on this entire cluster. We have about 100 open shards per node per 12 hours.
Btw, this same setup works in parallel now (data fan out) on version 5.2 with no issues and about 5000 open file descriptors while the 6.2 cluster already exceeded 150,000 for some of the nodes.

amnons77 on 16 Mar 2018

👍1

@amnons77 ok, if trimming the retention policy doesn't help we need figure out what's going on. Let's takes this to the forums to sort out. Can you post a message, including the output of nodes stats on http://discuss.elastic.co and link it here? once we know what it is we can, if needed, come back to github

bleskes on 16 Mar 2018

Hey @bleskes I am referring to the ticket opened by our colleague @redlus on the Discuss forum: https://discuss.elastic.co/t/elasticsearch-6-2-2-nodes-crash-after-reaching-ulimit-setting/124172
Also, Lior ran 'GET /_stats?include_segment_file_sizes&level=shards' and send the result files from 3 different times to @nhat which is also active on that thread.

ArielCoralogix on 16 Mar 2018

The translog for four of your shards (from three distinct indices) is behaving in an unanticipated way. Could you please share the index settings for:
DN_QfOruR5SMVRavPDnRfw
rRrAxVZzSye4W_-exBocCA
r-cR4xWkSFinkKg_RqiDsA

(from /{index}/_settings). I do not know the name of the indices that these correspond to, you will have to backtrace.

And can you also share the shard stats for these indices (from /{index}/_stats?level=shards&include_segment_file_sizes=true). You can email them to the same email address that you used before.

jasontedor on 17 Mar 2018

I opened #29125 to propose a fix for this.

dnhatn on 17 Mar 2018

👍2

The problem is that one replica got into an endless loop of flushing.

dnhatn on 17 Mar 2018

Pinging @elastic/es-distributed

elasticmachine on 17 Mar 2018

Hey @dnhatn @bleskes and @jasontedor,

We can confirm this affects v6.0.2 as well:

At 09:22 the node enters a state where the open file descriptors monotonically increase.
This is accompanied by higher load average, cpu, total indices actions and indices operation time.

redlus on 22 Mar 2018

Thank you for the quick response and handling of the issue!

redlus on 9 Apr 2018

👍3

@redlus thanks for testing and letting us know. Happy to hear it worked.

bleskes on 9 Apr 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings