Elasticsearch: Merge failure due to: java.lang.IllegalStateException: totalPointCount=4424 was passed when we were created, but we just hit 0329 values

Created on 26 Dec 2019  路  25Comments  路  Source: elastic/elasticsearch

Elasticsearch version (bin/elasticsearch --version):

Version: 7.4.1, Build: oss/docker/fc0eeb6e2c25915d63d871d344e3d0b45ea0ea1e/2019-10-22T17:16:35.176724Z, JVM: 13

(deployed in k8s with 3 data nodes and 3 dedicated masters)

Plugins installed:

analysis-url   1.0.0 (custom analysis plugin providing a token filter for URLs)
repository-gcs 7.4.1
repository-s3  7.4.1

JVM version (java -version):

openjdk version "13" 2019-09-17
OpenJDK Runtime Environment AdoptOpenJDK (build 13+33)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13+33, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):

Linux elasticsearch-0 4.14.138+ #1 SMP Tue Sep 3 02:58:08 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Yesterday we had an ES cluster repeatedly going red due to merge failures. In each case it was due to the error: java.lang.IllegalStateException: totalPointCount=4424 was passed when we were created, but we just hit 0329 values.

This was seen a total of 12 times in the last 24 hours on 3 different indices and on two different nodes.

After the merge failure ES does not attempt to reallocate the shard and the cluster goes yellow/red.

Doing a /_cluster/reroute?retry_failed=true causes the shard to be allocated instantaneously and the cluster health returns to green.

Attempting a force merge on the index, once it has reallocated, succeeds.

Have not found any other cases of this error in any of our historical logs (30d) or any of our other clusters (we have 2 other identically configured clusters in different regions as well as multiple test clusters).

The error seems to come from lucene though I don't know enough about ES/lucene internals to trace this state back.

We tried restarting one of the nodes that was getting into this state and have not seen a failure from it since then, however these have been so infrequent I'm not sure how significant that is and would like understand what the problem is before restarting other nodes where this has been seen.

Provide logs (if relevant):

{"type": "server", "timestamp": "2019-12-25T15:32:55,877Z", "level": "WARN", "component": "o.e.i.e.Engine", "cluster.name": "elasticsearch", "node.name": "elasticsearch-0", "message": " [v27.tcpevent-shrink-000188][0] failed engine [merge failed]", "cluster.uuid": "Wk2FFMJbQGKUAPPeZjek2w", "node.id": "45jcOptdTjqnA_fG1punmg" , 
"stacktrace": ["org.apache.lucene.index.MergePolicy$MergeException: java.lang.IllegalStateException: totalPointCount=2051853 was passed when we were created, but we just hit 0782 values",
"at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2389) [elasticsearch-7.4.1.jar:7.4.1]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) [elasticsearch-7.4.1.jar:7.4.1]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.4.1.jar:7.4.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:830) [?:?]",
"Caused by: java.lang.IllegalStateException: totalPointCount=2051853 was passed when we were created, but we just hit 0782 values",
"at org.apache.lucene.util.bkd.BKDWriter$OneDimensionBKDWriter.add(BKDWriter.java:562) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.util.bkd.BKDWriter.merge(BKDWriter.java:497) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.codecs.lucene60.Lucene60PointsWriter.merge(Lucene60PointsWriter.java:213) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.SegmentMerger.mergePoints(SegmentMerger.java:202) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:162) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4462) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4056) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101) ~[elasticsearch-7.4.1.jar:7.4.1]",
"at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]"] }
{"type": "server", "timestamp": "2019-12-25T15:32:55,885Z", "level": "WARN", "component": "o.e.i.c.IndicesClusterStateService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-0", "message": "[v27.tcpevent-shrink-000188][0] marking and sending shard failed due to [shard failure, reason [merge failed]]", "cluster.uuid": "Wk2FFMJbQGKUAPPeZjek2w", "node.id": "45jcOptdTjqnA_fG1punmg" , 
"stacktrace": ["org.apache.lucene.index.MergePolicy$MergeException: java.lang.IllegalStateException: totalPointCount=2051853 was passed when we were created, but we just hit 0782 values",
"at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2389) ~[elasticsearch-7.4.1.jar:7.4.1]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) ~[elasticsearch-7.4.1.jar:7.4.1]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.4.1.jar:7.4.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:830) [?:?]",
"Caused by: java.lang.IllegalStateException: totalPointCount=2051853 was passed when we were created, but we just hit 0782 values",
"at org.apache.lucene.util.bkd.BKDWriter$OneDimensionBKDWriter.add(BKDWriter.java:562) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.util.bkd.BKDWriter.merge(BKDWriter.java:497) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.codecs.lucene60.Lucene60PointsWriter.merge(Lucene60PointsWriter.java:213) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.SegmentMerger.mergePoints(SegmentMerger.java:202) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:162) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4462) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4056) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101) ~[elasticsearch-7.4.1.jar:7.4.1]",
"at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]"] }
:AnalyticGeo Analytics

Most helpful comment

Hi @toby-jn ,

Yes, indeed is the cause of what you are seeing. The issue has been fixed in Lucene and release as part of Lucene 8.5.1. Our development branches have been updated to that Lucene version (https://github.com/elastic/elasticsearch/pull/55229), so it should be part of the 7.7 release.

All 25 comments

Pinging @elastic/es-analytics-geo (:Analytics/Geo)

@nknize @iverase Can you please take a look?

Since first posting this we've now started seeing these on a second cluster and are also seeing a different backtrace from merge failures that I think is related:

{"type": "server", "timestamp": "2020-01-02T09:14:42,067Z", "level": "WARN", "component": "o.e.i.e.Engine", "cluster.name": "elasticsearch", "node.name": "elasticsearch-0", "message": " [v27.tcpevent-shrink-000191][0] failed engine [merge failed]", "cluster.uuid": "Wk2FFMJbQGKUAPPeZjek2w", "node.id": "45jcOptdTjqnA_fG1punmg" , 
"stacktrace": ["org.apache.lucene.index.MergePolicy$MergeException: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 16",
"at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2389) [elasticsearch-7.4.1.jar:7.4.1]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) [elasticsearch-7.4.1.jar:7.4.1]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.4.1.jar:7.4.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:830) [?:?]",
"Caused by: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 16",
"at java.util.Arrays.rangeCheck(Arrays.java:124) ~[?:?]",
"at java.util.Arrays.compareUnsigned(Arrays.java:6126) ~[?:?]",
"at org.apache.lucene.util.bkd.BKDWriter$BKDMergeQueue.lessThan(BKDWriter.java:356) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.util.bkd.BKDWriter$BKDMergeQueue.lessThan(BKDWriter.java:344) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:254) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:131) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.util.bkd.BKDWriter.merge(BKDWriter.java:487) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.codecs.lucene60.Lucene60PointsWriter.merge(Lucene60PointsWriter.java:213) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.SegmentMerger.mergePoints(SegmentMerger.java:202) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:162) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4462) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4056) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101) ~[elasticsearch-7.4.1.jar:7.4.1]",
"at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]"] }

Can open a separate issue for this, but both cases look to be caused by some state in BKDWriter getting corrupted.

Hi @toby-jn,

I don't think we need a separate issue as I think they are related as well. Would it be possible to share the mapping of the index that is failing?

In addition I have one more question, I see the name of the indices are something like v27.tcpevent-shrink-000191, have they been created using the shrink index API?

Hi, thanks for taking a look.

I've attached mappings for the two index templates we've seen this issue on:

v27.agentprocessfileevent-mapping.txt
v27.tcpevent-mapping.txt

We use the rollover API to manage our time based indexing and use the shrink API and forcemerge when they are ~15 GB. A job runs to periodically delete by query on documents older than our retention and I believe these merges are the automated scheduled merges clearing the deleted documents.

I've been collecting some statistics on failures.. so far we've had 494 errors since 2019-12-24T10:31:41, 145 of the totalPointCount error, and 349 of the index out of range. Across each index we've seen:

index   totalPoint      indexBounds
v27.agentprocessfileevent-shrink-000269 2       0
v27.agentprocessfileevent-shrink-000270 2       0
v27.agentprocessfileevent-shrink-000272 104     0
v27.agentprocessfileevent-shrink-000278 1       0
v27.agentprocessfileevent-shrink-000286 4       0
v27.tcpevent-shrink-000185      2       142
v27.tcpevent-shrink-000188      13      0
v27.tcpevent-shrink-000190      17      118
v27.tcpevent-shrink-000191      0       86
v27.tcpevent-shrink-000201      0       1
v27.tcpevent-shrink-000204      0       1
v27.tcpevent-shrink-000206      0       1

Cluster has a total of 233 indices, 98 of these are v27.agentprocessfileevent-shrink and 77 are v27.tcpevent-shrink

Thank you for this information, it puts more light to the issue.

A last question is regarding the issue appearing on December 24th. I would expect that some change would have happened in your cluster or operational procedures around that time. Do you recall any of these around that time?

There were no changes specifically on that day, and the most recent operational updates were made 5 days earlier (we added a node attribute that required a rolling cluster restart of ES, but not other changes).

It looks like this timing corresponds to around 24 days after when we upgraded to ES7 so the affected indices are all nearing the end of their retention duration I would expect they would see a lot of delete by query and so will see more merge activity, it's possible we're just starting to see the effects of this? This cluster had previously been running ES 6.5.4, but all indices we've had this issue with were created in ES7

@iverase appreciate there isn't a fix for this yet, but do you have any information about what may be causing it, or just some more insight as to what the problem could be? We're running a set of cron jobs currently retrying failed shard allocation every minute, however this isn't a great solution long term and if there's other operational changes we can make to mitigate this we would like to try them. We're seeing around 20-30 merge failures per day on our cluster and are concerned about potential side effects or other underlying issues here.

@toby-jn I have not been able to reproduce your issue so I am still clueless of what it is the issue here.

One thing you could do for me is to use the Lucene jar in the lib directory and run the CheckIndex utility. The command should look like:

 java -cp /usr/share/elasticsearch/lib/lucene-core-x.x.x.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex $datadir/nodes/0/indices/$index/$shard/index/

Note that it is a bit tedious as it needs to be run for every shard. It might be slow as well given that this command will not only verify checksums but also make sure that the index can actually be read and that its datastructures are consistent with each other.

Looks like this needs to be run with ES stopped, so we'll try and schedule some downtime for it. One thing I did notice on first of the indices we've seen this error on that I checked; the primary had merged to a single segment, however the replica still had a lot of segments:

/_cat/segments/v27.agentprocessfileevent-shrink-000320?v

index                                   shard prirep ip           segment generation docs.count docs.deleted    size size.memory committed searchable version compound
v27.agentprocessfileevent-shrink-000320 0     p      10.64.132.33 _2b             83   50020427       263771    15gb    23305350 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _0               0    1530101            0 430.8mb      820161 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1               1    1131920         9530 358.2mb      621306 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _2               2    1865342        10383 551.2mb     1019448 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _3               3     334343         2796  82.6mb      184944 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _4               4    1352885        12572 438.5mb      732877 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _5               5    1744796        14466 531.6mb      970092 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _6               6    1309300        11270 427.1mb      711871 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _7               7     231449         1172  75.4mb      111788 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _8               8    1341299        13828 444.9mb      733269 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _9               9    1277097         2104 418.1mb      700411 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _a              10    1226868         8399 404.3mb      670473 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _b              11    1107346            0 367.1mb      591807 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _c              12     190765            0  61.9mb      107388 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _d              13    1222114            0 396.5mb      655998 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _e              14      37285            0  13.1mb       30790 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _i              18      93730            0    31mb       67511 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _j              19     179249            0  57.6mb      109188 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _p              25   13848729        73512   4.2gb     7208905 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _q              26    1950571        13377 561.6mb     1046458 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _r              27      39600          291  12.7mb       30358 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _s              28      78983            0    12mb       51500 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _t              29      35413            0  14.1mb       31012 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _16             42       9444            0   3.4mb       16027 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _17             43       8722            0   3.1mb       15591 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1b             47    1928557         1806   530mb     1019366 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1c             48    1360571         9524 442.7mb      777892 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1d             49    1914889        17285 562.2mb     1024724 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1e             50     363154         2062  88.3mb      193942 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1f             51    1323353        15192 431.4mb      724082 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1g             52    1678005        13021 507.9mb      927195 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1h             53     277047         1938  91.2mb      128091 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1i             54    1484407        15361 501.3mb      830046 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1j             55    1439894        11555 483.5mb      786604 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1k             56    1375682         2327 458.2mb      754077 true      true       8.2.0   false
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1l             57    1246844            0 416.5mb      681799 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1m             58    1426232            0 458.6mb      784492 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1n             59     196784            0  62.2mb      111545 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _1o             60      98496            0    34mb       70240 true      true       8.2.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _27             79     205235            0  64.7mb      123124 true      true       8.3.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _28             80      47493            0  17.8mb       38116 true      true       8.3.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _29             81    1496837            0 478.6mb      702562 true      true       8.3.0   true
v27.agentprocessfileevent-shrink-000320 0     r      10.64.128.46 _2a             82       9596          735   3.8mb       16803 true      true       8.3.0   true

This may be completely normal, and makes some sense given the errors we're seeing, but does seem unusual (and perhaps explains why forcing retry and restarting nodes sometimes seems to fix the problem).

I don't think that is normal and I was not using replicas when trying to reproduce it. I will try this next. If this is common for the failing indexes, you might want to regenerate the replicas by setting number of replicas to 0 and then set it back to the desire number of replicas.

Another interesting information is that it seems there is some segments created with Lucene version 8.2.0 and others with 8.3.0.

At the beginning of the issue you said you are running on elasticsearch 7.4.1 which uses lucene 8.2.0. Have you upgraded ES in the meanwhile? (lucene 8.3.0 was introduced in elasticsearch 7.5.0)

Yes, we upgraded to ES 7.5.0 last week to pick up the fix for #47719, because index shrink was setting max retries to 1 when we were hitting these merge failures it was not retrying the allocation so we were regularly seeing indices go red. We were definitely seeing these issues before the 7.5.0 upgrade though

I made a copy of the segment files for that shard (the replica for v27.agentprocessfileevent-shrink-000320 which had previously had a merge failure), deleted the lock file and ran CheckIndex on that:
checkindex.txt

Doesn't seem to have any issues :(

@toby-jn just to discard a problem with a node, I can see the two error posted above come from the same node: 45jcOptdTjqnA_fG1punmg so I am wondering if the errors are always happening in the same node.

I think this was just a coincidence, we've seen these failures on multiple nodes on multiple independent (but identically configured) clusters

This seems to be happening during merges on indices where we've done a lot of delete by query. I've tried reproducing this on a test cluster, doing lots of merging and delete by query in a loop, but not seen any issues, we have seen this on a couple of our test stacks that are specifically testing our retention jobs, so if there's anything else we can do to debug this we have an environment where we can sometimes reproduce it. We can try running those with additional logging or special builds if needed.

If you have an environment where you can reproduce the issue, we would like to ask you to try to reproduce it using a different JVM so we can discard a problem at that level.

Sorry for the delay in doing this, took some setting up on our end... tried with JDK 12 and still see the errors unfortunately.

still seeing these a lot, but have been running some cron tasks to automatically go and fix up these errors when we see them.. seen this issue is making it into the 7.7 release: https://issues.apache.org/jira/browse/LUCENE-9300 is this a possible cause of the issue?

Hi @toby-jn ,

Yes, indeed is the cause of what you are seeing. The issue has been fixed in Lucene and release as part of Lucene 8.5.1. Our development branches have been updated to that Lucene version (https://github.com/elastic/elasticsearch/pull/55229), so it should be part of the 7.7 release.

I am closing this issue as it should be fixed in Elasticsearch 7.7. Please feel free to reopen it if you still experience the issue.

Was this page helpful?
0 / 5 - 0 ratings