Elasticsearch version (bin/elasticsearch --version):
Version: 7.4.1, Build: oss/docker/fc0eeb6e2c25915d63d871d344e3d0b45ea0ea1e/2019-10-22T17:16:35.176724Z, JVM: 13
(deployed in k8s with 3 data nodes and 3 dedicated masters)
Plugins installed:
analysis-url 1.0.0 (custom analysis plugin providing a token filter for URLs)
repository-gcs 7.4.1
repository-s3 7.4.1
JVM version (java -version):
openjdk version "13" 2019-09-17
OpenJDK Runtime Environment AdoptOpenJDK (build 13+33)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13+33, mixed mode, sharing)
OS version (uname -a if on a Unix-like system):
Linux elasticsearch-0 4.14.138+ #1 SMP Tue Sep 3 02:58:08 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Yesterday we had an ES cluster repeatedly going red due to merge failures. In each case it was due to the error: java.lang.IllegalStateException: totalPointCount=4424 was passed when we were created, but we just hit 0329 values.
This was seen a total of 12 times in the last 24 hours on 3 different indices and on two different nodes.
After the merge failure ES does not attempt to reallocate the shard and the cluster goes yellow/red.
Doing a /_cluster/reroute?retry_failed=true causes the shard to be allocated instantaneously and the cluster health returns to green.
Attempting a force merge on the index, once it has reallocated, succeeds.
Have not found any other cases of this error in any of our historical logs (30d) or any of our other clusters (we have 2 other identically configured clusters in different regions as well as multiple test clusters).
The error seems to come from lucene though I don't know enough about ES/lucene internals to trace this state back.
We tried restarting one of the nodes that was getting into this state and have not seen a failure from it since then, however these have been so infrequent I'm not sure how significant that is and would like understand what the problem is before restarting other nodes where this has been seen.
Provide logs (if relevant):
{"type": "server", "timestamp": "2019-12-25T15:32:55,877Z", "level": "WARN", "component": "o.e.i.e.Engine", "cluster.name": "elasticsearch", "node.name": "elasticsearch-0", "message": " [v27.tcpevent-shrink-000188][0] failed engine [merge failed]", "cluster.uuid": "Wk2FFMJbQGKUAPPeZjek2w", "node.id": "45jcOptdTjqnA_fG1punmg" ,
"stacktrace": ["org.apache.lucene.index.MergePolicy$MergeException: java.lang.IllegalStateException: totalPointCount=2051853 was passed when we were created, but we just hit 0782 values",
"at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2389) [elasticsearch-7.4.1.jar:7.4.1]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) [elasticsearch-7.4.1.jar:7.4.1]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.4.1.jar:7.4.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:830) [?:?]",
"Caused by: java.lang.IllegalStateException: totalPointCount=2051853 was passed when we were created, but we just hit 0782 values",
"at org.apache.lucene.util.bkd.BKDWriter$OneDimensionBKDWriter.add(BKDWriter.java:562) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.util.bkd.BKDWriter.merge(BKDWriter.java:497) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.codecs.lucene60.Lucene60PointsWriter.merge(Lucene60PointsWriter.java:213) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.SegmentMerger.mergePoints(SegmentMerger.java:202) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:162) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4462) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4056) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101) ~[elasticsearch-7.4.1.jar:7.4.1]",
"at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]"] }
{"type": "server", "timestamp": "2019-12-25T15:32:55,885Z", "level": "WARN", "component": "o.e.i.c.IndicesClusterStateService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-0", "message": "[v27.tcpevent-shrink-000188][0] marking and sending shard failed due to [shard failure, reason [merge failed]]", "cluster.uuid": "Wk2FFMJbQGKUAPPeZjek2w", "node.id": "45jcOptdTjqnA_fG1punmg" ,
"stacktrace": ["org.apache.lucene.index.MergePolicy$MergeException: java.lang.IllegalStateException: totalPointCount=2051853 was passed when we were created, but we just hit 0782 values",
"at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2389) ~[elasticsearch-7.4.1.jar:7.4.1]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) ~[elasticsearch-7.4.1.jar:7.4.1]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.4.1.jar:7.4.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:830) [?:?]",
"Caused by: java.lang.IllegalStateException: totalPointCount=2051853 was passed when we were created, but we just hit 0782 values",
"at org.apache.lucene.util.bkd.BKDWriter$OneDimensionBKDWriter.add(BKDWriter.java:562) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.util.bkd.BKDWriter.merge(BKDWriter.java:497) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.codecs.lucene60.Lucene60PointsWriter.merge(Lucene60PointsWriter.java:213) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.SegmentMerger.mergePoints(SegmentMerger.java:202) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:162) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4462) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4056) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101) ~[elasticsearch-7.4.1.jar:7.4.1]",
"at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]"] }
Pinging @elastic/es-analytics-geo (:Analytics/Geo)
@nknize @iverase Can you please take a look?
Since first posting this we've now started seeing these on a second cluster and are also seeing a different backtrace from merge failures that I think is related:
{"type": "server", "timestamp": "2020-01-02T09:14:42,067Z", "level": "WARN", "component": "o.e.i.e.Engine", "cluster.name": "elasticsearch", "node.name": "elasticsearch-0", "message": " [v27.tcpevent-shrink-000191][0] failed engine [merge failed]", "cluster.uuid": "Wk2FFMJbQGKUAPPeZjek2w", "node.id": "45jcOptdTjqnA_fG1punmg" ,
"stacktrace": ["org.apache.lucene.index.MergePolicy$MergeException: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 16",
"at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2389) [elasticsearch-7.4.1.jar:7.4.1]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) [elasticsearch-7.4.1.jar:7.4.1]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.4.1.jar:7.4.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:830) [?:?]",
"Caused by: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 16",
"at java.util.Arrays.rangeCheck(Arrays.java:124) ~[?:?]",
"at java.util.Arrays.compareUnsigned(Arrays.java:6126) ~[?:?]",
"at org.apache.lucene.util.bkd.BKDWriter$BKDMergeQueue.lessThan(BKDWriter.java:356) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.util.bkd.BKDWriter$BKDMergeQueue.lessThan(BKDWriter.java:344) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:254) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:131) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.util.bkd.BKDWriter.merge(BKDWriter.java:487) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.codecs.lucene60.Lucene60PointsWriter.merge(Lucene60PointsWriter.java:213) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.SegmentMerger.mergePoints(SegmentMerger.java:202) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:162) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4462) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4056) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]",
"at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101) ~[elasticsearch-7.4.1.jar:7.4.1]",
"at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]"] }
Can open a separate issue for this, but both cases look to be caused by some state in BKDWriter getting corrupted.
Hi @toby-jn,
I don't think we need a separate issue as I think they are related as well. Would it be possible to share the mapping of the index that is failing?
In addition I have one more question, I see the name of the indices are something like v27.tcpevent-shrink-000191, have they been created using the shrink index API?
Hi, thanks for taking a look.
I've attached mappings for the two index templates we've seen this issue on:
v27.agentprocessfileevent-mapping.txt
v27.tcpevent-mapping.txt
We use the rollover API to manage our time based indexing and use the shrink API and forcemerge when they are ~15 GB. A job runs to periodically delete by query on documents older than our retention and I believe these merges are the automated scheduled merges clearing the deleted documents.
I've been collecting some statistics on failures.. so far we've had 494 errors since 2019-12-24T10:31:41, 145 of the totalPointCount error, and 349 of the index out of range. Across each index we've seen:
index totalPoint indexBounds
v27.agentprocessfileevent-shrink-000269 2 0
v27.agentprocessfileevent-shrink-000270 2 0
v27.agentprocessfileevent-shrink-000272 104 0
v27.agentprocessfileevent-shrink-000278 1 0
v27.agentprocessfileevent-shrink-000286 4 0
v27.tcpevent-shrink-000185 2 142
v27.tcpevent-shrink-000188 13 0
v27.tcpevent-shrink-000190 17 118
v27.tcpevent-shrink-000191 0 86
v27.tcpevent-shrink-000201 0 1
v27.tcpevent-shrink-000204 0 1
v27.tcpevent-shrink-000206 0 1
Cluster has a total of 233 indices, 98 of these are v27.agentprocessfileevent-shrink and 77 are v27.tcpevent-shrink
Thank you for this information, it puts more light to the issue.
A last question is regarding the issue appearing on December 24th. I would expect that some change would have happened in your cluster or operational procedures around that time. Do you recall any of these around that time?
There were no changes specifically on that day, and the most recent operational updates were made 5 days earlier (we added a node attribute that required a rolling cluster restart of ES, but not other changes).
It looks like this timing corresponds to around 24 days after when we upgraded to ES7 so the affected indices are all nearing the end of their retention duration I would expect they would see a lot of delete by query and so will see more merge activity, it's possible we're just starting to see the effects of this? This cluster had previously been running ES 6.5.4, but all indices we've had this issue with were created in ES7
@iverase appreciate there isn't a fix for this yet, but do you have any information about what may be causing it, or just some more insight as to what the problem could be? We're running a set of cron jobs currently retrying failed shard allocation every minute, however this isn't a great solution long term and if there's other operational changes we can make to mitigate this we would like to try them. We're seeing around 20-30 merge failures per day on our cluster and are concerned about potential side effects or other underlying issues here.
@toby-jn I have not been able to reproduce your issue so I am still clueless of what it is the issue here.
One thing you could do for me is to use the Lucene jar in the lib directory and run the CheckIndex utility. The command should look like:
java -cp /usr/share/elasticsearch/lib/lucene-core-x.x.x.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex $datadir/nodes/0/indices/$index/$shard/index/
Note that it is a bit tedious as it needs to be run for every shard. It might be slow as well given that this command will not only verify checksums but also make sure that the index can actually be read and that its datastructures are consistent with each other.
Looks like this needs to be run with ES stopped, so we'll try and schedule some downtime for it. One thing I did notice on first of the indices we've seen this error on that I checked; the primary had merged to a single segment, however the replica still had a lot of segments:
/_cat/segments/v27.agentprocessfileevent-shrink-000320?v
index shard prirep ip segment generation docs.count docs.deleted size size.memory committed searchable version compound
v27.agentprocessfileevent-shrink-000320 0 p 10.64.132.33 _2b 83 50020427 263771 15gb 23305350 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _0 0 1530101 0 430.8mb 820161 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1 1 1131920 9530 358.2mb 621306 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _2 2 1865342 10383 551.2mb 1019448 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _3 3 334343 2796 82.6mb 184944 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _4 4 1352885 12572 438.5mb 732877 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _5 5 1744796 14466 531.6mb 970092 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _6 6 1309300 11270 427.1mb 711871 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _7 7 231449 1172 75.4mb 111788 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _8 8 1341299 13828 444.9mb 733269 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _9 9 1277097 2104 418.1mb 700411 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _a 10 1226868 8399 404.3mb 670473 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _b 11 1107346 0 367.1mb 591807 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _c 12 190765 0 61.9mb 107388 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _d 13 1222114 0 396.5mb 655998 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _e 14 37285 0 13.1mb 30790 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _i 18 93730 0 31mb 67511 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _j 19 179249 0 57.6mb 109188 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _p 25 13848729 73512 4.2gb 7208905 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _q 26 1950571 13377 561.6mb 1046458 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _r 27 39600 291 12.7mb 30358 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _s 28 78983 0 12mb 51500 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _t 29 35413 0 14.1mb 31012 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _16 42 9444 0 3.4mb 16027 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _17 43 8722 0 3.1mb 15591 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1b 47 1928557 1806 530mb 1019366 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1c 48 1360571 9524 442.7mb 777892 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1d 49 1914889 17285 562.2mb 1024724 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1e 50 363154 2062 88.3mb 193942 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1f 51 1323353 15192 431.4mb 724082 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1g 52 1678005 13021 507.9mb 927195 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1h 53 277047 1938 91.2mb 128091 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1i 54 1484407 15361 501.3mb 830046 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1j 55 1439894 11555 483.5mb 786604 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1k 56 1375682 2327 458.2mb 754077 true true 8.2.0 false
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1l 57 1246844 0 416.5mb 681799 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1m 58 1426232 0 458.6mb 784492 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1n 59 196784 0 62.2mb 111545 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _1o 60 98496 0 34mb 70240 true true 8.2.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _27 79 205235 0 64.7mb 123124 true true 8.3.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _28 80 47493 0 17.8mb 38116 true true 8.3.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _29 81 1496837 0 478.6mb 702562 true true 8.3.0 true
v27.agentprocessfileevent-shrink-000320 0 r 10.64.128.46 _2a 82 9596 735 3.8mb 16803 true true 8.3.0 true
This may be completely normal, and makes some sense given the errors we're seeing, but does seem unusual (and perhaps explains why forcing retry and restarting nodes sometimes seems to fix the problem).
I don't think that is normal and I was not using replicas when trying to reproduce it. I will try this next. If this is common for the failing indexes, you might want to regenerate the replicas by setting number of replicas to 0 and then set it back to the desire number of replicas.
Another interesting information is that it seems there is some segments created with Lucene version 8.2.0 and others with 8.3.0.
At the beginning of the issue you said you are running on elasticsearch 7.4.1 which uses lucene 8.2.0. Have you upgraded ES in the meanwhile? (lucene 8.3.0 was introduced in elasticsearch 7.5.0)
Yes, we upgraded to ES 7.5.0 last week to pick up the fix for #47719, because index shrink was setting max retries to 1 when we were hitting these merge failures it was not retrying the allocation so we were regularly seeing indices go red. We were definitely seeing these issues before the 7.5.0 upgrade though
I made a copy of the segment files for that shard (the replica for v27.agentprocessfileevent-shrink-000320 which had previously had a merge failure), deleted the lock file and ran CheckIndex on that:
checkindex.txt
Doesn't seem to have any issues :(
@toby-jn just to discard a problem with a node, I can see the two error posted above come from the same node: 45jcOptdTjqnA_fG1punmg so I am wondering if the errors are always happening in the same node.
I think this was just a coincidence, we've seen these failures on multiple nodes on multiple independent (but identically configured) clusters
This seems to be happening during merges on indices where we've done a lot of delete by query. I've tried reproducing this on a test cluster, doing lots of merging and delete by query in a loop, but not seen any issues, we have seen this on a couple of our test stacks that are specifically testing our retention jobs, so if there's anything else we can do to debug this we have an environment where we can sometimes reproduce it. We can try running those with additional logging or special builds if needed.
If you have an environment where you can reproduce the issue, we would like to ask you to try to reproduce it using a different JVM so we can discard a problem at that level.
Sorry for the delay in doing this, took some setting up on our end... tried with JDK 12 and still see the errors unfortunately.
still seeing these a lot, but have been running some cron tasks to automatically go and fix up these errors when we see them.. seen this issue is making it into the 7.7 release: https://issues.apache.org/jira/browse/LUCENE-9300 is this a possible cause of the issue?
Hi @toby-jn ,
Yes, indeed is the cause of what you are seeing. The issue has been fixed in Lucene and release as part of Lucene 8.5.1. Our development branches have been updated to that Lucene version (https://github.com/elastic/elasticsearch/pull/55229), so it should be part of the 7.7 release.
I am closing this issue as it should be fixed in Elasticsearch 7.7. Please feel free to reopen it if you still experience the issue.
Most helpful comment
Hi @toby-jn ,
Yes, indeed is the cause of what you are seeing. The issue has been fixed in Lucene and release as part of Lucene 8.5.1. Our development branches have been updated to that Lucene version (https://github.com/elastic/elasticsearch/pull/55229), so it should be part of the 7.7 release.