Elasticsearch: Docs missing from a replica

Created on 8 Aug 2016 · 16Comments · Source: elastic/elasticsearch

Elasticsearch version: 2.3.3
Plugins installed: cloud-aws
JVM version: 1.8.0_25
OS version: Ubuntu 12.04 LTS
Description of the problem including expected versus actual behavior:

I have a document that exists in one replica of a shard but not in the other replica. The initial symptom is that an update on a document id failed, further investigation showed that some nodes could search for the id and some failed. For example (all identifiers changed to protect the innocent):

GET /myindex/_search?q=doc1234

returns this when the query hits the shard where the doc doesn't exist:

"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
  "_index" : "myindex",
  "_type" : "mydoctype",
  "_id" : "doc1234"
  "_score" : 1.0,
  "_source" : {
    "explain" : true
  }
} ]

But when it hits the shard where the doc does exist, I get the full document back.

When I add the 'explain' parameter I can figure out which shard instance it is, and turns out that the primary has 3 fewer documents than the replica (out of 2.3M docs)

The document in question above is in the replica, but not the primary. I restored a snapshot of the index to another cluster, and I guess because it only restores the primary shard, the document is missing from there as well in both replica and primary.

After further investigation I see that the document count between primary and replicas are different in 469 of my 8662 shards. I suspect there's more but short of comparing doc id's between the primary/replica of each shard (not really possible given the size) that's all I can go by for now. So this is not an isolated problem but quite a bit more widespread.

Of those shards, some were created and brought over from 1.7. Others were created in 2.3.3 very recently. There's no pattern for age of index. The document in question was created on August 4, 2016 on an older index, but other time-based indices created in the past few days suffer the same document mismatch.

I have started a discussion here: https://discuss.elastic.co/t/docs-missing-from-a-replica/57382
This does smell like a much older discussion over here: https://discuss.elastic.co/t/how-to-fix-primary-replica-inconsistency/9016/18

Looking for ideas on how to troubleshoot this further.

:CorInfrCore feedback_needed

Source

rtkmhart

All 16 comments

Small correction, I realized a large portion of the shards that are reporting different document counts are actively being written to so that's expected. The shards that are not actively being written to and have different documents are 8 in total. But the overall nature of the problem doesn't change.

rtkmhart on 8 Aug 2016

As a workaorund I can forcefully "fix" the doc by retrieving and updating it with the "doc_as_upsert" attribute added and set to true. Seems the two nodes are happy to update or insert the doc as they see fit.

rtkmhart on 8 Aug 2016

How are documents written/updated/deleted in your cluster? Do you use version types? Do you use delete-by-query?
After fixing the inconsistencies between primary/replica, have you seen this happening again?

ywelsch on 10 Aug 2016

Are the 8 indices with differing doc counts from old versions, or 2.3.3?

clintongormley on 11 Aug 2016

@ywelsch The vast majority of the documents are written using bulk indexing, for both inserts and updates. We rarely delete documents. Version types are not used, we use the default settings. We do not use delete-by-query.

@clintongormley In each case the indices have a mix of versions that return from the "/index/_segments" call, a mix of 4.10.4 and 5.5.0, which I think means the indices were created prior to 2.3.3. In those cases I didn't upgrade in place, but did a snapshot from v1.7 and restore to the new v2.3.3 cluster, in case that makes a difference.

rtkmhart on 16 Aug 2016

We are continuing to see this issue and it is becoming more serious.

We have two classes of documents that we are indexing through a bulk operation: one class is doing a simple index, and the other class is doing an update operation with an upsert with a groovy script.

We don't have any evidence of missing items with the first class of documents.

But we are missing a small (<1%?), but functionally noticeable percentage of documents in some of the shards. I just worked through a small set of example failures from yesterday. In each case, the document was in the replica, but not the primary. (We have three shards and one replica for these indices). (BTW this is a hack to get our updatable objects to work with ES)

This is the template of the bulk update action we are using:

{"update":{"_index":"<index name>","_type":"<doc type>","_id":"<id>"}}
{"upsert":<doc body>,"script":{"file":"update_incident","params":{"doc":<partial doc>}}}

This is the groovy script used to update the document.

// We always want the latest timestamp
if (doc.get('@timestamp') < ctx._source['@timestamp']) {
  doc.putAt('@timestamp', ctx._source['@timestamp'])
}
// Update the state
ctx._source.putAll(doc)
ctx._source.incident.count++

rtkbkish on 26 Aug 2016

We have been doing this for over 18 months without issue (on 1.5, 1.6 and 1.7). It is only in the last month since the switch to Elasticsearch 2.x that we have run into consistency problems.

rtkbkish on 26 Aug 2016

@rtkbkish are you still seeing this issue?

colings86 on 31 Mar 2017

@colings86 Yes

rtkbkish on 12 Apr 2017

@rtkbkish many many things changed in this area of the code. There are known problems, many of them has been fixed and some are on the process of being fixed. This is a problem we take seriously and we can work to figure out what exactly the issue in your case. Some questions:

1) Which version are you on today?
2) Do you see this same with 5.3.0?
3) Do you see any networking issues / disconnects in the logs?
4) What type of bulk request are you doing - is this update, normal index or indexing with auto generated ids?

bleskes on 13 Apr 2017

We are currently on 2.3.3
We haven't made the jump to 5.3. The migration from 1.x to 2.x was so time-intensive that the next migration keeps getting delayed.
Networking issues/disconnects are very rare. Does not explain frequency of issue.
We generate the IDs. It is a bulk update operation.

rtkbkish on 13 Apr 2017

Networking issues/disconnects are very rare. Does not explain frequency of issue.

Can we re-iterate on how often you see this? Also - do you see any shard failures?

We generate the IDs. It is a bulk update operation.

What kind of update do you do?

bleskes on 13 Apr 2017

@bleskes
Sorry for the delay. Numerous local distractions.

I looked the logs over the past 4 months for our data nodes in this cluster (of which there are four).
This is the distribution of NodeDisconnectExceptions:

  67 2017-01-12
  14 2017-01-13
  20 2017-01-16
  10 2017-03-02
  53 2017-03-08
  45 2017-03-23
  60 2017-04-05
  59 2017-04-10
   7 2017-04-13
   3 2017-04-14
  55 2017-04-20
  12 2017-04-22

These were triggered by GC's on the data nodes. There is a very similar distribution of Shard create failures.

We see the missing docs much more frequently than this. Several times per day.

All updates are doing using the bulk API.

Originally, we used an upsert statement, but changed that to a create and update in an attempt to get around this issue. (It hasn't fixed things, so we should roll it back).

Current bulk statements:
{"create":{"_index":"","_type":"","_id":""}}
{}
{"update":{"_index":"","_type":"","_id":""}}
{"script":{"file":"update_incident","params":{"doc":{}}}}

Original Bulk statements:
{"update":{"_index":"","_type":"","_id":""}}
{"upsert":{},"script":{"file":"update_incident","params":{"doc":{}}}}

In either case we trigger a groovy script:

// We always want the latest timestamp
if (doc.get('@timestamp') < ctx._source['@timestamp']) {
  doc.putAt('@timestamp', ctx._source['@timestamp'])
}
// Update the state
ctx._source.putAll(doc)
ctx._source.incident.count++

rtkbkish on 27 Apr 2017

Do you see active primary shard failures in your logs, and if so how often?

jasontedor on 28 Apr 2017

Since November 2016, I see two "primary failed while replica initializing" errors. both for the same index on Mar 22.

Otherwise clusters of "marking and sending shard failed due to [failed to create shard]" when we have GC incidents. These are tied to the NodeDisconnectExceptions in the previous comment.

rtkbkish on 28 Apr 2017

I lean towards closing this as so much has changed in these regards since 2.x. Prior to 6.0, a replica could fall out of sync in the case of primary failure. In 6.0, we are introducing a primary/replica re-sync after a replica is promoted to primary. If this occurs again after 6.0 is released, we can revisit but at this point we are not going to make any changes in 2.x and 5.x to address.

jasontedor on 28 Oct 2017

👎4

Was this page helpful?

0 / 5 - 0 ratings