Can you please provide more information that would help us to reproduce the problem you are facing?
It would help to get the exact query you are running (including the indices that are hit), how your data looks like, possibly the mapping and how your index process looks like.

Ideally, can you provide a minimal example? Just a few documents that, if indexed, will exhibit the mentioned problem?

cbuescher on 10 Jul 2017

@cbuescher
Thats the problem, we are unable to reproduce it.
Though we are worried it will occur again.

We were targeting 2 types in the query.. e.g.
http://node01:9200/my_index/my_type,my_other_type/_search

What we observed in the screen shots above was that it would randomly return incorrect totals for same query executed multiple times.

With the explain options set in the query we saw 1 particular node would always return the wrong totals so we restarted this node but the issue continued.

Only when we had restarted all nodes did the problem disappear.

We don't know if it was some corruption in the shards of 1 node or what.
As mentioned we didn't see the same issue using _msearch and _count apis with same query.

dannymurphygfk on 10 Jul 2017

@cbuescher
Would you be able to suggest what additional information that we should collect, in order to be able to reproduce the error if we face the same issue again?

JagathJayasinghe on 17 Jul 2017

As mentioned earlier, anything that helps reproducing this on our side: how does your data look look like, what mappings are you using, what's your ES setup, how do you index, how does the query look like that exibits this problem.

cbuescher on 17 Jul 2017

Information provided here would give you some idea about the query, mappings, data, indexing and ES setup and please note that the examples provided here have been greatly simplified.

example query :
Query listed below exhibited the issue while the problem was still there and same query is now producing the expected correct result (after rolling restart of the cluster resolved the issue)

http://node01:9200/my_index/product/_search
{ "size": 10, "query": { "ids": { "values": [183307729] } } }

mappings for the type “product” :
mapping.txt

How data looks like:
product.txt

Indexing process and the ES setup:
elasticsearch.yml (elasticsearch.txt) file attached here is from one of the nodes in the cluster of 9 nodes.

Indexing to the cluster is performed using the HTTP bulk API with a default batch size of 1,000 items with a peak rate of around 2,000 per second.

JagathJayasinghe on 17 Jul 2017

The issue has occurred again, after only 1 month of uptime.

** Correction - _msearch does exhibit the same issue.
We use our userIds with the _msearch as the preference and while 1 might consistently work... another will consistently fail.

Also adding this preference to the _search query e.g. /product/_search?preference=2906 results in 1 preference returning correct results consistently and another consistently failing.

We had to perform a rolling restart of all nodes again to fix the issue.

@cbuescher Is there anything more we can check if/when it happens again ? APIs to find the shards a specific preference targets etc... to see if we can narrow down which shards/nodes are causing the issue ?

dannymurphygfk on 10 Aug 2017

Sorry for the delay, just looking into this again, this seems really odd but hard to debug without reproduction.
Just to clarify a bit: For a query like { "size": 10, "query": { "ids": { "values": [183307729] } } } you would expect to get back exactly one document and a total.hits of 1? But you are getting more than that?
Is the ProductId the same as the document id you index the main document with?
When this happens, does it happen for all ids or just very few? As long as you don't restart, does the problem reproduce with that same id?
What happens if instead of using the ids query or a terms query on the _id field you use the main document ProductId?

My suspicion is this might have something to do with the nested documents and merging, but have no good idea so far what is happening or how to explain this.

cbuescher on 15 Aug 2017

👍1

Another question I forgot to ask: What do you need the result of hits.total for? You mentioned _count as not exhitibing the same behaviour, is that still true as far as you know?

cbuescher on 15 Aug 2017

Let me answer your questions in turn,

1) Yes, for a query like { "size": 10, "query": { "ids": { "values": [183307729] } } } we expect to get back one document and hits.total of 1. But when this issue occurs we get an incorrect hits.total value.
Sometimes it returns 1 (which is correct value) other times a much higher value, this behaviour just feels like round-robin of shards.

2) ProductId and the document _id are the same.

3) When this happens the problem occurs for all ids we've randomly spot checked.

4) We don't currently index the ProductId field.

5) We use hits.total in side one of our web applications to make a business logic decision, _counts API still returns the correct count even when the incident occurs.

JagathJayasinghe on 17 Aug 2017

👍1

Thanks for the feedback, this is indeed a weird behaviour. Just a few follow up questions:

is the index you are running the query on maybe an old 2.x index or was it already created in 5.x
are all nodes running the same Elasticsearch version, so we can exclude serialization bugs?
Can you run the affected query using the Validate API with the explain parameter set to true and post the output so we see the actual Lucene query that gets executed?
Concerning possible corruptions: this seems to be unlikely because they would normally persist even after a restart. It might still make sense to check for corruptions, but there is a command line tool in Lucene you can use to check.
You can run java -cp lib/lucene-core-6.4.1.jar org.apache.lucene.index.CheckIndex -fast <Path-To-Index> in your Elasticsearch home directory, where Path-To-Index would be absolute path to the index segments in your data directory, so usually something like data/nodes/0/indices/<indexUUID>/<shardNo>/index/. The indexUUI can be obtained running the _cat/indices endpoint, and you should try all available shards on that node and see if the tool detects any problems

cbuescher on 17 Aug 2017

One thing I forgot mentioning about CheckIndex: since its a very low level Lucene tool that comes without warranty, you should make a backup of your index before running it. More in the JavaDocs.

cbuescher on 17 Aug 2017

I changed the titel to make the issue a bit more specific, still not sure if this is a bug or not but let's treat is as one for now.

cbuescher on 17 Aug 2017

No, we migrated from 1.X to 5.X using the _reindex API into new incidies.
All 9 nodes in the cluster are running 5.3.1
We will perform this test when the incident occurs again, it's happened twice in the last 7 days
Good information on on checking the corruption of the index, we'll give this a go after making a backup

JagathJayasinghe on 18 Aug 2017

Experiencing the same issue again,

Please see the queries and their results listed below. Would this information help ?
1)
http://node01:9200/my_index/my_product/_search?preference=3677
{ "size": 10, "query": {"ids": { "values": [105226165] }}}

This query consistently returns incorrect hits.total value.

2)
Executing the same query using Validate API with the explain parameter set to true

http://node01:9200/my_index/my_product/_validate/query?explain=true
{"query": {"ids": { "values": [105226165] }} }

Result:
{"valid": true, "_shards": { "total": 1,"successful": 1,"failed": 0},
"explanations": [
{
"index": "my_index",
"valid": true,
"explanation": "+_uid:my_product#105226165 #(#_type:my_product)"
}]}

3)
Executing the same query using Validate API with the explain parameter set to true and rewrite set to true
http://node01:9200/my_index/my_product/_validate/query?explain=true&rewrite=true
{"query": {"ids": { "values": [105226165] }}}

Result:
{"valid": true,"_shards": {"total": 1,"successful": 1,"failed": 0},
"explanations": [
{
"index": "my_index",
"valid": true,
"explanation": "+ConstantScore(_uid:my_product#105226165) #(ConstantScore(_type:my_product))^0.0"
}]}

JagathJayasinghe on 5 Sep 2017

Hi,
another question that might narrow down my initial suspicion this might have something to do with the nested "SellerProducts" documents. For the above document in question (105226165), do you know if and how many nested document it currently has? Maybe it had some that got deleted? Does this number come close to the wrong "total.hits" you are seeing? This is a wild quess but since the querys run in the "_uid" field it might be worth trying to narrow things down here a little.

cbuescher on 5 Sep 2017

Thank you for your response,

It has multiple nested documents but hits.total and the number of nested documents are not the same.
These numbers are way off and doesn't come close (hits.total was 290 whereas the expected number was 1).

JagathJayasinghe on 6 Sep 2017

@JagathJayasinghe thanks for your patience and bearing with me. I talked to @martijnvg who is more familiar with nested docs and we are still leaning in that direction with hypothesis of what might be wrong.

A few more questions:

Do you do many deletes (or updates, delete by query for that matter) on the documents in the index? If yes, how frequent are they?
Just as a clarification: in the first mapping you attached you seem to have nested enabled (for "SellerProducts"), but in the query explain we don't see the clause that should exclude nested documents. This should be there in 5.3, so thats a bit odd for us to see. Also in the type is called product in the mapping that you shared, but in the validate request you query type my_product. Is that the same mapping?
What happens if you close and reopen the index instead of a restart if the problem occurs?

Sorry that we're still poking around in the dark a little on this, there's something fishy here but its hard to put the finger to it.

cbuescher on 6 Sep 2017

Yes, we do many deletes and updates
Updates per day - 20m updates across 2 indices.
Deletes per day - high 100s / low 1000s.
Sorry that was my mistake, I should have used "product" type when simplifying for my reply
It maybe be possible for us to try closing/opening the index but it would require a maintenance window as there are several applications using the data that are not affected by the issue when it occurs.

JagathJayasinghe on 6 Sep 2017

Thanks,

maybe be possible for us to try closing/opening the index but it would require a maintenance window as there are several

okay, so that's not practical then, also this was only out of interest, no concrete hint in mind

I talked to @jimczi in the meantime regarding the explain output you posted and he thinks it looks okay (the nested filter is omitted because there is a filter on _type).
Another thing to recheck, just to be sure:

you mentioned "_count" would work when you see this problem, did you also use the same "preference" parameter in the url that you are using for the search query where you the incorrect total.hits?
does restart solve the problem for a long time? We were thinking that a restart might just change the order of the replicas, so a different "preference" parameter might still hit the same replica that exhibits the problem?

cbuescher on 6 Sep 2017

With "_count" we didn't try using "preference" parameter.
Restarting solve the problem for 1-2 weeks.

When the issue first happened we did use the _search_shards API with a successful call and an unsuccessful call of the same query in the hopes of identifying a faulty node/shard.
We thought we had identified the faulty node through the order they were listed in the responses. I think we restarted the first one that was out of order in the faulty call… but it didn’t fix the problem so we went and did a rolling restart of all nodes. Which we have been doing since.

Is there some other APIs we should use to help us identify the nodes/shards a certain preference is targeting next time this happens ?

Thank you.

JagathJayasinghe on 6 Sep 2017

When the issue first happened we did use the _search_shards API with a successful call and an unsuccessful call of the same query in the hopes of identifying a faulty node/shard.

As far as I can see, if you use _search_shard with a fixed preference=XYZ value you should get the node/shard that would be selected with preference first when you would run an actual search with that value. So if your ids query behaves weird with preference=x but works well with preference=y, that should tell you which shards on which nodes are used and check if restarting only those helps (at least for that id, docs with other ids might live on another shard).

Also setting "explain" : true in your search request should tell you _shard and _node for every hit, so if you use that with a good and a bad preference parameter then you probably see a difference.

Btw. how many shards and replicas are configured for the index in question? I though I'd already asked that and assumed a "standard" setup but just rechecked and couldn't find anything.

cbuescher on 8 Sep 2017

The issue has happened again.
So we ran our query with 2 preferences:

Preference 2904 works and correctly returns "total": 1
Query:
POST my_index/my_type%2Cmy_other_type/_search?preference=2904&explain=true
{ "query": { "ids": { "values": [111637091] }}}
Response: Response.txt

The source has 43 nested documents or 1 type and 9 nested documents of another type.

Preference 2906 fails and returns "total": 595
Query:
POST my_index/my_type%2Cmy_other_type/_search?preference=2906&explain=true
{"query": { "ids": {"values": [111637091] } }}
Response: Response.txt

We tried just restarting the node identified by the failed query but this did not fix the issue.
While the node was recovering, the query switched to another node but once recovered it switched back to the original node but continued to fail.
We restarted the node it switched to and tried again… but it still failed…
Then we restarted the final node with shard 0 and again it still failed…
So we restarted all nodes to clear the issue.

Please note that my_type and my_other_type used in this simplified query/response version are using same "product" mapping.

Thank you.

JagathJayasinghe on 12 Sep 2017

@JagathJayasinghe can you share how many shards and replicas are configured for the index in question? Apologies if this is already somewhere in the ticket, I couldn't find anything.

cbuescher on 14 Sep 2017

There are 6 shards and 2 replicas.

JagathJayasinghe on 14 Sep 2017

Thanks, lets take a step back here.
I didn't focus on the fact that you querying across two types so far, can you elaborate on this a bit more? This might come into play with the nested docs probably.
As far as I see in this issue, you have one index (here "my_index", probably different name in prod) and two types ("my_type", "my_other_type"). The mapping you provided in https://github.com/elastic/elasticsearch/issues/25603#issuecomment-315796289, that is the mapping for one type ("product").

Is the mapping for the other type the same?
Do the document you are querying for with the id are all of one type or are they different? Can there be the same document id on "my_type" and "my_other_type"?

The source has 43 nested documents or 1 type and 9 nested documents of another type.

Can you elaborate on what you mean by this? I assume this is one "parent" document with id "111637091", it is either of "my_type" or "my_other_type"? Do you know which? How can it have nested document under another type?

cbuescher on 14 Sep 2017

Thank you for the response. Please see explanations below in turn,

Yes, the mapping for both types is exactly the same.
The document Ids should be unique across both types.

So there are 43 nested documents of nestedType and 9 of otherNestedType

Yes, there is 1 parent document with id “111637091” in this case it is of type my_other_type.
It has 2 properties of nested documents:
“nestedTypeProp”:{
“dynamic”: “false”,
“type”:”nested”
“properties”:{…}
},
otherNestedTypeProp :{
“dynamic”: “false”,
“type”:”nested”
“properties”:{…}
}

JagathJayasinghe on 18 Sep 2017

Preference 2906 fails and returns "total": 595 Query: POST my_index/my_type%2Cmy_other_type/_search?preference=2906&explain=true {"query": { "ids": {"values": [111637091] } }}

would you be able to run the query without the type filter ie.:
POST my_index/_search?preference=2906&explain=true

and share the result?

s1monw on 21 Sep 2017

Problem is this issue is not occurring at the moment on our production cluster but it just so happens that today I noticed we were experiencing the same issue on our Dev cluster so I have been playing with the various APIs to see if I can find out anything new.
Again querying for a single Id.

Running the query with the &explain=true only gives us the explanation of the node that successfully returned the 1 hit.
response.txt
Whether run with or without the type filters I was getting the same hits total of 598.

Using the ?preference=_shards:n parameter I ran the query against each of the 9 shards we have on 3 nodes in Dev individually.
Node1 correctly returned 1 hit and total:1 for 1 of its shards and 0 hits and total:0 for the other shards on that node.
Node2 returned no hits but totals of 95 for all of its 3 shards
Node3 returned no hits but totals of 104 for all of its 3 shards

(1) + (95 * 3) + (104 *3) => 598

Again I tried the _count API but with the ?preference=_shards:n for each individual shard and it still returns the correct counts for all shards.

Is there something more we can try to find out whats going on in the 2 problem nodes we currently have in our Dev cluster ?

Incidently our Dev cluster is running v5.2.2, whereas our LIVE cluster is v5.3.1

dannymurphygfk on 21 Sep 2017

@dannymurphygfk, just to make sure, because @JagathJayasinghe was talking about 6 shards and 2 replicas before, I assume this was production. How's the situation on the dev cluster that you mentioned now? How many of those 9 shards are primaries, how is their distribution over the three nodes?

cbuescher on 21 Sep 2017

@cbuescher Yes Jagath was talking about the Production Cluster.
Our Dev cluster has 9 shards 0 replicas, 3 shards on each node.

dannymurphygfk on 21 Sep 2017

One thing I haven't asked at all so far is how you index your documents. Do you use any costumized routing by any chance? How about updates to parent documents, e.g. when nested documents are deleted or added?

cbuescher on 21 Sep 2017

@cbuescher We use the Bulk indexing API to index and no we don't use any customized routing.

When nested documents are added/updated/deleted the entire parent document is involved ?
We are using the nested model NOT the parent-child relationship.

I think we might have confused you maybe by previously calling our nested documents 'types' or something ?

dannymurphygfk on 21 Sep 2017

When nested documents are added/updated/deleted the entire parent document is involved ?
We are using the nested model NOT the parent-child relationship.

Yes, understood. I just wanted to exclude the chance of weird custom routing issues from the list of possible causes.

cbuescher on 21 Sep 2017

We are not restarting the nodes on our Dev cluster as we are hoping it can help us troubleshoot the problem.
Don't know it it will help much, but we noticed that we can create a new index, with the default settings for shards etc.. and if we query the index even before adding any documents we get bogus hit totals ?

PUT my_index

{
"acknowledged": true,
"shards_acknowledged": true
}

POST my_index/_search

{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 294,
"max_score": 0,
"hits": []
}
}

POST my_index/_search

{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 407,
"max_score": 0,
"hits": []
}
}

Like I said our Dev cluster is still currently in this state so if there are any other APIs we can call to help troubleshoot it please let us know.

dannymurphygfk on 26 Sep 2017

we can create a new index, with the default settings for shards

Thanks, its great if you can keep the dev cluster in the "problematic" state, and the fact that this reproduced without any data points to a different direction than my previous suspision about nested document I think, although that behaviour on a completely emty index is super strange as well. So just to sum up and double check, correct me if I'm wrong:

You have 3 nodes in the dev setup, this empty index seems to have the default of 5 shards now (not 9 like the other one containing data). How many replicas in this case?
Your don't even issue an "ids" query not on that new index, but simply an empty search (should be match_all), and you see total hits

A few questions again:

You posted two outputs, one with "total": 294, one with "total": 407, does this alternate or turn um at same frequency?
Can you this with again with ?preference=_shards:n agains all 5 shards and compare the output? Are there one or two shards that have this wrong count? Are those on different nodes?
Did (or can you) check the shards of this empty index for corruptions like mentioned in https://github.com/elastic/elasticsearch/issues/25603#issuecomment-323082017, maybe focusing on the nodes/shards with wrong outputs if the previous step resulted in any particular ones?

And thanks for sticking with this puzzling thing for so long.

cbuescher on 26 Sep 2017

We have 3 nodes in the Dev cluster
The new index defaulted to 5 shards and 1 replica
Yes, we are performing an empty search and yes we see total hits coming back.
Yes I ran 2 searches because I was seeing 2 different counts coming back alternately. (this morning this search is alternating between totals 199 and 398)

Running the search with preference=_shards I get

POST my_index/_search?preference=_shards:0

{
"took": 11,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}

POST my_index/_search?preference=_shards:1

{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}

POST my_index/_search?preference=_shards:2

{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}

POST my_index/_search?preference=_shards:3

{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 95,
"max_score": 0,
"hits": []
}
}

POST my_index/_search?preference=_shards:4

{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 104,
"max_score": 0,
"hits": []
}
}

And the distribution of the shards is as follows:

Actually when I rerun the search with the _shards preference again, its different shards that return the wrong totals.

Can you explain how I check the empty index shards for corruption ?

dannymurphygfk on 27 Sep 2017

Can you explain how I check the empty index shards for corruption ?

To check index shards for corruption you can use a command line tool build into Lucene and run it on the Lucene index that contains the files for that shard. For that, go to the affected node, change to the ES home directory (depending on your installation). There should be a lib folder that contains the lucene-core-6.4.2.jar (the version might be different, this is for 5.3.3). Next you need to know which directory the shard data is in. Figure out the index id of the affected index using the "_cat/indices" api (e.g. its rX1iHVUXSw6z7nYNSWzQ3A for a test on my local machine. The directory you want to check then is $YOUR_DATA_DIR/nodes/$NODE/indices/$INDEX-ID/$SHARD-ID/index where $YOUR_DATA_DIR is the data directory configured in your ES config file, $NODE should be 0 if you are only running one node on each machine, and $SHARD-ID the shard number you want to check (e.g. 4 in the above example where shard 4 seems to do quirky things).

It seems like you need to close the index before running the checker, otherwise you get a LockObtainFailedException. Then run:

java -cp ./lib/lucene-core-6.4.2.jar org.apache.lucene.index.CheckIndex -fast $YOUR_DATA_DIR/nodes/$NODE/indices/$INDEX-ID/$SHARD-ID/index

The tool should give you a report telling you if there are any issues. I'd try running it at least on both nodes that contain shard 4 of the above empty index. Now that the issue also reproduces on your dev environment, can you also try to close the "real" test index and try to run the checker on all the shards and report if it turns up anything of interest? I also asked initially if the problem with the wrong counts persists after closing/re-opening an index, that would be interesting to see if this is possible to do on the dev-cluster index now as well.

cbuescher on 27 Sep 2017

I ran the tool against all shards for the empty index... all report
"No problems were detected with this index."

Closing the empty index and reopening didn't fix the issue either.

I even ran the tool on the "real" test index against all of it shards, it logs how it checks each segment etc... but the outcome was still "no problems were detected with this index".

I don't see anything else remarkable in the report but have attached the output from node one just in case.
node1Shards.txt

Closing the this index and reopening didn't fix the issue.

I dunno if we already mentioned it before but specifying the "size" appears to have an impact on the results. If we specify size 0 we get the correct totals... but if we specify a size > 0 we get incorrect totals, both for our "real" index and our empty index.

e.g. with the empty index:
POST my_index/_search
{
"size": 0
}
vs
POST my_index/_search
{
"size": 1
}

dannymurphygfk on 29 Sep 2017

Hi @dannymurphygfk,

now that you are able to reproduce the issue on an empty index, would it be possible to provide us a copy of empty index shards? I'd like to see if I can recreate the problem locally using your faulty empty index shards. I think it should work if you zip the empty index ids data directory on each of the three nodes of your test cluster and attach it to this issue. I'm not sure at all this will allow me to reproduce the issue but I think its worth a try. Let me know if this is possible for you.

cbuescher on 2 Oct 2017

As requested the copy of the shards from the empty index.
my_index.zip

dannymurphygfk on 2 Oct 2017

Thanks for the copy of the shards, I imported them into a local three node 5.2.2 cluster (I think thats the version you're running in Dev), but unfortunately I'm not able to reproduce the issue like this. So the "bad" state that your cluster is in when returning the wrong "hits.total" is probably an in-process thing. This would also explain why closing/opening the index doesn't make the issue disappear, but a restart of the nodes does.

I dunno if we already mentioned it before but specifying the "size" appears to have an impact on the results. If we specify size 0 we get the correct totals... but if we specify a size > 0 we get incorrect totals, both for our "real" index and our empty index.

I think this is the most interesting hint we currently have, but I don't have an idea yet how to dig into this behaviour further. Thanks for bearing with me, let me know if you have any new information on your side.

cbuescher on 5 Oct 2017

Sorry for the long silence here, just wanted to let you know I'm still curious about this but currently have no good ideas of how to proceed without being able to reproduce this on our side. But it's still on the radar, let me know if you have anything new or you don't see the issue any more (maybe in a different environment)...

cbuescher on 1 Dec 2017

Hi @cbuescher ,

We have upgraded to v5.6.4 last week so will let you know if it happens again on this version.

dannymurphygfk on 4 Dec 2017

👍1

After posting yesterday about having upgraded... unfortunately we noticed the same problem happening again. It took a full cluster rolling restart to sort it.

dannymurphygfk on 5 Dec 2017

So we noticed the issue 6 days ago.... so we performed a rolling restart to clear.
The issue has been noticed again just now so we are yet again performing a rolling restart to clear.

Just in case this has any bearing, we do have 2 clusters a Head and a Tail
Both clusters data get indexed by the same process using the bulk indexing apis.

The issue is only happening on our Head cluster.

What is different about our Head cluster is:

Data older than 6 months is removed periodically.
Mappings are slightly different as the use case for querying each cluster is slightly different.
We use custom Native scripts for filtering and sorting.

Also might have some bearing, we periodically notice 1 or more nodes on our Head cluster consuming too much memory and crashing or getting stuck in a broken state and requiring manual intervention to recover. This is also something we are investigating.

dannymurphygfk on 11 Dec 2017

Data older than 6 months is removed periodically.
Mappings are slightly different as the use case for querying each cluster is slightly different.
We use custom Native scripts for filtering and sorting.

These might be interesting differences. Can you share a bit more e.g. about how the deletion happens (e.g. with which frequency, any querys/scripts involved). Same for the mappings, what are those slight differences?
As for the scripted filtering/sorting, do you also only use these on the Head cluster? If so, can you also share those?

cbuescher on 11 Dec 2017

The deletions are done through a Cron job that runs daily at 9pm.
/usr/bin/curl --silent -X POST -d '{"query":{"range":{"LastUpdated":{"lt":"now-180d/d"}}}}' http://node1:9200/my_index/my_type,my_other_type/_delete_by_query 2>&1 |/usr/bin/logger -t cleanup

The differences in the mappings are more to avoid indexing properties that we don't need to query by in either cluster. e.g. in Our Head cluster we index a property which we later use in the filtering/sorting scripts.
"MySortingMetaProperty":{ "type":"keyword" }

This field is not indexed in the Tail cluster.

Yes the scripts are only used by our Head cluster, they are quite involved though... not sure I can share them in their entirety.

E.g. some of the scripts use the "MySortingMetaProperty".
This property contains an array of slugs (CSV) of the important information in the nested documents we spoke about above. Our script parses this information and processes to provide a sort value for the main document.

dannymurphygfk on 11 Dec 2017

Just noticed while Cluster is in this bad state, if I query for an Id that does NOT exist I also get a bogus total value.

POST my_index/my_type%2Cmy_other_type/_search?preference=2906 {"query": { "ids": {"values": [1] } }}

There are no documents of my_type or my_other_type with Id 1 yet its still returned total of 296.

dannymurphygfk on 12 Dec 2017

I don't know if it will help much, but it looks like this issue may be happening more often than we originally thought.

We setup an hourly scheduled task to query 3 or 4 Ids including the id 1 which doesn't exist.
It queries the same ids with 20 different preference numbers. The idea was to identify early when this issue begins.

What we experienced over the xmas period is as follows:

15th Dec - Product id 1 started returning bogus totals
later that day 1 node (node 5) was restarted for other reasons and it appeared the issue was fixed until
17th Dec - Again product id 1 started returning bogus totals
2 nodes (nodes 4 and 5) were restarted for other reasons and again it appeared the issue was fixed until
21st Dec - Again product id 1 started returning bogus totals
23rd Dec - 1 node (node 4) went down and was was restarted - appeared to fix the issue until
25th Dec - Again product id 1 started returning bogus totals

2 nodes (node 4 and 9) have been restarted since but the issue persisted this time and is still present.

Right now we are looking at a workaround for the broken feature that relied on this being correct so avoid having to perform time consuming rolling restarts every time... but elsewhere in our application it will still be reporting the incorrect totals for us.

dannymurphygfk on 2 Jan 2018

Given we are not restarting the nodes immediately due to this issue I have more time to investigate it if you have any more ideas ?

What I have noticed today is.
I can query for an Id

POST my_index/my_type my_other_type/_search?preference=2904
{
    "query": {
        "ids": {
            "values": [239563172]
        }
    }
}

Which returns incorrect totals.

I can query for another Id

POST my_index/my_type my_other_type/_search?preference=2904
{
    "query": {
        "ids": {
            "values": [225238857]
        }
    }
}

Correctly returns 1

I can query for both Ids

POST my_index/my_type my_other_type/_search?preference=2904
{
    "query": {
        "ids": {
            "values": [239563172, 225238857]
        }
    }
}

Correctly returns 2 !

Using the profile parameter I ran the query with incorrect totals and the one with both.
The profile results at the top level look the same... querying the same nodes and shards ?
Looks like all 6 shards participating in the query ?

"profile": {
    "shards": [{
            "id": "[-aSRX8nEQdO_augx-5OAFw][my_index][4]",
            "searches": [...],
            "aggregations": [...]
        }, {
            "id": "[-aSRX8nEQdO_augx-5OAFw][my_index][5]",
            "searches": [...],
            "aggregations": [...]
        }, {
            "id": "[QmSRDHkyQeaeZU8sAkRwcw][my_index][2]",
            "searches": [...],
            "aggregations": [...]
        }, {
            "id": "[RBL1o_YbSvOk88d1agNmrw][my_index][1]",
            "searches": [...],
            "aggregations": [...]
        }, {
            "id": "[RBL1o_YbSvOk88d1agNmrw][my_index][3]",
            "searches": [...],
            "aggregations": [...]
        }, {
            "id": "[uUJfY22pSp2y8bFAsBbFIQ][my_index][0]",
            "searches": [...],
            "aggregations": [...]
        }
    ]
}

dannymurphygfk on 3 Jan 2018

I can query for both Ids

This is very interesting indeed. Thanks for trying, I'm really sorry I haven't had the bandwidth to dig into this further so far, but I have a few more vague suspicions I want to run past somebody else more knowledgable in those areas first. I hope to be back with something else to try to narrow this down soon.

cbuescher on 3 Jan 2018

Hi @dannymurphygfk,

after reading you comment from two days ago I realized we didn't consider looking into potential interferences of caches yet, most prominently the request cache. There is an option to disable it per request that I'd like you to try with the problematic ids queries and see if the result changes with and without using the cache.

If that doesn't turn up anything new, maybe clearing all caches is something you can try next. This might have a small effect when using it on your production cluster, so I'd try to do this on the dev cluster first when it is in a "problematic" state. However ES 5.x shouldn't rely to much on caching if your cluster isn't under heavy load, so it should also be relatively save to try on the production cluster if you are not running near full capacity.

Two other things I wanted to ask or re-check:

I assume you are running all the above tests with either Kibana console or some HTTP client and you are not targeting any client node running on your cluster by any chance. Would be interesting to know if otherwise, also if any of the data/client nodes in this case have different version
You mentioned before that restarting individual nodes doesn't usually resolve the issue. In this comment you mention that in December sometimes restarting one or two nodes seems to resolve it somehow for a time, is this correct? Is this still the case generally?

Hope this brings us a step nearer to narrowing this strange bug down.

cbuescher on 5 Jan 2018

Hi @cbuescher ,

Thanks I have just tried what you suggested but unfortunately to no avail.

Disabling the request cache doesn't affect the result... returns same result with or without cache disabled.

Likewise, clearing of all caches didn't solve the issue.

Yes I am using Kibana dev tools to execute the queries and all nodes are on the same version. 5.6.4.
Yes we setup a hourly scheduled task which is querying for some known ids separately with 20 different preferences... and if any results in a total > 1 then we get an email notification. In December we would start receiving these emails every hour... but when 1 or 2 nodes were recycled the emails stopped so have to assume the totals were all correct for a time... then we would start receiving emails again.

Does the inclusion of the size parameter give any clues ? Because I can query an id that returns bogus totals with size:0 and in this case it correctly returns 1... but anything else and its incorrect ?

Incidentally the ids I was using 2 days ago both fail for me today :(

dannymurphygfk on 5 Jan 2018

Thanks I have just tried what you suggested but unfortunately to no avail.

:-(
I was hoping to get closer to the source of all this, but its good to check nevertheless.
Thanks for the other answers which confirm my current assumptions.

Does the inclusion of the size parameter give any clues ? Because I can query an id that returns bogus totals with size:0 and in this case it correctly returns 1... but anything else and its incorrect ?

I will need to do some digging and thinking about this, might be the best clue at the moment.

Incidentally the ids I was using 2 days ago both fail for me today

Can you specify this a bit more? Do you mean that now

POST my_index/my_type my_other_type/_search?preference=2904
{
    "query": {
        "ids": {
            "values": [239563172, 225238857]
        }
    }
}

Doesn't correctly return 2 any more? Was this before or after the clearing of the caches? Any Nodes restarted in between?

cbuescher on 5 Jan 2018

Yes
Both queries that were returning correct totals yesterday are now returning bogus totals, both before and after the clearing of the cache.

POST my_index/my_type my_other_type/_search?preference=2904
{
    "query": {
        "ids": {
            "values": [239563172, 225238857]
        }
    }
}

POST my_index/my_type my_other_type/_search?preference=2904
{
    "query": {
        "ids": {
            "values": [225238857]
        }
    }
}

Looks like 1 node went down last night (node 6 - host issue) and was restarted.

btw neither of these ids worked for a colleague yesterday who would have been using a different preference.

dannymurphygfk on 5 Jan 2018

We are also experiencing this issue at nimble.com
any id query like {ids: {values: [id1]}} returns hits.total ~ 1000 what confuses user when he/she operates with one or a few(<30) docs and operation progress then shows like 1k docs.

we are running es 5.4.2
we have never seen this on 2.4
we use external versioning, child/nested docs
sometimes we index docs with version_type force

we are also encountering way too many errors like 'search_phase_execution_exception', u'No search context found for id' we didn't have on 2.4

serj-p on 10 Jan 2018

Hey,
Just to chime in, I also have this problem.
Creating a brand new index on my cluster and indexing 10 documents, I get 50 hits!
It seems a suspicious correlation to the fact that I also have 5 nodes (10x5=50)

{
  "name" : "elasticsearch-master-1",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "RxDSRv09Rt2zhQVNUeC_xQ",
  "version" : {
    "number" : "6.1.1",
    "build_hash" : "bd92e7f",
    "build_date" : "2017-12-17T20:23:25.338Z",
    "build_snapshot" : false,
    "lucene_version" : "7.1.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

{
"took": 120,
"timed_out": false,
"_shards": {
"total": 10,
"successful": 10,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 50,
"max_score": 1,
"hits": [ { "_index": "metrics20-2018.01.10","_type": "test","_id": "14","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "metrics20-2018.01.10","_type": "test","_id": "19","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "metrics20-2018.01.10","_type": "test","_id": "22","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "metrics20-2018.01.10","_type": "test","_id": "26","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "metrics20-2018.01.10","_type": "test","_id": "5","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "metrics20-2018.01.10","_type": "test","_id": "10","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "metrics20-2018.01.10","_type": "test","_id": "21","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "metrics20-2018.01.10","_type": "test","_id": "32","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "metrics20-2018.01.10","_type": "test","_id": "33","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "metrics20-2018.01.10","_type": "test","_id": "20","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}}]
}
}

Stono on 11 Jan 2018

This is the same on a different version of elasticsearch too?:

{
  "name" : "ip-10-155-0-205",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "OyESK4dhQKOXvR6V8W-Dtg",
  "version" : {
    "number" : "5.2.2",
    "build_hash" : "f9d9b74",
    "build_date" : "2017-02-24T17:26:45.835Z",
    "build_snapshot" : false,
    "lucene_version" : "6.4.1"
  },
  "tagline" : "You Know, for Search"
}

{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 50,
"max_score": 1,
"hits": [ { "_index": "karl20-2018.01.10","_type": "test","_id": "14","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "karl20-2018.01.10","_type": "test","_id": "19","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "karl20-2018.01.10","_type": "test","_id": "22","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "karl20-2018.01.10","_type": "test","_id": "24","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "karl20-2018.01.10","_type": "test","_id": "25","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "karl20-2018.01.10","_type": "test","_id": "26","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "karl20-2018.01.10","_type": "test","_id": "29","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "karl20-2018.01.10","_type": "test","_id": "40","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "karl20-2018.01.10","_type": "test","_id": "41","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}},{ "_index": "karl20-2018.01.10","_type": "test","_id": "44","_score": 1,"_source": { "dotted-name-by-zone": "part-exchange-api.not-set.by-zone.not-implemented-yet.jvm.threads.terminated.count","dotted-name-by-server": "part-exchange-api.not-set.by-server.mcr-al33030.jvm.threads.terminated.count","logstash-shipping-production": "false","event-type": "gauge","server-domain-name": "MCR-AL33030","server-identifier": "mcr-al33030","logstash-shipping-deployment-revision": "1029","server-ip": "not-implemented-yet","type": "metrics20-event","version": "not-set","logstash-shipping-version": "5.6.1","tags": [ "metrics20-event"],"logstash-shipping-hostname": "nonprodlogstashshipping8.node.dc2.consul","environment": "not-set","@timestamp": "2018-01-11T10:42:47.000Z","application": "part-exchange-api","@version": "1","host": "172.28.224.135","name": "jvm.threads.terminated.count","dotted-name-by-all": "part-exchange-api.not-set.by-all.jvm.threads.terminated.count","value": 0}}]
}
}

Stono on 11 Jan 2018

@Stono could you include the queries you're using as well as the results?

DaveCTurner on 11 Jan 2018

@Stono @DaveCTurner I don't suspect this to be the same issue, I think you only didn't set the "size" parameter. Your result contains 10 hits which is the default fetch size, total.hits should give you an idea about all the documents that matched. I don't think we should discuss this in this issue. If you have questions, would you please open a question in out Discourse forum? I'd prefer to keep this issue concentrated on the original issue which is complicated enough as it stands. Thanks.

cbuescher on 11 Jan 2018

@serj-p thanks for reporting this, can you check if the following things also apply to make sure this is a similar issue:

[ids] query gives different total.hits than if you filter for the id with e.g. a terms query
does this happen for you always or just after a certain time of running the nodes in the cluster?
after this happened for the first time, does it reproduce for certain fixed ids?

If this is the case I'd like to collect similarities/differences with the issue reported by @JagathJayasinghe to try and pin the cause down a bit more. It would also be great if you'd go over some of the findings we made above and try them and see if this helps in your case (e.g. individual node restarts etc...)

cbuescher on 11 Jan 2018

@cbuescher : the index only contains 10 documents as that's all I indexed in test data. @DaveCTurner the query was the same in both metricstest/_search?pretty=true&q=*:*

Stono on 11 Jan 2018

@stono yes but thats a match-all query, your hits all have different ids. Thats different from the issue discussed here. If you want to discuss this please open a forum thread or another issue with the data to reproduce this if you suspect a bug.

cbuescher on 11 Jan 2018

👍1

@cbuescher ahhh I see, thank you :-)

Stono on 11 Jan 2018

👍1

Hi @cbuescher
We did notice at one point that once the cluster was in this invalid state we were able to create a new index on our cluster and even before indexing anything we were getting bogus totals back.
https://github.com/elastic/elasticsearch/issues/25603#issuecomment-332237106

So to me the issue @Stono is seeing does look somewhat related to me ?
He is saying he only has 10 documents in the index ? so even a match all query should only return hits.total of 10 ? yet its returning 50 ?

dannymurphygfk on 15 Jan 2018

@dannymurphygfk I rember that issue with the empty cluster, however from the brief example @Stono described I suspect there is something else going on at indexing time, thats why I suggested discussing it on a different issue/threat to not make this issue longer than it already is without need. If it turns out it is related I would link back to this issue.

cbuescher on 15 Jan 2018

@cbuescher

[ids] query gives different total.hits than if you filter for the id with e.g. a terms query

we have id field indexed separately. query {"query":{"terms":{"id":["599c3d28f9f00fdc447e763c"]}}}
gives 1 hit and 908,928,903... other random number ~900 total. ids query behaves in the same random way. it seems like counts are broken not only for ids query, but here it's the most obvious. right
{"query":{"match":{"name":"Marketing"}}} gives from 7 to 50 total, I am sure this value can't change that frequently.

does this happen for you always or just after a certain time of running the nodes in the cluster?

i can't say as i cannot restart nodes in production in order to see when it appears again. I don't see this issue in test environment with 2 search server nodes(32 shards).

after this happened for the first time, does it reproduce for certain fixed ids?

All requests I've checked give close to random total value.

UPDATE

when I specify index name and type in request, query {"query":{"match":{"name":"Marketing"}}} gives persistent result which seems to be right.
in the same conditions ids and terms of ids queries give random results

serj-p on 16 Jan 2018

query {"query":{"terms":{"id":["599c3d28f9f00fdc447e763c"]}}}

Can you post the complete search request, including the endpoint URL (with index, types etc...) you are using please, together with a "good" and a "bad" result. Are you querying across multiple indices or types? If so how many do you have, which ones do you expect the request to hit?

Can you do the same for a good and bad "ids" query result please? The above terms query isn't really

Does specifying "size: 0" in the requests change anything about the erroneous total.hits like it does for @dannymurphygfk in https://github.com/elastic/elasticsearch/issues/25603#issuecomment-333148138?

cbuescher on 17 Jan 2018

@cbuescher

http://search.prod.com:9200/contacts_v3.3/contact/_search -d '{"query":{"terms":{"id": "599c3d28f9f00fdc447e763c"]}}}'

results in

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 32,
    "successful": 32,
    "failed": 0
  },
  "hits": {
    "total": 903,
    "max_score": 13.536535,
    "hits": [
      {
        "_index": "contacts_v3.3",
        "_type": "contact",
        "_id": "599c3d28f9f00fdc447e763c",
        "_score": 13.536535,
        "_source": {
          "updated": "2017-08-22T14:52:57+0000",
          "name": "Marketing Manager",
          "created": "2017-08-22T14:18:16+0000",
          "company_id": "50a8822bb544310ba33694ee",
          "is_account": true,
          "type": "contact",
          "id": "599c3d28f9f00fdc447e763c",
          "owner_id": "576670db2007d05cd450e067"
        }
      }
    ]
  }
}

http://search.prod.com:9200/contacts_v3.3/contact/_search -d '{"query":{"ids":{"values":["599c3d28f9f00fdc447e763c"]}}}'

in

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 32,
    "successful": 32,
    "failed": 0
  },
  "hits": {
    "total": 928,
    "max_score": 7.933596,
    "hits": [
      {
        "_index": "contacts_v3.3",
        "_type": "contact",
        "_id": "599c3d28f9f00fdc447e763c",
        "_score": 1.0,
        "_source": {
          "updated": "2017-08-22T14:52:57+0000",
          "name": "Marketing Manager",
          "created": "2017-08-22T14:18:16+0000",
          "company_id": "50a8822bb544310ba33694ee",
          "is_account": true,
          "type": "contact",
          "id": "599c3d28f9f00fdc447e763c",
          "owner_id": "576670db2007d05cd450e067"
        }
      }
    ]
  }
}

it seems it doesn't metter whether i provide index/type or not

http://search.prod.com:9200/contacts_v3.3/contact/_search -d '
{
  "query": {
    "term": {
      "last name.lc": "safaa"
    }
  },
  "stored_fields": [],
  "size": "100"
}'

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 32,
    "successful": 32,
    "failed": 0
  },
  "hits": {
    "total": 814,
    "max_score": 13.097227,
    "hits": [
      {
        "_index": "contacts_v3.3",
        "_type": "contact",
        "_id": "52d71f0178d200148c2c4a53",
        "_score": 13.097227
      },
      {
        "_index": "contacts_v3.3",
        "_type": "contact",
        "_id": "59f16f7a950fa207340bc193",
        "_score": 13.096043
      },
      {
        "_index": "contacts_v3.3",
        "_type": "contact",
        "_id": "50c357d40f4bd72b9b003dbe",
        "_score": 13.089661
      },
      {
        "_index": "contacts_v3.3",
        "_type": "contact",
        "_id": "56418b0b000efa7efa9961cd",
        "_score": 13.087736
      },
      {
        "_index": "contacts_v3.3",
        "_type": "contact",
        "_id": "52a9b02578d2001e40fa86ef",
        "_score": 13.086037
      }
    ]
  }
}

http://search.prod.com:9200/_search -d '
{
  "query": {
    "term": {
      "last name.lc": "safaa"
    }
  },
  "stored_fields": [],
  "size": "100"
}'

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 32,
    "successful": 32,
    "failed": 0
  },
  "hits": {
    "total": 814,
    "max_score": 13.097229,
    "hits": [
      {
        "_index": "contacts_v3.3",
        "_type": "contact",
        "_id": "52d71f0178d200148c2c4a53",
        "_score": 13.097229
      },
      {
        "_index": "contacts_v3.3",
        "_type": "contact",
        "_id": "59f16f7a950fa207340bc193",
        "_score": 13.096043
      },
      {
        "_index": "contacts_v3.3",
        "_type": "contact",
        "_id": "50c357d40f4bd72b9b003dbe",
        "_score": 13.089661
      },
      {
        "_index": "contacts_v3.3",
        "_type": "contact",
        "_id": "56418b0b000efa7efa9961cd",
        "_score": 13.087733
      },
      {
        "_index": "contacts_v3.3",
        "_type": "contact",
        "_id": "52a9b02578d2001e40fa86ef",
        "_score": 13.086031
      }
    ]
  }
}

reducing size to 0 does fix the issue:

{
  "query": {
    "term": {
      "last name.lc": "safaa"
    }
  },
  "stored_fields": [ ],
  "size": "0"
}

{"took":2,"timed_out":false,"_shards":{"total":32,"successful":32,"failed":0},"hits":{"total":5,"max_score":0.0,"hits":[]}}

same for ids query

{"query":{"ids":{"values":["599c3d28f9f00fdc447e763c"]}},"size": 0}

{"took":7,"timed_out":false,"_shards":{"total":32,"successful":32,"failed":0},"hits":{"total":1,"max_score":0.0,"hits":[]}}

serj-p on 19 Jan 2018

Any progress on this ?
We don't have any new information but we still have this issue with showing incorrect totals, unless we do a rolling restart every time.

dannymurphygfk on 20 Feb 2018

@dannymurphygfk sorry, no progress on this, but the issue is definitely still on the radar
Adding @elastic/es-search-aggs for wider audience

cbuescher on 13 Mar 2018

@dannymurphygfk I've taken a look through this issue and one thing that stands out to me that we need to eliminate as a cause is We use custom Native scripts for filtering and sorting. especially since the cluster that does not use the plugin containing your native scripts does not show this problem.

Does you Native Script plugin only install scripts that are called as part of your searches or do it also customise other elements of Elasticsearch? (I'd be interested in anything else it does but especially if it adds a custom fetch phase)
You have mentioned a few times that you are simplifying the queries you are posting. Do the queries you are running when you are reproducing the problem use any native scripts or anything form this plugin?
How do you use the native scripts? Can you provide an example request that includes all the uses of your scripts?
Since you are able to reproduce this problem on your Dev cluster, could you please disable the plugin and see if the problem reproduces?

colings86 on 23 Mar 2018

@colings86

The plugin just installs scripts that we use as part of searches... so derive from AbstractDoubleSearchScript or AbstractSearchScript.
We can reproduce the problem with just the simple Ids query using no scripts.
How we use the scripts

We have 5 scripts in total.
2 are derived from AbstractDoubleSearchScript and we use for sorting and aggregations.
3 derived from AbstractSearchScript used for filtering

Our documents contains a summary of each of the nested docs data as an array of CSVs

NestedMeta:["nestedDoc1,,,,,,,,", "nestedDoc2,,,,,,,,,", ",,,,,,,,,", ",,,,,,,,,", ",,,,,,,,,", ",,,,,,,,,", ",,,,,,,,,", ",,,,,,,,,"]

The sorting scripts and 2 of the filtering scripts use this data to generate a value depending on the various parameters we pass it.

"sort": [{
        "_script": {
            "type": "number",
            "script": {
                "inline": "sortScript",
                "params": {
                    "param1": 99999999,
                    "param2": false,
                    "param3": "110",
                    "param4": {
                        "param4_1": 636575194711828737
                    }
                    ...
                },
                "lang": "native"
            },
            "order": "asc"
        }
    }
],
"query": {
    "bool": {
        "must": [
        ...
        ],
        "filter": [{
                "script": {
                    "script": {
                        "inline": "filterScript",
                        "lang": "native",
                        "params": {
                            "param1": 99999999,
                            "param2": false,
                            "param3": "110",
                            "param4": {
                                "param4_1": 636575194711828737
                            }
                            ...
                        }
                    }
                }
            }
        ]
    }
},
"aggs": {
    "scriptGeneratedAggs": {
        "stats": {
            "script": {
                "inline": "sortScript",
                "params": {
                    "param1": 99999999,
                    "param2": false,
                    "param3": "110",
                    "param4": {
                        "param4_1": 636575194711828737
                    }
                    ...
                },
                "lang": "native"
            }
        }
    }
}

The last filtering script operates on the nested documents and uses a similar approach, with a field on each nested document with summary data on its sibling nested documents.

Unfortunately we no longer have the dev cluster that reproduced the issue. If we notice the issue on our new dev cluster I will try disabling the native scripts then.

dannymurphygfk on 26 Mar 2018

Hi, just wanted to add a note, this issue is definitely still happening, my report is:

single node (it's standard AWS ES cluster 1 instance of t2.medium.elasticsearch)
we have docs that are arguably 'complex': arrays of data inside, text, array of objects
we do deletes (let's say we've done 10~ deletes in the last couple of days)
the index is not big (1000-2000 docs)
the counts are ok when you search specifying a field equals to something
the counts are wrong on REST/node.js API but right on Kibana
we've fixed the counts after a good day of swearing by adding a workaround, when we search to count ALL the docs we still put a query, in our indexes 'createdAt' field is always present, so an _exists:createdAt does the trick

No idea (and no time right now to properly figure this out) but the smell are the deletions and the 'complex' docs. A delete doesn't seem to propagate properly.

Bear in mind, I've also tested getting all the docs out, and deleted docs are coming out (I was pulling just the _ids)

~~It's pretty crazy and to be quite frank, I am glad ES is not our primary store because something like this should never happen on a production-ready database.~~

Update 06/06/18 : next time I won't jump into conclusion, it was our misuse of ES

aterreno on 5 Jun 2018

@aterreno Thanks for the additional information. Could you also share the version you are using and whether you have nested fields in your mappings and what plugins are installed?

the counts are wrong on REST/node.js API but right on Kibana

This one is intriguing since Kibana is supposed to consume the REST API. Also when you say REST, do you mean that you are using curl to query Elasticsearch and check total hit counts? When you look at the query that Kibana sends to Elasticsearch, is there any difference with the ones that you are sending via curl?

the counts are ok when you search specifying a field equals to something

So what is the query that reproduces the issue? Is it an ids query like @JagathJayasing or can you also reproduce with other queries?

Bear in mind, I've also tested getting all the docs out, and deleted docs are coming out

How did you do pull docs out? Using a match_all query? With scroll?

It's pretty crazy and to be quite frank, I am glad ES is not our primary store because something like this should never happen on a production-ready database.

Agreed. It is a very embarassing bug.

jpountz on 5 Jun 2018

@aterreno can you provide the way you query elasticsearch:

do you have more than one index in the cluster?
do you specify an index pattern in kibana?

it would be very very helpful if you could provide a sample request and the response.

s1monw on 6 Jun 2018

Thanks for your prompt replies @s1monw & @jpountz, I've wrote a bash script to reproduce:

ES_CLUSTER=
INDEX=

curl -s -k "$ES_CLUSTER/$INDEX/_stats" | jq . | grep -A3 $INDEX | grep count 
#1597 (same as node api)

curl -s -k "$ES_CLUSTER/_cat/count/$INDEX" | awk '{$1=$2=""; print $0}'
#1597 (same as node api)

curl -s -k "$ES_CLUSTER/$INDEX/_count?q=email:*&preference=_primary&ignore_unavailable" | jq .count
#1587 (one more than kibana)

curl -s -k "$ES_CLUSTER/$INDEX/_count?q=email:*" | jq .count
#1587 (one more than kibana)

curl -s -k "$ES_CLUSTER/$INDEX/_search?q=email:*" | jq .hits.total 
#1587 (one more than kibana)

curl -k -s "$ES_CLUSTER/_msearch" -H 'Content-Type: application/json' -d'
{"index":["'"$INDEX"'"]}
{"version":true,"size":2000,"_source":{"excludes":[]},"query":{"bool":{"must":[{"match_all":{}}],"filter":[],"should":[],"must_not":[]}}}
' | jq .responses | jq .[0].hits.total
#1597 (same as node api)

curl -k -s "$ES_CLUSTER/_msearch" -H 'Content-Type: application/json' -d'
{"index":["'"$INDEX"'"],"ignore_unavailable":true,"preference":1528210810366}
{"version":true,"size":500,"sort":[{"updatedAt":{"order":"desc","unmapped_type":"boolean"}}],"_source":{"excludes":[]},"aggs":{"2":{"date_histogram":{"field":"updatedAt","interval":"1M","time_zone":"Europe/London","min_doc_count":1}}},"stored_fields":["*"],"script_fields":{},"docvalue_fields":["cads.created","cads.mfgTime","createdAt","emails.createdAt","notes.createdAt","postDate","switches.createdAt","switches.updatedAt","updatedAt","userIssues.createdAt","userIssues.resolvedAt"],"query":{"bool":{"must":[{"match_all":{}},{"range":{"updatedAt":{"gte":1370444411553,"lte":1528210811554,"format":"epoch_millis"}}}],"filter":[],"should":[],"must_not":[]}},"highlight":{"pre_tags":["@kibana-highlighted-field@"],"post_tags":["@/kibana-highlighted-field@"],"fields":{"*":{}},"fragment_size":2147483647}}
' | jq .responses | jq .[0].hits.total
#1564

I am not an ES expert so I might be missing something, but I just don't understand how these 'similar' queries return different counts on the same index.

To reply to @s1monw specifally:

yes we do have a few indexes in the cluster
yes we did specify an index pattern in kibana

aterreno on 6 Jun 2018

On Tue, Jun 5, 2018 at 6:21 PM Adrien Grand notifications@github.com
wrote:

@aterreno https://github.com/aterreno Thanks for the additional
information. Could you also share the version you are using and whether you
have nested fields in your mappings and what plugins are installed?

the counts are wrong on REST/node.js API but right on Kibana

This one is intriguing since Kibana is supposed to consume the REST API.
Also when you say REST, do you mean that you are using curl to query
Elasticsearch and check total hit counts? When you look at the query that
Kibana sends to Elasticsearch, is there any difference with the ones that
you are sending via curl?

Yes, I hope that the BASH script would help explaining the issue

the counts are ok when you search specifying a field equals to something

So what is the query that reproduces the issue? Is it an ids query like
@jagathjayasing or can you also reproduce with other queries?

Bear in mind, I've also tested getting all the docs out, and deleted docs
are coming out

How did you do pull docs out? Using a match_all query? With scroll?

A simple search was returning also delted docs, to add more info, we
deleted the index, worked on it for a day or so and deleted about 10 docs,
a plain match all was returning them.

It's pretty crazy and to be quite frank, I am glad ES is not our primary
store because something like this should never happen on a production-ready
database.

Agreed. It is a very embarassing bug.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/25603#issuecomment-394791480,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEpCm6Dar5xcYtapOYwRJIsr_aNJZQ6ks5t5r4dgaJpZM4ORM4Y
.

aterreno on 6 Jun 2018

Just to add some more details... I am pasting here the json schema... I hope it helps to understand the problem or reproducing it. It's indeed a farily complex (perhaps overcompliated, something we need to work on) document structure.

{
  "$id": "http://example.com/example.json", 
  "type": "object", 
  "definitions": {}, 
  "$schema": "http://json-schema.org/draft-07/schema#", 
  "properties": {
    "_index": {
      "$id": "/properties/_index", 
      "type": "string"
    }, 
    "_type": {
      "$id": "/properties/_type", 
      "type": "string"
    }, 
    "_id": {
      "$id": "/properties/_id", 
      "type": "string"
    }, 
    "_score": {
      "$id": "/properties/_score", 
      "type": "integer"
    }, 
    "_source": {
      "$id": "/properties/_source", 
      "type": "object", 
      "properties": {
        "email": {
          "$id": "/properties/_source/properties/email", 
          "type": "string"
        }, 
        "channel": {
          "$id": "/properties/_source/properties/channel", 
          "type": "string"
        }, 
        "createdAt": {
          "$id": "/properties/_source/properties/createdAt", 
          "type": "string"
        }, 
        "info": {
          "$id": "/properties/_source/properties/info", 
          "type": "object", 
          "properties": {
            "addressesHistory": {
              "$id": "/properties/_source/properties/info/properties/addressesHistory", 
              "type": "array", 
              "items": {
                "$id": "/properties/_source/properties/info/properties/addressesHistory/items", 
                "type": "object", 
                "properties": {
                  "nMonths": {
                    "$id": "/properties/_source/properties/info/properties/addressesHistory/items/properties/nMonths", 
                    "type": "integer"
                  }, 
                  "address": {
                    "$id": "/properties/_source/properties/info/properties/addressesHistory/items/properties/address", 
                    "type": "string"
                  }, 
                  "postcode": {
                    "$id": "/properties/_source/properties/info/properties/addressesHistory/items/properties/postcode", 
                    "type": "string"
                  }
                }
              }
            }, 
            "bankDetails": {
              "$id": "/properties/_source/properties/info/properties/bankDetails", 
              "type": "object"
            }, 
            "contact": {
              "$id": "/properties/_source/properties/info/properties/contact", 
              "type": "object", 
              "properties": {
                "title": {
                  "$id": "/properties/_source/properties/info/properties/contact/properties/title", 
                  "type": "string"
                }, 
                "firstname": {
                  "$id": "/properties/_source/properties/info/properties/contact/properties/firstname", 
                  "type": "string"
                }, 
                "lastname": {
                  "$id": "/properties/_source/properties/info/properties/contact/properties/lastname", 
                  "type": "string"
                }, 
                "email": {
                  "$id": "/properties/_source/properties/info/properties/contact/properties/email", 
                  "type": "string"
                }, 
                "phone": {
                  "$id": "/properties/_source/properties/info/properties/contact/properties/phone", 
                  "type": "string"
                }, 
                "homeAddress": {
                  "$id": "/properties/_source/properties/info/properties/contact/properties/homeAddress", 
                  "type": "object", 
                  "properties": {
                    "address": {
                      "$id": "/properties/_source/properties/info/properties/contact/properties/homeAddress/properties/address", 
                      "type": "string"
                    }, 
                    "postcode": {
                      "$id": "/properties/_source/properties/info/properties/contact/properties/homeAddress/properties/postcode", 
                      "type": "string"
                    }, 
                    "ehlAddressId": {
                      "$id": "/properties/_source/properties/info/properties/contact/properties/homeAddress/properties/ehlAddressId", 
                      "type": "string"
                    }
                  }
                }, 
                "manualAddress": {
                  "$id": "/properties/_source/properties/info/properties/contact/properties/manualAddress", 
                  "type": "object"
                }, 
                "billingAddress": {
                  "$id": "/properties/_source/properties/info/properties/contact/properties/billingAddress", 
                  "type": "object"
                }
              }
            }, 
            "loginLink": {
              "$id": "/properties/_source/properties/info/properties/loginLink", 
              "type": "string"
            }, 
            "personal": {
              "$id": "/properties/_source/properties/info/properties/personal", 
              "type": "object", 
              "properties": {
                "broadband": {
                  "$id": "/properties/_source/properties/info/properties/personal/properties/broadband", 
                  "type": "boolean"
                }, 
                "ownership": {
                  "$id": "/properties/_source/properties/info/properties/personal/properties/ownership", 
                  "type": "string"
                }, 
                "passwordHint": {
                  "$id": "/properties/_source/properties/info/properties/personal/properties/passwordHint", 
                  "type": "string"
                }, 
                "dob": {
                  "$id": "/properties/_source/properties/info/properties/personal/properties/dob", 
                  "type": "string"
                }, 
                "employmentStatus": {
                  "$id": "/properties/_source/properties/info/properties/personal/properties/employmentStatus", 
                  "type": "string"
                }, 
                "specialNeeds": {
                  "$id": "/properties/_source/properties/info/properties/personal/properties/specialNeeds", 
                  "type": "string"
                }
              }
            }
          }
        }, 
        "meters": {
          "$id": "/properties/_source/properties/meters", 
          "type": "object", 
          "properties": {
            "userMeter": {
              "$id": "/properties/_source/properties/meters/properties/userMeter", 
              "type": "object", 
              "properties": {
                "smartMeter": {
                  "$id": "/properties/_source/properties/meters/properties/userMeter/properties/smartMeter", 
                  "type": "string"
                }, 
                "installingSupplier": {
                  "$id": "/properties/_source/properties/meters/properties/userMeter/properties/installingSupplier", 
                  "type": "string"
                }, 
                "sameSupplier": {
                  "$id": "/properties/_source/properties/meters/properties/userMeter/properties/sameSupplier", 
                  "type": "string"
                }
              }
            }, 
            "serials": {
              "$id": "/properties/_source/properties/meters/properties/serials", 
              "type": "object"
            }, 
            "gbgInfo": {
              "$id": "/properties/_source/properties/meters/properties/gbgInfo", 
              "type": "object"
            }
          }
        }, 
        "preferences": {
          "$id": "/properties/_source/properties/preferences", 
          "type": "object", 
          "properties": {
            "creditCheck": {
              "$id": "/properties/_source/properties/preferences/properties/creditCheck", 
              "type": "string"
            }, 
            "customerService": {
              "$id": "/properties/_source/properties/preferences/properties/customerService", 
              "type": "string"
            }, 
            "fixedContract": {
              "$id": "/properties/_source/properties/preferences/properties/fixedContract", 
              "type": "string"
            }, 
            "fuelSource": {
              "$id": "/properties/_source/properties/preferences/properties/fuelSource", 
              "type": "string"
            }, 
            "fuelType": {
              "$id": "/properties/_source/properties/preferences/properties/fuelType", 
              "type": "string"
            }, 
            "paymentMethod": {
              "$id": "/properties/_source/properties/preferences/properties/paymentMethod", 
              "type": "string"
            }, 
            "smart": {
              "$id": "/properties/_source/properties/preferences/properties/smart", 
              "type": "string"
            }, 
            "supplier": {
              "$id": "/properties/_source/properties/preferences/properties/supplier", 
              "type": "string"
            }, 
            "switch": {
              "$id": "/properties/_source/properties/preferences/properties/switch", 
              "type": "string"
            }, 
            "cancellationFee": {
              "$id": "/properties/_source/properties/preferences/properties/cancellationFee", 
              "type": "string"
            }
          }
        }, 
        "status": {
          "$id": "/properties/_source/properties/status", 
          "type": "string"
        }, 
        "tariff": {
          "$id": "/properties/_source/properties/tariff", 
          "type": "object", 
          "properties": {
            "postcode": {
              "$id": "/properties/_source/properties/tariff/properties/postcode", 
              "type": "string"
            }, 
            "dnoRegion": {
              "$id": "/properties/_source/properties/tariff/properties/dnoRegion", 
              "type": "string"
            }, 
            "currentTariff": {
              "$id": "/properties/_source/properties/tariff/properties/currentTariff", 
              "type": "object", 
              "properties": {
                "elecPaymentMethod": {
                  "$id": "/properties/_source/properties/tariff/properties/currentTariff/properties/elecPaymentMethod", 
                  "type": "string"
                }, 
                "elecPaymentMethodLabel": {
                  "$id": "/properties/_source/properties/tariff/properties/currentTariff/properties/elecPaymentMethodLabel", 
                  "type": "string"
                }, 
                "elecSupplier": {
                  "$id": "/properties/_source/properties/tariff/properties/currentTariff/properties/elecSupplier", 
                  "type": "string"
                }, 
                "elecSupplierId": {
                  "$id": "/properties/_source/properties/tariff/properties/currentTariff/properties/elecSupplierId", 
                  "type": "string"
                }, 
                "elecTariff": {
                  "$id": "/properties/_source/properties/tariff/properties/currentTariff/properties/elecTariff", 
                  "type": "string"
                }, 
                "elecTariffId": {
                  "$id": "/properties/_source/properties/tariff/properties/currentTariff/properties/elecTariffId", 
                  "type": "string"
                }, 
                "gasPaymentMethod": {
                  "$id": "/properties/_source/properties/tariff/properties/currentTariff/properties/gasPaymentMethod", 
                  "type": "string"
                }, 
                "gasPaymentMethodLabel": {
                  "$id": "/properties/_source/properties/tariff/properties/currentTariff/properties/gasPaymentMethodLabel", 
                  "type": "string"
                }, 
                "gasSupplierId": {
                  "$id": "/properties/_source/properties/tariff/properties/currentTariff/properties/gasSupplierId", 
                  "type": "string"
                }, 
                "gasSupplier": {
                  "$id": "/properties/_source/properties/tariff/properties/currentTariff/properties/gasSupplier", 
                  "type": "string"
                }, 
                "gasTariff": {
                  "$id": "/properties/_source/properties/tariff/properties/currentTariff/properties/gasTariff", 
                  "type": "string"
                }, 
                "gasTariffId": {
                  "$id": "/properties/_source/properties/tariff/properties/currentTariff/properties/gasTariffId", 
                  "type": "string"
                }
              }
            }
          }
        }, 
        "updatedAt": {
          "$id": "/properties/_source/properties/updatedAt", 
          "type": "string"
        }, 
        "usage": {
          "$id": "/properties/_source/properties/usage", 
          "type": "object", 
          "properties": {
            "simpleEstimate": {
              "$id": "/properties/_source/properties/usage/properties/simpleEstimate", 
              "type": "object", 
              "properties": {
                "elec": {
                  "$id": "/properties/_source/properties/usage/properties/simpleEstimate/properties/elec", 
                  "type": "string"
                }, 
                "gas": {
                  "$id": "/properties/_source/properties/usage/properties/simpleEstimate/properties/gas", 
                  "type": "string"
                }
              }
            }
          }
        }, 
        "cads": {
          "$id": "/properties/_source/properties/cads", 
          "type": "object"
        }, 
        "emails": {
          "$id": "/properties/_source/properties/emails", 
          "type": "array", 
          "items": {
            "$id": "/properties/_source/properties/emails/items", 
            "type": "object", 
            "properties": {
              "email": {
                "$id": "/properties/_source/properties/emails/items/properties/email", 
                "type": "string"
              }, 
              "createdAt": {
                "$id": "/properties/_source/properties/emails/items/properties/createdAt", 
                "type": "string"
              }, 
              "type": {
                "$id": "/properties/_source/properties/emails/items/properties/type", 
                "type": "string"
              }
            }
          }
        }, 
        "notes": {
          "$id": "/properties/_source/properties/notes", 
          "type": "array"
        }, 
        "switches": {
          "$id": "/properties/_source/properties/switches", 
          "type": "array"
        }, 
        "userIssues": {
          "$id": "/properties/_source/properties/userIssues", 
          "type": "array", 
          "items": {
            "$id": "/properties/_source/properties/userIssues/items", 
            "type": "object", 
            "properties": {
              "category": {
                "$id": "/properties/_source/properties/userIssues/items/properties/category", 
                "type": "string"
              }, 
              "createdAt": {
                "$id": "/properties/_source/properties/userIssues/items/properties/createdAt", 
                "type": "string"
              }, 
              "issue": {
                "$id": "/properties/_source/properties/userIssues/items/properties/issue", 
                "type": "string"
              }, 
              "userIssueId": {
                "$id": "/properties/_source/properties/userIssues/items/properties/userIssueId", 
                "type": "string"
              }, 
              "email": {
                "$id": "/properties/_source/properties/userIssues/items/properties/email", 
                "type": "string"
              }, 
              "subcategory": {
                "$id": "/properties/_source/properties/userIssues/items/properties/subcategory", 
                "type": "string"
              }, 
              "filters": {
                "$id": "/properties/_source/properties/userIssues/items/properties/filters", 
                "type": "object", 
                "properties": {
                  "status": {
                    "$id": "/properties/_source/properties/userIssues/items/properties/filters/properties/status", 
                    "type": "string"
                  }, 
                  "category": {
                    "$id": "/properties/_source/properties/userIssues/items/properties/filters/properties/category", 
                    "type": "string"
                  }, 
                  "subcategory": {
                    "$id": "/properties/_source/properties/userIssues/items/properties/filters/properties/subcategory", 
                    "type": "string"
                  }
                }
              }
            }
          }
        }, 
        "statuses": {
          "$id": "/properties/_source/properties/statuses", 
          "type": "object", 
          "properties": {
            "account": {
              "$id": "/properties/_source/properties/statuses/properties/account", 
              "type": "string"
            }, 
            "meter": {
              "$id": "/properties/_source/properties/statuses/properties/meter", 
              "type": "string"
            }, 
            "smartMeter": {
              "$id": "/properties/_source/properties/statuses/properties/smartMeter", 
              "type": "string"
            }, 
            "issues": {
              "$id": "/properties/_source/properties/statuses/properties/issues", 
              "type": "boolean"
            }, 
            "switches": {
              "$id": "/properties/_source/properties/statuses/properties/switches", 
              "type": "boolean"
            }
          }
        }
      }
    }, 
    "fields": {
      "$id": "/properties/fields", 
      "type": "object", 
      "properties": {
        "createdAt": {
          "$id": "/properties/fields/properties/createdAt", 
          "type": "array", 
          "items": {
            "$id": "/properties/fields/properties/createdAt/items", 
            "type": "string"
          }
        }, 
        "userIssues.createdAt": {
          "$id": "/properties/fields/properties/userIssues.createdAt", 
          "type": "array", 
          "items": {
            "$id": "/properties/fields/properties/userIssues.createdAt/items", 
            "type": "string"
          }
        }, 
        "emails.createdAt": {
          "$id": "/properties/fields/properties/emails.createdAt", 
          "type": "array", 
          "items": {
            "$id": "/properties/fields/properties/emails.createdAt/items", 
            "type": "string"
          }
        }, 
        "updatedAt": {
          "$id": "/properties/fields/properties/updatedAt", 
          "type": "array", 
          "items": {
            "$id": "/properties/fields/properties/updatedAt/items", 
            "type": "string"
          }
        }
      }
    }
  }
}

aterreno on 6 Jun 2018

Thanks for the details, I suspect things are fine in your case actually.

I just don't understand how these 'similar' queries return different counts on the same index

They don't all do the same thing. For intance the query email:* will only match email addresses that have at least on token for the analyzer on this field. For instance if the value of this field is an empty string or a string that only contain characters that the analyzer splits on, then the document won't match.

The last query has a required range query on the updatedAt which probably excludes some documents as well.

jpountz on 6 Jun 2018

But email is the _id, a document won't be inserted without email and I've
exported all docs, mapped only the email, and no dupes or empty... That was
very weird.

The updatedAt is more interesting as, it's not a mandatory field, the way
we fixed our 'node api' was to filter on createdAt _exists_ as that field
is definitely always there.

On Wed, Jun 6, 2018 at 10:20 AM Adrien Grand notifications@github.com
wrote:

Thanks for the details, I suspect things are fine in your case actually.

I just don't understand how these 'similar' queries return different
counts on the same index

They don't all do the same thing. For intance the query email:* will only
match email addresses that have at least on token for the analyzer on this
field. For instance if the value of this field is an empty string or a
string that only contain characters that the analyzer splits on, then the
document won't match.

The last query has a required range query on the updatedAt which probably
excludes some documents as well.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/25603#issuecomment-395003373,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEpCiDXwMrtupFsmDQVN04AmUSxyFLfks5t5573gaJpZM4ORM4Y
.

aterreno on 6 Jun 2018

Indeed this index looks healthy, I can't see any issues with these numbers. The email question looks indeed like a tokenization issue.

A simple search was returning also delted docs, to add more info, we
deleted the index, worked on it for a day or so and deleted about 10 docs,
a plain match all was returning them.

was this index refreshed after deleting the docs? I mean ES doesn't refresh it's point in time view once the delete returns. it might do depending on you refresh interval. I am not sure what you are seeing is in any way related to this issue. Also these are all basic usecases and there wasn't a single report like this in the past. Nevertheless, lets find the root cause of your issues...

s1monw on 6 Jun 2018

Thanks @s1monw , I don't want to waste anyone time, it could be we are using ES the wrong way, I forgot to mention that I've also tried to refresh/flush/clear cache before running those curls

ES_CLUSTER=
INDEX=

curl -XPOST -k '$ES_CLUSTER/$INDEX/_cache/clear'
curl -XPOST -k '$ES_CLUSTER/$INDEX/_flush'     
curl -XPOST -k '$ES_CLUSTER/$INDEX/_refresh'

Would that be sufficient?

I can try to go into further details with the sequence of events that happen on that index, but that will take me some time...

aterreno on 6 Jun 2018

You should be able to look at the problematic documents by running a query like that:

GET $INDEX/_search
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "must_not": { "query_string": { "query": "email:*" } }
    }
  }
}

Their email field will likely be only made of characters that your analyzer splits on.

jpountz on 6 Jun 2018

Bingo, that returns the deleted users, plus one document which has a wrong schema.

ES_CLUSTER=
INDEX=

curl -s -k $ES_CLUSTER/$INDEX/_search -H 'Content-Type: application/json'  -d'
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "must_not": { "query_string": { "query": "email:*" } }
    }
  }
}' | jq .hits.hits |  jq 'map(._id)'

I don't understand why though, the doc with the wrong schema is 'on us', that needs to go but the other ones really should be the same as the other ones.

Now, if I try to delete those again, from curl I get:

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Rejecting mapping update to [$INDEX] as the final mapping would have more than 1 type: [_doc, user]"}],"type":"illegal_argument_exception","reason":"Rejecting mapping update to [$INDEX] as the final mapping would have more than 1 type: [_doc, user]"},"status":400}%

What does it mean?
And how then I can cleanup this index by properly remove this 'dirty data' from it?

Thank you so much!

aterreno on 6 Jun 2018

Now, if I try to delete those again, from curl I get:

can you share the curl command?

s1monw on 6 Jun 2018

It's pretty crazy and to be quite frank, I am glad ES is not our primary store because something like this should never happen on a production-ready database.

@aterreno I do wonder if you, give that we figured out what's going on and it's not nearly as terrible as we thought it is, want to reiterate on this statement?

s1monw on 6 Jun 2018

❤4

@s1monw the curl is the default, as per docs:
curl -X DELETE "localhost:9200/twitter/_doc/1"

anything wrong with that?

Answering the other question, I might have been harsh, but I am still quite happy that it's not my primary data store TBH, I've used tons of different SQL and noSQL solutions in my career and I don't understand why ES is not as straightforward, maybe something missing in the docs?

If I have a document in a database, no matter what the token analyser is, shouldn't I be able to just delete it? And if deleted it should be gone for good.

If the curl above is right, I can finally see the root cause of the issue, assuming that http request is what is behind the scenes from the ES node.js client, the lambdas are failing at deleting those docs.

Sorry, I might need to read an ES Bible before pontificating here but it's still not a behaviour that I'd expect from a data store.

Or we are using it completely wrong and we shouldn't tokenise the _id, is that the biggest mistake we could possibly do with ES?

aterreno on 6 Jun 2018

Given the error message, you probably need to run curl -X DELETE "localhost:9200/twitter/user/1" instead.

jpountz on 6 Jun 2018

Sorry, I might need to read an ES Bible before pontificating here but it's still not a behaviour that I'd expect from a data store.

please keep in mind elasticsearch is a search engine and you are indexing documents that get tokenized. Everything worked as expected and all the data is available. I do understand that coming from a datastore that is not a search engine is difficult and needs some training and reading. I want to understand why using a different technology that you haven't got enough experience and causing you trouble warrant such a statement. We take issues like this very serious, we jumped on it and resolved it in literally no time. All issues we found were on your end. I would have really appreciated you correcting your statement. If you are happy you using a different technology because you are more familiar with it, I am more than fine with that. I think you should live up to your comment and revert it, there are many people working very hard to build a good product, they don't deserve statements like this.

Or we are using it completely wrong and we shouldn't tokenise the _id, is that the biggest mistake we could possibly do with ES?

yes if you send ES text it will tokenize it by default as I would expect from a search engine. If your email field is the ID you should make it a keyword field.

s1monw on 6 Jun 2018

❤3

Thanks @jpountz that fixed the docs that were still 'hanging', I'll check the node code (which I didn't write) to do the delete to understand why they didn't get deleted in the first place.
I'll also keep an eye on eventual docs that don't get deleted, until numbers are low I can diff the docs and figure out why the models are 'upsetting' ES.
We might protect ourselves further by using some sort of schema validation before inserting into ES to avoid this sort of issues.
I appreciate a lot the support guys, thanks again.

aterreno on 6 Jun 2018

@s1monw I've updated my comment, and I apologise for the comment, I've appreciated a lot the support.

Given that there's no BUG on ES side I'll do the necessary reading to figure out how to both using the uax_url_email tokenizer (as we are doing) but also using email as keyword for _id.

It's still not 100% clear (but I'll figure it out myself, don't want to waste any further time of your / your team time) why those docs didn't get deleted in the first place as the others.

aterreno on 6 Jun 2018

❤3 👍2

thanks I really appreciate that @aterreno

s1monw on 6 Jun 2018

👍2

no worries @s1monw keep the good work :-) I can prove I (still) love ES (https://medium.com/@javame/visualise-all-things-82adc32bcf64 posted just last night ;)

aterreno on 6 Jun 2018

👍3 ❤2

I don't suppose there was any progress on this issue ? Its rearing its head for us again.

murfee25 on 1 Feb 2019

Technically is not an issue, and it should probably be closed, Simon
explained very well the 'odds' of ES, this is the way it supposed to work,
as he said (and I'll never forget that lesson) it's a search engine not a
database ;)

On Fri, Feb 1, 2019 at 11:46 AM murfee25 notifications@github.com wrote:

I don't suppose there was any progress on this issue ? Its rearing its
head for us again.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/25603#issuecomment-459697194,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEpCnTeDiSOx_9vvsWbNeZVxMICVNMjks5vJCkmgaJpZM4ORM4Y
.

aterreno on 1 Feb 2019

I think your issue was different to the original issue...
We are querying the search engine for a specific id and it finds it for us... but the totals it says for the query are wildly out.
e.g. instead of totals :1 we get totals: 319 ?

murfee25 on 1 Feb 2019

Its rearing its head for us again.

Sorry, I cannot find your user handle on this issue before. Do you mean you are with one of the parties involved with this issue so for or are you seeing some similar behaviour? In the later case it would probably be great to open a new issue and link this one so we can first check if and how they are related before mixing this already quite long thread with another case?

If you're refering to the same systems that @dannymurphygfk and @JagathJayasinghe were facing this issue then could you probably start by recapping if you were having trouble in the meantime, which ES versions you are running now or did everything stay the same?

cbuescher on 1 Feb 2019

@cbuescher, Hi sorry yes, didn't realise I was logged in with a different handle... it is rearing its head for us again.

Still Running 5.6.4 and everything else still the same.

dannymurphygfk on 1 Feb 2019

@dannymurphygfk okay, sorry to hear that you are still having trouble with this, there is indeed very little for us to go with since we are unable to reproduce still. Did you manage to rule out the possibility of your custom plugin interfering like @colings86 suggested in https://github.com/elastic/elasticsearch/issues/25603#issuecomment-375631978 by disabling the scripts on your new dev cluster?

cbuescher on 1 Feb 2019

Unfortunately that wasn't really an option to disable those scripts as it would have disabling them for an indefinite period of time, which would have rendered the cluster unusable from our product. The cluster is used by our Beta/QA systems.

dannymurphygfk on 1 Feb 2019

I'm having the same issue. Fortunately, it runs in an off-line job queue and I can just fail the job when this behavior is exhibited. The job gets resubmitted for another try and eventually works -- usually on the second try.

Here's the hits object captured from my debugger:

{
  "total": 48,
  "max_score": 507.43854,
  "hits": []
}

Note: Usually the query will return with the hits object correctly populated but the same query with the same values will occasionally exhibit the anomalous behavior.

I've been having the issue since 5.4. I can't reproduce it on demand. When it breaks in my debugger, I'll copy / paste the generated JSON to Kibana Console and run it with expected results.

I migrated the index over when I upgraded to ES 6.6 -- I did not rebuild it -- and I have the same intermittent behavior there, too.

Here's some context:

ES 5.4 on Ubuntu 14, 32GB 8 processors, 3 nodes
ES 6.6 on Ubuntu 18, 16GB 4 processors, 7 nodes (migrated index with S3 snapshot / restore from the 5.4 cluster)
Using elasticsearch.js library in node.js v8.11.2
size is always 1
no from: / to: specification
5 shards, 2 replicas
The target index is largely static with low-frequency maintenance.

Here's the query (old _type style):

GET /address/segment/_search
{
  "size": 1,
  "_source": [
    "segmentmid.coordinates",
    "ziptype",
    "street",
    "placename",
    "usps_prefname",
    "placename",
    "admin_name",
    "altplacename",
    "localadmin",
    "borough",
    "locality",
    "neighbourhood",
    "altneighbourhood",
    "countryiso2",
    "county",
    "verifiedlocations.housenumber",
    "state",
    "region",
    "altregion",
    "street_pref",
    "street",
    "verifiedlocations.postcode",
    "postcode"
  ],
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "common": {
                  "street_pref": {
                    "query": "PIONEER",
                    "_name": "street",
                    "boost": 5
                  }
                }
              },
              {
                "common": {
                  "street": {
                    "query": "PIONEER",
                    "_name": "street",
                    "boost": 1
                  }
                }
              },
              {
                "common": {
                  "street_pref": {
                    "query": "PIONEER BLVD",
                    "_name": "street",
                    "boost": 5
                  }
                }
              },
              {
                "common": {
                  "street": {
                    "query": "PIONEER BLVD",
                    "_name": "street",
                    "boost": 1
                  }
                }
              }
            ]
          }
        },
        {
          "bool": {
            "should": [
              {
                "multi_match": {
                  "query": "ARTESIA",
                  "fields": [
                    "usps_prefname",
                    "placename",
                    "admin_name",
                    "altplacename",
                    "localadmin",
                    "borough",
                    "locality",
                    "neighbourhood",
                    "altneighbourhood"
                  ],
                  "operator": "and",
                  "fuzziness": 0,
                  "_name": "city",
                  "boost": 12
                }
              }
            ]
          }
        },
        {
          "bool": {
            "should": [
              {
                "multi_match": {
                  "query": "CA",
                  "fields": [
                    "state",
                    "region",
                    "altregion"
                  ],
                  "operator": "and",
                  "fuzziness": 0,
                  "_name": "state",
                  "boost": 7.5
                }
              }
            ]
          }
        },
        {
          "bool": {
            "should": [
              {
                "multi_match": {
                  "query": "90701",
                  "fields": [
                    "verifiedlocations.postcode",
                    "postcode"
                  ],
                  "operator": "and",
                  "fuzziness": 0,
                  "_name": "zip",
                  "boost": 16.6
                }
              }
            ]
          }
        },
        {
          "bool": {
            "should": [
              {
                "multi_match": {
                  "query": "US",
                  "fields": [
                    "countryiso2"
                  ],
                  "operator": "and",
                  "fuzziness": 0,
                  "_name": "countryiso2",
                  "boost": 2
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Here's the proper / desired result returned from Kibana console:

{
  "took" : 96,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 48,
    "max_score" : 507.43854,
    "hits" : [
      {
        "_index" : "address",
        "_type" : "segment",
        "_id" : "LID1101576711766_1",
        "_score" : 507.43854,
        "_source" : {
          "countryiso2" : "US",
          "street_pref" : "Pioneer Blvd",
          "street" : [
            "Pioneer",
            "Pioneer Blvd",
            "Pioneer Boulevard"
          ],
          "postcode" : "90701",
          "usps_prefname" : "ARTESIA",
          "state" : "CA",
          "admin_name" : "Artesia",
          "placename" : [
            "Artesia",
            "ARTESIA",
            "CERRITOS"
          ],
          "segmentmid" : {
            "coordinates" : [
              -118.08214449992698,
              33.862094500000296
            ]
          },
          "ziptype" : "Default"
        },
        "matched_queries" : [
          "zip",
          "state",
          "countryiso2",
          "city",
          "street"
        ]
      }
    ]
  }
}

--Ray

RayRenteria on 4 Mar 2019

I've been having the issue since 6.1.3

yc1024 on 11 May 2019

Still having this issue in 2020, using version 6.8.

Any solutions?

aleiakkim on 9 Nov 2020

I think we're mixing lots of different issues/problems here so I am going to close this issue. The original issue was about a terms query on the _id field that should match a single document but returns a hits.total greater than 1.
We couldn't reproduce this specific issue in 6.x and 7.x so please reopen if you have a reproducible case.

Still having this issue in 2020, using version 6.8.

Please make sure that the issue is the same as the original one and if not, don't hesitate to open a new issue that describes the problem and how to reproduce.

jimczi on 9 Nov 2020

Elasticsearch: The hits.total differs from actual number of documents in `ids` query

Most helpful comment

All 105 comments

PUT my_index

POST my_index/_search

POST my_index/_search

POST my_index/_search?preference=_shards:0

POST my_index/_search?preference=_shards:1

POST my_index/_search?preference=_shards:2

POST my_index/_search?preference=_shards:3

POST my_index/_search?preference=_shards:4

UPDATE

Related issues