Elasticsearch: ES 2.2.1 strange document count in _cat/indices

Created on 24 Mar 2016 · 20Comments · Source: elastic/elasticsearch

Elasticsearch version: 2.2.1
JVM version: Oracle Java 8 (1.8.0_74)
OS version: Ubuntu Server 14.04.4 LTS
Description of the problem including expected versus actual behavior: /_cat/indices returns twice number of docs (docs.count column) than exists in the index.

curl -XGET 'http://127.0.0.1:9200/_cat/indices?v'
health status index                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index-1doc              2   1          2            0     39.9kb         19.9kb
green  open   index-1doc-1s0r         1   0          2            0     19.8kb         19.8kb
green  open   index-10000doc-7s3r     7   3      20000            0     76.3mb           19mb

Steps to reproduce:

Feed a new index with N documents using bulk insert
Query /_cat/indices and check docs.count column

>docs help wanted

Source

hgfischer

All 20 comments

If I had to guess that is the total number of documents across all shards. IIRC _cat/indices has done this for a long time, presumably because it is an index metadata level thing. I agree this is confusing though.

nik9000 on 24 Mar 2016

If I had to guess that is the total number of documents across all shards.

I don't think that that is what it does, we only sum up the documents over the primary shards; see RestIndicesAction.

This does not replicate for me. On a fresh two-node cluster (to ensure that replicas are allocated):

$ curl -XPOST localhost:9200/i/t/1?pretty=1 -d '{ "f":"v" }'
{
  "_index" : "i",
  "_type" : "t",
  "_id" : "1",
  "_version" : 1,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : true
}
$ curl -XGET localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
green  open   i       5   1          1            0      6.7kb          3.3kb
$ for i in `seq 1 8192`; do curl -sS -XPOST localhost:9200/i2/t2/$i -d '{ "f":"v" }' > /dev/null; done
$ curl -XGET localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
green  open   i2      5   1       8192            0    394.3kb        190.4kb
green  open   i       5   1          1            0      6.7kb          3.3kb

If you are able to provide a reproduction script that reproduces the issue on a _fresh_ install of Elasticsearch, could you provide it here?

jasontedor on 24 Mar 2016

It's not the index shard/replication settings for sure, because from all 3 indexes I showed previously, they all have the problem and they have different shard/replication settings, but the docs.count is always twice the total:

green  open   index-1doc              2   1          2            0     39.9kb         19.9kb
green  open   index-1doc-1s0r         1   0          2            0     19.8kb         19.8kb
green  open   index-10000doc-7s3r     7   3      20000            0     76.3mb           19mb

Index index-1doc = 1 document, 2 shards, 1 replica = docs.count = 2
Index index-1doc-1s0r = 1 document, 1 shard, 0 replicas = docs.count = 2
Index index-10000doc-7s3r = 10000 documents, 7 shards, 3 replicas = docs.count = 20000

The 10-node cluster I'm testing on it's a fresh one. This is the global settings I changed in each node:

cluster.name: ****************
node.name: ****************
index.number_of_shards: 2
index.number_of_replicas: 1
path.data: ****************
path.logs: ****************
bootstrap.mlockall: true
network.host: ****************
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [ **************** 10 IPs **************** ]
discovery.zen.minimum_master_nodes: 3
gateway.expected_nodes: 10
gateway.expected_master_nodes: 3
gateway.expected_data_nodes: 10
gateway.recover_after_time: 30m
gateway.recover_after_nodes: 5
gateway.recover_after_master_nodes: 2
gateway.recover_after_data_nodes: 5
node.max_local_storage_nodes: 1
action.destructive_requires_name: true
threadpool.bulk.queue_size: 1000
index.merge.scheduler.max_thread_count: 1
index.translog.flush_threshold_size: 1gb
index.search.slowlog.threshold.query.warn  : 10s
index.search.slowlog.threshold.query.info  : 5s
index.search.slowlog.threshold.query.debug : 2s
index.search.slowlog.threshold.query.trace : 500ms
index.search.slowlog.threshold.fetch.warn  : 1s
index.search.slowlog.threshold.fetch.info  : 800ms
index.search.slowlog.threshold.fetch.debug : 500ms
index.search.slowlog.threshold.fetch.trace : 200ms

I tested in another 2 3-node cluster with similar settings and in another 1-node server, also with similar settings and they all presented the same count.

I inserted the document(s) using bulk request, even with a single document.

hgfischer on 24 Mar 2016

I inserted the document(s) using bulk request, even with a single document.

This is what we need to see, because I suspect that this is where the issue is.

jasontedor on 24 Mar 2016

👍1

@jasontedor Why is this issue closed?!

hgfischer on 29 Mar 2016

Reopening, I think it was closed by mistake, sorry @hgfischer

javanna on 29 Mar 2016

@hgfischer we're still waiting for the info that @jasontedor asked for, as we are unable to replicate this issue with the info provided thus far.

clintongormley on 29 Mar 2016

This is what we need to see, because I suspect that this is where the issue is

Since you suspect where the issue is, do I still need to build something to reproduce this?

hgfischer on 29 Mar 2016

Why is this issue closed?!

Because it does not replicate with the information provided. We are happy to reopen when there is a verified bug.

Since you suspect where the issue is, do I still need to build something to reproduce this?

Yes, and I'm sorry it was not clear, but the issue does not replicate for me so we need to see what you are doing.

jasontedor on 29 Mar 2016

To be clear, I also attempted to replicate via bulk requests and the issue does not replicate. Again, starting from a fresh two-node cluster.

$ cat > request
{ "index": { "_index": "i", "_type": "t" } }
{ "f": "v" }
$ for i in `seq 1 8192`; do echo '{ "index": { "_index": "i2", "_type": "t2" } }' >> request; echo '{ "f": "v" }' >> request; done
$ curl -sS -XPOST locahost:9200/_bulk --data-binary "@request" > /dev/null
$ curl -XGET localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
green  open   i2      5   1       8192            0    267.4kb        132.2kb
green  open   i       5   1          1            0      6.9kb          3.4kb

Note that I did not pre-assign document IDs because I suspect that whatever is going on involves requests _without_ document IDs being sent to Elasticsearch twice.

jasontedor on 29 Mar 2016

I'm preparing a script to replicate the problem.

BTW I'm setting document IDs with UUIDv4.

hgfischer on 29 Mar 2016

I'm preparing a script to replicate the problem.

Thank you! It will receive my full attention as soon as it is in hand.

BTW I'm setting document IDs with UUIDv4.

Well, there goes that theory; we'll get to the bottom of it either way. :smile:

jasontedor on 29 Mar 2016

Here it goes: https://github.com/hgfischer/es-17319

hgfischer on 29 Mar 2016

@hgfischer Thank you. I will take a very close look at this later tonight.

jasontedor on 29 Mar 2016

@hgfischer Thanks for the very thorough and careful reproduction, it should be considered a model for future reproductions. What you're experiencing can be boiled down the following reproduction, starting from a fresh single-node cluster:

$ curl -sS -XPUT localhost:9200/i -d '
> {
>   "mappings": {
>     "t": {
>       "properties": {
>         "f": {
>           "type": "nested"
>         }
>       }
>     }
>   }
> }' > /dev/null
$ curl -sS -XPOST localhost:9200/i/t/1 -d '
> {
>   "f": { "v": 1 }
> }' > /dev/null
$ curl -XGET localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open   i       5   1          2            0      3.5kb          3.5kb

What you're observing here is due to your use of the nested type. From your mapping:

    "mappings": {
      "customer": {
        .
        .
        .
        "additionalData": {
          "type": "nested"

When a document with a field mapped as a nested type is indexed into Elasticsearch, Elasticsearch creates a hidden document for each value of the field that is mapped as a nested type. To be clear about what I mean here, with the same mapping above:

$ curl -sS -XDELETE localhost:9200/i/t/1 > /dev/null
$ curl -sS -XPOST localhost:9200/i/t/1 -d '
> {
>   "f": [ { "v": 1 }, { "v": 2 } ]
> }' > /dev/null
$ curl -XGET localhost:9200/_cat/indices
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open   i       5   1          3            0       650b           650b

Note that there are three documents here: the actual document, and the two hidden documents, one for each of the values of the nested field f.

These hidden documents are returned in the counts because Elasticsearch retrieves the store count directly from Lucene (which of course counts these hidden documents as actual documents).

This is operating as intended.

jasontedor on 30 Mar 2016

👍1

@jasontedor Thanks for the detailed explanation! Would you please consider adding this info on the _cat/indices docs.count documentation, please?

What about adding a new column to _cat/indices with the _root_ docs.count and rename docs.count to lucene.docs.count maybe?

Thanks

hgfischer on 30 Mar 2016

What about adding a new column to _cat/indices with the root docs.count and rename docs.count to lucene.docs.count maybe?

That probably isn't going to happen. Those APIs don't need to execute queries to do their thing, instead relying on Lucene APIs that get to read meta. I suspect root count would need a query.

Would you please consider adding this info on the _cat/indices docs.count documentation, please?

Reopening this issue to do just that. Since you've been so good to us I have to offer you first dibs on it - the documentation is in docs/reference/cat/indices.asciidoc if you want to edit it. If not, one of us will do it.

nik9000 on 30 Mar 2016

Would you please consider adding this info on the _cat/indices docs.count documentation, please?

Sure, unless you want to take @nik9000's invitation to submit a PR yourself, I am happy to take it. Let us know either way?

What about adding a new column to _cat/indices with the _root_ docs.count and rename docs.count to lucene.docs.count maybe?

I'm hesitant to change this, I think that for the cat indices API, this is doing the right thing: counting the number of documents that are in the index. That is, this API is working at the physical index level and should return the physical count.

Note that you can get the number of root documents (non-hidden) via the cat count API:

$ curl -XGET localhost:9200/_cat/count/i?v
epoch      timestamp count
1459348721 10:38:41  1
$ curl -XGET localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open   i       5   1          3            0      3.7kb          3.7kb

This is on the same data as above, with my example of a document have two values for the field mapped as a nested type. This does exactly as @nik9000 suggested: it executes a query to get the count.

jasontedor on 30 Mar 2016

Yes, I'll do the PR. I would like to see a CONTRIBUTORS file (like http://golang.org/CONTRIBUTORS) in the project too, can I add it? :)

Regarding the changes on _cat/indices, ok then. I guess the documentation is enough.

hgfischer on 30 Mar 2016

Yes, I'll do the PR.

Awesome. :heart:

I would like to see a CONTRIBUTORS file (like http://golang.org/CONTRIBUTORS) in the project too, can I add it? :)

We have a contributing guidelines in the main GitHub repo. I think that we should add a pull request template to draw attention to it though.

Regarding the changes on _cat/indices, ok then. I guess the documentation is enough.

Cool, thanks again!

jasontedor on 30 Mar 2016

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to ignore a filed in elastic search mapping

abtpst · 3Comments

multi datacenter deployment support

ttaranov · 3Comments

Display ES version as part of index.version in index settings

ppf2 · 3Comments

Support coerce JSON -> String

matthughes · 3Comments

Add a way to determine the position of a result in a set or the presence of forward/backward results

rbayliss · 3Comments