Elasticsearch: ES 2.2.1 strange document count in _cat/indices

Created on 24 Mar 2016  路  20Comments  路  Source: elastic/elasticsearch

Elasticsearch version: 2.2.1
JVM version: Oracle Java 8 (1.8.0_74)
OS version: Ubuntu Server 14.04.4 LTS
Description of the problem including expected versus actual behavior: /_cat/indices returns twice number of docs (docs.count column) than exists in the index.

curl -XGET 'http://127.0.0.1:9200/_cat/indices?v'
health status index                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index-1doc              2   1          2            0     39.9kb         19.9kb
green  open   index-1doc-1s0r         1   0          2            0     19.8kb         19.8kb
green  open   index-10000doc-7s3r     7   3      20000            0     76.3mb           19mb

Steps to reproduce:

  1. Feed a new index with N documents using bulk insert
  2. Query /_cat/indices and check docs.count column
>docs help wanted

All 20 comments

If I had to guess that is the total number of documents across all shards. IIRC _cat/indices has done this for a long time, presumably because it is an index metadata level thing. I agree this is confusing though.

If I had to guess that is the total number of documents across all shards.

I don't think that that is what it does, we only sum up the documents over the primary shards; see RestIndicesAction.

This does not replicate for me. On a fresh two-node cluster (to ensure that replicas are allocated):

$ curl -XPOST localhost:9200/i/t/1?pretty=1 -d '{ "f":"v" }'
{
  "_index" : "i",
  "_type" : "t",
  "_id" : "1",
  "_version" : 1,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : true
}
$ curl -XGET localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
green  open   i       5   1          1            0      6.7kb          3.3kb
$ for i in `seq 1 8192`; do curl -sS -XPOST localhost:9200/i2/t2/$i -d '{ "f":"v" }' > /dev/null; done
$ curl -XGET localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
green  open   i2      5   1       8192            0    394.3kb        190.4kb
green  open   i       5   1          1            0      6.7kb          3.3kb

If you are able to provide a reproduction script that reproduces the issue on a _fresh_ install of Elasticsearch, could you provide it here?

It's not the index shard/replication settings for sure, because from all 3 indexes I showed previously, they all have the problem and they have different shard/replication settings, but the docs.count is always twice the total:

green  open   index-1doc              2   1          2            0     39.9kb         19.9kb
green  open   index-1doc-1s0r         1   0          2            0     19.8kb         19.8kb
green  open   index-10000doc-7s3r     7   3      20000            0     76.3mb           19mb

Index index-1doc = 1 document, 2 shards, 1 replica = docs.count = 2
Index index-1doc-1s0r = 1 document, 1 shard, 0 replicas = docs.count = 2
Index index-10000doc-7s3r = 10000 documents, 7 shards, 3 replicas = docs.count = 20000

The 10-node cluster I'm testing on it's a fresh one. This is the global settings I changed in each node:

cluster.name: ****************
node.name: ****************
index.number_of_shards: 2
index.number_of_replicas: 1
path.data: ****************
path.logs: ****************
bootstrap.mlockall: true
network.host: ****************
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [ **************** 10 IPs **************** ]
discovery.zen.minimum_master_nodes: 3
gateway.expected_nodes: 10
gateway.expected_master_nodes: 3
gateway.expected_data_nodes: 10
gateway.recover_after_time: 30m
gateway.recover_after_nodes: 5
gateway.recover_after_master_nodes: 2
gateway.recover_after_data_nodes: 5
node.max_local_storage_nodes: 1
action.destructive_requires_name: true
threadpool.bulk.queue_size: 1000
index.merge.scheduler.max_thread_count: 1
index.translog.flush_threshold_size: 1gb
index.search.slowlog.threshold.query.warn  : 10s
index.search.slowlog.threshold.query.info  : 5s
index.search.slowlog.threshold.query.debug : 2s
index.search.slowlog.threshold.query.trace : 500ms
index.search.slowlog.threshold.fetch.warn  : 1s
index.search.slowlog.threshold.fetch.info  : 800ms
index.search.slowlog.threshold.fetch.debug : 500ms
index.search.slowlog.threshold.fetch.trace : 200ms

I tested in another 2 3-node cluster with similar settings and in another 1-node server, also with similar settings and they all presented the same count.

I inserted the document(s) using bulk request, even with a single document.

I inserted the document(s) using bulk request, even with a single document.

This is what we need to see, because I suspect that this is where the issue is.

@jasontedor Why is this issue closed?!

Reopening, I think it was closed by mistake, sorry @hgfischer

@hgfischer we're still waiting for the info that @jasontedor asked for, as we are unable to replicate this issue with the info provided thus far.

This is what we need to see, because I suspect that this is where the issue is

Since you suspect where the issue is, do I still need to build something to reproduce this?

Why is this issue closed?!

Because it does not replicate with the information provided. We are happy to reopen when there is a verified bug.

Since you suspect where the issue is, do I still need to build something to reproduce this?

Yes, and I'm sorry it was not clear, but the issue does not replicate for me so we need to see what you are doing.

To be clear, I also attempted to replicate via bulk requests and the issue does not replicate. Again, starting from a fresh two-node cluster.

$ cat > request
{ "index": { "_index": "i", "_type": "t" } }
{ "f": "v" }
$ for i in `seq 1 8192`; do echo '{ "index": { "_index": "i2", "_type": "t2" } }' >> request; echo '{ "f": "v" }' >> request; done
$ curl -sS -XPOST locahost:9200/_bulk --data-binary "@request" > /dev/null
$ curl -XGET localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
green  open   i2      5   1       8192            0    267.4kb        132.2kb
green  open   i       5   1          1            0      6.9kb          3.4kb

Note that I did not pre-assign document IDs because I suspect that whatever is going on involves requests _without_ document IDs being sent to Elasticsearch twice.

I'm preparing a script to replicate the problem.

BTW I'm setting document IDs with UUIDv4.

I'm preparing a script to replicate the problem.

Thank you! It will receive my full attention as soon as it is in hand.

BTW I'm setting document IDs with UUIDv4.

Well, there goes that theory; we'll get to the bottom of it either way. :smile:

@hgfischer Thank you. I will take a very close look at this later tonight.

@hgfischer Thanks for the very thorough and careful reproduction, it should be considered a model for future reproductions. What you're experiencing can be boiled down the following reproduction, starting from a fresh single-node cluster:

$ curl -sS -XPUT localhost:9200/i -d '
> {
>   "mappings": {
>     "t": {
>       "properties": {
>         "f": {
>           "type": "nested"
>         }
>       }
>     }
>   }
> }' > /dev/null
$ curl -sS -XPOST localhost:9200/i/t/1 -d '
> {
>   "f": { "v": 1 }
> }' > /dev/null
$ curl -XGET localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open   i       5   1          2            0      3.5kb          3.5kb

What you're observing here is due to your use of the nested type. From your mapping:

    "mappings": {
      "customer": {
        .
        .
        .
        "additionalData": {
          "type": "nested"

When a document with a field mapped as a nested type is indexed into Elasticsearch, Elasticsearch creates a hidden document for each value of the field that is mapped as a nested type. To be clear about what I mean here, with the same mapping above:

$ curl -sS -XDELETE localhost:9200/i/t/1 > /dev/null
$ curl -sS -XPOST localhost:9200/i/t/1 -d '
> {
>   "f": [ { "v": 1 }, { "v": 2 } ]
> }' > /dev/null
$ curl -XGET localhost:9200/_cat/indices
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open   i       5   1          3            0       650b           650b

Note that there are three documents here: the actual document, and the two hidden documents, one for each of the values of the nested field f.

These hidden documents are returned in the counts because Elasticsearch retrieves the store count directly from Lucene (which of course counts these hidden documents as actual documents).

This is operating as intended.

@jasontedor Thanks for the detailed explanation! Would you please consider adding this info on the _cat/indices docs.count documentation, please?

What about adding a new column to _cat/indices with the _root_ docs.count and rename docs.count to lucene.docs.count maybe?

Thanks

What about adding a new column to _cat/indices with the root docs.count and rename docs.count to lucene.docs.count maybe?

That probably isn't going to happen. Those APIs don't need to execute queries to do their thing, instead relying on Lucene APIs that get to read meta. I suspect root count would need a query.

Would you please consider adding this info on the _cat/indices docs.count documentation, please?

Reopening this issue to do just that. Since you've been so good to us I have to offer you first dibs on it - the documentation is in docs/reference/cat/indices.asciidoc if you want to edit it. If not, one of us will do it.

Would you please consider adding this info on the _cat/indices docs.count documentation, please?

Sure, unless you want to take @nik9000's invitation to submit a PR yourself, I am happy to take it. Let us know either way?

What about adding a new column to _cat/indices with the _root_ docs.count and rename docs.count to lucene.docs.count maybe?

I'm hesitant to change this, I think that for the cat indices API, this is doing the right thing: counting the number of documents that are in the index. That is, this API is working at the physical index level and should return the physical count.

Note that you can get the number of root documents (non-hidden) via the cat count API:

$ curl -XGET localhost:9200/_cat/count/i?v
epoch      timestamp count
1459348721 10:38:41  1
$ curl -XGET localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open   i       5   1          3            0      3.7kb          3.7kb

This is on the same data as above, with my example of a document have two values for the field mapped as a nested type. This does exactly as @nik9000 suggested: it executes a query to get the count.

Yes, I'll do the PR. I would like to see a CONTRIBUTORS file (like http://golang.org/CONTRIBUTORS) in the project too, can I add it? :)

Regarding the changes on _cat/indices, ok then. I guess the documentation is enough.

Yes, I'll do the PR.

Awesome. :heart:

I would like to see a CONTRIBUTORS file (like http://golang.org/CONTRIBUTORS) in the project too, can I add it? :)

We have a contributing guidelines in the main GitHub repo. I think that we should add a pull request template to draw attention to it though.

Regarding the changes on _cat/indices, ok then. I guess the documentation is enough.

Cool, thanks again!

Was this page helpful?
0 / 5 - 0 ratings