Elasticsearch: Search using highlight returns less hits

Created on 6 Jan 2021  路  7Comments  路  Source: elastic/elasticsearch

Elasticsearch version (bin/elasticsearch --version): AWS ES 7.9, 2 nodes

Plugins installed: none afaik

JVM version (java -version): idk

OS version (uname -a if on a Unix-like system): mine is linux ubuntu 20, but ES is hosted on AWS ES service

Description of the problem including expected versus actual behavior:
I opened the issue also in elasticsearch-js but I think is more appropriate to open it here, since I think it's ES related and not language related. I will keep both issues closed or open when 1 gets resolved. Without copy pasting, you can read a detailed description of my problem there

Any help is appreciated, thanks.

>bug feedback_needed

All 7 comments

, since I think it's ES related

We need more info to diagnose this as a core elasticsearch issue. We would require JSON for :
1) The index mapping
2) The example docs
3) The example query

Can you supply examples of the above that reproduce the problem? Kibana's "dev tools" panel is useful for constructing these.

You can find the code in the other issue I linked, I copy that here too, plus the mapping

function getByQuery(q: string): Promise<any> {
  return esClient.search({
    index: process.env.ES_INDEX,
    q, // see below
    body: {
      highlight: {
        fields: {
          text: { pre_tags: ['<b>'], post_tags: ['</b>'] },
        }
      },
    }
  });
}

async function main() {
  try {
    const res = await getByQuery('(text:"foo") AND source:"bar"'); // example query, but it happens also with simply: text:"foo"
    console.log(res.body.hits.hits.length); // prints 1 <- wtf?
    console.log(res.body.hits.total.value); // prints 6
  } catch (err) {
    console.log(err);
  }
}

The mapping is the default one, I never changed it and never created it before saving the docs. In source field I usually save a simple word like "example", while in the text field I save very very large text, imagine a book content:

...
  source: { type: 'text', fields: { keyword: { type: 'keyword', ignore_above: 256 } } },
  text: { type: 'text', fields: { keyword: { type: 'keyword', ignore_above: 256 } } },
...

I wonder if there is a limit where highlight says: "doc is too large, I won't highlight this and won't return the doc, but I count it in total.value"

Like I mentioned in the other issue, if I comment the highlight object (line 6-10), the query return the correct number of results.

I wonder if there is a limit where highlight says: "doc is too large, I won't highlight this and won't return the doc, but I count it in total.value"

There is a limit in how big fields can be before they are highlighted but this should cause an error on the whole search request (which is not a great experience and something we are working to fix)

Maybe your query is matching docs that don't have a text value at all?
It's hard to speculate. We need your help to reproduce and working through your application code is not the best way to do it. Below is a copy of the text that is included when you open a new issue:

Please include a *minimal* but *complete* recreation of the problem,
including (e.g.) index creation, mappings, settings, query etc.  The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.

I'd need the JSON of the docs and queries to reproduce here. It will get us to a solution faster.

There is a limit

Good to know there is a limit for highlight, I found out in the docs 1M chars limit, do you mean this? I'm sorry I didn't see it before, thanks for telling me that.
Now I'll try to recreate the problem playing with docs smaller or larger than 1M.

do you mean this?

The index.highlight.max_analyzed_offset setting is what controls this.

You may want to consider breaking up your large books into separate elasticsearch documents rather than upping this limit (it's there for a reason). Your highlighting should be quicker with smaller documents. plus there's some neat stuff you can do with documents that are fragments of books.

Ok I riproduced it and I found the 1M error message we were talking about before, so that's why it's returning less results than expected:

{
  "took": 46,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 4,
    "skipped": 0,
    "failed": 1,
    "failures": [
      {
        "shard": 0,
        "index": "estest",
        "node": "RoJxzBNMS-KosDqot8Rp_w",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "The length of [text] field of [VSrs13YBm8yxvDKoSpiX] doc of [estest] index has exceeded [1000000] - maximum allowed to be analyzed for highlighting. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended!"
        }
      }
    ]
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.632142,
    "hits": [
      {
        "_shard": "[estest][1]",
        "_node": "RoJxzBNMS-KosDqot8Rp_w",
        "_index": "estest",
        "_type": "_doc",
        "_id": "gLbs13YB5DddV4n-W-xU",
        "_score": 0.53413403,
        "_source": {
          "text": "<very long text here>"
        },
        "highlight": {
          "text": "Array of 5 (correct)"
        }
      }
    ]
  }
}

I don't know if this is the correct behavious, probably yes, but the search request does not fail "as whole" (maybe it's even better) as you stated above:

this should cause an error on the whole search request

If you want to reproduce it, at least on AWS ES v7.9 (free tier offers 1 ES instance), it's really easy:

Given any 2 docs as the following (it doesn't matter the docs, but the length):

{
  "text": "<more than 1 million chars here>"
}
{
  "text": "<less than 1 million chars here>"
}
  1. Add these 2 docs in a new index with no mapping
  2. Perform a search query (it doesnt matter the query, but the highlight) with highlight on the field text, eg: text:"anything"
  3. The response above is returned.

If this is not the intented behaviour, I'm available helping further.

For now, to solve my problem, I splitted the 1M docs into multiple docs as you suggested me and everything is working fine.

the search request does not fail "as whole" (maybe it's even better) as you stated above:

Ah - maybe it's only when there's a single index/shard that if fails the whole request (that was the scenario when I encountered it).

The total number of hits is calculated in the query phase but selected results are grabbed in a subsequent "fetch" phase - which, as you encountered, can have partial failures which leads to the discrepancy.

It sounds like it's working as designed in this scenario and you've moved to a better doc design anyhow so I'll close this.

Thanks for reporting.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Praveen82 picture Praveen82  路  3Comments

ttaranov picture ttaranov  路  3Comments

clintongormley picture clintongormley  路  3Comments

clintongormley picture clintongormley  路  3Comments

jasontedor picture jasontedor  路  3Comments