Elasticsearch version (bin/elasticsearch --version): AWS ES 7.9, 2 nodes
Plugins installed: none afaik
JVM version (java -version): idk
OS version (uname -a if on a Unix-like system): mine is linux ubuntu 20, but ES is hosted on AWS ES service
Description of the problem including expected versus actual behavior:
I opened the issue also in elasticsearch-js but I think is more appropriate to open it here, since I think it's ES related and not language related. I will keep both issues closed or open when 1 gets resolved. Without copy pasting, you can read a detailed description of my problem there
Any help is appreciated, thanks.
, since I think it's ES related
We need more info to diagnose this as a core elasticsearch issue. We would require JSON for :
1) The index mapping
2) The example docs
3) The example query
Can you supply examples of the above that reproduce the problem? Kibana's "dev tools" panel is useful for constructing these.
You can find the code in the other issue I linked, I copy that here too, plus the mapping
function getByQuery(q: string): Promise<any> {
return esClient.search({
index: process.env.ES_INDEX,
q, // see below
body: {
highlight: {
fields: {
text: { pre_tags: ['<b>'], post_tags: ['</b>'] },
}
},
}
});
}
async function main() {
try {
const res = await getByQuery('(text:"foo") AND source:"bar"'); // example query, but it happens also with simply: text:"foo"
console.log(res.body.hits.hits.length); // prints 1 <- wtf?
console.log(res.body.hits.total.value); // prints 6
} catch (err) {
console.log(err);
}
}
The mapping is the default one, I never changed it and never created it before saving the docs. In source field I usually save a simple word like "example", while in the text field I save very very large text, imagine a book content:
...
source: { type: 'text', fields: { keyword: { type: 'keyword', ignore_above: 256 } } },
text: { type: 'text', fields: { keyword: { type: 'keyword', ignore_above: 256 } } },
...
I wonder if there is a limit where highlight says: "doc is too large, I won't highlight this and won't return the doc, but I count it in total.value"
Like I mentioned in the other issue, if I comment the highlight object (line 6-10), the query return the correct number of results.
I wonder if there is a limit where highlight says: "doc is too large, I won't highlight this and won't return the doc, but I count it in total.value"
There is a limit in how big fields can be before they are highlighted but this should cause an error on the whole search request (which is not a great experience and something we are working to fix)
Maybe your query is matching docs that don't have a text value at all?
It's hard to speculate. We need your help to reproduce and working through your application code is not the best way to do it. Below is a copy of the text that is included when you open a new issue:
Please include a *minimal* but *complete* recreation of the problem,
including (e.g.) index creation, mappings, settings, query etc. The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.
I'd need the JSON of the docs and queries to reproduce here. It will get us to a solution faster.
There is a limit
Good to know there is a limit for highlight, I found out in the docs 1M chars limit, do you mean this? I'm sorry I didn't see it before, thanks for telling me that.
Now I'll try to recreate the problem playing with docs smaller or larger than 1M.
do you mean this?
The index.highlight.max_analyzed_offset setting is what controls this.
You may want to consider breaking up your large books into separate elasticsearch documents rather than upping this limit (it's there for a reason). Your highlighting should be quicker with smaller documents. plus there's some neat stuff you can do with documents that are fragments of books.
Ok I riproduced it and I found the 1M error message we were talking about before, so that's why it's returning less results than expected:
{
"took": 46,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 4,
"skipped": 0,
"failed": 1,
"failures": [
{
"shard": 0,
"index": "estest",
"node": "RoJxzBNMS-KosDqot8Rp_w",
"reason": {
"type": "illegal_argument_exception",
"reason": "The length of [text] field of [VSrs13YBm8yxvDKoSpiX] doc of [estest] index has exceeded [1000000] - maximum allowed to be analyzed for highlighting. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended!"
}
}
]
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.632142,
"hits": [
{
"_shard": "[estest][1]",
"_node": "RoJxzBNMS-KosDqot8Rp_w",
"_index": "estest",
"_type": "_doc",
"_id": "gLbs13YB5DddV4n-W-xU",
"_score": 0.53413403,
"_source": {
"text": "<very long text here>"
},
"highlight": {
"text": "Array of 5 (correct)"
}
}
]
}
}
I don't know if this is the correct behavious, probably yes, but the search request does not fail "as whole" (maybe it's even better) as you stated above:
this should cause an error on the whole search request
If you want to reproduce it, at least on AWS ES v7.9 (free tier offers 1 ES instance), it's really easy:
Given any 2 docs as the following (it doesn't matter the docs, but the length):
{
"text": "<more than 1 million chars here>"
}
{
"text": "<less than 1 million chars here>"
}
highlight on the field text, eg: text:"anything"If this is not the intented behaviour, I'm available helping further.
For now, to solve my problem, I splitted the 1M docs into multiple docs as you suggested me and everything is working fine.
the search request does not fail "as whole" (maybe it's even better) as you stated above:
Ah - maybe it's only when there's a single index/shard that if fails the whole request (that was the scenario when I encountered it).
The total number of hits is calculated in the query phase but selected results are grabbed in a subsequent "fetch" phase - which, as you encountered, can have partial failures which leads to the discrepancy.
It sounds like it's working as designed in this scenario and you've moved to a better doc design anyhow so I'll close this.
Thanks for reporting.