I have a document with title Women's Funny T-Shirt (#wine (Hashtag Tee Shirt) Drinking) Ladies Shirt
The analyze api shows there are 11 terms in title field.
GET /products/_analyze
{
"field": "custom.title",
"text": "Women's Funny T-Shirt (#wine (Hashtag Tee Shirt) Drinking) Ladies Shirt"
}
```json
{
"tokens": [
{
"token": "women's",
"start_offset": 0,
"end_offset": 7,
"type": "
"position": 0
},
{
"token": "funny",
"start_offset": 8,
"end_offset": 13,
"type": "
"position": 1
},
{
"token": "t",
"start_offset": 14,
"end_offset": 15,
"type": "
"position": 2
},
{
"token": "shirt",
"start_offset": 16,
"end_offset": 21,
"type": "
"position": 3
},
{
"token": "wine",
"start_offset": 24,
"end_offset": 28,
"type": "
"position": 4
},
{
"token": "hashtag",
"start_offset": 30,
"end_offset": 37,
"type": "
"position": 5
},
{
"token": "tee",
"start_offset": 38,
"end_offset": 41,
"type": "
"position": 6
},
{
"token": "shirt",
"start_offset": 42,
"end_offset": 47,
"type": "
"position": 7
},
{
"token": "drinking",
"start_offset": 49,
"end_offset": 57,
"type": "
"position": 8
},
{
"token": "ladies",
"start_offset": 59,
"end_offset": 65,
"type": "
"position": 9
},
{
"token": "shirt",
"start_offset": 66,
"end_offset": 71,
"type": "
"position": 10
}
]
}
But the explain api tells that the `title` field has fieldLength of 16
GET /products/custom/1971/_explain
{
"query": {
"bool": {
"should": {
"match": {
"title": {
"query": "women t shirt"
}
}
}
}
}
}
```json
{
"_index": "products",
"_type": "custom",
"_id": "1971",
"matched": true,
"explanation": {
"value": 5.2311754,
"description": "sum of:",
"details": [
{
"value": 5.2311754,
"description": "sum of:",
"details": [
{
"value": 2.0648851,
"description": "weight(title:t in 278491) [PerFieldSimilarity], result of:",
"details": [
{
"value": 2.0648851,
"description": "score(doc=278491,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 2.2896154,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 446517,
"description": "docFreq",
"details": []
},
{
"value": 4407636,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.901848,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 12.637836,
"description": "avgFieldLength",
"details": []
},
{
"value": 16,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
},
{
"value": 3.1662903,
"description": "weight(title:shirt in 278491) [PerFieldSimilarity], result of:",
"details": [
{
"value": 3.1662903,
"description": "score(doc=278491,freq=3.0 = termFreq=3.0\n), product of:",
"details": [
{
"value": 2.1297789,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 523907,
"description": "docFreq",
"details": []
},
{
"value": 4407636,
"description": "docCount",
"details": []
}
]
},
{
"value": 1.4866756,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 3,
"description": "termFreq=3.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 12.637836,
"description": "avgFieldLength",
"details": []
},
{
"value": 16,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 1,
"description": "_type:amazon, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 1,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
}
@Corei13 The field length is stored as a single byte (in fact, as 7 bits currently - this is going to change) so it is inaccurate by design, because it needs to deal with much bigger numbers than 128.
@clintongormley Thanks for the explanation. I have a couple more questions.
A bit of context: I'm training a neural net to learn how to rank. It tries to learn optimal value for k1, b and boost for each field. Without knowing how ES computes computes fieldLength, I'm not being able to generate correct dataset.
Thanks is advance!
@Corei13
What's the approximate margin of error?
You can see what it is currently here: https://issues.apache.org/jira/browse/LUCENE-6819 and the proposed new encoding here: https://issues.apache.org/jira/browse/LUCENE-7730
Is there any way to make elasticsearch store correct field length, by sacrificing more storage?
No, it's hard coded in Lucene
Thank you @clintongormley.
I would like to follow up that whether this has been changed in Elasticsearch 6.2 and is it still using 7 bytes to store the field length.
I have created new index in Elastic Search 7.3.2 version ( "lucene_version" : "8.1.0") .
With following details:
INDEX created using the below details:
"skillset_synonyms_filter": {
"type": "synonym_graph",
"synonyms": [
"c-sharp,c sharp",
".net,mvc,core, asp.net,winservice",
"sql,structured query language,dba,database,backend",
"javascript,js"
]
},
"my_stopword_filter": {
"type": "stop",
"stopwords": [
"a","an","and","are","as","at","be","but","by","for","if","in","into","is","it","no","not","of","on","or",
"such","that","the","their","then","there","these","they","this","to","was","will","with"
]
},
"analyzer": {
"my_synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"skillset_synonyms_filter",
"my_stopword_filter"
]
and also I have added the data into the index like this
1]skills -.NET,MVC,SQL,WEB API
2]skills -.NET,MVC,SQL,WEB API
3]skills -javascript
4]skills-java
5]skills-SQL
6]skills:.NET,.NET MVC,.NET Core,SQL,WEB API
my query to fetch the record like this:
{
"explain": "true",
"query": {
"bool": {
"must": [
{
"match": {
"skills": {
"query": "sql"
}
}
}
]
}
}
So i am getting the correct result set but i am not getting how the tf getting calculated.
Means I am getting the "length of field" (dl)=3 And "average length of field " (avgdl)= 13.666667
which is confusing, So anyone could you please help me with this.
Here is the Details:
{
"value": 0.66775244,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1,
"description": "phraseFreq=1.0",
"details": [ ]
}
,
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": [ ]
}
,
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": [ ]
}
,
{
"value": 3,
"description": "dl, length of field",
"details": [ ]
}
,
{
"value": 13.666667,
"description": "avgdl, average length of field",
"details": [ ]
}
]
Thanks.
So what is dl? - it's not the length of the field (in characters) I'm searching on.
Most helpful comment
@Corei13
You can see what it is currently here: https://issues.apache.org/jira/browse/LUCENE-6819 and the proposed new encoding here: https://issues.apache.org/jira/browse/LUCENE-7730
No, it's hard coded in Lucene