Elasticsearch: Wrong fieldLength in _explain

Created on 11 May 2017 · 6Comments · Source: elastic/elasticsearch

I have a document with title Women's Funny T-Shirt (#wine (Hashtag Tee Shirt) Drinking) Ladies Shirt

The analyze api shows there are 11 terms in title field.

GET /products/_analyze
{
  "field": "custom.title",
  "text": "Women's Funny T-Shirt (#wine (Hashtag Tee Shirt) Drinking) Ladies Shirt"
}

```json
{
"tokens": [
{
"token": "women's",
"start_offset": 0,
"end_offset": 7,
"type": "",
"position": 0
},
{
"token": "funny",
"start_offset": 8,
"end_offset": 13,
"type": "",
"position": 1
},
{
"token": "t",
"start_offset": 14,
"end_offset": 15,
"type": "",
"position": 2
},
{
"token": "shirt",
"start_offset": 16,
"end_offset": 21,
"type": "",
"position": 3
},
{
"token": "wine",
"start_offset": 24,
"end_offset": 28,
"type": "",
"position": 4
},
{
"token": "hashtag",
"start_offset": 30,
"end_offset": 37,
"type": "",
"position": 5
},
{
"token": "tee",
"start_offset": 38,
"end_offset": 41,
"type": "",
"position": 6
},
{
"token": "shirt",
"start_offset": 42,
"end_offset": 47,
"type": "",
"position": 7
},
{
"token": "drinking",
"start_offset": 49,
"end_offset": 57,
"type": "",
"position": 8
},
{
"token": "ladies",
"start_offset": 59,
"end_offset": 65,
"type": "",
"position": 9
},
{
"token": "shirt",
"start_offset": 66,
"end_offset": 71,
"type": "",
"position": 10
}
]
}

But the explain api tells that the `title` field has fieldLength of 16

GET /products/custom/1971/_explain
{
"query": {
"bool": {
"should": {
"match": {
"title": {
"query": "women t shirt"
}
}
}
}
}
}

```json
{
  "_index": "products",
  "_type": "custom",
  "_id": "1971",
  "matched": true,
  "explanation": {
    "value": 5.2311754,
    "description": "sum of:",
    "details": [
      {
        "value": 5.2311754,
        "description": "sum of:",
        "details": [
          {
            "value": 2.0648851,
            "description": "weight(title:t in 278491) [PerFieldSimilarity], result of:",
            "details": [
              {
                "value": 2.0648851,
                "description": "score(doc=278491,freq=1.0 = termFreq=1.0\n), product of:",
                "details": [
                  {
                    "value": 2.2896154,
                    "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                    "details": [
                      {
                        "value": 446517,
                        "description": "docFreq",
                        "details": []
                      },
                      {
                        "value": 4407636,
                        "description": "docCount",
                        "details": []
                      }
                    ]
                  },
                  {
                    "value": 0.901848,
                    "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                    "details": [
                      {
                        "value": 1,
                        "description": "termFreq=1.0",
                        "details": []
                      },
                      {
                        "value": 1.2,
                        "description": "parameter k1",
                        "details": []
                      },
                      {
                        "value": 0.75,
                        "description": "parameter b",
                        "details": []
                      },
                      {
                        "value": 12.637836,
                        "description": "avgFieldLength",
                        "details": []
                      },
                      {
                        "value": 16,
                        "description": "fieldLength",
                        "details": []
                      }
                    ]
                  }
                ]
              }
            ]
          },
          {
            "value": 3.1662903,
            "description": "weight(title:shirt in 278491) [PerFieldSimilarity], result of:",
            "details": [
              {
                "value": 3.1662903,
                "description": "score(doc=278491,freq=3.0 = termFreq=3.0\n), product of:",
                "details": [
                  {
                    "value": 2.1297789,
                    "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                    "details": [
                      {
                        "value": 523907,
                        "description": "docFreq",
                        "details": []
                      },
                      {
                        "value": 4407636,
                        "description": "docCount",
                        "details": []
                      }
                    ]
                  },
                  {
                    "value": 1.4866756,
                    "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                    "details": [
                      {
                        "value": 3,
                        "description": "termFreq=3.0",
                        "details": []
                      },
                      {
                        "value": 1.2,
                        "description": "parameter k1",
                        "details": []
                      },
                      {
                        "value": 0.75,
                        "description": "parameter b",
                        "details": []
                      },
                      {
                        "value": 12.637836,
                        "description": "avgFieldLength",
                        "details": []
                      },
                      {
                        "value": 16,
                        "description": "fieldLength",
                        "details": []
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value": 0,
        "description": "match on required clause, product of:",
        "details": [
          {
            "value": 0,
            "description": "# clause",
            "details": []
          },
          {
            "value": 1,
            "description": "_type:amazon, product of:",
            "details": [
              {
                "value": 1,
                "description": "boost",
                "details": []
              },
              {
                "value": 1,
                "description": "queryNorm",
                "details": []
              }
            ]
          }
        ]
      }
    ]
  }
}

Source

Corei13

Most helpful comment

@Corei13

What's the approximate margin of error?

You can see what it is currently here: https://issues.apache.org/jira/browse/LUCENE-6819 and the proposed new encoding here: https://issues.apache.org/jira/browse/LUCENE-7730

Is there any way to make elasticsearch store correct field length, by sacrificing more storage?

No, it's hard coded in Lucene

clintongormley on 11 May 2017

👍2 ❤1

All 6 comments

@Corei13 The field length is stored as a single byte (in fact, as 7 bits currently - this is going to change) so it is inaccurate by design, because it needs to deal with much bigger numbers than 128.

clintongormley on 11 May 2017

@clintongormley Thanks for the explanation. I have a couple more questions.

What's the approximate margin of error?
Is there any way to make elasticsearch store correct field length, by sacrificing more storage?
How does elasticsearch approximate the field length?

A bit of context: I'm training a neural net to learn how to rank. It tries to learn optimal value for k1, b and boost for each field. Without knowing how ES computes computes fieldLength, I'm not being able to generate correct dataset.

Thanks is advance!

Corei13 on 11 May 2017

@Corei13

What's the approximate margin of error?

You can see what it is currently here: https://issues.apache.org/jira/browse/LUCENE-6819 and the proposed new encoding here: https://issues.apache.org/jira/browse/LUCENE-7730

Is there any way to make elasticsearch store correct field length, by sacrificing more storage?

No, it's hard coded in Lucene

clintongormley on 11 May 2017

👍2 ❤1

Thank you @clintongormley.
I would like to follow up that whether this has been changed in Elasticsearch 6.2 and is it still using 7 bytes to store the field length.

eugene-yang on 12 Apr 2018

I have created new index in Elastic Search 7.3.2 version ( "lucene_version" : "8.1.0") .
With following details:
INDEX created using the below details:

FILTER

"skillset_synonyms_filter": {
"type": "synonym_graph",
"synonyms": [
"c-sharp,c sharp",
".net,mvc,core, asp.net,winservice",
"sql,structured query language,dba,database,backend",
"javascript,js"
]
},
"my_stopword_filter": {
"type": "stop",
"stopwords": [
"a","an","and","are","as","at","be","but","by","for","if","in","into","is","it","no","not","of","on","or",
"such","that","the","their","then","there","these","they","this","to","was","will","with"
]
},

ANALYZER

"analyzer": {
"my_synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"skillset_synonyms_filter",
"my_stopword_filter"
]

}

and also I have added the data into the index like this
1]skills -.NET,MVC,SQL,WEB API
2]skills -.NET,MVC,SQL,WEB API
3]skills -javascript
4]skills-java
5]skills-SQL
6]skills:.NET,.NET MVC,.NET Core,SQL,WEB API

so in this the 1st and 2nd record the skillset is the same.

my query to fetch the record like this:
{
"explain": "true",
"query": {
"bool": {
"must": [
{
"match": {
"skills": {
"query": "sql"
}
}
}
]
}
}

}

So i am getting the correct result set but i am not getting how the tf getting calculated.
Means I am getting the "length of field" (dl)=3 And "average length of field " (avgdl)= 13.666667
which is confusing, So anyone could you please help me with this.

Here is the Details:
{
"value": 0.66775244,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1,
"description": "phraseFreq=1.0",
"details": [ ]
}
,
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": [ ]
}
,
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": [ ]
}
,
{
"value": 3,
"description": "dl, length of field",
"details": [ ]
}
,
{
"value": 13.666667,
"description": "avgdl, average length of field",
"details": [ ]
}
]