Elasticsearch: Add match count scoring option

Created on 25 Sep 2015  路  9Comments  路  Source: elastic/elasticsearch

Feature Request:

I would like to be able to run a query that simply returns a score indicating the number of terms that matched my query. As far as I can tell after poring over the documentation (& internet), this is not currently supported.

Say my index contains the following records:

  • { "name": "john jacob smith junior 3rd" }
  • { "name": "john jacob smith junior" }
  • { "name": "john smith 3rd" }
  • { "name": "fred smith" }

These are the scores I'd want to get back for the following match queries:

search terms: john smith

  • john jacob smith junior 3rd => 2
  • john jacob smith junior => 2
  • john smith 3rd => 2
  • fred smith => 1

search terms: fred smith 3rd

  • john jacob smith junior 3rd => 2
  • john jacob smith junior => 1
  • john smith 3rd => 1
  • fred smith => 2

etc.

This should support all the semantics of a standard match query, e.g. fuzziness, minimum_should_match, etc., and I should be able to set a boost so that I may weight these queries in a should clause.

I feel like such an approach would fill what seems to me to be a gap between the infinite granularity of standard match scores (closeness of the search term to the searched field normalized by TF-IDF), and constant_score queries that collapse everything down to a simple yes or no.

The rationale for this is that in our application we don't actually care how many terms in the searched field _don't_ match, we just care how many _do_ match, and we don't care how rare or important a search term is in the index. We want to give every single record that matches the same number of search terms the precisely same score, and then use our own business logic to boost ranking appropriately based on other fields, e.g. recency, popularity, promoted status, etc.

Using the examples above, a search for smith would give every record a 1, and we would use our other fields (not shown for brevity) in a function_score query to boost the ones we feel are most relevant to the top of the pack.

I feel like it would make sense to implement this as a parameter that could be added to the existing query types. Something like score_mode that accepted values like similarity or match_count, where similarity would be the default and represents how scoring currently works, and match_count would be my proposed addition.

So a simple query might look like:

{
    "query": {
        "match": {
            "display_name": {
                "fuzziness": 0.75,
                "query": "smith",
                "scoring_mode": "match_count"
            }
        }
    }
}

In our use case we'd want to bundle several of these up in a should clause, which does seem to normalize the overall score (I wrote a query with two constant_score clauses with boost: 1 and matches on both clauses had score of 1.4142135 and matches on one clause had score of 0.35355338). That's not ideal, but I could work with it. In a perfect world (or my perfect world!) a should clause containing multiple of these match_count queries would emit the sum of all those scores without modification.

Most helpful comment

Here's what I mean:

You can tune BM25 to ignore term frequency (k1) and field length normalization (b) as follows:

PUT t
{
  "similarity": {
    "only_idf": {
      "type": "BM25",
      "k1": 0,
      "b": 0
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "field": {
          "type": "string",
          "similarity": "only_idf"
        }
      }
    }
  }
}

POST t/my_type/_bulk
{"index": {}}
{"field": "A"}
{"index": {}}
{"field": "A B x y z"}
{"index": {}}
{"field": "A x B x C x D foo bar guacamole"}

This query will add up the IDF for each matching term (you mentioned the requirement in your second comment to differentiate between high and low quality terms):

GET t/_search
{
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        {
          "term": {
            "field": "a"
          }
        },
        {
          "term": {
            "field": "b"
          }
        },
        {
          "term": {
            "field": "c"
          }
        },
        {
          "term": {
            "field": "d"
          }
        }
      ]
    }
  }
}

You can use constant scores to count each match as 1 (although the final score for each document is normalised so you get 0.5, 1, and 2, instead of 1, 2, and 4:

GET t/_search
{
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        {
          "constant_score": {
            "filter": {
              "term": {
                "field": "a"
              }
            }
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "field": "b"
              }
            }
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "field": "c"
              }
            }
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "field": "d"
              }
            }
          }
        }
      ]
    }
  }
}

And here's a function score query which gives you 1, 2, and 4:

GET t/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "field": "a b c d"
        }
      },
      "boost_mode": "replace",
      "score_mode": "sum",
      "functions": [
        {
          "filter": {
            "term": {
              "field": "a"
            }
          },
          "weight": 1
        },
        {
          "filter": {
            "term": {
              "field": "b"
            }
          },
          "weight": 1
        },
        {
          "filter": {
            "term": {
              "field": "c"
            }
          },
          "weight": 1
        },
        {
          "filter": {
            "term": {
              "field": "d"
            }
          },
          "weight": 1
        }
      ]
    }
  }
}

All 9 comments

UPDATE: In going over this with a colleague I realized that only knowing how many search terms matched a record in the index could obscure the fact that it really was a low quality match. e.g. the query terms john jacob smith junior would score a 2 against john smith, but the query john smith would produce the exactly same score against that record despite having no term misses.

Perhaps in addition to or instead of match_count it could have match_percent. In that case john smith would get a score of 0.5 against the query john jacob smith junior, while the same query against the record john smith would produce a score of 1.

This is a little closer to Elasticsearch's behavior today, but with the important removal of the TF-IDF normalization and the exceedingly fine granularity that overwhelms our custom function_score ranking.

so basically what you are saying is you just want to wrap each term in the boolean query in a constant score query that returns 1 on a match?

You have a few options. You can disable term stats completely by setting index_options to docs.

You can use BM25 similarity and tune k1 and b to control term frequency saturation and field length normalization.

Use a constant score query per term, wrapped in a bool query (with disable_coord set to true).

You're already using function score, so you could just filter on the terms you're interested in and score each one with the weight you desire.

Alternatively, use the weight and boost_mode parameters in function score to tune how query weights are combined with your function score weights.

Write your own custom similarity that does exactly what you want.

i think there are enough options here that we don't need to further complicate the match query.

@clintongormley thanks for the info. I definitely don't want to gum up the works if this is already possible. I will look into the options you describe but based on your reply I want to be sure I explained this clearly. You mentioned using constant_score but that would collapse all those scores down to the same value (value of boost or default of 1), no? And re: function_score I don't want to "filter on the terms [I'm] interested in", I want to include everything.

I'm not looking to filter results or get precise weighting across multiple should clauses, I'm trying to get a score that tells me how many search terms matched the searched record, without normalizing for term frequency, etc. Something along the lines of having the search terms a b c d produce a score of:

  • 1 against the field "A"
  • 2 against the record "A B x y z"
  • 4 against the record "A x B x C x D foo bar guacamole"
  • etc.

My apologies if my initial description was misleading, or if I've misunderstood your reply. I will of course defer to you completely on this but I just want to make sure we're not talking past each other.

Here's what I mean:

You can tune BM25 to ignore term frequency (k1) and field length normalization (b) as follows:

PUT t
{
  "similarity": {
    "only_idf": {
      "type": "BM25",
      "k1": 0,
      "b": 0
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "field": {
          "type": "string",
          "similarity": "only_idf"
        }
      }
    }
  }
}

POST t/my_type/_bulk
{"index": {}}
{"field": "A"}
{"index": {}}
{"field": "A B x y z"}
{"index": {}}
{"field": "A x B x C x D foo bar guacamole"}

This query will add up the IDF for each matching term (you mentioned the requirement in your second comment to differentiate between high and low quality terms):

GET t/_search
{
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        {
          "term": {
            "field": "a"
          }
        },
        {
          "term": {
            "field": "b"
          }
        },
        {
          "term": {
            "field": "c"
          }
        },
        {
          "term": {
            "field": "d"
          }
        }
      ]
    }
  }
}

You can use constant scores to count each match as 1 (although the final score for each document is normalised so you get 0.5, 1, and 2, instead of 1, 2, and 4:

GET t/_search
{
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        {
          "constant_score": {
            "filter": {
              "term": {
                "field": "a"
              }
            }
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "field": "b"
              }
            }
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "field": "c"
              }
            }
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "field": "d"
              }
            }
          }
        }
      ]
    }
  }
}

And here's a function score query which gives you 1, 2, and 4:

GET t/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "field": "a b c d"
        }
      },
      "boost_mode": "replace",
      "score_mode": "sum",
      "functions": [
        {
          "filter": {
            "term": {
              "field": "a"
            }
          },
          "weight": 1
        },
        {
          "filter": {
            "term": {
              "field": "b"
            }
          },
          "weight": 1
        },
        {
          "filter": {
            "term": {
              "field": "c"
            }
          },
          "weight": 1
        },
        {
          "filter": {
            "term": {
              "field": "d"
            }
          },
          "weight": 1
        }
      ]
    }
  }
}

Thanks @clintongormley. This is _exceedingly_ helpful. I didn't realize I'd effectively have to shatter my search term into individual should clauses ("field": "a", "field": "b", etc.) So I guess I need to tokenize my search terms on white space and build the query dynamically, one should array item per term?

And I didn't mean I need to distinguish between high and low quality _terms_ (like the vs. guacamole), I meant I want to distinguish high quality matches (all my search terms match) from low quality matches (some terms were found, some were not). For example if I search for a b, the record a b c d would be a high quality match, as it contains all my search terms, and would have a score of 2 using your function_score query.

My concern is that the query a b c d against the record a b would also produce a score of 2 even though half my search terms missed. To be clear I still want them in the results, I'd just want them to score lower. But now that I think it through I wonder if this actually would cause issues in practice. I don't actually care about correctly ranking results for search a b against those for search a b c d, I only care that all results that contain a and b and c and d are ranked identically at the top, followed by all records that are missing one term, then missing two, all the way down to the minimum_should_match I specify.

Any parting words of advice/caution regarding the above? Otherwise you're free to go. ;-) (And regardless, I owe you a beer or three!)

My concern is that the query a b c d against the record a b would also produce a score of 2 even though half my search terms missed. To be clear I still want them in the results, I'd just want them to score lower.

That's where query coordination comes in handy (ie the thing I disabled with disable_coord). See https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html#coord for more

what if i cant analyze the query string myself ?
for query string: "iphone5s"
i need the analyzer analyzing the query string and having the same effect as jpotisch need.

for example:
query string: "iphone5s"
terms analyzed by index analyzer: "iphone" "5s"
score result expecting: 2, 1 or 0

can this be possible?
@clintongormley

and the best is that i can know which term is matched to the document in script score, then i can do something to adjust the weight of different term,
i may build up a mapping from term to weight in script.

@clintongormley Thanks for your illuminating post. Do you think the tuned BM25 will be faster (search performance wise) than writing filter queries inline? Thanks again!

Was this page helpful?
0 / 5 - 0 ratings