Elasticsearch: Expose the lucene Matches API to searches

Created on 20 Apr 2018 · 10Comments · Source: elastic/elasticsearch

The ability to tell exactly where a query has matched in a document is something that's been requested many times (see for example #5736). The solution has normally been to try and parse something out of a highlighter or query explanation, but in lucene 7.4 we will be able to get this information directly from the new Matches API.

This can be implemented similarly to the existing Explain API:

by adding matches=true to the body of a search request
via a REST endpoint [index]/[doc]/_explain

The response in both cases would be an array of match objects, looking something like this:

"matches" : [
    "field1" : [ { "startPosition" : 1, "endPosition" : 1, 
                      "startUTF16Offset" : 34, "endUTF16Offset" : 43,
                      "terms" : [ "badgering" ] }, ... ],
    "field2" : [ ... ],
    ...
]

We need to make it clear that the offsets are using UTF-16 encoding. If offsets are not stored in the index, they will need to be derived by re-analyzing the source somehow (either using a MemoryIndex or just extracting offsets from a TokenStream), and limits will need to be put on this, similar to the limits on highlighting. This process may be expensive, so it should also be possible to return only the positions.

:SearcSearch >feature

Source

romseygeek

👍8

Most helpful comment

It could also be nice to have this information per query. So instead of setting matches:true to the main search request we could allow this option to be set per query builder like we do for named queries ?

jimczi on 20 Apr 2018

👍4 😄1

All 10 comments

Pinging @elastic/es-search-aggs

elasticmachine on 20 Apr 2018

jimczi on 20 Apr 2018

👍4 😄1

We could extend the named query functionality to allow you to get the Matches, score and Explanation for each named query.

romseygeek on 20 Apr 2018

👍2

to get the Matches, score and Explanation for each named query

This would align with #29606.

We need to make it clear that the offsets are using UTF-16 encoding. If offsets are not stored in the index, they will need to be derived by re-analyzing the source somehow (either using a MemoryIndex or just extracting offsets from a TokenStream), and limits will need to be put on this, similar to the limits on highlighting. This process may be expensive, so it should also be possible to return only the positions.

I'm afraid that returning UTF16 offsets is going to make the API super hard to use to everyone but Java clients? Maybe the encoding should be a (required?) parameter of the API? This would make the API slower for sure, but much easier to use.

jpountz on 20 Apr 2018

👍1

I'm afraid that returning UTF16 offsets is going to make the API super hard to use to everyone but Java clients?

I suspect that whatever offsets are returned will break some clients in some more-or-less subtle ways. It's reasonable to expect offsets to be used in expressions like s.substring(startIndex,endIndex) and the meaning of the numbers in this kind of expression are very language-dependent.

UTF16 offsets are equivalent to offsets in byte-based strings using ASCII (also ISO-8859-1) encoding, and as offsets into codepoint-based strings that do not contain surrogates. The fact that these numbers _almost_ coincide is the source of much trappiness. I don't know whether UTF16 offsets are likely to be more widely useful than the other possibilities, but I think they work well for C++ and possibly Python too (but badly for Haskell FWIW).

DaveCTurner on 20 Apr 2018

👍1

I can't speak for other languages, but eg in Perl, I'd get back the JSON and the characters would be decoded from UTF8, so my substr function would work on characters, not on raw UTF8 bytes.

clintongormley on 20 Apr 2018

@clintongormley Actually it seems like to work on UTF8 bytes? perl -e "print substr(\"àbcd\", 2, 1)" -> "b", perl -e "print substr(\"abcd\", 2, 1)" -> "c".

jpountz on 20 Apr 2018

@jpountz that's because the UTF8 bytes haven't been decoded into characters in that example - they're just raw bytes.

This example handles the encoding and decoding correctly:

perl -C -e "use utf8; print substr(\"àbcd\", 2 1)"

Decoding JSON would automatically convert the raw UTF8 bytes into characters.

clintongormley on 20 Apr 2018

👍1

I did a bit more digging with @bleskes: Perl, Python and Ruby seem to make it easy to work with unicode code points. Go, Java, C# and C++ are not so easy since you need to explicitly convert back and forth between unicode code offsets and byte/character offsets: you can't use substring utility methods directly.

I think it means we have the following options which both have pros/cons:

always return Unicode code point offsets (UTF32 offsets)
- nicer API
- easy to use with Perl, Python, Ruby, less with Go, Java, C# and C++
- potentially still trappy if the user doesn't use the right string abstraction, eg. Python 2 users need to make sure they use a unicode string, not a regular string
make the encoding a required option of the API
- less nice but forces the uses to reason about encodings
- easier to work with Go, Java and C# by requesting offsets that can directly be used with their substring utility methods, eg. offsets into the UTF16 representation of the string for Java/C# and offsets into the UTF8 representation of the string for Go
make the API return extracted matches and context in a way that would make it easy to build snippets in a search results UI, but no offsets
- less flexible but no potential to use the offsets wrong

Another source of traps is that some language APIs iterate graphemes (eg. C#'s 'TextElement') rather than code points, which would merge combined sequences into a single item. This would be trappy for the same reason that @DaveCTurner mentioned above that it would almost always coincide.

jpountz on 23 Apr 2018

Rather than implementing this as a REST API, we'll look into exposing it via highlighter plugins, so that encoding issues won't be a problem.

romseygeek on 21 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings