The ability to tell exactly where a query has matched in a document is something that's been requested many times (see for example #5736). The solution has normally been to try and parse something out of a highlighter or query explanation, but in lucene 7.4 we will be able to get this information directly from the new Matches API.
This can be implemented similarly to the existing Explain API:
matches=true
to the body of a search request[index]/[doc]/_explain
The response in both cases would be an array of match objects, looking something like this:
"matches" : [
"field1" : [ { "startPosition" : 1, "endPosition" : 1,
"startUTF16Offset" : 34, "endUTF16Offset" : 43,
"terms" : [ "badgering" ] }, ... ],
"field2" : [ ... ],
...
]
We need to make it clear that the offsets are using UTF-16 encoding. If offsets are not stored in the index, they will need to be derived by re-analyzing the source somehow (either using a MemoryIndex or just extracting offsets from a TokenStream), and limits will need to be put on this, similar to the limits on highlighting. This process may be expensive, so it should also be possible to return only the positions.
Pinging @elastic/es-search-aggs
It could also be nice to have this information per query. So instead of setting matches:true
to the main search request we could allow this option to be set per query builder like we do for named
queries ?
We could extend the named query functionality to allow you to get the Matches, score and Explanation for each named query.
to get the Matches, score and Explanation for each named query
This would align with #29606.
We need to make it clear that the offsets are using UTF-16 encoding. If offsets are not stored in the index, they will need to be derived by re-analyzing the source somehow (either using a MemoryIndex or just extracting offsets from a TokenStream), and limits will need to be put on this, similar to the limits on highlighting. This process may be expensive, so it should also be possible to return only the positions.
I'm afraid that returning UTF16 offsets is going to make the API super hard to use to everyone but Java clients? Maybe the encoding should be a (required?) parameter of the API? This would make the API slower for sure, but much easier to use.
I'm afraid that returning UTF16 offsets is going to make the API super hard to use to everyone but Java clients?
I suspect that whatever offsets are returned will break some clients in some more-or-less subtle ways. It's reasonable to expect offsets to be used in expressions like s.substring(startIndex,endIndex)
and the meaning of the numbers in this kind of expression are very language-dependent.
UTF16 offsets are equivalent to offsets in byte-based strings using ASCII (also ISO-8859-1) encoding, and as offsets into codepoint-based strings that do not contain surrogates. The fact that these numbers _almost_ coincide is the source of much trappiness. I don't know whether UTF16 offsets are likely to be more widely useful than the other possibilities, but I think they work well for C++ and possibly Python too (but badly for Haskell FWIW).
I can't speak for other languages, but eg in Perl, I'd get back the JSON and the characters would be decoded from UTF8, so my substr
function would work on characters, not on raw UTF8 bytes.
@clintongormley Actually it seems like to work on UTF8 bytes? perl -e "print substr(\"脿bcd\", 2, 1)"
-> "b", perl -e "print substr(\"abcd\", 2, 1)"
-> "c".
@jpountz that's because the UTF8 bytes haven't been decoded into characters in that example - they're just raw bytes.
This example handles the encoding and decoding correctly:
perl -C -e "use utf8; print substr(\"脿bcd\", 2 1)"
Decoding JSON would automatically convert the raw UTF8 bytes into characters.
I did a bit more digging with @bleskes: Perl, Python and Ruby seem to make it easy to work with unicode code points. Go, Java, C# and C++ are not so easy since you need to explicitly convert back and forth between unicode code offsets and byte/character offsets: you can't use substring utility methods directly.
I think it means we have the following options which both have pros/cons:
Another source of traps is that some language APIs iterate graphemes (eg. C#'s 'TextElement') rather than code points, which would merge combined sequences into a single item. This would be trappy for the same reason that @DaveCTurner mentioned above that it would almost always coincide.
Rather than implementing this as a REST API, we'll look into exposing it via highlighter plugins, so that encoding issues won't be a problem.
Most helpful comment
It could also be nice to have this information per query. So instead of setting
matches:true
to the main search request we could allow this option to be set per query builder like we do fornamed
queries ?