Today field collapsing (or grouping) is "doable" in ES through a top_hits
aggregation under a terms
aggregation:
https://www.elastic.co/guide/en/elasticsearch/guide/master/top-hits.html
If the number of groups to retrieve is low it can be very efficient since it will be coupled with the breadth_first
mode of the aggregation framework. Though there are limitations inherent to the terms
aggregation that prohibit its usage for a simple field collapsing use case. One of the main limitation is that it's impossible to paginate over the groups. There are many reasons why we don't want to implement pagination on the terms
aggregation but to me the main reason is that it's impossible to ensure that the ranking will be accurate. It all depends on the sort that is applied, so for instance if you sort on the sum of a field then we might present a wrong order to the end user. On the contrary sorting on the maximum value of a field is ok and should lead to accurate ranking.
Since determining the accuracy of the sort is a difficult task we could dedicate a new type of aggregation that would limit the type of sorting that you can apply to your group. This new aggregation could look like this:
"aggs": {
"my_top_groups": {
"top_groups": {
"field": "product_id",
"sort": ["_score"],
"size": 10,
"group_size": 2
}
This aggregation would return the 10 best groups (size:10) sorted by the max _score of their inner hits with a maximum of 2 inner hits per group (group_size:2). It would be possible to sort on any field as long as the field exists in the index and to group by any field as long as the field has a unique value per document.
Additionally it should be possible to paginate over the top groups by specifying a start
parameter:
"aggs": {
"my_top_groups": {
"top_groups": {
"field": "product_id",
"sort": ["_score"],
"size": 10,
"group_size": 2,
"start": 10
}
So start:10
would be the initial offset for the list of groups to return (this would be the "page 2" of the first example). It won't be optimal since we still need to retrieve 20 groups per shard to find the group ranked 10 to 20 but it would be accurate. Of course we could have a limit on the start
offset (like we do for normal search).
One limitation of this approach is that only the ranking of each group and the first hit inside each group would be accurate. The inner hits of each group after the first one may be wrong. This is due to the fact that each group can appear in multiple shards. One way to solve this problem is to ensure that all documents in each group are co-located on the same shard. By doing this we could also efficiently return the number of groups that match the query. Though I don't think this should be a requirement but it could be documented as a way to get always accurate results.
This makes sense to me. How deep can it handle? I think going down to 1000, 2000 groups (say) should still be reasonable performance. This would be 5-10 pages deep if pages are 10-20 groups each.
@paullovessearch I think 1000/2000 is reasonable, more or less that's the order of magnitude we should aim.
Looking forward to it 馃憤
Looks great. Another important caveat is it does not support child aggs.
I discussed with @colings86 regarding the feasibility of such an aggregation and we came to the conclusion that this would require a lot of changes in the aggregation framework. Today aggregations are executed in a single phase but to implement field collapsing and paging efficiently we would need at least two phases. The first phase retrieves the top groups then the coordinating node can merge and select the N best groups globally. The second phase sends the N best groups to all shards to retrieve the M best hits per group. And finally the coordinating node can merge the top hits per group. Though with the aggregation as it is we would need to pack everything into one phase. Each shard would return the N top groups and the M top hits per group . With pagination it's not the N top groups but the N+from groups that need to be returned by each shard.
We could add a "fetch" phase for the aggregations but I don't think that other aggs could benefit from it. IMO this is not useful to have an agg that is intended to be a root aggregation and where sub-aggs are not allowed.
Considering that the use case that this issue is trying to solve is a search problem, I'd like to try another approach:
This intent of this issue is to provide a simple and accurate solution for single level field collapsing on search results. Since we want to enable pagination on top of collapsed results we need to keep the thing simple. The top_groups
aggregation is one solution but it would be inefficient and too costly to run it without big modifications in the aggregation framework (multiple phase aggregations).
Reusing the search capability of ES we could instead implement this feature in the search layer directly. A query for collapsed result would look like this:
GET _search
{
"collapse": {
"field": "product_id"
},
"query": {
"match_all": {
}
}
}
The query above would just "collapse" the top hits based on the provided field.
An example of a response snippet:
"hits": [
{
"_index": "my-index",
"_type": "items",
"_id": "1",
"_source": ...,
"_group": 0,
"_score", 200
},
{
"_index": "my-index",
"_type": "items",
"_id": "1",
"_source": ...,
"_group": 3
"_score": 100
}
]
The response is a normal search response except that only one hit per group is returned. This hit is the head of the group, the best candidate based on the sort specification of the query. An additional meta field named _group
indicates the id of the group for the hit.
By returning only the best hit per group we ensure that the ranking is accurate and the pagination works exactly like a normal search.
If we need more than one result per group then we can reuse the inner_hits
idea:
GET _search
{
"collapse": {
"field": "product_id"
"inner_hits": {
"size": 2
}
},
"query": {
"match_all": {
}
}
}
Each search hit will contain an inner_hits object that contain the next 2 competitive hits for this group:
{
"hits": [
{
"_index": "my-index",
"_type": "items",
"_id": "1",
"_source": ...,
"_group": 0,
"inner_hits": {
"product_id": {
"hits": {
"total": ...,
"hits": [
{
"_id": "12",
"_source": ...
"_score": 15
},
{
"_id": "22",
"_source": ...
"_score": 13
},
...
]
}
}
}
}
}
The avantage of this solution is that it can be smoothly integrated in the existing search layer. inner_hits
for the collapse
parameter can be implemented as a sub fetch phase like the parent/child does. Although it would not be efficient in a scroll context and we should probably disable collapsed scroll queries if this beast gets in. Simple pagination would work with the same limitation than for normal search which is why deep pagination should also be prohibited.
Does it make any sense ?
I like it!
Awesome!! @jimczi I liked the proposal 馃憤
We discussed this internally and decided to move on with a POC in order to evaluate the difficulty and the amount of code needed.
I'll try to work on this soon but I have other issues to finish first. I'll update the issue when I start the implementation.
Most helpful comment
I discussed with @colings86 regarding the feasibility of such an aggregation and we came to the conclusion that this would require a lot of changes in the aggregation framework. Today aggregations are executed in a single phase but to implement field collapsing and paging efficiently we would need at least two phases. The first phase retrieves the top groups then the coordinating node can merge and select the N best groups globally. The second phase sends the N best groups to all shards to retrieve the M best hits per group. And finally the coordinating node can merge the top hits per group. Though with the aggregation as it is we would need to pack everything into one phase. Each shard would return the N top groups and the M top hits per group . With pagination it's not the N top groups but the N+from groups that need to be returned by each shard.
We could add a "fetch" phase for the aggregations but I don't think that other aggs could benefit from it. IMO this is not useful to have an agg that is intended to be a root aggregation and where sub-aggs are not allowed.
Considering that the use case that this issue is trying to solve is a search problem, I'd like to try another approach:
New Proposal
This intent of this issue is to provide a simple and accurate solution for single level field collapsing on search results. Since we want to enable pagination on top of collapsed results we need to keep the thing simple. The
top_groups
aggregation is one solution but it would be inefficient and too costly to run it without big modifications in the aggregation framework (multiple phase aggregations).Reusing the search capability of ES we could instead implement this feature in the search layer directly. A query for collapsed result would look like this:
The query above would just "collapse" the top hits based on the provided field.
An example of a response snippet:
"hits": [ { "_index": "my-index", "_type": "items", "_id": "1", "_source": ..., "_group": 0, "_score", 200 }, { "_index": "my-index", "_type": "items", "_id": "1", "_source": ..., "_group": 3 "_score": 100 } ]
The response is a normal search response except that only one hit per group is returned. This hit is the head of the group, the best candidate based on the sort specification of the query. An additional meta field named
_group
indicates the id of the group for the hit.By returning only the best hit per group we ensure that the ranking is accurate and the pagination works exactly like a normal search.
If we need more than one result per group then we can reuse the
inner_hits
idea:Each search hit will contain an inner_hits object that contain the next 2 competitive hits for this group:
The avantage of this solution is that it can be smoothly integrated in the existing search layer.
inner_hits
for thecollapse
parameter can be implemented as a sub fetch phase like the parent/child does. Although it would not be efficient in a scroll context and we should probably disable collapsed scroll queries if this beast gets in. Simple pagination would work with the same limitation than for normal search which is why deep pagination should also be prohibited.Does it make any sense ?