Elasticsearch: Collapse results around a field value

Created on 24 Nov 2016 · 21Comments · Source: elastic/elasticsearch

Describe the feature:

A method needs to be provided to collapse results around a field. This feature has many uses and purposes:

For web search: To collapse results by domain ID
For ecommerce search: To collapse SKUs by main product ID
For email and comments search: To collapse threaded responses by thread ID
For document search: To collapse versions of documents around similarity hash

And the list goes on and on.

Note that the cardinality of the field can be very large.

This feature can substantially improve search results quality by not allowing results to be cluttered up with many similar results. In other words, if you return (for example) three results per domain ID, this prevents one domain with many results from swamping the search results page.

Ideally, results are sorted based on the highest-sorted record which belongs to the group of records with the same value in the field. Also, ideally, the top N documents (for example, N=3) for each field value would be provided with each result.

Other search engines are able to provide this feature, including the FAST-ESP search engine and Solr Results Grouping.

I would accept a solution which does not allow jumping to an arbitrary page number, but where users can only go from page 1 to page 2 to page 3, sequentially. I would even accept a solution where the total number of groups is an estimate. (others may feel differently, however, please chime in)

:AnalyticAggregations discuss

Source

paullovessearch

Most helpful comment

Additional use cases (such as your 1-4 above) are not required.

Except use case 3 which is pagination - where we came from in https://github.com/elastic/elasticsearch/issues/4915

Search (and not aggs) is the mechanism for paginated results. We support (for good reason) the following styles of doc search, in order of performance:
1) Single documents e.g. text doc
2) Nested documents e.g. order plus order lines
3) Parent/child e.g. company and employees

You'll notice the order of performance is directly related to the locality of the data (same doc, same segment or same shard). We do not support an option 4 where we try and rank elements scattered across multiple nodes as we can't defy physics e.g. overcome the speed limitations of networks to support fast searches. However, entity-centric indexing is a valid approach to consolidating related data together as a background process rather than deferring these tasks until query time.

So these are the tools that are available to you for "grouped search". You'll need to organise your indexing strategies to pick the best approach. I recognise that this can require effort on your part and I'd love to offer a miracle cure-all pill but at some point "eat less, exercise more" is the only advice you can give. The one change we could possibly explore is that option 3) parent/child might be made to work where clients had not actually chosen to provide a physical parent doc and only children.

If you agree with these points lets put this "grouped" thing to bed and move on to tackling collapsing with Diversity which has a different set of challenges to discuss...

markharwood on 26 Nov 2016

❤2

All 21 comments

Other search engines are able to provide this feature,

Taking a quick look at the Solr docs it seems to be that all docs sharing the common grouping field value have to be physically located on the same shard. If that is a concession you are willing to make then I expect it is possible to use parent/child queries to achieve what you are after in elasticsearch.

markharwood on 24 Nov 2016

Yes, I can make it work with parent/child. But that's more effort (and more work on the ingestion side) than simply collapsing on a field.

paullovessearch on 24 Nov 2016

But that's more effort

Are you at least prepared to route related content to the same shard? Otherwise I expect you'll struggle to find a technology that can gather the scattered pieces of the puzzle together in a way that supports fast searches.

markharwood on 24 Nov 2016

For this we have the top-hits aggregation

clintongormley on 24 Nov 2016

not allowing results to be cluttered up with many similar results

If that is your use-case, then something like https://issues.apache.org/jira/browse/LUCENE-6066 looks like a better fit to me.

jpountz on 24 Nov 2016

@markharwood : Sure, this could required related content be routed to the same shard. Of course, it would be nice if the index had a configuration which did this automatically. I.e. choose the field name and the routing happens automatically.
@jpountz : Yes, but it needs to work within the context of the Elasticsearch cluster as a whole.
@clintongormley : What got this started is that you can't page through aggregation buckets. So, what is needed is, essentially, the top-hits aggregation with paging through the buckets.

paullovessearch on 24 Nov 2016

. I.e. choose the field name and the routing happens automatically.

That adds a fair amount of cost if you don't supply the routing value separately - the coordinating node has to parse the JSON body just to know where to route. Kind of like the Post Office having to open all the envelopes on letters just to read where they should be sent.

markharwood on 24 Nov 2016

❤2

what is needed is, essentially, the top-hits aggregation with paging through the buckets.

To me it feels closer to parent/child with inner hits but relaxing the requirement for a physical parent doc to group the children. Parent-child already enforces the idea of shard-locality and routing based on a value. Pagination support is already there (unlike terms aggs). Child docs can exist without the parent. The sticking point is that, currently, "orphaned" child docs are not returned in results and you don't want to have to create parent docs for them in order to group. I'm not sure how much work would be involved in adding support for "orphans".

markharwood on 24 Nov 2016

👍1

@markharwood : This all makes sense to me. It is a really useful feature, and definitely worth it.

In another use case, I have a customer which has bill lines and wants to group them into bills. It's another type of group-by clause, which I think could also be useful for aggregations.

Grouping log lines by purchase transaction ID, for example.

The list goes on and on.

paullovessearch on 25 Nov 2016

Grouping log lines by purchase transaction ID, for example.
The list goes on and on.

Let's try keep this focused. If you're analyzing transactions or bills that represent a single unit then create transaction or bill objects. See entity-centric indexing: https://www.youtube.com/watch?v=yBf7oeJKH2Y

If we're talking about Google -style collapsing of free-text search results (e.g. pages grouped under a site) then we need to be discussing different techniques.

markharwood on 25 Nov 2016

@markharwood : I'm sorry, I was just providing examples of lists of records which could be collapsed over a field. I believe they can all be handled by the new feature proposed.

I'll watch the video of entity centric indexing, but in all the cases I'm talking about, there is no record which is the "parent". The parent is a collection off records with some sort of binding value (domain ID, transaction ID, etc.).

In the log-line example, one of our customers is an ecommerce site. When a new order is created it is stamped with an order-id, and that order-ID is carried along across many different servers and is stamped into hundreds (if not thousands) of log lines in dozens of log files.

This customer wants to show a list of orders in the search results, not a list of log lines.

Of course, we could reach back into the RDBMS where the order was first created and index that, then make that the parent and parse out all of the log lines with the order ID and make them children of the original order.

But 1) We don't actually have access to the original RDBMS table [it is managed by a separate group] and 2) Sometimes orders are deleted [and so the record goes away in the RDBMS - but it lives on in the logs], and 3) we would need to modify the records for all of the log lines with order IDs so that they are children records (extra processing and integration required).

Another option (the one which we are pursuing right now) is to create a Kafka consumer which reads the Kafka data with Spark, aggregates on order ID, then writes another Kafka stream with the stream of orders which is then indexed by Elasticsearch. This allows us some additional aggregation power which may be useful later.

But it would be much easier if we could simply collapse the results by order-ID.

paullovessearch on 25 Nov 2016

Besides, (tell me if I'm on the wrong topic)

SolR has a numberOfGroups option (group.ngroups in Result Grouping) that allows, for each group, to return the... number of groups.

Actually with the following request:

{
  "from": 0,
  "size": 0,
  "aggs": {
    "groupby.city": {
      "terms": {
        "field": "city",
        "size" : 999,
        "order": {
          "_term": "asc"
        }
      }
    }
  }
}

I will actually fetch 999 cities, but the result won't tell me how many cities really match that aggregation. So, trying to paginate group results is a real pain.
Pinging #4915 again.

bpolaszek on 25 Nov 2016

@paullovessearch i don't think you meant to close it, so reopening

clintongormley on 25 Nov 2016

I think we need to distinguish the collapsing and the grouping. To me collapsing means that you want to limit the number of top documents that belongs to the same collapsing key. You don't group top hits with the same collapsing key you just filter those that breaks the collapsing rule.
Today this can be done with the diversified_sampler agg but it's a best effort at the shard level:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-diversified-sampler-aggregation.html#_controlling_diversity

Grouping is different, first it finds the best groups matching the query and then for each group it keeps the N best match. The results are top groups and not top hits. For instance in a e-commerce context I want to group my results by product and keep the N best items for each product. For this we have the top_hits aggregation coupled with the terms aggregation. For accurate results you can also route your document at indexing time based on the grouping field. It works fine except when you want to paginate the results. We could paginate the terms aggregation like we do for regular search pagination but it would suffer from the same limitation. If you want to access page number 10 all shards need to send their results for page 0 to 10. This is one of the reason why we are reluctant to add the pagination in the terms aggregation. It won't scale. Solr is no different here, they have the pagination on grouping but if you try to go deep you'll likely kill your cluster.

SolR has a numberOfGroups option (group.ngroups in Result Grouping) that allows, for each group, to return the... number of groups.

It returns the total number of groups but to be accurate all documents in each group must be co-located on the same shard. In ES you can simply do a cardinality aggregation on the grouping field:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html

jimczi on 25 Nov 2016

I'll watch the video of entity centric indexing, but in all the cases I'm talking about, there is no record which is the "parent".

That is also true of _all_ the examples in the video (I do recommend you watch it). Many of the examples you have given, Paul (transactions+log-records, orders+items, bills+lines) have potential attributes e.g. duration which are better computed in an entity-centric fashion rather than assembled at query-time using aggs. The benefits are:
1) More complex consolidation logic is possible e.g. flagging anomalous sequences of events.
2) Expensive query-time aggs are avoided
3) The fused entity is a document that can be paginated using normal search APIs
4) The event store can still be organized along time-based indices (no routing of event docs required)
So the general advice is it doesn't always pay to do _all_ of your aggregation logic at query time.

Of course there is still Jim's other use case of result-collapsing to diversify search results but that doesn't sound like the primary use case you want to discuss here? (again, we need to establish a clear focus for this issue).

markharwood on 25 Nov 2016

@markharwood : Agreed, all of those are useful when collapsing at index time - and we are pursing methods (using Kafka and Spark) for doing this sort of management of these sorts of streams at index time.
But there are lots of use cases where these sorts of complex benefits are not required. Lots of times all you need to do is collapse on a field and show the top N matching records which belong to the same value of that field. Additional use cases (such as your 1-4 above) are not required.

@jimczi : In terms of diversity, I think a lot of users are looking for a very simple way to explain the results to their users. And (like Google) - collapsing results on a field is very intuitive and easy to understand for the end-user. "Oh look, you only get a single result from each host name", etc.

Collapse-by-field implementations that I've seen imply that the groups are ordered by the strongest (first-occuring) record within the group, based on the sort order. And so it's not really "ordering of the groups", it's really "ordering of the records" and then collapsing by group.

So it's not a "group by" / "order by" clause in the traditional RDBMS sense. It's more of a "show the results in the same order, but then skip / collapse all the ones which occur later in the results with a field-value which has already been seen before.

However, if we can provide some of the same advantages of parent-child functions for more complex processing of the groups, then cool. But that isn't what I've used in the past and what my customers need right now.

@clintongormley : Thanks for re-opening! Why do they put the "close and comment" button right next to the "comment" button anyway? Crazy.

paullovessearch on 26 Nov 2016

Additional use cases (such as your 1-4 above) are not required.

Except use case 3 which is pagination - where we came from in https://github.com/elastic/elasticsearch/issues/4915

If you agree with these points lets put this "grouped" thing to bed and move on to tackling collapsing with Diversity which has a different set of challenges to discuss...

markharwood on 26 Nov 2016

❤2

Okay then. I'll state outright that what I want is option 3) parent/child, without a physical parent document.

Does that help?

paullovessearch on 26 Nov 2016

Does that help?

Always helps to discuss something concrete :) We've meandered a little here in getting to this point so maybe close this issue in favour of a more targeted one?

markharwood on 26 Nov 2016

Sure, makes sense. Should I do it?

paullovessearch on 26 Nov 2016

Closing in favour of https://github.com/elastic/elasticsearch/issues/21820

markharwood on 28 Nov 2016

Was this page helpful?

0 / 5 - 0 ratings