Elasticsearch: Order by scripted_metric sub aggregation

Created on 14 Nov 2014 · 94Comments · Source: elastic/elasticsearch

Since there is a new Scripted metric aggregation (scripted_metric) in 1.4, it is possible to do a lot of amazing stuff.
For example it is possible to implement Weighted Average aggregation, which we were missing before.

Now we are really missing a possibility to sort by scripted_metric results.

Live example:

We calculate weightedAvgVis with scripted_metric and want to get ids with TOP 5 values of weightedAvgVis. Since script returns double, it looks logically possible.

{  
   "from":0,
   "size":0,
   "query":{  
      "match_all":{   }
   },
   "aggregations":{  
      "idNodes":{  
         "terms":{  
            "field":"id",
            "size":5,
            "order":{  
               "weightedAvgVis":"asc"
            }
         },
         "aggregations":{  
            "weightedAvgVis":{  
               "scripted_metric":{  
                  "init_script":"_agg['weightedSum'] = 0d; _agg['countSum'] = 0L;",
                  "map_script":"_agg['weightedSum'] = _agg.weightedSum + _source['avgVis'] * _source['count']; _agg['countSum'] = _agg.countSum + _source['count'];",
                  "reduce_script":"weightedSum = 0d; countSum = 0L; for(a in _aggs) {weightedSum += a.weightedSum; countSum += a.countSum;};if(countSum == 0L) {return 0d;} else {return weightedSum / countSum}"
               }
            }
         }
      }
   }
}

:AnalyticAggregations >enhancement feedback_needed help wanted high hanging fruit

Source

eryabitskiy

👍13

Most helpful comment

@colings86, @clintongormley

You asked us (August 19) to describe the use cases for this feature. Many people from the community posted their use cases. After more than 3 months, I think reasonable to get an official response from ElasticSearch.

As mentioned in this thread this feature is supported by Crate.io (based on ElasticSearch) and by Solr but still not supported by ES. I think we all like ES but it becomes very hard (at least for me) to understand/accept this missing feature.

Please give us a response or even better a fix!

lquerel on 30 Nov 2016

👍13

All 94 comments

@eryabitskiy this is something that I have been thinking about but requires a few outstanding issues to be resolved first. Specifically these are https://github.com/elasticsearch/elasticsearch/pull/8421 and https://github.com/elasticsearch/elasticsearch/issues/8434. These would allow us to specify much more powerful order paths and the getProperty method on the scripted metric aggregation could be used to retrieve arbitrary properties of the scripts results.

colings86 on 17 Nov 2014

Hi there, I'm not sure if this is appropriate but I thought you may want to gauge interest. We're very keen to see this as well. Currently we end up doing a lot of sorting on oversized resultsets in Go, whereas being able to sort on scripted metrics would save us this hassle.

snikch on 4 Dec 2014

+1 This would be great if solved!

vb3 on 9 May 2015

felipegs on 9 May 2015

concordiadiscors on 12 May 2015

:+1: sort on scripted metrics aggregations would be a killer feature. In the meantime we're also doing client-side sorting on oversized results.

rmruano on 12 May 2015

👍3

hendrik-schumacher on 13 May 2015

aaneja on 18 May 2015

edigu on 20 May 2015

brianstebar2 on 27 May 2015

+1 Aside from the other use-cases mentioned here, this feature would give the Kibana product some major strength, particularly for rapid prototyping.

tekwiz on 15 Jun 2015

genme on 26 Jun 2015

adimasuhid on 10 Aug 2015

(hopping on the "+1 to this feature" bandwagon)

iantruslove on 26 Aug 2015

mpereira on 26 Aug 2015

vhiroki on 10 Sep 2015

EvgenyGusarov on 23 Sep 2015

lquerel on 14 Oct 2015

bkj on 15 Oct 2015

mariomara on 20 Oct 2015

ssickles on 10 Nov 2015

antmat on 11 Nov 2015

vasiliy-kiryukhin on 2 Dec 2015

Regarding the computation of weighted average I sent a pull request to extend the current avg aggregation in order to support this kind of average natively without scripted metric.

Here the issue created on github:
https://github.com/elastic/elasticsearch/issues/15731

Here the pull request (only the first approach is implemented):
https://github.com/elastic/elasticsearch/pull/15781

lquerel on 6 Jan 2016

Regarding the support of the order clause for scripted metric I sent a pull request to support that.

https://github.com/elastic/elasticsearch/pull/15718

lquerel on 6 Jan 2016

orrchen on 18 Jan 2016

justinmpier on 1 Feb 2016

jayhahn on 1 Feb 2016

gchen on 3 Feb 2016

rajagopals on 10 Feb 2016

Hexer338 on 3 Mar 2016

raphaelMalie on 6 Mar 2016

dhaval1986 on 11 Mar 2016

Looking forward for this one...

Vineeth-Mohan on 18 Mar 2016

zzheka on 18 Mar 2016

@elasticsearch-guys, Is there any chance to integrate quickly my pull request or an improvement of it.

lquerel on 18 Mar 2016

👍6

praeto on 7 Apr 2016

appasahebs on 13 Apr 2016

golubev on 18 Apr 2016

I realy need this feature please add it :)
+1

Barlog951 on 25 Apr 2016

+1 It would be a great feature

jschwindt on 4 May 2016

brenuart on 10 May 2016

lukaszkarpeta on 10 May 2016

Is there any other approach to this other than client side sorting at present? Trying to sort on a user -specified weighted average on the fly.

JonPeel on 22 May 2016

+1, (and a pretty please) 😄

chevin99 on 14 Jun 2016

giocaputo on 29 Jun 2016

vipulbhavsar94 on 2 Aug 2016

boatmeme on 12 Aug 2016

👍1

zaobao on 17 Aug 2016

I would like to understand the use cases when users are wanting to order by a scripted_metric aggregation. Could users who currently order by a scripted_metric aggregation please reply with a short description of what the scripted metric aggregation is doing and what the overall description of the search request is? (if you can please also include the request body and the scripts, probably in a gist linked on your comment rather than inlined so it's easy to read)

colings86 on 19 Aug 2016

My use case: I have classifieds in cities. When a user is on a specific city, I want to display the total number of classifieds in cities arround (< 50km), and sort these neighbors cities by distance.

So basically I do a term aggregation to group the classifieds per cities (so I can get the total number of classifieds per cities), and then I do a scripted_metric aggregation to get the distance from the origin city for each city group I get, because I couldn't find an other way to achieve this. I need finally to sort the initial aggregation with the distance I got.

My query: https://gist.github.com/raphaelMalie/203e411fddb64d75444014d3f28e6f8e

raphaelMalie on 19 Aug 2016

@colings86 Here's a simplified example of what I'm trying to do: http://stackoverflow.com/questions/37799130/elasticsearch-sort-terms-by-pipeline-bucket-script-aggregation

As another example to clarify, say I have 10,000 names of students and I want to build a table showing the top 10 students based on grades, and grades are determined by combining values of multiple fields. So I want to be able to do a terms agg on the name and order it by the grade (a bucket_script agg that calculates multiple fields). Without ordering them, I would have to return all 10,000 students with their grades and do a client-side sort to see just the top 10.

Thanks. 👍

chevin99 on 19 Aug 2016

@raphaelMalie I commented on your gist with a suggestion of how you might be able to avoid the scripted_metric agg in your case (commented there so we could have a conversation there if necessary rather than it getting lost on this thread).

Thanks to @clintongormley for helping me with the solution

colings86 on 19 Aug 2016

@chevin99 in your grades example what do the individual documents in the index represent? (e.g. a student and their scores on all tests)

colings86 on 19 Aug 2016

@colings86 each doc would represent a test score and other info about the test that would be used to calculate the overall score/grade.

Does this help? https://gist.github.com/chevin99/930c537be7caf2e22abdcf1086350c55

chevin99 on 19 Aug 2016

@colings86 Here is the kind of queries we implemented with scripted metrics.

weighted average (not supported by the avg function, except with a patch I wrote but not integrated). See this feature request for a more detailed description https://github.com/elastic/elasticsearch/issues/15731 and this pull request for a working (but not super clean) solution https://github.com/elastic/elasticsearch/pull/15781
simple arithmetic operations on fields of documents (like sum(field1)/sum(field2)) to compute some kind of average, see this gist for an example of query https://gist.github.com/lquerel/08c98e7a870d4631214d58a6c8efdda1

For all these queries we use the clause "ORDER BY" to compute a top 10 for example.

IMO a better way to solve these different queries could be to support weighted average natively and extend the avg function to support other variant of average like the second example. In addition for our case we don't really need a fully accurate top 10 of something, we just need a good approximation (probably a common use case).

lquerel on 22 Aug 2016

@colings86 This is my use case, Can't find a way of doing something like this in ES 2.3.5 currently.

SELECT TOP 3 Name, SUM(Value) [Value] FROM @table
GROUP BY Name 
ORDER BY ABS(SUM(Value)) DESC

Take top N results ordered by aggregation with applied function (ABS in this case). So aggregation is performed first (using scripted_metric map reduce) and then it needs to be ordered by the result to get top 3.

More info here
SQL Style GroupBy & OrderBy with function in ES?

madamowski on 26 Aug 2016

@colings86 So motivation of my issue is still actual:

1) We have a timeline of activity for each Topic.
2) Timeline is stored in documents per day for each topic separately.
3) To get activity for some time period over multiple topics we use TERMS aggregation by topicID and sub scripted aggregations for each activity field..
4) The problem that in this case you get activity for all topics and it is too much. We need some pagination to cut down number of topics calculated at one time. And for that we need sorting (sorting over topicID doesn't help, because we need most active topics first).

I mean there is size property for TERMS aggregation, But it is very limited in use, while you can sort only by terms.

P.S. Also offset would be nice for pagination... but sorting is a very good start.
P.S.2 Now we are loading tuples topicID+sortingFiledValue to WebServer, do sort+paginate and load page topics. Any functionality that helps to move sorting/pagination to ES would be useful.

eryabitskiy on 27 Aug 2016

Hi @colings86!

You can look at https://discuss.elastic.co/t/ordering-terms-aggregation/47791 to see my use case for example:

Let's suppose that we have documents with fields (group_id, price, x,...). Our goal is to:

run a query that filters the whole documents set;

group documents by group_id;

in each group fetch the document with the highest x value, let's call it a "group's representative";

take that document's price and order all groups by that price (order groups by the price of theirs representatives).

I also tried to use scripted_metric to implement this query. I succeeded with calculation of the desired scripted metric representative_price for each bucket, but when it came to ordering terms by it, I found it to be impossible.

golubev on 29 Aug 2016

Hi @colings86

Is there any progress/plan for this issue?

ElasticSearch 5.0 is there, but still no support of "order by" on scripted metric.

lquerel on 28 Oct 2016

👍10

cszhouwei on 15 Nov 2016

jjj8jay on 15 Nov 2016

lubp123 on 17 Nov 2016

baptistemanson on 17 Nov 2016

+1024!

Hi @colings86, here's a very simple use case.

My ES index is made of documents that store:

A date (a full day)
A referer
The amount of displays of an ad on that referer
The amount of clicks on that ad on that referer

I want to display the top 10 referers having the best conversion rate between impressions and clicks. I may have 10000+ referers.
Here's the query sample:
https://gist.github.com/bpolaszek/00fec1dedc0ed02263f2328a9c0cd161

This comes with #4915.
Of course I could store the conversion rate on each doc, but CTR is just one of many metrics I need and I have a huge amount of data. And inline scripts are intended for that :)

Besides, grouping, sorting and slicing are available in the SQL world, despite choosing a NoSQL engine for rendering analytics is obvious in 2016.

Today, I have to silently load an arbitrary amount of aggregations (999999 to be _almost_ sure I have all my referers, sort the result myself, and slice the aggregation result from 10 to 20 if the end user goes on page 2, for instance). This is really annoying.

These 2 features are really mandatory if we want ES to be a real BI engine - these features exist in Apache SolR for a few years, for instance. But I love ES and I'd rather wait for these features to be implemented :)

Thank you much for your consideration,
Ben

bpolaszek on 21 Nov 2016

👍5

@colings86, @clintongormley

Please give us a response or even better a fix!

lquerel on 30 Nov 2016

👍13

Sorting the terms aggregation by any sub aggregation (with the exception of min and max) has un-bounded error for the same reason as sorting by ascending count has unbounded error which is described in https://github.com/elastic/elasticsearch/issues/17614#issuecomment-244114209. This issue is also discussed in https://github.com/elastic/elasticsearch/issues/17588. To be clear, the problem here isn't that we calculate a value with some error to the true value, as we do this with other aggregations such as the cardinality aggregation and indeed with the terms aggregation when sorted by descending document count. The difference between sorting the terms aggregation by a sub-aggregation and other aggs that given an approximation to the true value is that in the sorting by sub-aggregation case the error is unbounded (i.e. we have no knowledge of how big the error might be) whereas with the other aggregations mentioned we can either calculate the worst-case error, or we have a good idea of the expected error. This means that the results returned when ordering terms by a sub-aggregation are not even a good approximation to the true values.

Currently the sorting parameter for the terms aggregation can be _count, _term an aggregation that produces a single number. The scripted_metric aggregation can return any object (including complex structures) and is not confined to return a numeric value. Changing the sorting to support other aggregations outside the above constraints is non-trivial and would make the sorting code much more complex (for example one challenge would be that we would need to have some way of making sure that the values the aggregations return are sortable).

The scripted_metric aggregation was intended to be used mainly for prototyping and we did not envisage it being used in this way. Because this would add more cases where the error in the terms aggregation is unbounded and since adding support for this would be non-trivial we don't feel it’s correct to add this feature. What we would like to do instead is work towards adding support for the mentioned use cases in other ways, like adding new aggregations that support them.

From the comments above there seem to be three main use cases:

Grouping on a terms and sorting the groups by a calculation involving the result of multiple aggregation on the group's contents. This is a tricky problem to solve for the same reasons as listed above.
Weighted Average Aggregation - This could be implemented as a new aggregation type and there is an issue discussing this here: https://github.com/elastic/elasticsearch/issues/15731
The equivalent of SQL Group by - there are a few different forms of this listed here:
- SQL group by sorted by a field value - we have a proposal for implementing this functionality here: https://github.com/elastic/elasticsearch/issues/21833
- SQL group by with an aggregation with groups sorted by that aggregation - this is actually very similar to 1.

colings86 on 7 Dec 2016

@colings86

Just want to make things clear.

Sorting the terms aggregation by any sub aggregation (with the exception of min and max) has un-bounded error for the same reason as sorting by ascending count has unbounded error which is described in....

At same time there is an example from documentation that demonstrates sorting also by AVG aggregation not only min and max.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order
"order" : { "avg_play_count" : "desc" }

QUESTION: Do I understand you right that this example from documentation with avg is malfunction and could return wrong results?

Otherwise, I don't see any advantage of avg aggregation over scripted aggregation returning some primitive float/double value (In case of min and max, I for sure see difference).

eryabitskiy on 7 Dec 2016

QUESTION: Do I understand you right that this example from documentation with avg is malfunction and could return wrong results?

Yes, that is true and that is a bug in the documentation which I will fix to be clearer. Also note that there is a note at the top that section which warns that sorting by a sub aggregation is discouraged.

Otherwise, I don't see any advantage of avg aggregation over scripted aggregation returning some primitive float/double value

There is no advantage and sorting the terms aggregation by an average aggregation has the same pitfalls in this regard.

colings86 on 7 Dec 2016

I've opened https://github.com/elastic/elasticsearch/pull/22032 to address the documentation issue

colings86 on 7 Dec 2016

@colings86

Thank you for your answer!

I understand that there is always a trade of between performance and accuracy especially in FTS engine. But if you fetch all buckets from shard you can always reach 100% accuracy.

And I suspect, that actually most of folks would be ok with all buckets returned from every shard for the sake of 100% accuracy over performance and memory consuming (we can make a vote to check it). We do it already on Java server side anyway.

Also you can always set some protector max size property that triggers an error on too much memory consumption during such queries.

Can you consider such feature?

eryabitskiy on 7 Dec 2016

👍2

+1.

For my use case, I actually need to fetch all buckets, then sort and slice on client-side.
This is not only destroying performance, this also destroys developer's experience.

bpolaszek on 7 Dec 2016

👍2

And I suspect, that actually most of folks would be ok with all buckets returned from every shard for the sake of 100% accuracy over performance and memory consuming (we can make a vote to check it).

We only add features to Elasticsearch which are horizontally scalable. Whatever we add should work when you're running one node on your laptop with 50GB of data or 1000 nodes in your data server with 50 PB of data.

Fetching all terms from all shards does not scale horizontally, and so we will not add it.

We do it already on Java server side anyway.

Exactly. This is a problem that should be solved client side instead (where you know the limits of your data and how much you will need to scale).

clintongormley on 7 Dec 2016

I see...

Than I can mention one horizontally scalable use case: if each bucket is fully allocated withing one shard (by using terms field also as a routing) you can simply fetch only top buckets from shards and still get accurate results. Unfortunately it is only works together with proper routing.

eryabitskiy on 7 Dec 2016

@colings86

In order to be able to use an aggregation for sorting the terms aggregation it must _currently_ be a numeric metric aggregation. Pipeline Aggregations are a different family of aggregations than the Metric Aggregations and cannot be used for sorting. You couldn't sort using a Pipeline aggregation because they are only executed in the reduce phase on the coordinating node and so you do not have the information on the shard in order to be able to sort the shards buckets to pick the top N to send to the coordinating node.

Is there any plan that will fulfill this feature?
Is there any plan that will remove the restriction about numeric-metric-aggregation ?
Is there any plan that will support pipeline-aggregation-metric on node or shard?

In my opinion, these aggregations, relevant to calculation like sum/min/max/avg/count etc, are all metrics event the result is not in numeric format. For example:
{ "first_student":{ "min": { "field": "first_name" } } }
It will help a lot that elasticsearch suport different types of metric widely. Thus it will be more powerful than traditional SQL-data-analysis tools.

anhzhi on 19 Sep 2017

I'm currently moving back my analytics to an SQL platform because of this.
Simple and frequent use case: compute progressions between aggregated metrics (i.e CTR : +15%) and sort by progression. Impossible to do with ES without storing progression first.

bpolaszek on 19 Sep 2017

@colings86

Currently, I will try to translate different kinds of bucket-script-metric into the forms of map-reduce-script in a general method, then ordering-phase about these metrics can be supported.
I think it needs large-mount or repeated calculation, which will lead to performance destroying and relatively-low efficiency.
Another solution is that the client sides execute the ordering-phase. This solution requires the huge (otherwise the order result will be in poor accuracy) results derived from elaticsearch 's aggregation.
Considering the big-data transfering and lacking of distributed-calculation-support about the client side, this solution will be more ineffective.

anhzhi on 19 Sep 2017

Another solution is that the client sides execute the ordering-phase. This solution requires the huge (otherwise the order result will be in poor accuracy) results derived from elaticsearch 's aggregation.

This requires the knowledge of how much values the aggregation will return. You can not ask ES to "return all buckets" on an aggregation and you have to specify an arbitrary limit number, which is completely nonsense and performance-killing.

bpolaszek on 19 Sep 2017

@anhzhi I'm having trouble finding the text that you quoted for context (sorry if I've missed it in the comments above), could you paste the link to the comment or documentation you saw this in? The statement is however very true.

Is there any plan that will fulfill this feature?

If you are talking about being able to sort the terms aggregation by a pipeline aggregation then I don't see how this would be possible to implement. The terms aggregation needs to sort the buckets collected on the shard so it can return only the shard_size number of them to the coordinating node. If it did not do this sorting on the shard it would have to return all the buckets it collected to the coordinating node which would not scale since the number of buckets can be very large for high cardinality fields.

So, given the fact that we have to sort on the shard the only other solution would be to 'push down' the calculation of pipeline aggregations onto the shards. But this would not work either since pipeline aggregations at their heart compare the final results of different aggregations (think derivative where you are comparing the final values in consecutive buckets, or dividing two sum aggregations). On the shard the final result is not known because that result depends on the results from all the shards not just one, so if you calculated the value just based on the information on that shard the result would almost certainly be wrong and worse it could be wildly affected by the results of the other shards so would not even be a good estimate of the final value.

So pushing down the calculation of pipeline aggregation onto the shards doesn't help with sorting the terms aggregation by pipeline aggregations either.

If you were referring instead to sorting the terms aggregation by the scripted_metric aggregations, it was discussed in https://github.com/elastic/elasticsearch/issues/8486#issuecomment-265457470 and we decided that rather than we would rather not add more cases where the error in the terms aggregation is unbounded so we would instead like to try to solve the specific use cases for the scripted_metric aggregation.

Is there any plan that will remove the restriction about numeric-metric-aggregation ?

There are no plans at this time. We need a value to compare the buckets in order to sort them and a numeric value gives us a value that can always be compared without complicating the API. On the other hand arbitrary objects do not naturally compare so the API would need to be made more complex and harder to use to support this.

Is there any plan that will support pipeline-aggregation-metric on node or shard?

See my comments to your first question above

In my opinion, these aggregations, relevant to calculation like sum/min/max/avg/count etc, are all metrics event the result is not in numeric format. For example:
{ "first_student":{ "min": { "field": "first_name" } } }

I'm not sure I understand this since the output of the min aggregation is a singular double value. For the rest API we do wrap this up in an object so you can navigate to it in the response but for the purposes of comparing when sorting it is a singular double value, and importantly we know that it will always be a singular double value so we can validate it in the request before we execute the request.

Another solution is that the client sides execute the ordering-phase. This solution requires the huge (otherwise the order result will be in poor accuracy) results derived from elasticsearch 's aggregation.
Considering the big-data transferring and lacking of distributed-calculation-support about the client side, this solution will be more ineffective.

Unfortunately this is also true for Elasticsearch itself because of the reasons I outlined answering your first question; that we need the information from all shards to be able to calculate the results of the pipeline aggregations in order to be able to do the sorting.

I would love to find a solution to this but I don't see how we could do so at this stage in a reliable, scalable way within the constraints of sharding.

colings86 on 19 Sep 2017

@colings86 The quoted text is from
https://github.com/elastic/elasticsearch/issues/14101 or
https://discuss.elastic.co/t/ordering-terms-aggregation-based-on-pipeline-metric/31839/7
I am refering to both aspects you mentioned:

sorting the terms aggregation by a pipeline aggregation， such like bucket_script
sorting the terms aggregation by the scripted_metric aggregations
example as follows:

{
   "query": {
      "bool": {
         "filter": [
            {
               "query_string": {
                  "query": "*",
                  "analyze_wildcard": true
               }
            }
         ],
         "must": [
            {
               "range": {
                  "@timestamp": {
                     "to": 1505804143152,
                     "from": 1505800543152
                  }
               }
            }
         ]
      }
   },
   "from": 0,
   "aggs": {
      "http.client_ip.topn": {
         "terms": {
            "field": "http.client_ip",
            "order":[{
               # only normal metric aggregation supported
               # while script metric aggregation or bucket-script pipeline aggregation not supported
                "sum_bytes_total": "desc"
            }],
            "size": 10
         },
         "aggs": {
            "sum_bytes_total": {
               "sum": {
                  "script": "doc['http.bytes_in'].value+doc['http.bytes_out'].value"
               }
            },
            "bucket_script_metric": {
               "bucket_script": {
                  "buckets_path": {
                     "sum_bytes_total": "sum_bytes_total"
                  },
                  "script": {
                     "lang": "groovy",
                     "inline": "sum_bytes_total"
                  }
               }
            },
            "mr_script_metric": {
               "scripted_metric": {
                  "init_script": "params._agg.bytes_total = []",
                  "map_script": "params._agg.bytes_total.add(doc['http.bytes_in'].value+doc['http.bytes_out'].value)",
                  "combine_script": "double bytes_total = 0; for (item in params._agg.bytes_total) { bytes_total += item } return bytes_total",
                  "reduce_script": "double bytes_total = 0; for (a in params._aggs) { bytes_total += a } return bytes_total"
               }
            }
         }
      }
   },
   "size": 0
}

Now i get what you say. I think it needs some more boost improvements for distributed-calculation algorithms and software frameworks of ElasticSearch.
I'm still eagerly expecting for your new version, new innovations and new solutions about big data analysis.

anhzhi on 20 Sep 2017

👍3

@anhzhi , totally agreed. Distributed calculation is a big miss for any big data analytics. At this exactly moment, I need to do such thing, but I just got disappointed when I reached and read this thread.

tberne on 21 Dec 2017

@tberne @anzhi I have moved away from ElasticSearch because of this: needed to sort on scripted metrics and create range facets. Since ES was unable to provide this, I used to retrieve all documents and do that logic on the PHP side, which was completely counterproductive and performance-killing.

Last week I have delivered a new version of this application on which I worked for several months, and the storage engine is now based on MariaDb Columnstore (the only open-source SQL engine that stores data in columns instead of rows). Performance is not as good as ElasticSearch (and INSERT/UPDATE/DELETE take ages), but at least, creating scripted metrics and sorting on them (SELECT SUM(bytes_in + bytes_out) AS sum_bytes_total [...] ORDER BY sum_bytes_total DESC) is a piece of cake and performance is acceptable (queries about ~150ms on tables with dozens of millions rows). Of course, you don't have facets (aggregations) out of the box and you have to create them by yourself, which is more complicated to do than ES, but at least it works, and all the logic is done within the storage engine; no more tricky things to complete a non-supported behavior.

I formerly thought ElasticSearch was the good choice as the main engine of an BI application, but I was wrong: ElasticSearch is perfect for full-text, not for analytics.

bpolaszek on 22 Dec 2017

👍6

I formerly thought ElasticSearch was the good choice as the main engine of an BI application, but I was wrong: ElasticSearch is perfect for full-text, not for analytics.

I couldn't agree more. ES is just perfect when the thing is Full Text Search. Even with some simple aggregation scenario, it is a killing tool. But when we start doing some serious analysis, it lacks some major things.

I don't know yet, but I think I will keep ES as a big data storage and, perhaps, I will introduce the use of some stream processor (like Apache Flink) to consolidate some data back to ES.

tberne on 22 Dec 2017

@bpolaszek - Cant you get the conversion rate per referer per ad using the script provision in sum aggregation and sum it using the same. This can be used later to order the referers.
Something like below is what I have in mind -

{
   "aggs" : {
             "refereres" : {
                         "terms" : { "field" : "referer" },
                         "order" : "sumOfConvertion" 
                      },
                 "aggs" : {
                        "convertionSum" : {
                                    "sum" : { "script" : " doc['clicks'].value / doc['views'].value "}
                            }
                 }
}

Vineeth-Mohan on 24 Dec 2017

There's a solution with ES that still requires client-side sorting, but should be more efficient for retrieving all the results you need in order to sort them: try the composite agg which allows you to retrieve all results (with pagination) and then you can do a merge sort client side.

clintongormley on 9 Jan 2018

👍1

Very interesting discussion! I do wonder how Solr solved this efficiently?

Are there any new feature that would allow this, without retrieving all results and sorting client side? Our use case is similar: calculate CTR (clicks / impressions) on aggregated fields and sort results by the highest values.

rathko on 18 Jan 2019

Very interesting discussion! I do wonder how Solr solved this efficiently?

Are there any new feature that would allow this, without retrieving all results and sorting client side? Our use case is similar: calculate CTR (clicks / impressions) on aggregated fields and sort results by the highest values.

AFAIK Solr did not solve this either.

bpolaszek on 19 Jan 2019

Thanks @bpolaszek

It appears in our case we can shard/route the data, which means that data for every search query will always live on the same server and there would be no need to shuffle it around the cluster.

We are considering to write a custom plugin to allow us sorting on calculated fields. Any suggestions on how this could be achieved, given the above routing limitation? I had a look at the existing plugins but couldn't find anything similar.

rathko on 21 Jan 2019

pplant on 15 Feb 2019

Issue +1. Help add support

xiaojiu01 on 23 Dec 2019

Are there any workarounds for the moment?

damianoporta on 17 Apr 2020

Because there is still a some activity on this bug, I like to mention a solution:

We added transform in 7.2 which you can use to group data, aggregate it and write it back into secondary index. It provides a lot of freedom, e.g. scripted metric. At least the original ask of the thread starter can be solved using a transform. In the first step you can group the data and apply the custom aggregation (scripted metric). Afterwards the destination can be queried to get the top-n results.

This solution is scalable and works with large amount of data. Starting with 7.3 the transform can run in continuous mode and update the destination.