Since there is a new Scripted metric aggregation (scripted_metric) in 1.4, it is possible to do a lot of amazing stuff.
For example it is possible to implement Weighted Average aggregation, which we were missing before.
Now we are really missing a possibility to sort by scripted_metric results.
Live example:
We calculate weightedAvgVis with scripted_metric and want to get ids with TOP 5 values of weightedAvgVis. Since script returns double, it looks logically possible.
{
"from":0,
"size":0,
"query":{
"match_all":{ }
},
"aggregations":{
"idNodes":{
"terms":{
"field":"id",
"size":5,
"order":{
"weightedAvgVis":"asc"
}
},
"aggregations":{
"weightedAvgVis":{
"scripted_metric":{
"init_script":"_agg['weightedSum'] = 0d; _agg['countSum'] = 0L;",
"map_script":"_agg['weightedSum'] = _agg.weightedSum + _source['avgVis'] * _source['count']; _agg['countSum'] = _agg.countSum + _source['count'];",
"reduce_script":"weightedSum = 0d; countSum = 0L; for(a in _aggs) {weightedSum += a.weightedSum; countSum += a.countSum;};if(countSum == 0L) {return 0d;} else {return weightedSum / countSum}"
}
}
}
}
}
}
@eryabitskiy this is something that I have been thinking about but requires a few outstanding issues to be resolved first. Specifically these are https://github.com/elasticsearch/elasticsearch/pull/8421 and https://github.com/elasticsearch/elasticsearch/issues/8434. These would allow us to specify much more powerful order paths and the getProperty method on the scripted metric aggregation could be used to retrieve arbitrary properties of the scripts results.
Hi there, I'm not sure if this is appropriate but I thought you may want to gauge interest. We're very keen to see this as well. Currently we end up doing a lot of sorting on oversized resultsets in Go, whereas being able to sort on scripted metrics would save us this hassle.
+1 This would be great if solved!
+1
+1
:+1: sort on scripted metrics aggregations would be a killer feature. In the meantime we're also doing client-side sorting on oversized results.
+1
+1
+1
+1
+1 Aside from the other use-cases mentioned here, this feature would give the Kibana product some major strength, particularly for rapid prototyping.
+1
+1
(hopping on the "+1 to this feature" bandwagon)
+1
+1
+1
+1
+1
+1
+1
+1
+1
Regarding the computation of weighted average I sent a pull request to extend the current avg aggregation in order to support this kind of average natively without scripted metric.
Here the issue created on github:
https://github.com/elastic/elasticsearch/issues/15731
Here the pull request (only the first approach is implemented):
https://github.com/elastic/elasticsearch/pull/15781
Regarding the support of the order clause for scripted metric I sent a pull request to support that.
+1
+1
+1
+1
+1
+1
+1
+1
Looking forward for this one...
+1
@elasticsearch-guys, Is there any chance to integrate quickly my pull request or an improvement of it.
+1
+1
+1
+1
I realy need this feature please add it :)
+1
+1 It would be a great feature
+1
+1
+1
Is there any other approach to this other than client side sorting at present? Trying to sort on a user -specified weighted average on the fly.
+1, (and a pretty please) 😄
+1
+1
+1
+1
I would like to understand the use cases when users are wanting to order by a scripted_metric
aggregation. Could users who currently order by a scripted_metric
aggregation please reply with a short description of what the scripted metric aggregation is doing and what the overall description of the search request is? (if you can please also include the request body and the scripts, probably in a gist linked on your comment rather than inlined so it's easy to read)
My use case: I have classifieds in cities. When a user is on a specific city, I want to display the total number of classifieds in cities arround (< 50km), and sort these neighbors cities by distance.
So basically I do a term aggregation to group the classifieds per cities (so I can get the total number of classifieds per cities), and then I do a scripted_metric aggregation to get the distance from the origin city for each city group I get, because I couldn't find an other way to achieve this. I need finally to sort the initial aggregation with the distance I got.
My query: https://gist.github.com/raphaelMalie/203e411fddb64d75444014d3f28e6f8e
@colings86 Here's a simplified example of what I'm trying to do: http://stackoverflow.com/questions/37799130/elasticsearch-sort-terms-by-pipeline-bucket-script-aggregation
As another example to clarify, say I have 10,000 names of students and I want to build a table showing the top 10 students based on grades, and grades are determined by combining values of multiple fields. So I want to be able to do a terms agg on the name and order it by the grade (a bucket_script agg that calculates multiple fields). Without ordering them, I would have to return all 10,000 students with their grades and do a client-side sort to see just the top 10.
Thanks. 👍
@raphaelMalie I commented on your gist with a suggestion of how you might be able to avoid the scripted_metric
agg in your case (commented there so we could have a conversation there if necessary rather than it getting lost on this thread).
Thanks to @clintongormley for helping me with the solution
@chevin99 in your grades example what do the individual documents in the index represent? (e.g. a student and their scores on all tests)
@colings86 each doc would represent a test score and other info about the test that would be used to calculate the overall score/grade.
Does this help? https://gist.github.com/chevin99/930c537be7caf2e22abdcf1086350c55
@colings86 Here is the kind of queries we implemented with scripted metrics.
For all these queries we use the clause "ORDER BY" to compute a top 10 for example.
IMO a better way to solve these different queries could be to support weighted average natively and extend the avg function to support other variant of average like the second example. In addition for our case we don't really need a fully accurate top 10 of something, we just need a good approximation (probably a common use case).
@colings86 This is my use case, Can't find a way of doing something like this in ES 2.3.5 currently.
SELECT TOP 3 Name, SUM(Value) [Value] FROM @table
GROUP BY Name
ORDER BY ABS(SUM(Value)) DESC
Take top N results ordered by aggregation with applied function (ABS in this case). So aggregation is performed first (using scripted_metric map reduce) and then it needs to be ordered by the result to get top 3.
More info here
SQL Style GroupBy & OrderBy with function in ES?
@colings86 So motivation of my issue is still actual:
1) We have a timeline of activity for each Topic.
2) Timeline is stored in documents per day for each topic separately.
3) To get activity for some time period over multiple topics we use TERMS aggregation by topicID and sub scripted aggregations for each activity field..
4) The problem that in this case you get activity for all topics and it is too much. We need some pagination to cut down number of topics calculated at one time. And for that we need sorting (sorting over topicID doesn't help, because we need most active topics first).
I mean there is size property for TERMS aggregation, But it is very limited in use, while you can sort only by terms.
P.S. Also offset would be nice for pagination... but sorting is a very good start.
P.S.2 Now we are loading tuples topicID+sortingFiledValue to WebServer, do sort+paginate and load page topics. Any functionality that helps to move sorting/pagination to ES would be useful.
Hi @colings86!
You can look at https://discuss.elastic.co/t/ordering-terms-aggregation/47791 to see my use case for example:
Let's suppose that we have documents with fields
(group_id, price, x,...)
. Our goal is to:
- run a query that filters the whole documents set;
- group documents by
group_id
;- in each group fetch the document with the highest
x
value, let's call it a "group's representative";- take that document's
price
and order all groups by thatprice
(order groups by the price of theirs representatives).
I also tried to use scripted_metric
to implement this query. I succeeded with calculation of the desired scripted metric representative_price
for each bucket, but when it came to ordering terms
by it, I found it to be impossible.
Hi @colings86
Is there any progress/plan for this issue?
ElasticSearch 5.0 is there, but still no support of "order by" on scripted metric.
+1
+1
+1
+1
+1024!
Hi @colings86, here's a very simple use case.
My ES index is made of documents that store:
I want to display the top 10 referers having the best conversion rate between impressions and clicks. I may have 10000+ referers.
Here's the query sample:
https://gist.github.com/bpolaszek/00fec1dedc0ed02263f2328a9c0cd161
This comes with #4915.
Of course I could store the conversion rate on each doc, but CTR is just one of many metrics I need and I have a huge amount of data. And inline scripts are intended for that :)
Besides, grouping, sorting and slicing are available in the SQL world, despite choosing a NoSQL engine for rendering analytics is obvious in 2016.
Today, I have to silently load an arbitrary amount of aggregations (999999 to be _almost_ sure I have all my referers, sort the result myself, and slice the aggregation result from 10 to 20 if the end user goes on page 2, for instance). This is really annoying.
These 2 features are really mandatory if we want ES to be a real BI engine - these features exist in Apache SolR for a few years, for instance. But I love ES and I'd rather wait for these features to be implemented :)
Thank you much for your consideration,
Ben
@colings86, @clintongormley
You asked us (August 19) to describe the use cases for this feature. Many people from the community posted their use cases. After more than 3 months, I think reasonable to get an official response from ElasticSearch.
As mentioned in this thread this feature is supported by Crate.io (based on ElasticSearch) and by Solr but still not supported by ES. I think we all like ES but it becomes very hard (at least for me) to understand/accept this missing feature.
Please give us a response or even better a fix!
Sorting the terms aggregation by any sub aggregation (with the exception of min and max) has un-bounded error for the same reason as sorting by ascending count has unbounded error which is described in https://github.com/elastic/elasticsearch/issues/17614#issuecomment-244114209. This issue is also discussed in https://github.com/elastic/elasticsearch/issues/17588. To be clear, the problem here isn't that we calculate a value with some error to the true value, as we do this with other aggregations such as the cardinality aggregation and indeed with the terms aggregation when sorted by descending document count. The difference between sorting the terms aggregation by a sub-aggregation and other aggs that given an approximation to the true value is that in the sorting by sub-aggregation case the error is unbounded (i.e. we have no knowledge of how big the error might be) whereas with the other aggregations mentioned we can either calculate the worst-case error, or we have a good idea of the expected error. This means that the results returned when ordering terms by a sub-aggregation are not even a good approximation to the true values.
Currently the sorting parameter for the terms aggregation can be _count
, _term
an aggregation that produces a single number. The scripted_metric
aggregation can return any object (including complex structures) and is not confined to return a numeric value. Changing the sorting to support other aggregations outside the above constraints is non-trivial and would make the sorting code much more complex (for example one challenge would be that we would need to have some way of making sure that the values the aggregations return are sortable).
The scripted_metric
aggregation was intended to be used mainly for prototyping and we did not envisage it being used in this way. Because this would add more cases where the error in the terms aggregation is unbounded and since adding support for this would be non-trivial we don't feel it’s correct to add this feature. What we would like to do instead is work towards adding support for the mentioned use cases in other ways, like adding new aggregations that support them.
From the comments above there seem to be three main use cases:
@colings86
Just want to make things clear.
Sorting the terms aggregation by any sub aggregation (with the exception of min and max) has un-bounded error for the same reason as sorting by ascending count has unbounded error which is described in....
At same time there is an example from documentation that demonstrates sorting also by AVG aggregation not only min and max.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order
"order" : { "avg_play_count" : "desc" }
QUESTION: Do I understand you right that this example from documentation with avg is malfunction and could return wrong results?
Otherwise, I don't see any advantage of avg aggregation over scripted aggregation returning some primitive float/double value (In case of min and max, I for sure see difference).
QUESTION: Do I understand you right that this example from documentation with avg is malfunction and could return wrong results?
Yes, that is true and that is a bug in the documentation which I will fix to be clearer. Also note that there is a note at the top that section which warns that sorting by a sub aggregation is discouraged.
Otherwise, I don't see any advantage of avg aggregation over scripted aggregation returning some primitive float/double value
There is no advantage and sorting the terms aggregation by an average aggregation has the same pitfalls in this regard.
I've opened https://github.com/elastic/elasticsearch/pull/22032 to address the documentation issue
@colings86
Thank you for your answer!
I understand that there is always a trade of between performance and accuracy especially in FTS engine. But if you fetch all buckets from shard you can always reach 100% accuracy.
And I suspect, that actually most of folks would be ok with all buckets returned from every shard for the sake of 100% accuracy over performance and memory consuming (we can make a vote to check it). We do it already on Java server side anyway.
Also you can always set some protector max size property that triggers an error on too much memory consumption during such queries.
Can you consider such feature?
+1.
For my use case, I actually need to fetch all buckets, then sort and slice on client-side.
This is not only destroying performance, this also destroys developer's experience.
And I suspect, that actually most of folks would be ok with all buckets returned from every shard for the sake of 100% accuracy over performance and memory consuming (we can make a vote to check it).
We only add features to Elasticsearch which are horizontally scalable. Whatever we add should work when you're running one node on your laptop with 50GB of data or 1000 nodes in your data server with 50 PB of data.
Fetching all terms from all shards does not scale horizontally, and so we will not add it.
We do it already on Java server side anyway.
Exactly. This is a problem that should be solved client side instead (where you know the limits of your data and how much you will need to scale).
I see...
Than I can mention one horizontally scalable use case: if each bucket is fully allocated withing one shard (by using terms field also as a routing) you can simply fetch only top buckets from shards and still get accurate results. Unfortunately it is only works together with proper routing.
@colings86
In order to be able to use an aggregation for sorting the terms aggregation it must _currently_ be a numeric metric aggregation. Pipeline Aggregations are a different family of aggregations than the Metric Aggregations and cannot be used for sorting. You couldn't sort using a Pipeline aggregation because they are only executed in the reduce phase on the coordinating node and so you do not have the information on the shard in order to be able to sort the shards buckets to pick the top N to send to the coordinating node.
Is there any plan that will fulfill this feature?
Is there any plan that will remove the restriction about numeric-metric-aggregation ?
Is there any plan that will support pipeline-aggregation-metric on node or shard?
In my opinion, these aggregations, relevant to calculation like sum/min/max/avg/count etc, are all metrics event the result is not in numeric format. For example:
{
"first_student":{
"min": {
"field": "first_name"
}
}
}
It will help a lot that elasticsearch suport different types of metric widely. Thus it will be more powerful than traditional SQL-data-analysis tools.
I'm currently moving back my analytics to an SQL platform because of this.
Simple and frequent use case: compute progressions between aggregated metrics (i.e CTR : +15%) and sort by progression. Impossible to do with ES without storing progression first.
@colings86
Currently, I will try to translate different kinds of bucket-script-metric into the forms of map-reduce-script in a general method, then ordering-phase about these metrics can be supported.
I think it needs large-mount or repeated calculation, which will lead to performance destroying and relatively-low efficiency.
Another solution is that the client sides execute the ordering-phase. This solution requires the huge (otherwise the order result will be in poor accuracy) results derived from elaticsearch 's aggregation.
Considering the big-data transfering and lacking of distributed-calculation-support about the client side, this solution will be more ineffective.
Another solution is that the client sides execute the ordering-phase. This solution requires the huge (otherwise the order result will be in poor accuracy) results derived from elaticsearch 's aggregation.
This requires the knowledge of how much values the aggregation will return. You can not ask ES to "return all buckets" on an aggregation and you have to specify an arbitrary limit number, which is completely nonsense and performance-killing.
@anhzhi I'm having trouble finding the text that you quoted for context (sorry if I've missed it in the comments above), could you paste the link to the comment or documentation you saw this in? The statement is however very true.
Is there any plan that will fulfill this feature?
If you are talking about being able to sort the terms aggregation by a pipeline aggregation then I don't see how this would be possible to implement. The terms aggregation needs to sort the buckets collected on the shard so it can return only the shard_size
number of them to the coordinating node. If it did not do this sorting on the shard it would have to return all the buckets it collected to the coordinating node which would not scale since the number of buckets can be very large for high cardinality fields.
So, given the fact that we have to sort on the shard the only other solution would be to 'push down' the calculation of pipeline aggregations onto the shards. But this would not work either since pipeline aggregations at their heart compare the final results of different aggregations (think derivative where you are comparing the final values in consecutive buckets, or dividing two sum aggregations). On the shard the final result is not known because that result depends on the results from all the shards not just one, so if you calculated the value just based on the information on that shard the result would almost certainly be wrong and worse it could be wildly affected by the results of the other shards so would not even be a good estimate of the final value.
So pushing down the calculation of pipeline aggregation onto the shards doesn't help with sorting the terms aggregation by pipeline aggregations either.
If you were referring instead to sorting the terms aggregation by the scripted_metric aggregations, it was discussed in https://github.com/elastic/elasticsearch/issues/8486#issuecomment-265457470 and we decided that rather than we would rather not add more cases where the error in the terms aggregation is unbounded so we would instead like to try to solve the specific use cases for the scripted_metric aggregation.
Is there any plan that will remove the restriction about numeric-metric-aggregation ?
There are no plans at this time. We need a value to compare the buckets in order to sort them and a numeric value gives us a value that can always be compared without complicating the API. On the other hand arbitrary objects do not naturally compare so the API would need to be made more complex and harder to use to support this.
Is there any plan that will support pipeline-aggregation-metric on node or shard?
See my comments to your first question above
In my opinion, these aggregations, relevant to calculation like sum/min/max/avg/count etc, are all metrics event the result is not in numeric format. For example:
{ "first_student":{ "min": { "field": "first_name" } } }
I'm not sure I understand this since the output of the min aggregation is a singular double value. For the rest API we do wrap this up in an object so you can navigate to it in the response but for the purposes of comparing when sorting it is a singular double value, and importantly we know that it will always be a singular double value so we can validate it in the request before we execute the request.
Another solution is that the client sides execute the ordering-phase. This solution requires the huge (otherwise the order result will be in poor accuracy) results derived from elasticsearch 's aggregation.
Considering the big-data transferring and lacking of distributed-calculation-support about the client side, this solution will be more ineffective.
Unfortunately this is also true for Elasticsearch itself because of the reasons I outlined answering your first question; that we need the information from all shards to be able to calculate the results of the pipeline aggregations in order to be able to do the sorting.
I would love to find a solution to this but I don't see how we could do so at this stage in a reliable, scalable way within the constraints of sharding.
@colings86 The quoted text is from
https://github.com/elastic/elasticsearch/issues/14101 or
https://discuss.elastic.co/t/ordering-terms-aggregation-based-on-pipeline-metric/31839/7
I am refering to both aspects you mentioned:
{
"query": {
"bool": {
"filter": [
{
"query_string": {
"query": "*",
"analyze_wildcard": true
}
}
],
"must": [
{
"range": {
"@timestamp": {
"to": 1505804143152,
"from": 1505800543152
}
}
}
]
}
},
"from": 0,
"aggs": {
"http.client_ip.topn": {
"terms": {
"field": "http.client_ip",
"order":[{
# only normal metric aggregation supported
# while script metric aggregation or bucket-script pipeline aggregation not supported
"sum_bytes_total": "desc"
}],
"size": 10
},
"aggs": {
"sum_bytes_total": {
"sum": {
"script": "doc['http.bytes_in'].value+doc['http.bytes_out'].value"
}
},
"bucket_script_metric": {
"bucket_script": {
"buckets_path": {
"sum_bytes_total": "sum_bytes_total"
},
"script": {
"lang": "groovy",
"inline": "sum_bytes_total"
}
}
},
"mr_script_metric": {
"scripted_metric": {
"init_script": "params._agg.bytes_total = []",
"map_script": "params._agg.bytes_total.add(doc['http.bytes_in'].value+doc['http.bytes_out'].value)",
"combine_script": "double bytes_total = 0; for (item in params._agg.bytes_total) { bytes_total += item } return bytes_total",
"reduce_script": "double bytes_total = 0; for (a in params._aggs) { bytes_total += a } return bytes_total"
}
}
}
}
},
"size": 0
}
Now i get what you say. I think it needs some more boost improvements for distributed-calculation algorithms and software frameworks of ElasticSearch.
I'm still eagerly expecting for your new version, new innovations and new solutions about big data analysis.
@anhzhi , totally agreed. Distributed calculation is a big miss for any big data analytics. At this exactly moment, I need to do such thing, but I just got disappointed when I reached and read this thread.
@tberne @anzhi I have moved away from ElasticSearch because of this: needed to sort on scripted metrics and create range facets. Since ES was unable to provide this, I used to retrieve all documents and do that logic on the PHP side, which was completely counterproductive and performance-killing.
Last week I have delivered a new version of this application on which I worked for several months, and the storage engine is now based on MariaDb Columnstore (the only open-source SQL engine that stores data in columns instead of rows). Performance is not as good as ElasticSearch (and INSERT/UPDATE/DELETE take ages), but at least, creating scripted metrics and sorting on them (SELECT SUM(bytes_in + bytes_out) AS sum_bytes_total [...] ORDER BY sum_bytes_total DESC) is a piece of cake and performance is acceptable (queries about ~150ms on tables with dozens of millions rows). Of course, you don't have facets (aggregations) out of the box and you have to create them by yourself, which is more complicated to do than ES, but at least it works, and all the logic is done within the storage engine; no more tricky things to complete a non-supported behavior.
I formerly thought ElasticSearch was the good choice as the main engine of an BI application, but I was wrong: ElasticSearch is perfect for full-text, not for analytics.
I formerly thought ElasticSearch was the good choice as the main engine of an BI application, but I was wrong: ElasticSearch is perfect for full-text, not for analytics.
I couldn't agree more. ES is just perfect when the thing is Full Text Search. Even with some simple aggregation scenario, it is a killing tool. But when we start doing some serious analysis, it lacks some major things.
I don't know yet, but I think I will keep ES as a big data storage and, perhaps, I will introduce the use of some stream processor (like Apache Flink) to consolidate some data back to ES.
@bpolaszek - Cant you get the conversion rate per referer per ad using the script provision in sum aggregation and sum it using the same. This can be used later to order the referers.
Something like below is what I have in mind -
{
"aggs" : {
"refereres" : {
"terms" : { "field" : "referer" },
"order" : "sumOfConvertion"
},
"aggs" : {
"convertionSum" : {
"sum" : { "script" : " doc['clicks'].value / doc['views'].value "}
}
}
}
There's a solution with ES that still requires client-side sorting, but should be more efficient for retrieving all the results you need in order to sort them: try the composite agg which allows you to retrieve all results (with pagination) and then you can do a merge sort client side.
Very interesting discussion! I do wonder how Solr solved this efficiently?
Are there any new feature that would allow this, without retrieving all results and sorting client side? Our use case is similar: calculate CTR (clicks / impressions) on aggregated fields and sort results by the highest values.
Very interesting discussion! I do wonder how Solr solved this efficiently?
Are there any new feature that would allow this, without retrieving all results and sorting client side? Our use case is similar: calculate CTR (clicks / impressions) on aggregated fields and sort results by the highest values.
AFAIK Solr did not solve this either.
Thanks @bpolaszek
It appears in our case we can shard/route the data, which means that data for every search query will always live on the same server and there would be no need to shuffle it around the cluster.
We are considering to write a custom plugin to allow us sorting on calculated fields. Any suggestions on how this could be achieved, given the above routing limitation? I had a look at the existing plugins but couldn't find anything similar.
+1
Issue +1. Help add support
Are there any workarounds for the moment?
Because there is still a some activity on this bug, I like to mention a solution:
We added transform in 7.2
which you can use to group data, aggregate it and write it back into secondary index. It provides a lot of freedom, e.g. scripted metric
. At least the original ask of the thread starter can be solved using a transform. In the first step you can group the data and apply the custom aggregation (scripted metric). Afterwards the destination can be queried to get the top-n results.
This solution is scalable and works with large amount of data. Starting with 7.3
the transform can run in continuous mode and update the destination.
Sounds cool! Do you have an example query? (I'm no longer in ES, but still curious 😀)
Most helpful comment
@colings86, @clintongormley
You asked us (August 19) to describe the use cases for this feature. Many people from the community posted their use cases. After more than 3 months, I think reasonable to get an official response from ElasticSearch.
As mentioned in this thread this feature is supported by Crate.io (based on ElasticSearch) and by Solr but still not supported by ES. I think we all like ES but it becomes very hard (at least for me) to understand/accept this missing feature.
Please give us a response or even better a fix!