Elasticsearch: Add K-means clustering feature

Created on 24 Mar 2014 · 59Comments · Source: elastic/elasticsearch

Add k-means clustering to allow detection of clusters in data sets.
http://en.wikipedia.org/wiki/K-means_clustering

Would be useful for geo points but also other use cases too.

Thanks to https://github.com/koobs for suggesting this one in Sydney Elastic Training.

:AnalyticAggregations >feature Analytics help wanted

Source

geekpete

👍32 ❤12 🚀3

Most helpful comment

It's certainly gotten some attention. While a bit stalled at the moment, due to other priorities, @colings86 has a branch for geo_point k-means: https://github.com/colings86/elasticsearch/tree/feature/geokmeans

So a little more patience and this feature will be available soon.

nknize on 14 Mar 2017

👍10 ❤3 🎉3

All 59 comments

It would be great! I really need this feature. Is there any estimate of when you will start coding?

savioteles on 20 May 2014

Not that I'm seeking to drop her in it, but Britta https://twitter.com/a2tirb would definitely have the madskillz to build this feature but not sure of her priorities/bandwidth/interest to attack this feature with sticks.

I'm also not sure how new features are selected or voted up for prioritisation by elasticsearch overlords either.

geekpete on 21 May 2014

If I recall correctly, @geekpete proposed to have that in the context of aggregations, that is, build cluster and then use these as buckets inside the aggregations framework. Indeed, this would be an extremely useful feature.

While it would be very much fun to implement unfortunately I do not think we will implement it in the near future. Anyone coming up with a pull request for this is of course more than welcome :-)

For now I can only point you to the carrot2 plugin which does an excellent job in clustering search results.

brwe on 21 May 2014

I'll add the comment that 'k' clusters ought to user-suppliable as an argument to the aggregation for maximum value, with possible k-values being:

N
Random
Basic Rule-Of-Thumb
Document (m) by Term(n) matrix
Some other cool method :)

For context, I brought this up @ the ElasticSearch Training in response to a brief conversation about search vs 'insight' in relation to data, the former where you know what you're looking for, the latter where you dont, or might not. The specific example was geospatial result sets with arbitrary demography data fields. It was a great session @brwe!

koobs on 22 May 2014

I also would like to cast my vote for some kind of automated clustering feature. Carrot2 is great but as far as understand can only work on small amount of data. Would be great to have something that clusters ALL the data all the time. Maybe custom clustering analyzer?

mishakogan on 21 Aug 2014

@brwe would #8110 help here?

clintongormley on 10 Nov 2014

@clintongormley not really. Bucket reducers from #8110 would run on the final aggregation but clustering needs the documents.

brwe on 12 Nov 2014

@brwe I think implementing clustering as a reducer could help reduce the cost very significantly? K-means is costly so running such an algorithm on a dataset containing lots of documents could be very slow. On the other hand, if we take geo-clustering as an example, we could make it very fast (though a bit lossy) by working on top of the output of the geo-hash grid aggregation as a bucket reducer?

jpountz on 12 Nov 2014

👍1

True, I should distinguish use cases. For up to 2d it might help indeed. For text clustering I do not see it.

brwe on 14 Nov 2014

just found this - would be great. +1

yehosef on 29 Jun 2015

dsingley on 9 Oct 2015

search for this... this would be a very great feature. Also other Mining-algorithms.

redmouthch on 12 Oct 2015

Implementing this as a pipeline aggregation should now be possible. In that case we would first collect values into buckets using other aggregations and then use the pipeline aggregation to create clusters from those buckets.

colings86 on 13 Nov 2015

👍1

that would be mad!

lessless on 20 Dec 2015

@koobs is there a recording of this session somewhere out there?

lessless on 30 Dec 2015

@lessless I hope not :)

koobs on 30 Dec 2015

This would really be awesome!

irony on 27 Jan 2016

audriusbugas on 10 Mar 2016

:+1:

reinier-pv on 10 Mar 2016

trupin on 9 Apr 2016

chenryn on 28 Apr 2016

marfago on 4 May 2016

hkulekci on 9 May 2016

👍1

amazium on 23 Jun 2016

I am removing the discuss label and make it adopt me - there has been enough discussion on this.

s1monw on 1 Jul 2016

+1 carrot2 is good for text clustering but does not use the aggregations framework, would be great to have a text clustering option that we can build sub-aggregations / child aggregations underneath.

ryanrozich on 4 Jul 2016

will we be able to use it with geopoints?

lessless on 11 Aug 2016

SimoneTosato on 7 Oct 2016

+1 This feature would be so helpful and powerful

ddavidebor on 7 Oct 2016

What happened with this feature? Is there any work in progress? I think it would be really useful.

diugalde on 14 Oct 2016

tol on 13 Mar 2017

So a little more patience and this feature will be available soon.

nknize on 14 Mar 2017

👍10 ❤3 🎉3

sebastianovide on 28 Jun 2017

mkarakucuk on 28 Jul 2017

A-Tokyo on 24 Nov 2017

vasily-kirichenko on 18 Dec 2017

ItshakEli on 24 Dec 2017

debb-hp-com on 9 Jan 2018

@nknize, @colings86 pushed last update to his branch in Jul '17. Is it ready or forced out by the higher-priority work?

lessless on 9 Jan 2018

@lessless There has not yet been further work on this and its still a little way off. There are actually other cluster-like aggregations which are likely to be merged first as they are a bit simpler to implement (e.g. https://github.com/elastic/elasticsearch/pull/26659) as they are a bit easier to validate and test for a first implementation of aggregations which merge buckets at collection-time. Although we would like to make progress here, its not something that is being currently tackled as a main task due to other priorities

colings86 on 9 Jan 2018

Could we collate potential future features on a special section of the roadmap perhaps?
This ticket could be closed and referenced to the "potential future features" area of the roadmap. This might help to clear a number of other github tickets that don't have major focus if priority is on other work at the moment.

geekpete on 9 Jan 2018

Stalled waiting on https://github.com/elastic/elasticsearch/pull/26659

/cc @elastic/es-search-aggs

colings86 on 13 Mar 2018

Still desired :)

lessless on 10 Sep 2018

Indeed !! very desired !

LaurentChardin on 11 Oct 2018

@colings86 should "stalled" label be removed now? #26659 was closed in favor of #28993 which is merged now

lessless on 11 Oct 2018

It is true that because #28993 is merged the "stalled" label can be removed.

colings86 on 12 Oct 2018

I also confirm it's very desired and I'd be happy to see it.

Destroy666x on 22 Oct 2018

+1 for this

ivssh on 3 Dec 2018

👍3

ThomasSolti on 22 Jan 2019

Is the size parameter in https://www.elastic.co/guide/en/elasticsearch/reference/7.0/search-aggregations-bucket-geotilegrid-aggregation.html something like k-means-clustering for geo-search?

barracuda317 on 22 Feb 2019

@barracuda317 Not really, no. GeoTile just overlays a fixed grid over the area and aggregates documents into those grid cells. The grids are constructed irregardless of the data distribution (think of it more like a heatmap).

Clustering like k-means dynamically identifies regions of data that are "similar" and groups them together into a cluster. Clustering can give you individual clusters that are different shapes, sizes, and densities. For a practical example, a clustering algo might group all the values inside a city together, then cluster the rural countryside together as a different group (much larger but also more sparse)

$polyfractal picture$ polyfractal on 22 Feb 2019

@colings86 is there a reason why this was never completed? If the issue is time commitment, I'm very interested in this feature and would like to try finishing the implementation.

jamesdorfman on 17 Apr 2019

@jamesdorfman Can you please describe your use case. Are you interested to have k-means clustering on geo data (they can be up to 8 dims)?

mayya-sharipova on 18 Apr 2019

👍1

@mayya-sharipova +1 to exactly that use case

lessless on 18 Apr 2019

👍1

@mayya-sharipova yes, I was specifically interested in implementing the geo data use case. This thread made it seem as though this specific feature is highly desired.

Furthermore, I'm not completely certain about how difficult this will be to implement, so I also think that the restricted use case of clustering only geo data is a good starting point.

jamesdorfman on 19 Apr 2019

👍1

Another use case was to group ranges of prices of products within an index, and use k-cluster to propose cluster of prices to use with price selection.

LaurentChardin on 21 Apr 2019

👍1

Upon further research and experimentation it seems that a more straightforward approach would be to implement an agglomerative hierarchical clustering algorithm, rather than k-means clustering.

K-means involves creating k buckets, and then reassigning data points at each iteration of the algorithm. On the other hand, in agglomerative hierarchical clustering each point is initially placed in its own cluster. Then, these clusters are merged together on subsequent iterations.
https://en.wikipedia.org/wiki/Hierarchical_clustering

I am currently working on implementing this clustering feature as a histogram multi-bucket aggregation. The k-means approach would involve moving documents between buckets at each iteration; however, the hierarchical method would simply entail creating a bucket for each document and then merging them until the desired number of buckets is reached. This functionality is very similar to the existing Auto Date Histogram Aggregation, where buckets are created and then merged. Since bucket merging functionality was already created for that aggregation, this approach is significantly easier to implement.

Furthermore, it seems that both methods can produce clusters of similar quality. https://www.cs.utah.edu/~piyush/teaching/4-10-print.pdf

Please let me know if this line of reasoning makes sense :)

jamesdorfman on 1 May 2019

@jamesdorfman Isn't Agglomerative Hierarchical Clustering expensive in terms time and space complexities compared to K-means?

Agglomerative clustering:
time complexity: O(n^3)
space complexity: O(n^2)

K-means:
time complexity: O(n * k * m)
space complexity: O((n + k)m)

And since K-means has implementations that support incremental learning, the space complexity can be further reduced to make it constant.

arshad171 on 4 Sep 2019

sroui on 1 Jan 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Secure Settings

rjernst · 3Comments

Bad geopoint field should throw error

clintongormley · 3Comments

Should range aggregations support the `missing` option?

jpountz · 3Comments

Cleanup Elasticsearch configuration files support

jasontedor · 3Comments

More Lucene suggesters

clintongormley · 3Comments