Elasticsearch: Variance calculation should use more sophisticated algorithm

Created on 17 Jun 2020  ·  10Comments  ·  Source: elastic/elasticsearch

ExtendedStats uses a naive algorithm for calculating variance which can lead to floating point errors when the difference between sumSq and sum is very small (see here and here). This _won't_ be caught by our existing Kahan summation which is more concerned about accumulated errors when summing values.

There are alternative algos that behave better, which will be increasingly important as other tools build on variance (like t-test).

:AnalyticAggregations >enhancement Analytics

Most helpful comment

Yeah, what Nik said :) Usually these sort of things fall into one of two camps:

  1. New algo is compatible with old algo somehow (maybe by falling back to loss of precision, etc), so it just has to be "translated" back and forth
  2. New algo is fundamentally incompatible, so we end up maintaining "old" in newer versions for a while while the BWC matrix overlaps.

Since this is a loss of precision situation, we can probably figure out a way to translate back/forth. But that's just a guess, haven't really thought about it much :)

All 10 comments

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

Yes exactly, the way it is done now with sum of squares can lead to a numerical phenomenon called Catastrophic Cancellation due to floating point arithmetic, that ultimately calculates an incorrect variance and even sometime negative values. The Welford algorithm is the key, and the way I see it done in Elastic Search is to first implement an aggregate called the M2 moment that the Welford algo uses, when this is done, the formula becomes straightforward. More info on all of this here https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm

Can I send a PR to change the existing algo to Welford algorithm?

I think a PR never hurts, this would be great! 😇 🙏🏻

@polyfractal I read your comment here, so shall I leave this to be handled by a pro?

If you open a PR with the new algorithm one of us would be happy to advise
on the backwards compatibility side. There is usually something fairly
simple that'd make it work. The difficult thing is usually figuring out
what that is.

On Mon, Jun 22, 2020, 08:09 Vishnu Gt notifications@github.com wrote:

@polyfractal https://github.com/polyfractal I read your comment here
https://github.com/elastic/elasticsearch/pull/37384#issuecomment-645999659,
so shall I leave this to be handled by a pro?


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/58275#issuecomment-647476322,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AABUXIQEQBGYACTAXADR3MTRX5CWXANCNFSM4OAY6YPQ
.

Awesome, I'll start working on the PR with the new algorithm.

Yeah, what Nik said :) Usually these sort of things fall into one of two camps:

  1. New algo is compatible with old algo somehow (maybe by falling back to loss of precision, etc), so it just has to be "translated" back and forth
  2. New algo is fundamentally incompatible, so we end up maintaining "old" in newer versions for a while while the BWC matrix overlaps.

Since this is a loss of precision situation, we can probably figure out a way to translate back/forth. But that's just a guess, haven't really thought about it much :)

I suspect the PR I submitted for this issue was somehow missed, so commenting here, just in case.

I suspect the PR I submitted for this issue was somehow missed, so commenting here, just in case.

Yup, it did. I've marked it for review on our side. I'll see about reviewing it in a bit if no one else gets to it first.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jpountz picture jpountz  ·  75Comments

rjernst picture rjernst  ·  43Comments

jordansissel picture jordansissel  ·  69Comments

eryabitskiy picture eryabitskiy  ·  94Comments

monken picture monken  ·  160Comments