Elasticsearch: Dynamically map all numerics to floats by default?

Created on 15 Jan 2016  路  10Comments  路  Source: elastic/elasticsearch

Elasticsearch assumes that if a number contains a dot, then it should be mapped as a floating-point number (double in 2.x and float in master) and otherwise as an long. But this is quite trappy as it means that we expect that floating point numbers are consistently serialized with a dot (see eg. https://twitter.com/bitemyapp/status/687415657651154944 or #15961).

Instead, we could map all numerics to floats by default (but you could still use dynamic templates to override it if you want). This would have two drawbacks:

  • floats can only represent integer values accurately up to 2^24 (~16M)
  • it would increase storage requirements

I ran some simulations to see how worse it would be to store integers as floating point numbers, the good news being that since most bits will be zeros on the right side of the mantissa, gcd compression will help save some bits:

  • less than 256 unique values: storing numbers as a float or as a long doesn't matter since we would use table compression in both cases
  • more than 256 unique values between 1 and 1000: an int would require 10 bits per value, rounded to 12 for retrieval efficiency, while a float would use 17 bits rounded to 20 for retrieval efficiency (+67%)
  • more than 256 unique values between 1 and 100,000: an int would require 17 bits per value, rounded to 20 for retrieval efficiency, while a float would use 24 bits rounded to 24 for retrieval efficiency (+20%)

I'm not sold yet about what we should do but thought we should have this discussion. Again, note that it would only apply to dynamically mapped fields, integers that are mapped as integers would remain as efficient as they are today.

:SearcMapping >breaking >enhancement

Most helpful comment

I think this should be done because the defaults should try to favor usability over performance (or storage in this case)

All 10 comments

I think you'd make half the people happy, and the other half unhappy. It's so easy to add a dynamic mapping rule that allows you to add all numeric fields as float should you choose to do so, I'm not sure it's worth the change.

I think this should be done because the defaults should try to favor usability over performance (or storage in this case)

This might become more necessary as we are considering rejecting numbers that have a decimal part on integer types: #25861.

+1, this is very trappy.
Just realized I had some random docs getting dropped due to float vs long indexing depending on which type is picked up on 1st doc. Many times ppl dont fully control how the json numbers gets serialized.
It would not be realistic to name all fields in mapping, so I'd set up some catch all number types to force all to floats, but I have to look into docs which types are possibly autodetected or if I need to catch all possible numeric types (which would be pretty ugly).
Maybe I'm missing something easy in the docs.
Thanks!

@elastic/es-search-aggs

We chatted about this in the search/aggs meeting a little while ago (forgot to update, sorry).

We decided that the breaking change + potential confusion around floating point error made this less than ideal. In our experience, floating point errors are difficult to understand for even relatively savvy users. Especially if we were to map to floats instead of doubles, it was feared many users could be bitten by rounding without understanding what was happening and start seeing strange search results because of it.

Ranges can look very strange when fp rounding errors happen. And while a keyword should be used instead, many people accidentally use the dynamic long for IDs which also tend to be very large and could easily hit fp errors with _very_ strange side effects (returning the wrong users, etc)

We felt it would be at least as tricky as truncation errors, so breaking for a different set of hard-to-understand semantics wasn't worth it.

I just realized we didn't discuss the decision made in #25861 to remove coerce however, and how that might affect this issue. I was going to close, but perhaps it should be discussed again. @jpountz thoughts?

@polyfractal would you be able to clarify how removing coerce might affect the decision in this issue (and require further discussion)?

@jtibshirani I think it was related to Adrien's earlier comment in https://github.com/elastic/elasticsearch/issues/16018#issuecomment-319312005

E.g. if coerce goes away then any number with fractional portions (exception fractional parts that are zero like 1.0) will be rejected. So dynamically mapping all values to float/double makes it more user friendly in that all values can be indexed by default. If this wasn't implemented I think we'd be in a situation where float/doubles would have to be explicitly mapped first.

But I'm not positive, that's just my guess based on Adrien's comment. :)

Thanks for the additional context! To me, even with the coerce option it seems like floats need to be mapped explicitly -- if the first document indexed happens to contain a number without a decimal point, then all subsequent floats will be truncated (which is likely undesirable/ confusing behavior). Hopefully @jpountz will be able to clarify his comment, and we can see if we can close or another discussion is needed.

if the first document indexed happens to contain a number without a decimal point, then all subsequent floats will be truncated (which is likely undesirable/ confusing behavior)

Agreed! To be sure we are on the same page, the behavior you are describing is our current default behavior.

And while a keyword should be used instead, many people accidentally use the dynamic long for IDs which also tend to be very large and could easily hit fp errors with very strange side effects (returning the wrong users, etc)

This argument convinces me that we should not map all numbers as floats by default, so I'll close this issue.

Was this page helpful?
0 / 5 - 0 ratings