Elasticsearch: Support unsigned number types

Created on 5 Oct 2015 · 12Comments · Source: elastic/elasticsearch

There are a lot of cases, where you want to store a byte/short/int/long in an elastic field, but it's unsigned by nature. With byte/short/int, it's always possible to use a bigger type (at the cost of increased storage space? I haven't measured that), but not with long.
Splitting numbers or subtracting/adding the "magic number" from and to the value is a nightmare.
I think it would be much more convenient to support unsigned types natively.

Theoretically on elasticsearch side it wouldn't be too hard: if a field is marked as unsigned (ubyte/ushort/uint/ulong?), every value which "comes in" needs to be decreased by 2^(bitsize(type)-1), and when "going out", increased by the same amount. Otherwise elasticsearch internals wouldn't change.
Of course representing the unsigned values with Java native signed types can be tricky (BigInteger or Java 8 native unsigned types?).

:SearcMapping discuss

Source

bra-fsn

Most helpful comment

This is a seriously disappointing limitation. 64-bit numbers cannot be stored (since long is signed) -- years after this issue was opened.

bluikko on 28 Aug 2019

👍9

All 12 comments

The values are mostly already compressed in lucene. For example, if your values range from -1 to 253, only a single byte should be used (depends on what features you have enabled for the field eg indexed, stored, doc values).

I don't think we should complicate the api with more types. Other than input validation (which the user can do on their end), the different types for long/int/short/byte don't matter to lucene. They are simply numeric and we will represent them according to the values seen. I would rather see the ability to set the valid range of values in the mapping, and limit to just "integral" and "floating point" field types. (Granted the types do matter for fielddata, but with doc values by default for 2.0 this doesn't really matter anymore).

rjernst on 5 Oct 2015

Granted the types do matter for fielddata

Actually they don't, so even on 1.x things are just fine.

jpountz on 6 Oct 2015

I think the only time that type may be relevant is when you want to use an unsigned long that doesn't fit into the range of a signed long?

clintongormley on 6 Oct 2015

Once we get BigDecimal support it becomes irrelevant I think? I would rather wait for that (hopefully not too long now with BKD going into lucene core soon) than add something the the api right now that would require complex internal changes (how do you represent an unsigned long without BigDecimal in java?).

rjernst on 6 Oct 2015

clintongormley on 6 Oct 2015

@clintongormley: Currently, I have to use cases:

I have several unsigned int fields, which I can't express with signed integer. If I set them to long, the "binding" part of the ES schema (mapping) disappears, the devs think that it's a long and handle it accordingly. Apps break if anybody writes a value which can't be expressed in unsigned int space.
Yes, unsigned long is another (big :) problem. I have to do math or express them as strings, but I lose the arithmetic operators then (search for ranges etc)
And third (which I couldn't think ES will ever support): arbitrary size numbers. For example I store several 128, 256, 384 and 512 bit hash values.
Now I can only do this with:
split numbers (into 64 bits and I have to do unsigned-signed-unsigned conversions)
num to string encoding
hex storage as string
binary storage (base64)
Instead of writing a 512 bit number...

As far as I can remember, 1 and 2 was the most space efficient (storing several billions, so space matters), but having an arbitrary sized integer would be the best.

So my point is:

having different (unsigned/signed) types helps in keeping consistency (value range enforcement) across different users - sometimes there are just too much devs hanging around
having a bigdecimal would also be a big advantage

Thanks,

bra-fsn on 6 Oct 2015

👍2

@rjernst: "(how do you represent an unsigned long without BigDecimal in java?)"
With BigInteger? :)

bra-fsn on 6 Oct 2015

having different (unsigned/signed) types helps in keeping consistency (value range enforcement) across different users - sometimes there are just too much devs hanging around

See my comments before. Would the problem you have with storing an unsigned int in a long be handled by allowing you to set the minimum and maximum value allowed on numeric types? This way it could be a long, but you set 0 as the min and max unsigned int as the max. The only limitation (for now, until we have bigint/decimal support) would be you still could not represent anything outside of min/max signed long.

having a bigdecimal would also be a big advantage

See the issue #5683, and the latest lucene issue to support this, https://issues.apache.org/jira/browse/LUCENE-6825

rjernst on 6 Oct 2015

@rjernst: yes, for me having a min/max (and an arbitrary precision integer :) would be enough, given that it won't explode the storage space needs.

bra-fsn on 6 Oct 2015

We just discussed this issue in FixitFriday: we like specifying types using byte,short,int,long better than min and max values.

And given that doc values compute dynamically how many bits per value are required, space requirements would be pretty similar between an unsigned int field and a long, so implementing support for unsigned numeric types would not help.

jpountz on 16 Oct 2015

@rjernst Thanks for your suggestion, however it won't help us or anyone else indexing 64-bit hashes, our workaround for those at the moment is to store them as a string.