Elasticsearch: Add support for 64-bit unsigned integers

Created on 27 Jul 2018  路  13Comments  路  Source: elastic/elasticsearch

As a follow-up to discussions in #17006, we agreed in FixitFriday to add support for 64-bit unsigned integers. These numbers will support accurate term and range queries, but aggregations will work on the nearest double value.

:SearcMapping >feature Search

Most helpful comment

I've been thinking a lot about this, and I fear that sending numbers above the interop range for json could cause problems to some implementations.

From the json spec:

   This specification allows implementations to set limits on the range
   and precision of numbers accepted.  Since software that implements
   IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
   generally available and widely used, good interoperability can be
   achieved by implementations that expect no more precision or range
   than these provide, in the sense that implementations will
   approximate JSON numbers within the expected precision.  A JSON
   number such as 1E400 or 3.141592653589793238462643383279 may indicate
   potential interoperability problems, since it suggests that the
   software that created it expects receiving software to have greater
   capabilities for numeric magnitude and precision than is widely
   available.

   Note that when such software is used, numbers that are integers and
   are in the range [-(2**53)+1, (2**53)-1] are interoperable in the
   sense that implementations will agree exactly on their numeric
   values.

For instance, if JavaScript will get a value outside that range, it cannot longer guarantee arithmetic precision (MDN docs).
Furthermore, if JS will try to parse a JSON with a bigint value, it will round it without warn the user, causing a lot of pain to debug it.

Given that a custom serializer will introduce significant overhead, and it should be a user choice accept it or not, I agree with @ezimuel that introducing a query parameter such as bigint_as_string that will instruct Elasticsearch on how those values are being sent and read could be a good tradeoff.
By doing so, we can guarantee interoperability and still support all the use cases that need bigint as numeric values.

The JS client could use this query parameter as follows:
Send it by default to all requests, so we are sure not to lose precision and let the user decide how to handle the bigint values, but also give the users the ability to disable this default option and use a custom serializer to handle Bigint automatically. (this cannot be a default for the JS client for two reasons, the json bigint parser is significantly slower and the number and bigint types are not compatible, so the user must know what they are doing).

Unfortunately, this is not a JS problem only, this issue will happen also with other languages. For example if you use jq, any value above 1.79E+308 will be rounded to 1.79E+308.

$ echo '{"n":1.7976931348623157e+310}' | jq .

{
  "n": 1.7976931348623157e+308
}

By supporting something like bigint_as_string, we delegate to the user this choice, which they can take based on the platform they need to support.

All 13 comments

Pinging @elastic/es-search-aggs

Just adding this note here for reference.

We ran into a use case for this in https://github.com/elastic/ecs/pull/673. The Windows registry only has a concept of unsigned numbers, so we saw two paths to take for numbers greater than 263 for u64. We either cast to s64, which is hard to enforce and too easy to do wrong. The other option was to only store the string representation. Since we have little need for math, we opted for a keyword string field in the interim.

Following up on @rw-access comment, there are a fair number of data fields coming from the Elastic Endpoint that we would like to store as u64 if possible. Is this issue being actively worked on?

Same problem here during processing of network performance data from routers as part of a network management system. This data contains counters (e.g. received octets) which are 64-bit unsigned values. It would be great if we could store them properly.

We, within the search team, had discussion about the following points:

I. How to represent unsigned long internally, in json, and the elasticsearch clients for:

  • Queries
  • Sort values
  • Aggs
  • Scripts

Several options here are:

  1. Long for queries, sort values and scripts; Double for aggs

    • easy to handle

    • as values > Long.MAX_VALUE are presented as negative, not a nice experience for users especially when unsigned long is used for IDs.

  2. BigInteger for query field values, sort values, scripts; Double for aggs

    • difficult to handle internally

    • some elasticsearch clients may not support them (need to confirm with the lang-client team)

  3. String

    • easy to handle

    • easy to support

    • no support for numeric metrics aggs (do we need them on unsigned long field?)

II. Should we accept mixing longs and unsigned longs when sorting on a field?
- Do we need this mixed sorting on longs and unsigned longs?

Option 3 is a non-starter for any use-cases that I have worked on. The most common scenario that I see is related to metrics which are counters. Besides the network equipment mentioned by @ktetzlaff, a significant amount of server and application monitoring data are unsigned 64-bit counters. For example, this is the package that Metricbeat uses to gather disk I/O stats (https://github.com/shirou/gopsutil/blob/master/disk/disk.go#L14) and as you can see there are a number of uint64 values. Beats actually sends these to Elasticsearch as a long, which is technically wrong, but to date simply unavoidable. So this feature is needed to "fix" many Beats modules as well.

Given that counters will make up the majority of unsigned values, it is important (mandatory really) that at least the derivative aggregation work well and accurately. Whether that is best achieved with option 1 or 2... I would say that 2 is probably preferred. However if 1 is the only way to get the necessary support, and it allows for derivative aggregations, I could live with it.

I am in the same Situation as @robcowart : Option 3 would not achieve anything for me in my usecases. The main reason I want 64 bit integers is because sometimes metric values are greater than the maximum size the current data types offer (mostly long running tasks, counted in milliseconds).

I think we should eliminate option 1 from the picture. It's confusing to rely on a convention that requires to translate a signed long. Option 2 is more natural since we already have an abstraction for numerics in Scripts and sort values. They return a Number and the user can choose the resolution that he needs when consuming the value. Unsigned longs could for instance return a BigInteger for values greater than Long.MAX_VALUE.
@elastic/es-clients would you be able to comment on the proposal ? The json specification accepts numbers without restrictions so we were wondering if we need to do anything special to ease the usage of these 64 bits numbers in clients ?

Thanks for comments everyone, very valuable.

@jimczi I have chatted with the es-clients team, all the clients (php, perl, ruby, python, .NET) support BigInteger except JavaScript. Also for some clients with 32bit platform (which should be quite outdated) don't support BigInteger, e.g. php, ruby.

For JavaScript, the language does support BigInts, but the native json parser does not. But this should not be a blocker for us, the es-clients team are willing to write a custom json parser to handle this case.

One thing is left is to confirm with Kibana team if they can handle 64 bit integers.

Update:
I have chatted with the Kibana team, and Kibana has problems with handling big integers (relevant issue: https://github.com/elastic/kibana/issues/40183), and this is not going to be fixed any time soon.
But considering that currently Kibana also has issues with large longs, and citing @timroes "I don't think adding a new type to the list that will cause exactly the same issues will be a problem", I guess we can still proceed with our option 2.
@jimczi what do you think?

Updated option 2:

  1. Sort: return long for values < 2^63, BigInteger for values >= 2^63
  2. Scrips: return long for values < 2^63, BigInteger for values >= 2^63
  3. Aggregations:

    • terms agg: return long or BigInteger for buckets (if this field is used for "ids", it is valuable to have precise bucket keys)

    • numeric metric aggs: return doubles

Just to confirm: are we planning a maximum of 64bits unsigned long or any arbitrary number BigInteger? Implications to changes are the same but in terms of (de)serialization performance its good to know what to optimize for.

There are some potential client breaking changes.

Javascript client: will need a dedicated generic serializer which is going to be slower for everything. The workaround here is that the default WON'T support ulong/bigints OOTB but the client can be injected with a serializer that supports it either globally or per request.

Java HLRC: The java HLRC relies heavily on boxing, e.g the to and from on RangeQueryBuilder is an Object so is future proof 馃憤 (at the cost of boxing). Not sure if this is is the case everywhere the change is intended to be supported.

.NET NEST client: In most cases we have dedicated types and interface e.g the range query is backed by an interface IRangeQuery with implementations like LongRangeQuery, DateRangeQuery that strongly type the to/from parameters. We can extend this with e.g a BigIntegerRangeQuery quite easily. In some cases (especially responses) that are now capped as long or double changing that to BigInteger or BigDecimal is technically a breaking change.

I suggest we proceed with option 2 and:

  • try to solve any breaking changes as best as we can.
  • Accept a technical breaking change that in practice is hard to hit (e.g long implicitly converting to BigInteger and back might be good enough?).
  • If all else fails document the limitations of a particular client and set it up for success when 8.0 comes around.

PHP client: it supports 64-bit integer but not unsigned. We can support BigInteger using the GMP extensions of PHP (not default in the language).
Perl: it supports 64-bit by default and BigInteger using bigint.

Because some clients may need to support 64-bit and BigInteger in different ways, I suggest to provide some facilitation, for instance:

  • add a bigint_as_string=1 in the query string (0 by default) to return 64-bit and BigInteger as string; this will prevent JSON decoder to throw errors if numeric > 32-bit.
  • add a numeric_type field in the response that will alert the client for using 64 (bit) or big (as BigInteger). This alert will not have 32 (bit) value, as default. This alert will be triggered if at least one numeric value in the response is 64 or big.

I've been thinking a lot about this, and I fear that sending numbers above the interop range for json could cause problems to some implementations.

From the json spec:

   This specification allows implementations to set limits on the range
   and precision of numbers accepted.  Since software that implements
   IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
   generally available and widely used, good interoperability can be
   achieved by implementations that expect no more precision or range
   than these provide, in the sense that implementations will
   approximate JSON numbers within the expected precision.  A JSON
   number such as 1E400 or 3.141592653589793238462643383279 may indicate
   potential interoperability problems, since it suggests that the
   software that created it expects receiving software to have greater
   capabilities for numeric magnitude and precision than is widely
   available.

   Note that when such software is used, numbers that are integers and
   are in the range [-(2**53)+1, (2**53)-1] are interoperable in the
   sense that implementations will agree exactly on their numeric
   values.

For instance, if JavaScript will get a value outside that range, it cannot longer guarantee arithmetic precision (MDN docs).
Furthermore, if JS will try to parse a JSON with a bigint value, it will round it without warn the user, causing a lot of pain to debug it.

Given that a custom serializer will introduce significant overhead, and it should be a user choice accept it or not, I agree with @ezimuel that introducing a query parameter such as bigint_as_string that will instruct Elasticsearch on how those values are being sent and read could be a good tradeoff.
By doing so, we can guarantee interoperability and still support all the use cases that need bigint as numeric values.

The JS client could use this query parameter as follows:
Send it by default to all requests, so we are sure not to lose precision and let the user decide how to handle the bigint values, but also give the users the ability to disable this default option and use a custom serializer to handle Bigint automatically. (this cannot be a default for the JS client for two reasons, the json bigint parser is significantly slower and the number and bigint types are not compatible, so the user must know what they are doing).

Unfortunately, this is not a JS problem only, this issue will happen also with other languages. For example if you use jq, any value above 1.79E+308 will be rounded to 1.79E+308.

$ echo '{"n":1.7976931348623157e+310}' | jq .

{
  "n": 1.7976931348623157e+308
}

By supporting something like bigint_as_string, we delegate to the user this choice, which they can take based on the platform they need to support.

@ezimuel @delvedor Thanks a lot for comments and explanation. We've discussed this and decided that we should find a systematic way for the REST layer to handle big integers (also big longs), and this should not be a part of this Issue/PR. I have created a separate issue for this and put your ideas there.

For 64-bit unsigned integers, we have decided to proceed with our current plan to return Long/BigInteger.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

malpani picture malpani  路  3Comments

ttaranov picture ttaranov  路  3Comments

matthughes picture matthughes  路  3Comments

ppf2 picture ppf2  路  3Comments

jasontedor picture jasontedor  路  3Comments