As a follow-up to discussions in #17006, we agreed in FixitFriday to add support for 64-bit unsigned integers. These numbers will support accurate term and range queries, but aggregations will work on the nearest double value.
Pinging @elastic/es-search-aggs
Just adding this note here for reference.
We ran into a use case for this in https://github.com/elastic/ecs/pull/673. The Windows registry only has a concept of unsigned numbers, so we saw two paths to take for numbers greater than 263 for u64. We either cast to s64, which is hard to enforce and too easy to do wrong. The other option was to only store the string representation. Since we have little need for math, we opted for a keyword string field in the interim.
Following up on @rw-access comment, there are a fair number of data fields coming from the Elastic Endpoint that we would like to store as u64 if possible. Is this issue being actively worked on?
Same problem here during processing of network performance data from routers as part of a network management system. This data contains counters (e.g. received octets) which are 64-bit unsigned values. It would be great if we could store them properly.
We, within the search team, had discussion about the following points:
I. How to represent unsigned long internally, in json, and the elasticsearch clients for:
Several options here are:
II. Should we accept mixing longs and unsigned longs when sorting on a field?
- Do we need this mixed sorting on longs and unsigned longs?
Option 3 is a non-starter for any use-cases that I have worked on. The most common scenario that I see is related to metrics which are counters. Besides the network equipment mentioned by @ktetzlaff, a significant amount of server and application monitoring data are unsigned 64-bit counters. For example, this is the package that Metricbeat uses to gather disk I/O stats (https://github.com/shirou/gopsutil/blob/master/disk/disk.go#L14) and as you can see there are a number of uint64
values. Beats actually sends these to Elasticsearch as a long
, which is technically wrong, but to date simply unavoidable. So this feature is needed to "fix" many Beats modules as well.
Given that counters will make up the majority of unsigned values, it is important (mandatory really) that at least the derivative aggregation work well and accurately. Whether that is best achieved with option 1 or 2... I would say that 2 is probably preferred. However if 1 is the only way to get the necessary support, and it allows for derivative
aggregations, I could live with it.
I am in the same Situation as @robcowart : Option 3 would not achieve anything for me in my usecases. The main reason I want 64 bit integers is because sometimes metric values are greater than the maximum size the current data types offer (mostly long running tasks, counted in milliseconds).
I think we should eliminate option 1 from the picture. It's confusing to rely on a convention that requires to translate a signed long. Option 2 is more natural since we already have an abstraction for numerics in Scripts and sort values. They return a Number
and the user can choose the resolution that he needs when consuming the value. Unsigned longs could for instance return a BigInteger
for values greater than Long.MAX_VALUE
.
@elastic/es-clients would you be able to comment on the proposal ? The json specification accepts numbers without restrictions so we were wondering if we need to do anything special to ease the usage of these 64 bits numbers in clients ?
Thanks for comments everyone, very valuable.
@jimczi I have chatted with the es-clients team, all the clients (php, perl, ruby, python, .NET) support BigInteger except JavaScript. Also for some clients with 32bit platform (which should be quite outdated) don't support BigInteger, e.g. php, ruby.
For JavaScript, the language does support BigInts, but the native json parser does not. But this should not be a blocker for us, the es-clients team are willing to write a custom json parser to handle this case.
One thing is left is to confirm with Kibana team if they can handle 64 bit integers.
Update:
I have chatted with the Kibana team, and Kibana has problems with handling big integers (relevant issue: https://github.com/elastic/kibana/issues/40183), and this is not going to be fixed any time soon.
But considering that currently Kibana also has issues with large longs, and citing @timroes "I don't think adding a new type to the list that will cause exactly the same issues will be a problem", I guess we can still proceed with our option 2.
@jimczi what do you think?
Updated option 2:
Just to confirm: are we planning a maximum of 64bits unsigned long
or any arbitrary number BigInteger
? Implications to changes are the same but in terms of (de)serialization performance its good to know what to optimize for.
There are some potential client breaking changes.
Javascript client: will need a dedicated generic serializer which is going to be slower for everything. The workaround here is that the default WON'T support ulong/bigints OOTB but the client can be injected with a serializer that supports it either globally or per request.
Java HLRC: The java HLRC relies heavily on boxing, e.g the to
and from
on RangeQueryBuilder
is an Object
so is future proof 馃憤 (at the cost of boxing). Not sure if this is is the case everywhere the change is intended to be supported.
.NET NEST client: In most cases we have dedicated types and interface e.g the range query is backed by an interface IRangeQuery
with implementations like LongRangeQuery
, DateRangeQuery
that strongly type the to/from
parameters. We can extend this with e.g a BigIntegerRangeQuery
quite easily. In some cases (especially responses) that are now capped as long
or double
changing that to BigInteger
or BigDecimal
is technically a breaking change.
I suggest we proceed with option 2 and:
long
implicitly converting to BigInteger
and back might be good enough?).8.0
comes around.PHP client: it supports 64-bit
integer but not unsigned
. We can support BigInteger using the GMP extensions of PHP (not default in the language).
Perl: it supports 64-bit
by default and BigInteger using bigint.
Because some clients may need to support 64-bit
and BigInteger
in different ways, I suggest to provide some facilitation, for instance:
bigint_as_string=1
in the query string (0 by default) to return 64-bit
and BigInteger
as string; this will prevent JSON decoder to throw errors if numeric > 32-bit
.numeric_type
field in the response that will alert the client for using 64
(bit) or big
(as BigInteger
). This alert will not have 32
(bit) value, as default. This alert will be triggered if at least one numeric value in the response is 64
or big
.I've been thinking a lot about this, and I fear that sending numbers above the interop range for json could cause problems to some implementations.
From the json spec:
This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
generally available and widely used, good interoperability can be
achieved by implementations that expect no more precision or range
than these provide, in the sense that implementations will
approximate JSON numbers within the expected precision. A JSON
number such as 1E400 or 3.141592653589793238462643383279 may indicate
potential interoperability problems, since it suggests that the
software that created it expects receiving software to have greater
capabilities for numeric magnitude and precision than is widely
available.
Note that when such software is used, numbers that are integers and
are in the range [-(2**53)+1, (2**53)-1] are interoperable in the
sense that implementations will agree exactly on their numeric
values.
For instance, if JavaScript will get a value outside that range, it cannot longer guarantee arithmetic precision (MDN docs).
Furthermore, if JS will try to parse a JSON with a bigint value, it will round it without warn the user, causing a lot of pain to debug it.
Given that a custom serializer will introduce significant overhead, and it should be a user choice accept it or not, I agree with @ezimuel that introducing a query parameter such as bigint_as_string
that will instruct Elasticsearch on how those values are being sent and read could be a good tradeoff.
By doing so, we can guarantee interoperability and still support all the use cases that need bigint as numeric values.
The JS client could use this query parameter as follows:
Send it by default to all requests, so we are sure not to lose precision and let the user decide how to handle the bigint values, but also give the users the ability to disable this default option and use a custom serializer to handle Bigint automatically. (this cannot be a default for the JS client for two reasons, the json bigint parser is significantly slower and the number and bigint types are not compatible, so the user must know what they are doing).
Unfortunately, this is not a JS problem only, this issue will happen also with other languages. For example if you use jq
, any value above 1.79E+308
will be rounded to 1.79E+308
.
$ echo '{"n":1.7976931348623157e+310}' | jq .
{
"n": 1.7976931348623157e+308
}
By supporting something like bigint_as_string
, we delegate to the user this choice, which they can take based on the platform they need to support.
@ezimuel @delvedor Thanks a lot for comments and explanation. We've discussed this and decided that we should find a systematic way for the REST layer to handle big integers (also big longs), and this should not be a part of this Issue/PR. I have created a separate issue for this and put your ideas there.
For 64-bit unsigned integers, we have decided to proceed with our current plan to return Long/BigInteger.
Most helpful comment
I've been thinking a lot about this, and I fear that sending numbers above the interop range for json could cause problems to some implementations.
From the json spec:
For instance, if JavaScript will get a value outside that range, it cannot longer guarantee arithmetic precision (MDN docs).
Furthermore, if JS will try to parse a JSON with a bigint value, it will round it without warn the user, causing a lot of pain to debug it.
Given that a custom serializer will introduce significant overhead, and it should be a user choice accept it or not, I agree with @ezimuel that introducing a query parameter such as
bigint_as_string
that will instruct Elasticsearch on how those values are being sent and read could be a good tradeoff.By doing so, we can guarantee interoperability and still support all the use cases that need bigint as numeric values.
The JS client could use this query parameter as follows:
Send it by default to all requests, so we are sure not to lose precision and let the user decide how to handle the bigint values, but also give the users the ability to disable this default option and use a custom serializer to handle Bigint automatically. (this cannot be a default for the JS client for two reasons, the json bigint parser is significantly slower and the number and bigint types are not compatible, so the user must know what they are doing).
Unfortunately, this is not a JS problem only, this issue will happen also with other languages. For example if you use
jq
, any value above1.79E+308
will be rounded to1.79E+308
.By supporting something like
bigint_as_string
, we delegate to the user this choice, which they can take based on the platform they need to support.