I don't like the fact that the coerce option combines a useful feature: the ability to convert strings to a number (which we are supposed to do in most APIs anyway) with a dangerous feature: the truncation (it does not even round) of floating-point numbers when indexed as a byte/short/integer/long.
I think we should remove this option, accept a string in all cases and require users to use an ingest processor if they want to ignore the decimal part of their floating-point numbers. Note that eg. "1.0" would still be accepted in an integer field since its decimal part is equal to zero.
We had a brief discussion about it in the search meeting and there was consensus that enabling truncation by default can be trappy. We discussed the following alternatives:
coerce option but make it only about accepting stringsWe should also try to gather feedback about use-cases for truncating, if any, and what users should migrate to instead (the convert processor?).
Discussed in Fix it Friday and we agreed that we should keep the coerce option and make it only about accepting strings. The automatic truncation is not a good option and should be replaced with an ingest processor that does the right thing (floor/ceil/round).
hi I'd like to look into this as my first PR.
is there a good place to start looking in the code to address this?
@Nablyy You should look at NumberFieldMapper, and try to disallow floating-point numbers that have a decimal part on integer types and 7.x indices.
@Nablyy If you're not working on this, can I take this?
@liketic If you want, take this.
Hello everyone,
If no one is working on this anymore, I would like to make a PR #27219 :) thx in advance for all your reviews and comments.
@jpountz Hi, I am trying to understand the issue and I think I need some clarification. When you say make coerce all about accepting strings, do you mean that from now on all it should mean to a user is that it will convert strings into the target type as long as there is no need to manipulate the value itself and that invalid strings will now be simply errored instead of truncated etc?
i.e Do you mean, the change should be so that, "5.0" --> Integer [coerce, in this context would actually coerce the value to 5], but "5.1" --> Integer [coerce, in this context means nothing, because "5.1" can't be coerced to integer without manipulation, and so should result in error]
I think this is a frankly terrible idea. It is perfectly understandable what happens when you convert a float to an int. It truncated the float in every sensible language.
Why conflate this wth questions about rounding? In what language does int(float) result in a rounded number?
I find this feature eminantly useful because it allows me to define int(float) in my indexing config. Please leave it in. This is a documentation issue at worst.
@akotlar Agreed casting may be useful. The thing I dislike is that it happens implicitly.
cc @elastic/es-search-aggs
@jpountz Sorry, maybe I misunderstood. Does it not only happen when coerce: true ? If so, isn't that explicit? I would agree if it did something like round implicitly: when coercing from float to int I expect truncation.
@akotlar Could you give more details about the kind of fields that you index and want to truncate?
Absolutely. I work with genetic data, indexing all mutations, and a lot of annotated information about those mutations, found in say 1000 people.
One of those annotations comes from dbSNP. As it arrives from UCSC Genome Browser, there is a field, alleleNs, which is an integer count ostensibly, because it is the averaging of number of chromosomes (most people have 2 copies of each chromosome) by number of people. Unfortunately, on rare occasion someone will have 3 or 1 copies of a chromosome. So the average ends up being a float. This happens very rarely, and I mostly don鈥檛 care about being off by < 1 once in a great while.
In general, genetic data can be really inconsistent, and so say 1/10000 entries will be badly formatted. These I want to cast to the majority type.
The way I wrote the program that handles all of this stuff (and which uses ES), I have attempted to place as much of my api as possible into YAML configs, so that one program could accommodate many permutations of this messy data. This maps really well to ES, since the search layer can be largely configured from a single Json/YAML. I don鈥檛 want to write programming logic or invest functions per annotation feature, since there are on the order of a thousand possible features, and millions of possibly desirable combinations of these (and the feature space grows monthly more or less). That is a good use case for config files.
So I realize that I could use an ingestion pipeline, but for simple transformations like this, I vastly prefer to just use an already existing mechanism (existant in the mapping configs ES supports) that can handle this in a perfectly understandable way.
OK, I'm adding back the discuss label to make sure that we take your feedback into account.
On my end I would still prefer that we do this change. I understand there could be disagreement. And I also think there might be work to do on the integration of index APIs with pipelines. For instance the mapping option has the benefit that you can never skip it, while it's easy to just forget to pass the name of the ingest pipeline in a call to the index APIs.
Sounds good
We discussed this in FixitFriday and agreed to still remove support for coerce despite latest arguments. For the record, here is an example how to handle rounding on top of Elasticsearch thanks to an ingest pipeline. Another option would be to do the same on client-side before sending data to Elasticsearch.
PUT test
{
"mappings": {
"_doc": {
"properties": {
"my_long": {
"type": "long",
"coerce": false
}
}
}
}
}
PUT _ingest/pipeline/round_my_long
{
"description": "describe pipeline",
"processors": [
{
"script": {
"lang": "painless",
"source": "ctx[params.field] = (long) Math.round(ctx[params.field])",
"params": {
"field": "my_long"
}
}
}
]
}
PUT test/_doc/1?pipeline=round_my_long
{
"my_long": 3.6
}
My position hasn't changed, it is clear that this is a needless complication.
Thanks for the ingestion pipeline tip.
I am wondering if we still want to coerce values during queries. Currently we truncate decimal part on range queries on integer data types. I assume this is useful if we ran a query across different indexes where the same field name can have different field types.
@mayya-sharipova Indeed we want to keep doing the right thing when querying integer field types with ranges on floating-point numbers. (We're not always truncating, we sometimes round up, sometimes down, so that querying integer fields with floating-point numbers always gives the same answer as if the field had been indexed as a floating-point number too.)
I think this is a good initiative. However I am concerned about the breaking change when a field is mapped as integer type but _source contains a floating point part. Users might rely on this feature, a re-index would suddenly fail.
I think such a change should be done with at least 2 major release cycles: 1 to flip the default, so its strict by default and 2 to remove it completely.
I am not fully sure if accepting .0 is good or bad, unsigned_long which lacks coerce by design errors for .0. If .0 gets accepted, unsigned_long should be relaxed, too. It's probably better to be fully strict which is also more compliant with e.g. JSON.
We've discussed this issue today in the Search meeting, and decided for now to postpone implementation of this change. One reason is that we still don't have an appealing way to index numbers with floating point as integers. Ingest pipeline can be complex for some use cases.
Another reason is runtime fields. As we change the behaviour of coerce option, it may be break runtime fields that were relying on the original behaviour.
As runtime fields are shaping up, they may provide better ways to extract values depending on index version and to index documents from source instead of using injest pipelines.
Most helpful comment
Hello everyone,
If no one is working on this anymore, I would like to make a PR #27219 :) thx in advance for all your reviews and comments.