Describe the feature:
Currently in master, the correct way to access a doc value is to do a check on size ( see #32207) to avoid an exception when dealing with missing values: doc['field'].size() > 0
However our examples do not use this form (though the docs mentions that in a paragraph).
Without this check, a document with missing fields can cause a working Painless script to throw an exception which is unexpected.
As discussed in the search/aggs meeting it would be desirable to have an intuitive, short and safe access to the doc value regardless of the existence of the value.
Maybe related to #25913 /cc @nik9000
Pinging @elastic/es-core-infra
/cc @elastic/es-search-aggs
There are 2 types of existence checks that need to be supported. First is whether a mapping exists. This is what checking for a field in doc
does today. Second is whether the current document has any documents. This is what the size() check does. What changed in #32207 is what happens when asking for .value
when no values exist.
it would be desirable to have an intuitive, short and safe access to the doc value regardless of the existence of the value.
What is unintuitive about checking size? It's like doing hasNext()
on an iterator before you can call next()
.
Also note the choice of adopting size() checks as the preferred way to check for missing values is indeed related to #25913. If we change the return value per document (eg return null for some documents), we can't lazily cache the return value, which is the plan to do only one hash lookup regardless of the number of documents the script is run on. The returned object from doc['field']
is already advanced externally as the documents are evaluated by scripts.
I'll present my reasons below; the group options (which I might not remember) is in the meeting recording.
What is unintuitive about checking size? It's like doing hasNext() on an iterator before you can call next().
Both the name (naming is hard!) and the fact that it has to be called in the first place.
iterator
and size
are concepts associated with collections thus multiple values not a single value, which is what we advertise doc values to be through our examples.
Since the intent is to check whether the value exists, something akin to Optional
/isPresent
would have conveyed the intent better imo however that comes with its own "bureaucracy".
Naming aside it's the safety aspect that concerns me.
I'm arguing that making the call in first place is problematic; we make it difficult for folks to do the right thing: if one doesn't check the size (just like we do not through-out _all_ our examples), she will end up with an exception _if_ the doc value field is missing.
To me this is trappy and unpredictable because a query can work fine until a doc with a missing value is indexed and then will fail; fixing it requires then _every_ doc value access becomes at least a tertiary expression:
doc['field'].value
-> doc['field'].size() > 0 ? doc['field'].value : null/0/-1
That's a lot of boiler-plate code _per_ doc value; in case of multiple doc values (ex: a+b*c) the script becomes a mess.
Frankly returning null
(despite the penalty of auto-boxing/lack of caching) would be a better compromise since it's safer.
To use the iterator
example, I'd rather we have a for-loop
instead of the hasNext/next
dance.
A number of ideas has been thrown around to improve the situation:
doc['field'].getOrDefault(default)/.valueOrDefault(default)
Have a default value similar to Map
(@nik9000's idea).
doc['field']?.value
This goes back to the null
discussion - the issue I believe is that Painless cannot do (yet) rewriting of a query to apply the Elvis operator correctly.
getOrDefault
to provide a wrapping method:docValue(doc, 'field')
or doc.value('field')
Again, I might have missed a few, @nik9000 ?
Not sure how feasible these are though. /cc @jdconrad
Thanks,
Probably a bad idea, but what about removing value
entirely? These things are lists so maybe we should force people to think of them that way? I had been suggesting valueOrDefault
but maybe folks really should just treat them like lists and do def f = doc['f']; return f.empty ? 0 : f.get(0);
.
This goes back to the
null
discussion - the issue I believe is that Painless cannot do (yet) rewriting of a query to apply the Elvis operator correctly.
Painless has groovy's safe dereference syntax and the elvis operator. So if doc['field']
returned null when a field isn't mapped and .value
returned null when the document doesn't have a value for that index then doc['field']?.value?:0
could be read as "get the first value of field
if there is one on this document but if there isn't, return 0". At this point, ?.
and ?:
only do things with null
.
I'm fine keeping the doc['field']
throws an exception if the field is unmapped behavior. I think it is a useful warning for the common case - that folks mistype field
. I think it is less common that folks want to query across indices that don't have the field. I like that adding the containsKey
check forces you to think about what you should do for documents that don't have any notion of that field.
The isEmpty
check paired with value
is what I don't like. Well, mostly what I don't like is that our docs don't include the isEmpty
check. I've always thought value
was confusing. That didn't stop me from using it when I knew that that field only ever had one value.
I'm in favor of removing .value
, but I think it is a huge change for users, as it has been around forever in scripts. I would be in favor of valueOrDefault
for this reason, but I could go either way.
TL;DR:
.value
(if I understand what that means) will break BC.Prompted by a Discuss issue, I'd like to improve the public documentation around this issue, specifically in the Script
sections of various individual Metrics Agg pages. The accessors there are typically shown "naively" (neglecting the behavior where a doc is missing the field, or when a doc has multiple values for the field), e.g.
"script" : {"source" : "doc.grade.value" }
My initial intention was to propose an additional section below each of these that mentions the error returned by that query when a doc is missing the value, and show a guarded form that matches the error, e.g.
"script" : {
"source" : "if (doc['grade'].size()==0) { return null; } return doc.age.value;"
}
The Discuss issue mentioned that omitting .value
remediated the problem, and I was able to confirm this - except that it has the consequence of causing Metrics aggs to behave differently for documents with multiple values on the field.
In 7.4, using .value
causes only the first value of the field in each doc to be included in the calculation of the metric, where omitting .value
, e.g. doc.grade
includes all values from each doc in the metric calculation.
If the implementation is changed so that doc.<field>
now only returns the first value, extant aggregation scripts may be disrupted.
To be honest, though, I wonder how many people use doc.<field>
_intentionally_ to get all values versus accidentally, .e.g. like the above user they discovered dropping .value
made the missing value error go away.
Regarding possible changes, IMO:
Regarding possible current doc improvements, please advise broad strokes what would be acceptable. I want to say links to documentation for accessing fields in scripts would be helpful, but then I see _that_ doc doesn't have any references to dot operator (doc.field
) access, nor does it distinguish between the behaviors of doc.field
and doc.field.value
where there is an array of values.
@GlenRSmith I completely agree this is a gap in the documentation. Common context parameters such as doc and source should at the very least have entire sections with simple examples dedicated to each. I was considering adding these sections to the Painless guide. Behavioral changes are obviously more difficult, but if it's the right thing to do we should come up with a plan for appropriate deprecations. I think the problem with obscurity is largely that users may not be aware that doc['field']
(or doc.field
) is actually returning what amounts to a List. .value
was likely added a shortcut for users since most fields end up being single-valued so it wasn't as strange to constantly have to do doc['field'][0]
.
This is covered by the planned fields api for Painless.
Most helpful comment
I'm in favor of removing
.value
, but I think it is a huge change for users, as it has been around forever in scripts. I would be in favor ofvalueOrDefault
for this reason, but I could go either way.