Elasticsearch: 5.0 painless feedback: it's a PAIN to get the source document

Created on 29 Nov 2016 · 11Comments · Source: elastic/elasticsearch

I just wanted to log this as it caused me much frustration and isn't very consistent for the end user trying to write scripts...

To get the document as part of a pipeline inline script you would use ctx. Previously in transforms it was ctx._source.
To get the document as part of an update by query script you would use ctx._source. This hasn't changed since the previous versions.
Somewhat related, if you want to access source doc and properties in a pipeline processor it (https://www.elastic.co/guide/en/elasticsearch/reference/master/accessing-data-in-pipelines.html) You just use _source or the property name.
For combining fields / accessing doc values you would use total += doc['goals'][i]; (https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting-painless.html#_accessing_doc_values_from_painless).

For the love of PAINLESS can we please standardize on one common name to get the source doc. This is just out of hand and very unintuitive. I prefer just to have doc be the common variable across all of these.

:CorInfrScripting >enhancement

Source

niemyjski

👍3

Most helpful comment

Painless is very painful. I am looking to do something very simple: change all of the array type elements in my json to have "_nested_" prepended to its key (so that I can do dynamic nested type mapping).

So: { arr: [1,2] } -> { _nested_arr: [1,2] }

In javascript, this was a simple as:

(function transform(c) {
    for (var keys = Object.keys(c), i = 0; i < keys.length; i++) {
        var key = keys[i],
            val = c[key];

        if (Array.isArray(val)) {
            c["_nested_" + key] = val;
            delete c[key];
            transform(val);
        } else if (typeof val === "object")
            transform(val);
    }
})(ctx);

But the painless documentation is very confusing!
a. What is the type of ctx/ctx._source (sounds like map), and which one should I be referring to?
b. How does on iterate throughctx/ctx._source?
c. How does one delete a key on ctx/ctx._source?
d. Is painless pass by reference, or pass by value (will updating an element in a map update its parent)?

In javascript this would be much easier.

benorgera on 27 Jul 2018

👍2

All 11 comments

Hi @niemyjski

Re point 4, the doc[] indicates a completely different access pattern, so I think that should remain distinct from the other 3.

To get the document as part of a pipeline inline script you would use ctx.

Not sure what you mean here?

Somewhat related, if you want to access source doc and properties in a pipeline processor you just use _source or the property name.

This seems OK to me, where you can leave out the _source as a commonly used shortcut.

But I agree that in scripts we should try to make things more consistent.

@talevy @martijnvg what are your thoughts?

clintongormley on 30 Nov 2016

To get the document as part of a pipeline inline script you would use ctx. Previously in transforms it was ctx._source.

The case here is that you are not necessarily working with the same type of document as the lucene context. Here, we are dealing with something known as a sourceAndMetadata object, which contains both the original _source fields, as well as any other ingest metadata like _id, _type, etc. The reason for this skewed view into the originally sent source document is for a unified retrieval strategy within the mustache scripts of ingest's field templating (independent to painless). We can definitely revisit this to keep a more uniform view into the document, more similar to how the indexing stage sees it.

Somewhat related, if you want to access source doc and properties in a pipeline processor you just use _source or the property name.

right, that is just a convenience scheme. Also, in this case, we are not using painless, this is a custom Ingest scheme. I don't mean to say that to mean it is exempt from consistency, just saying it is not a painless context.

do you have anything to add about this @martijnvg? I am not sure how we can change things within 5.x to keep things backwards compatible, but totally agree we should revisit this for changes in 6.0, maybe?

talevy on 30 Nov 2016

@clintongormley point 4 is still accessing the raw document so I think it should be included. In a pipeline inline script you can't do ctx._source it's ctx is the document object. In a pipeline processor (not a script) you can use _source or the propertyname (E.g., {{name}}.

niemyjski on 30 Nov 2016

point 4 is still accessing the raw document so I think it should be included.

No it is not accessing the raw document, it is accessing the value from doc-values. So doc[] wouldn't be available in a pipeline or update script.

In a pipeline inline script you can't do ctx._source it's ctx is the document object.

Ah right, that is unfortunate. Would be good to make this consistent

The case here is that you are not necessarily working with the same type of document as the lucene context. Here, we are dealing with something known as a sourceAndMetadata object, which contains both the original _source fields, as well as any other ingest metadata like _id, _type, etc. The reason for this skewed view into the originally sent source document is for a unified retrieval strategy within the mustache scripts of ingest's field templating (independent to painless).

I think it'd be clearer to be able to access ctx._source.some_field and ctx._id etc, while today it looks like it'd be ctx.some_field vs ctx._id? That seems wrong.

We can definitely revisit this to keep a more uniform view into the document, more similar to how the indexing stage sees it.

Yeah, although I don't see a clear path to changing this without breaking bwc.

clintongormley on 1 Dec 2016

Discussed in FixitFriday: agreed with @clintongormley 's last comment that we should try to make ingest more consistent with other APIs. The bw compat looks challenging however.

jpountz on 2 Dec 2016

In my 5.1.1 inline painless script for terms agg values I had to use params._source (was hoping for _source ('Variable [_source] is not defined.') and docs seemed to indicate ctx._source ('null_pointer_exception' at ctx.) (only reference I found https://www.elastic.co/guide/en/elasticsearch/reference/5.1/modules-scripting-painless.html#_updating_fields_with_painless). I eventually pieced together params._source from mailing list.

nezda on 4 Feb 2017

🎉1

@nezda Thanks for pointing this out. It's super confusing that inside inline scripts _source has to be accessed from the params object and not directly as _source.

elasticsearcher on 13 Jun 2017

👍2

This still needs to be documented now that contexts are done.

jdconrad on 13 Mar 2018

So: { arr: [1,2] } -> { _nested_arr: [1,2] }

In javascript, this was a simple as:

(function transform(c) {
    for (var keys = Object.keys(c), i = 0; i < keys.length; i++) {
        var key = keys[i],
            val = c[key];

        if (Array.isArray(val)) {
            c["_nested_" + key] = val;
            delete c[key];
            transform(val);
        } else if (typeof val === "object")
            transform(val);
    }
})(ctx);

In javascript this would be much easier.

benorgera on 27 Jul 2018

👍2

I would like to discuss creating consistency for input variables in contexts related to source, doc, and params.

jdconrad on 5 Dec 2019

There are a few things in this issue:
1) Consistent naming of source, moved to #52593.

Backwards compatibility needs to be addressed systemically, we'll take that on in #52594
2) source does not have good documentation, moved to #52600.
3) source vs doc is confusing, source is nested, doc is not nested.

doc is an accessor for fields in lucene, fields are flat, the . are just part of the field name.
source is a json object which is nested. Keys may or may not have . in them. So we could not flatten source without introducing ambiguities.

The best option we have is to document then, as will happen in #52600.

If there are any other thoughts related to above, we'd love to hear them in the issues I mentioned.

Regarding the questions posed in this issue:

a. What is the type of ctx/ctx._source (sounds like map), and which one should I be referring to?

ctx._source is a Map on the top level that represents a JSON blob using Maps, Lists, and primitives. For an update script, you need to use ctx._source for ingest you'd use ctx directly.

b. How does on iterate throughctx/ctx._source?

If you know what your source is, you can use a for loop Otherwise iterate through the top-level map and use instanceof to determine the types of values. eg. if (ctx._source['foo'] instanceof Map)...