Elasticsearch: Ingest processor cannot access _id on autogenerated id

Created on 12 Apr 2019 · 8Comments · Source: elastic/elasticsearch

I think this is merely a documentation issue for now. Found at https://discuss.elastic.co/t/accessing-id-in-ingest-pipeline/176503

Indexing a document that will have its ID autogenerated, obviously has no way of accessing its id, however there is no error happening and the user just might not know the correct order of operations.

Elasticsearch version (bin/elasticsearch --version): 7.0.0

Steps to reproduce:

PUT _ingest/pipeline/my_pipeline
{
    "processors": [
      {
        "set" : {
          "field" : "id",
          "value" : "{{_id}}"
        }
      }
    ]
}

DELETE foo

POST foo/_doc?pipeline=my_pipeline&refresh=true
{"foo":"rab"}

# id field will be empty
GET foo/_search

:CorFeatureIngest >docs >enhancement CorFeatures Docs

Source

spinscale

👍5

Most helpful comment

Maybe we can investigate generating the id prior to doing ingest. Currently generating an id happens after ingest has occurred.

martijnvg on 4 Feb 2020

👍3

All 8 comments

Pinging @elastic/es-core-features

elasticmachine on 15 Apr 2019

Hi, May I ask about the decision whether this issue would be processed recently? Actually, I'm expecting to have a copy field with the generated ID text as well.

chigix on 27 May 2019

Agreed, accessing the _id in a pipeline for documents with auto generated ids leads to unexpected behaviour. So this needs to be documented, on top of that I'm leaning towards also throwing a descriptive error in the case there is no _id present.

martijnvg on 28 May 2019

We discussed this issue and failing with a descriptive error is preferred over the current behaviour if the id is missing and a pipeline uses {{_id}}.

martijnvg on 30 May 2019

👍1

[docs issue triage]

Leaving open. This is still relevant.

jrodewig on 7 Oct 2019

Agreed, accessing the _id in a pipeline for documents with auto generated ids leads to unexpected behaviour. So this needs to be documented, on top of that I'm leaning towards also throwing a descriptive error in the case there is no _id present.

Do not agree.

You should at least provide read-only access to the _id field in pipeline.
We ingest about 1TB of logs per day from hundreds of different entities and we analyze those logs every night. Without read-only access to the _id field.

We have to use the expensive scroll API. We can not use the Search After feature because of duplicate _id value to another field with doc_values enabled is a very slow operation.

I don't know if it's possible to do thousands of scrolls in parallel on tens of TB's data.

There is no elegant way to have a duplicate id as https://www.elastic.co/guide/en/elasticsearch/reference/7.5/search-request-body.html#request-body-search-search-after said: "Instead it is advised to duplicate (client side or with a set ingest processor) the content of the _id field in another field that has doc value enabled and to use this new field as the tiebreaker for the sort."

I can generate flake id as Elasticsearch does by developing a Flake Id Logstash Plugin but this would slow down the indexing speed (see: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html).

If I can not duplicate _id as the official document said the search after is totally useless for me.

hlzhang on 3 Feb 2020

Maybe we can investigate generating the id prior to doing ingest. Currently generating an id happens after ingest has occurred.

martijnvg on 4 Feb 2020

👍3

This issue is open for about a year now and nothing happened to your documentation which is clearly wrong !!
I've implemented the suggested set processor solution from the documentation:

Instead it is advised to duplicate (client side or with a set ingest processor) the content of the _id field in another field that has doc value enabled and to use this new field as the tiebreaker for the sort.

to now realize it was a waste of time as it's never going to work ?! Why aren't u able to update the documentation for about a year?