I think this is merely a documentation issue for now. Found at https://discuss.elastic.co/t/accessing-id-in-ingest-pipeline/176503
Indexing a document that will have its ID autogenerated, obviously has no way of accessing its id, however there is no error happening and the user just might not know the correct order of operations.
Elasticsearch version (bin/elasticsearch --version): 7.0.0
Steps to reproduce:
PUT _ingest/pipeline/my_pipeline
{
"processors": [
{
"set" : {
"field" : "id",
"value" : "{{_id}}"
}
}
]
}
DELETE foo
POST foo/_doc?pipeline=my_pipeline&refresh=true
{"foo":"rab"}
# id field will be empty
GET foo/_search
Pinging @elastic/es-core-features
Hi, May I ask about the decision whether this issue would be processed recently? Actually, I'm expecting to have a copy field with the generated ID text as well.
Agreed, accessing the _id in a pipeline for documents with auto generated ids leads to unexpected behaviour. So this needs to be documented, on top of that I'm leaning towards also throwing a descriptive error in the case there is no _id present.
We discussed this issue and failing with a descriptive error is preferred over the current behaviour if the id is missing and a pipeline uses {{_id}}.
[docs issue triage]
Leaving open. This is still relevant.
Agreed, accessing the _id in a pipeline for documents with auto generated ids leads to unexpected behaviour. So this needs to be documented, on top of that I'm leaning towards also throwing a descriptive error in the case there is no _id present.
Do not agree.
You should at least provide read-only access to the _id field in pipeline.
We ingest about 1TB of logs per day from hundreds of different entities and we analyze those logs every night. Without read-only access to the _id field.
We have to use the expensive scroll API. We can not use the Search After feature because of duplicate _id value to another field with doc_values enabled is a very slow operation.
I don't know if it's possible to do thousands of scrolls in parallel on tens of TB's data.
There is no elegant way to have a duplicate id as https://www.elastic.co/guide/en/elasticsearch/reference/7.5/search-request-body.html#request-body-search-search-after said: "Instead it is advised to duplicate (client side or with a set ingest processor) the content of the _id field in another field that has doc value enabled and to use this new field as the tiebreaker for the sort."
I can generate flake id as Elasticsearch does by developing a Flake Id Logstash Plugin but this would slow down the indexing speed (see: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html).
If I can not duplicate _id as the official document said the search after is totally useless for me.
Maybe we can investigate generating the id prior to doing ingest. Currently generating an id happens after ingest has occurred.
This issue is open for about a year now and nothing happened to your documentation which is clearly wrong !!
I've implemented the suggested set processor solution from the documentation:
Instead it is advised to duplicate (client side or with a set ingest processor) the content of the _id field in another field that has doc value enabled and to use this new field as the tiebreaker for the sort.
to now realize it was a waste of time as it's never going to work ?! Why aren't u able to update the documentation for about a year?
Most helpful comment
Maybe we can investigate generating the id prior to doing ingest. Currently generating an id happens after ingest has occurred.