Elasticsearch: Typeless parent/child

Created on 31 Aug 2016 · 34Comments · Source: elastic/elasticsearch

Related to #15613

Types are going to be removed. The parent/child is tightly coupled to types, so that needs to be changed. Parent/child still needs to distinguish between a parent or a child document, so the idea is that a new meta field type name _join will be allow that. This new field would replace _parent meta field. Each _join meta field type would maintain an indexed field to distinguish between parent and child documents and a doc values field needed for the join operation.

Example of how to use typeless parent/child:

PUT /stackoverflow
{
  "mapping": {
    "_join": ["question-to-answers"]
    }
  }
}

Indexing question document (parent):

PUT stackoverflow/1?join=question-to-answers
{
  "question" : "..."
}

Indexing answer document (child):

PUT stackoverflow/2?join=question-to-answers&parent=1
{
  "answer" : "..."
}

Besides the parent the typeless parent/child will also need a join url parameter. In order to prevent adding more feature specific options to transport and rest layer, meta fields should be completely isolated, from rest to mapping layer. This will result in cleaner code and allows parent/child to be moved to a module.

Adding answer-to-comments join field:

PUT /index1
{
  "mapping": {
    "_join": ["question-to-answers", "answer-to-comments"]
    }
  }
}

Indexing question document (parent):

PUT stackoverflow/1?join=question-to-answers
{
  "question" : "..."
}

Indexing answer document (child):

PUT stackoverflow/2?join=question-to-answers,answer-to-comments&parent=1
{
  "answer" : "..."
}

Indexing comment document (grand child):

PUT stackoverflow/3?join=answer-to-comments&parent=2&routing=1
{
  "comment" : "..."
}

:SearcSearch >breaking

Source

martijnvg

❤3 👍1

Most helpful comment

The typeless parent-join has landed in master and 5.x.
You can find the documentation here:
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html

The first release for this will be 5.6 where users can start exploring this typeless parent/child by setting mapping.single_type to true:
https://www.elastic.co/guide/en/elasticsearch/reference/5.x/parent-join.html

New issues can be opened for enhancements or bugs but this long standing issue can be closed !

jimczi on 16 Jun 2017

🎉7

All 34 comments

I'm not sure we need to retain the notion of types, maybe we could go with something like below? You say that parent_type is used for the name of the internal join field, bu we could directly use the name of the join field directly for that? (in that case question-to-answers)

PUT /stackoverflow
{
  "mappings": {
    "properties": {
      "question-to-answers": {
        "type": "join"
      }
    }
  }
}

PUT stackoverflow/1
{
  "question" : "...",
  "question-to-answers" : null # no parent id, so this doc is a parent
}

PUT stackoverflow/2?routing=1
{
  "answer" : "...",
  "question-to-answers" : "1" # id of the parent
}

Something else that made me wonder when reading your proposal is that users generally do not like modifying the structure of their documents for specifying metadata (about the join in that case), so maybe it should remain a meta field, something like below?

PUT /stackoverflow
{
  "mappings": {
    "_join": [
      "question-to-answers"
    ]
  }
}

PUT stackoverflow/1?join=question-to-answers
{
  "question" : "..."
}

PUT stackoverflow/2?join=question-to-answers&parent=1
{
  "answer" : "..."
}

jpountz on 1 Sep 2016

👍3

I'm not sure we need to retain the notion of types, maybe we could go with something like below?

When thinking about this refactoring I thought we needed to distinguish between the different documents (in case of multiple join fields), but when answering this question I realize we don't :) ,
because all documents in the same index will have different ids. (currently that is not the case as ids are only unique within a single type)

Something else that made me wonder when reading your proposal is that users generally do not like modifying the structure of their documents for specifying metadata (about the join in that case)

The other reason I moved away from metadata is that besides the parent there is a need of an additional meta field. I didn't want to add more options to IndexRequest and other classes for a feature that isn't used as much as our core features. This gives us cleaner code (p/c's code being encapsulated in only a FieldMapper impl and QueryBuilder impls) and does give the opportunity to eventually isolate parent/child into a module. The price of this is that the relationship needs to be specified in the source of the document.

If we can make metadata completely pluggable (from rest layer to field mapper layer) then we specify the required parameter in the url and having the ability to isolate the p/c code. But I don't feel that this should be a requirement for this refactoring. Although if we really want this we do have the time to develop this.

martijnvg on 1 Sep 2016

I didn't want to add more options to IndexRequest and other classes for a feature that isn't used as much as our core features.

This is a very good point.

jpountz on 1 Sep 2016

If we can make metadata completely pluggable (from rest layer to field mapper layer) then we specify the required parameter in the url

I think we should do this. It is already a deficiency in the pluggability of Metadata fields. And I like @jpountz suggested api.

rjernst on 1 Sep 2016

@rjernst @jpountz I've updated the description of the issue to match with Adrien's proposal.

martijnvg on 3 Sep 2016

Hi All,

Interesting discussion, I'm jumping-in with a couple of questions/comments.

Right now, as a user I can independently define the properties of my parent and child documents.
If I understand correctly, one consequence of removing types, is that I would define ONE mapping defining all properties of both my parent and child document (basically a merge of my current child and parent mapping files). Is that right?

Then, when inserting a document, I would either set properties allowed by the parent OR properties allowed by the child. ES won't have any way to tell if the properties are 'parent properties' or 'child properties'. Am I correct ?
If that's right, it seems like a major step backward in terms of functionality. What do you guys think?

Another question: Would it be possible to support cross index parent-child relationships?

All the best

stephanebastian on 15 Mar 2017

👍1

Your assumptions are correct. It is indeed hard to keep feature parity when moving forward, however we think that removing types remains a good trade-off by making Elasticsearch easier to use, easier to understand and possibly faster.

Cross index parent-child relationships would not be possible as-is since Elasticsearch relies on the fact that a parent and all its children are on the same shard.

jpountz on 15 Mar 2017

Then, when inserting a document, I would either set properties allowed by the parent OR properties allowed by the child. ES won't have any way to tell if the properties are 'parent properties' or 'child properties'.

I would like to add to this that today we use the type as marker of what is a child and what is a parent. Today you could use the same field in both parent and child, so properties alone isn't enough to identify what is what. In this case both types would have the same field defined in their mappings.

With the proposed change, the join parameter would indicate what is the parent and what is the child document. If both parent and child documents use the same field they will use the same field mapping instead of two field mappings in the two type mappings.

So I think there is no step backwards in terms of functionality, but the way how parent child relationships are defined is just different with types no longer being there.

martijnvg on 16 Mar 2017

I totally understand moving mappings up to the index level (basically removal of types), however we use strict mapping and there will be no way with this proposal that we'll be able to enforce which fields are allowed in a child vs a parent. If my understanding is correct, I'll be able to enforce mapping on the union of my logical children fields with logical parent fields, but if a field is logically "allowed" in a child there is nothing preventing it from also being in a parent document.

I can certainly live with these constraints, I just want to be sure I understand and plan for the future.

Thanks - C

cmorss on 28 Mar 2017

After some internal discussions with @jpountz and @martijnvg I'd like to update this issue with the latest status.
Firstly naming, instead of join I'd like to call it by parent join I think it is important since it would indicate what kind of join we're supporting.
Secondly and since this feature is targeted to 5.5 I think we should add it as a module with the following restrictions:

the value of the join (name + parent) must be encoded in the source
the routing for any document (parent, child or grand-child) is checked but not automatically added. This means that user should add the routing manually for each document. If the document is only a child we can ensure that routing == parent but if it's also a grand-children we can just ensure that routing was set with some values.
This restrictions are useful to simplify the integration of this feature as a module without being intrusive in all indexation layers (rest and mappings).
We can later work on:

If we can make metadata completely pluggable (from rest layer to field mapper layer) then we specify the required parameter in the url
but I don't think we can achieve this for 5.5. This is not ideal to require the join metadata to be inside the source but adding more metadata handling to all required layers would be worst I think.

So the proposal is as follow:

Defining a single parent-child relation:

"join": { "questions": "answers" }

Defining a parent-child-grand-child relation:

"join": { "questions": { "answers": "comments" } }

Defining multiple parent-child relation:

"join": { "questions": { "answers": "comments" }, "products": "items" }
With this format the relation between each entity is explicit and we can use the hierarchy to validate join values inside documents.
Also it is possible to update a mapping to add a child to an existing parent:

Adding a child to an existing parent:
`````
"join": {
"parent": "child1"
}

"join": {
"parent": {
"child1": {},
"child2": {}
}

"join": {
"parent": {
"child1": "grand_child",
"child2": {}
}
`````

Indexing a parent question:
PUT stackoverflow/1 { "join" : { "name": "questions" }, "question": "..." }

Indexing a child answer:

PUT stackoverflow/2?routing=1 { "join" : { "name": "answers", "parent": "1" }, "answer": "..." }

Indexing a grandchild comment:

PUT stackoverflow/3?routing=1 { "join" : { "name": "comments", "parent": "2" }, "comment": "..." }

So the plan here is to have a metadata ParentJoinFieldMapper separated in a module. Then we would migrate has_parent, has_child queries and the children aggregation to it. These queries and aggs would still be compatible with the legacy parent_child but only for 5.x and 6.x. Lastly we must find a way to also migrate inner_hits in order to be able to completely replace the functionality of the current parent_child.

@clintongormley WDYT ?

jimczi on 26 Apr 2017

How about replacing

"join": {
  "questions": {
     "answers": "comments"
   }
}

with

"join": {
  "questions": "answers",
  "answers": "comments"
}

It feels more natural to me, but maybe I'm missing something.

jpountz on 26 Apr 2017

IMO the first option is more visual, you see more clearly the relation tree and you don't have to repeat the join name but I am fine either way.

jimczi on 26 Apr 2017

The first option is more visual but is somewhat confusing, eg why does questions take a map, but answers just takes comments? A stricter syntax would be:

"join": {
  "questions": {
     "answers": {
        "comments": {}
     }
   }
}

but that encourages users to use several layers of inheritance, which is unwise.

I think I prefer the simple version that @jpountz suggested. In the case that a parent has multiple child types, it could be:

"join": {
  "questions": ["answers","comments"]
}

Question: Does the parent document need to know about which join field to use? If a document is neither a parent nor a child, does the join field still contain a value, or is it null?

clintongormley on 26 Apr 2017

👍3

but that encourages users to use several layers of inheritance, which is unwise.

Right and since most of the use cases are for a single parent-child relation the syntax would be the same anyway: "questions": "answers".
I'll start with the simple version that @jpountz suggested.

Does the parent document need to know about which join field to use

Yes because we use a different docvalue field for each "parent=>child" relation.

If a document is neither a parent nor a child, does the join field still contain a value, or is it null?

It is not required to add a join field in this case so the field can be missing in the document.

jimczi on 26 Apr 2017

This is an interesting change. I've used parent-child documents a bit and there are a few things I don't understand. Perhaps someone could help..

In the current ES design there would be an index with two types - eg stackoverflow/questions and stackoverflow/answers and the mapping of the answers specifies that the parent document is a question. When indexing the children you need to specify the routing param to know what the parent is (and which shard to put it.)

In all the examples given here, it seems as if there is one index stackoverflow and somehow you're specifying the its a parent or child with query-string options. I don't understand this - shouldn't there be two indices so_questions and so_answers?

Also - there seems to be references to this relationship in the indexing of the parent - but I'm not sure I understand why that is. Currently, you need to specify the parent when indexing the child but you don't have to let the parent know anything about even a possible relationship with children. The current approach makes more sense to me since the parent doesn't need to have children and is not dependent on them. If there is something needed under the hood to handle it differently than it works now, I would think it should be transparent to the user so it will work similarly to how it does now.

Basically, as a user of parent-child, it's not clear why it has to work so differently at the API level than how it does now (even if you want to use different terms like join vs parent to distinguish.)

I'm curious if there any POC of this approach, regardless of the API syntax. If the data is not in the same shard, then I would be concerned the performance would be significantly worse (unless you guarantee the shard are on the same node, which seems like a nightmare.) But it's very possible I don't understand how it will work.

yehosef on 23 May 2017

👍1

What you are proposing makes a lot of sense but I did want to throw out what is probably a crazy idea and just see if it makes sense. I'm not sure that it does.

Would it make any sense to wrap parent joins inside the ingest API and require that the parent dataset (a) live in a different index and (b) fit entirely within memory on an ingest node? In my head this would be similar to a broadcast (aka map side) join in Hive/Spark/etc. It may not be necessary to wrap it in the ingest API. Another option might be to provide a _broadcast endpoint on an index that caches the results of a query against an index in memory for the sake of performing joins.

Potential benefits from doing this might be (A) the ability to support joins beyond parent joins (B) the ability to perform lookups (for example, geoip analysis) while indexing data and (C) the limitations of the parent dataset would be explicit -- the results of curl foo/_broadcast/ -d '{ mah awesome query }' have to fit in memory.

Potential downsides that come to mind immediately are it would probably be terrible when there are a lot of inserts on the parent side. It's also pretty far removed from the current model of parent child.

This could be totally bonkers. I'm honestly not sure.

neuroticnetworks on 23 May 2017

@evanv
I'm not sure about the ingest API - but it sounds like what you're suggesting might be similar to what I suggest here https://github.com/elastic/elasticsearch/pull/3278#issuecomment-290063164. It's different, but I think after the same goal, doing arbitrary joins.

But solving in a direction like this would be much slower than the current parent/child approach, I would think. Because the parent and children are on the same shard, you don't have to pass a lot of data around (inside a node or even worse across the network) and the "joining" is happening locally because when you indexed it you have to ensure the parent-child are together with the routing param.

yehosef on 23 May 2017

👍1

@yehosef I'll take a look at 3278. With respect to speed, I'm not sure I follow. What I mean by broadcast join is in fact "broadcast this entire result set to all of the nodes that would be performing the joins" Ingest API definitely wouldn't be necessary... but it would be one place to potentially limit memory usage (eg cache the data on the ingest nodes and stream results over them).

neuroticnetworks on 23 May 2017

I just read through the blog post more and discussed this a little with warkolm at https://discuss.elastic.co/t/parent-child-and-elastic-6/85722/5 and I think I'm starting to understand better how it's planned to work.

Based on the discussion there, I summarize the problems with types from the blog post:

misconceptions, miseducation and bad practices when people think of types like tables
sparsity - which will be less of an issue for Lucene 7, by the time this is required
doc scoring - I'm curious to the extent that this causes real problems and if the switch to BM25 changes it at all. If you have any articles/bugs, etc talking about this problem, I'd be interested to hear.

And it seems the biggest problem is the first - I understand that when you have an "feature" that most people don't get, don't really need, and causes problems, there is a natural inclination to drop it. But I'm not sure if the new solution is much better (but I may just not understand it fully).

Here's an alternative suggestion. Internally, keep the concept of types and most of the internal plumbing for how they work, how queries work ,etc. But change the documentation and API to make the current concept of type go away. The reinforce this, you could remove the "_type" field and use instead a "_join_type" field.

Here are some examples:
The mapping might be like

"join": {
  "questions": ["answers"],
  "answers":["comments"]
}

As an example - I'm not sure if the a parent doc needs to know about the grandchildren.

The normal/default indexing would look like:

POST so_question/1 
{
   "title": "how will types work"
}

This would create a doc

 {
    "_index": "so_question",
    "_id": "1",       
    "_source": {
        "title": "how will types work"
     }
}

This is the default use case everywhere you don't need joins.. You could include a _join_type of NULL or default if it makes things easier.

but

POST stackoverflow/1?join_type=question
{
   "title": "how will types work"
}

would create a doc

 {
    "_index": "stackoverflow",
    "_join_type": "question",
    "_id": "question-1",       
    "_source": {
        "title": "how will types work"
     }
}

Note - it doesn't need to know that it has children at index time. The _id is a place for discussion - it could be that if you want to specify id, it has to be unique across types, etc.

POST stackoverflow/2?join_type=answer&parent=1
{
   "text": "It's full of stars!"
}

would create a doc

 {
    "_index": "stackoverflow",
    "_join_type": "answer",
    "_parent": 1,
    "_id": "answer-2",       
    "_source": {
        "text": "It's full of stars!"
     }
}

And then regular queries and parent-child queries would work more or less as they do now. Just instead of specifying the join type as part of the rest endpoint, it would be part of the query (or could be a special query string, but I don't think it's needed.)

So basically the suggest is to keep types in a form very similar to it's current implementation. This will avoid having to make big internal changes (I think), allow parent-child to work in a similar way to how they work now, and accomplish the goal of weaning users off types for the normal use case.

I'm interested to hear what people think about this approach. If there are some things I misunderstood, I would very much like to understand it better.

yehosef on 24 May 2017

Here's an alternative suggestion. Internally, keep the concept of types and most of the internal plumbing for how they work, how queries work ,etc. But change the documentation and API to make the current concept of type go away. The reinforce this, you could remove the "_type" field and use instead a "_join_type" field.

@yehosef that's more or less the plan. Internally the feature will work exactly the same as before except that it will use a join_name instead of a type.
See my comment here:
https://github.com/elastic/elasticsearch/issues/20257#issuecomment-297446091

jimczi on 24 May 2017

I'm sorry to be a bit late with this question, but why not to remove all this types hustle completely and just make a single parent-child relationship per index?

ei-grad on 24 May 2017

@jimczi - thanks for the clarification. I think I understand a little better - but still not sure I'm at 100%. I'll explain where my confusion is coming from.

The usual way to post data is like the first example at https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

PUT twitter/tweet/1
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

where the url describe the document including the type and the id and the json message is the document.

in the example you gave

PUT stackoverflow/1
{
  "join" : {
     "name": "questions"
  },
  "question": "..."
}

It wasn't clear to me because the "join" information is inside the document payload. But I assume that's not part of the document. What if in my document schema I have a field called "join"? That's why I was thinking it should still be in the url (though perhaps now as a query_string). While you could give it an underscore, it still doesn't feel right to include meta data at the same level as the fields. If you really don't want it in the url, you could maybe allow a separate field for the "_source" or doc - but then you have special rules for parsing for parent/child vs not - which really seems wrong.

Also - it seems that you you're making "join" be an object instead of a simple key because you want to make room for the "parent" field - but why not put it into the query string as it is now? You could just call it "join_name":"question".

Also the term "join_name" is, to me, less clear. We are dealing with different "types" of data - questions and answers, posts and comments, etc. They are not the same thing. If you ask me, I would think saying "this document has a name of 'question'" is less clear than "this document has a type of 'question'". "Name" implies a high degree of specificity whereas "type" inherently does not.

I also think it make make things simpler if you allow an index to have an optional default "_join_type" (whatever it will be called now). That will probably make the internal transition simpler (you can make the default something like "doc"). If I'm using parent-child, I could make the default be the parent and only specify the child, or the opposite.. not sure.

NOTE
I was just working on something that has different "types" and I would have usually put in the same index (not parent-child). And I tried pretending I don't have types and put then into different indices. While I really understand the problems you're trying to solve, it doesn't really help me.

I don't have any of the problems you're having

I am fine with the same fields being the same - that's the default now - it's fine.
I don't care about sparsity - so it'll take some more space.. and it's going to get better with lucene 7 when this is required. And different docs even of the same type sometimes have different fields anyway..
I don't care about score differences (I haven't seen examples where it causes real differences.. but I think usually it's not really so different for many/most use cases - would love to hear otherwise) Everything within that type should be skewed accordingly.

And just makes things more complicated (though not too much so). I have to deal with more indices, have more rules for backups/restores (eg, I want posts-2017-05 and comments-2017-05), I want to look through both things - "find where someone mentioned 'elasticsearch'" - I need to specify both indices (I know I can use both or a wildcard.) It's just more for me to manage. If I really care about the problems you identified, I can already solve those problems now using different indices. The change will just force me to follow your best-practices.

These are not a deal-breakers - I'm not going to drop elasticsearch. But it's not a feature I'm excited to upgrade to.

yehosef on 25 May 2017

I'm sorry to be a bit late with this question, but why not to remove all this types hustle completely and just make a single parent-child relationship per index?

@ei-grad Ideally an index should have a single parent/child relation. But in cases where there is parent-child-grandchild relation or parent-child1 and parent-child2 relations (reuse of the same parent for different types of children), then multiple join fields need to be configured.

It wasn't clear to me because the "join" information is inside the document payload. But I assume that's not part of the document.

@yehosef Yes, it is part of the json document.

What if in my document schema I have a field called "join"?

In the mapping you could configure a different field to be of type join, so there is no clash with your own fields.

martijnvg on 26 May 2017

@martijnvg So, for a greenfield project that needs to use parent/child today, what would be the recommended approach to lessen the future pain?

A bit context, Elasticsearch parent/child feature is great for entity/event relationship, i.e. you have a number of entities and a much larger number of events for those entities. ES makes it super fast to filter and get the events of entities that match a certain query (filter parents and children at the same time and get the children or parents). I hope the long-term goal is not to remove parent/child feature.

kutchar on 26 May 2017

@yehosef Yes, it is part of the json document.

I hope that is reconsidered. It's not that it's a common problem and not something that can be worked around, it just doesn't feel like an elegant solution. Is it really a regular document field, or you just access through the document but it's really stored as metadata? If it's really a document field, is it possible for some to add multiple values like you can in other fields? If so, what does that do? If not, will you have to write special code to enforce that rule?

Maybe if the join fields could be set to specific field names, it might be better. Eg, I might have a real field in by document called "type" or "class" - that could be used as the join type also. You could also do the same with the parent id - it's common that that exists as a real field in the data, so if you could define it to be "parent_id" and you could just use that field for the mapping also. (This doesn't address multiple values.. )

While I'm still not excited about removing types and I think it could/should be been handled differently, I see that there has been a lot of work on this for some time and it doesn't look like the type extermination squad is going to give up easily. So I am preparing to resign myself to this direction. I just want to give one last cry from a elasticsearch advocate..

What I'll do in the future if I would want to use types (for the reasons I mentioned) is to just create a "type" field in the document and filter based on that where needed. I'll have sparsity issues, and I may have scoring issues, but I'll deal.

yehosef on 28 May 2017

I hope the long-term goal is not to remove parent/child feature.

@kutchar The feature will not be removed, but we will changing the way it is exposed and how it can be used in ES. The _parent meta field type will be replaced by a join field type, but the has_child query, has_parent query and children aggregation remain untouched, so there is no loss in functionality.

Is it really a regular document field ... If it's really a document field, is it possible for some to add multiple values like you can in other fields? If so, what does that do

@yehosef Yes and if someone tries to add multiple values by specifying an array then the validation in the new join field will fail. In fact the join field type would only accept a json string.

Maybe if the join fields could be set to specific field names, it might be better. Eg, I might have a real field in by document called "type" or "class" - that could be used as the join type also. You could also do the same with the parent id - it's common that that exists as a real field in the data, so if you could define it to be "parent_id" and you could just use that field for the mapping also.

Yes, you will be able to control what your join field is in the mapping.

What I'll do in the future if I would want to use types (for the reasons I mentioned) is to just create a "type" field in the document and filter based on that where needed. I'll have sparsity issues, and I may have scoring issues, but I'll deal.

If the different types of documents use their own fields (the fields you use for free text search) then you don't have scoring issues, so you can still put these different types of documents into the same index (and if you like you can add a type field, but that isn't required). However if these different types have a substantial number of document then I would recommend to store each type of documents in a different index.

martijnvg on 29 May 2017

@yehosef Yes and if someone tries to add multiple values by specifying an array then the validation in the new join field will fail. In fact the join field type would only accept a json string.

I assume it also has to be protected for scripted updates.

Yes, you will be able to control what your join field is in the mapping.

Just to confirm, I'll be able to use fields in the root of the document payload or fields such as "type" or "parent_id" and I won't have to use a nested object like "join.type" and "join.parent_id". Is that correct?

Would this change allow documents to have multiple parents?

yehosef on 1 Jun 2017

Just to confirm, I'll be able to use fields in the root of the document payload or fields such as "type" or "parent_id" and I won't have to use a nested object like "join.type" and "join.parent_id". Is that correct?

You'll have to use an object field defining the name of the join and an optional parent.
The parent-join is just an object field that needs to be defined at the root level, we don't want to hack the document updates to magically handle this feature. I understand that changing the logic of a feature that exists for years is not ideal but the benefit of the rewriting is that it is now isolated in a module and that it can works without _type. As said earlier, only the mapping and the document handling change. The query and agg side will remain identical.

Would this change allow documents to have multiple parents?

Nope, that's a different issue and there is no plan to add it in the future.

jimczi on 1 Jun 2017

@jimczi - thanks for the clarification. So the "join" field is an object and will be "special". If I have a field named "join" in my document, I'll be able to change the mapping to something like "parent-join" and that will be where the join data lives? Eg. parent-join.name, parent-join.parent ?

But don't you still have to hack those fields to disallow arrays or scripted doc updates? Once this "metadata" lives in the document, you have to do checks to avoid invalid data, no?

Is the field "join" automatically special in all indices and I have to override it, or does it only become special if I define in the mapping that there is a parent-child relationship?

yehosef on 1 Jun 2017

@yehosef we'll write a complete documentation of this feature when it's ready (which should happen soon). In the mean time you can follow the iteration we're doing to make this feature alive.
For instance the new field mapper has been added here:
https://github.com/elastic/elasticsearch/commit/b5d62ae74766ff11a739d867192d93dc674cd191

I'll update this issue when we have something that people can start to test and then we'll still have some time to discuss the pros/cons of the adopted solution which will remain experimental in 5.x

jimczi on 1 Jun 2017

@jimczi will this new join field type rely on Global Ordinals? Or, more to the point, what will be the practical considerations relevant to using _join? How will important details like the byte-length of record identifiers impact the use of _join?

Thanks for any guidance on this -- or anticipated guidance :-)

Very excited to see this new feature coming together.

johnrfrank on 2 Jun 2017

Will this change also remove Nested objects? Or are there plans to improve the performance gap between a nested object query and a parent-child query?

Parent-child queries can be 5 to 10 times slower than the equivalent nested query!

shawnjohnson on 15 Jun 2017

Will this change also remove Nested objects?

No, nested objects are not impacted by this change.

Or are there plans to improve the performance gap between a nested object query and a parent-child query?

This issue is just about typeless parent join that will replace the parent/child with types. We did not change the internals and how the query is executed so it should be similar in terms of performance.
Though we decided to index the parent id this time so inner_hits that retrieves children should be faster and the indexing cost should be a little higher.

Or, more to the point, what will be the practical considerations relevant to using _join? How will important details like the byte-length of record identifiers impact the use of _join?

The same as before ;)
We use doc_values and global_ordinals so the length of the id can have an impact on disk consumption and RAM consumption. But again, this is not a new approach for parent/child, it is just a rewriting without types so if your use case worked with the previous model it should behave the same in the new one.

jimczi on 16 Jun 2017

👍1

The typeless parent-join has landed in master and 5.x.
You can find the documentation here:
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html

New issues can be opened for enhancements or bugs but this long standing issue can be closed !

jimczi on 16 Jun 2017

🎉7

Was this page helpful?

0 / 5 - 0 ratings