Currently, the _field_caps API presents a transparent view of fields where aliases and multi-fields are not linked. However for some applications it is important to be able to retrieve the source field that was used to populate a specific field since the value of a sub-field (inside a multi-field) or an alias is not present in the _source. For this reason we have some duplicated logic in ml and sql that tries to infer the source path using some fragile heuristics. In order to simplify this retrieval this issue is a proposal to add a source_path section to the fields inside the field_caps response. This way it would be easy for consumers of this API to redirect some fields to their source path when values need to be retrieved:
{
"indices" : [ "index1", "index2" ],
"fields" : {
"alias_field" : {
"keyword" : {
"type" : "keyword",
"searchable" : true,
"aggregatable" : true,
"source_path": "concrete_field"
}
},
"some_field.multi_field" : {
"keyword" : {
"type" : "keyword",
"searchable" : true,
"aggregatable" : true,
"source_path": "some_field"
}
}
}
}
This information should be easy to add for alias field and sub-fields inside a multi-field. It might be more problematic for copy_to since the target could have multiple source.
Pinging @elastic/es-search (:Search/Mapping)
One complexity to note is that a field could have a different source_path across indices. For example, a field could be mapped as an field alias in one index, and as a concrete field in another. (We actually expect this set-up to be common, as it is part of the workflow for renaming a field with time-based indices). So I think the response format would need to support separate source paths for different indexes.
Would it be good enough to return an array that contains both the old and new name in that case? I believe we would need the source_path to be an array anyway because copy_to means that a single field might have sources in multiple fields? And we would compute the union when merging field caps from multiple indices?
We discussed offline and came up with the proposal that source_path could be treated differently than the existing components in the response. We largely expect components like searchable and aggregatable to be consistent across indices, and what's most helpful to the user is a merged view of their values. But for source_path, there is no expectation of consistency across indices, and in some common use cases (like the field alias case described above), the values will differ.
Here's a sketch for how it could look to include index information for source_path. The last section all_field represents a field where other field values have been copied into it through copy_to.
{
"indices" : [ "index1", "index2" ],
"fields" : {
"alias_field" : {
"keyword" : {
"searchable" : true,
"aggregatable" : true,
"source_path": [{
"paths": ["concrete_field"],
"indices": ["index1"]
}]
}
},
"some_field.multi_field" : {
"keyword" : {
"searchable" : true,
"aggregatable" : true,
"source_path": [{
"paths": ["some_field"],
"indices": ["index1", "index2"]
}]
}
}
},
"all_field" : {
"text" : {
"searchable" : true,
"aggregatable" : true,
"source_path": [{
"paths": ["title_field", "abstract_field", "body_field"],
"indices": ["index1"],
}, {
"paths": ["title_field", "body_field"],
"indices": ["index2"]
}]
}
}
}
This makes sense to me. I wonder how to handle meta fields (e.g. _id or _index) and constant_keyword fields too.
For meta fields, maybe we don't need to do anything and just expect clients to rely on the assumption that fields that start with an underscore are a meta field and expected to be returned directly as part of the hit?
Regarding constant_keyword maybe we could slightly modify your proposal to make it possible to return a value directly, e.g.
{
"indices" : [ "index1", "index2" ],
"fields" : {
"my_keyword" : {
"constant_keyword" : {
"searchable" : true,
"aggregatable" : true,
"source": [{
"value": ["foo"],
"indices": ["index1"]
}]
},
"keyword" : {
"searchable" : true,
"aggregatable" : true,
"source": [{
"path": ["my_keyword"],
"indices": ["index2"]
}]
}
},
"alias_field" : {
"keyword" : {
"searchable" : true,
"aggregatable" : true,
"source": [{
"paths": ["concrete_field"],
"indices": ["index1"]
}]
}
},
"some_field.multi_field" : {
"keyword" : {
"searchable" : true,
"aggregatable" : true,
"source": [{
"paths": ["some_field"],
"indices": ["index1", "index2"]
}]
}
}
},
"all_field" : {
"text" : {
"searchable" : true,
"aggregatable" : true,
"source": [{
"paths": ["title_field", "abstract_field", "body_field"],
"indices": ["index1"],
}, {
"paths": ["title_field", "body_field"],
"indices": ["index2"]
}]
}
}
}
Another question is whether we should always return the source information, or whether we should skip it in the common case that the source path is equal to the field name on all indices?
It seems like this proposal would work for alias and multi-mapped fields like the default text/keyword multi-mapping, and I'm looking forward to using it in Lens.
What is the expected behavior when the field is mapped with optional properties like index: false or doc_values: true? Are there edge cases to be careful of?
Thanks for taking a look @wylieconlon !
What is the expected behavior when the field is mapped with optional properties like index: false or doc_values: true? Are there edge cases to be careful of?
This proposal centers on adding information about the field's path in _source and won't change the existing behavior of the field caps API around parameters like index and doc_values. These parameters are reflected in the searchable, aggregatable, non_searchable_indices, etc. components of the field caps response. Let me know if I'm missing part of your question.
Answering a couple questions asked over email:
Is adding information about the 'source path' enough, or would you need explicit indication that the field is an alias?
I think we might still need an explicit indication that the field is an alias. Unless I'm missing something, the currently proposed API response wouldn't allow us to determine which fields are aliases and which are copy_to. I don't have an example off the top of my head, but perhaps we'd want to treat those two differently somewhere in Kibana. A "sub_type" property on the source_path objects might be generally useful. Although it is possible to determine which fields are multi fields currently, the logic for doing so is a bit hairy聽and it would be nice to have a central place to check for any sub type. To be clear, this is what I'm thinking:
{
"indices": [
"index1",
"index2"
],
"fields": {
"alias_field": {
"keyword": {
"searchable": true,
"aggregatable": true,
"source_path": [
{
"paths": [
"concrete_field"
],
"indices": [
"index1"
],
"sub_type": "alias"
}
]
}
},
"some_field.multi_field": {
"keyword": {
"searchable": true,
"aggregatable": true,
"source_path": [
{
"paths": [
"some_field"
],
"indices": [
"index1",
"index2"
],
"sub_type": "multi"
}
]
}
}
},
"all_field": {
"text": {
"searchable": true,
"aggregatable": true,
"source_path": [
{
"paths": [
"title_field",
"abstract_field",
"body_field"
],
"indices": [
"index1"
],
"sub_type": "copy_to"
},
{
"paths": [
"title_field",
"body_field"
],
"indices": [
"index2"
],
"sub_type": "copy_to"
}
]
}
}
}
The multi-index case is a bit tricky. In particular, a field can be mapped as an alias in one index, and as a concrete field in another. We currently propose to return the full source path information separated out by index, but I'm curious if it'd be just as helpful to return a merged list of source paths across all indices.
I like how you're currently separating it out by index. Since we don't know of 100% of the places we'll need to use this info in Kibana, I think the more info the better.
Aside from my sub_type suggestion I think the API looks great and it should definitely give us the info we need in Kibana.
@jtibshirani @jimczi Should we close this in favor of #49028?
That works for me, we can always reopen if the idea becomes relevant again.
Most helpful comment
One complexity to note is that a field could have a different
source_pathacross indices. For example, a field could be mapped as an field alias in one index, and as a concrete field in another. (We actually expect this set-up to be common, as it is part of the workflow for renaming a field with time-based indices). So I think the response format would need to support separate source paths for different indexes.