Elasticsearch: Not all meta fields in _source print a nice error

Created on 1 May 2017  路  27Comments  路  Source: elastic/elasticsearch

Given this example:

POST /_bulk
{"index":{"_index":"releaseindex","_type":"rootobject","_id":"1"}}
{"product": [{"_version": 14}],"seo": {"section": [{"_version": 1}]},"_version": 1}
{"index":{"_index":"releaseindex","_type":"rootobject","_id":"2"}}
{"seo": {"section": [{"_version": 1}]},"_version": 1}
{"index":{"_index":"releaseindex","_type":"rootobject","_id":"3"}}
{"_version": 1}
{"index":{"_index":"releaseindex","_type":"rootobject","_id":"4"}}
{"product": [{"_version": 14}],"seo": {"section": [{"_version": 1}]}}
{"index":{"_index":"releaseindex","_type":"rootobject","_id":"5"}}
{"_field_names": 1}
{"index":{"_index":"releaseindex","_type":"rootobject","_id":"5"}}
{"_index": 1}

We get different messages for adding meta fields in the root of _source e.g look at _index vs _field_names

{
      "index": {
        "_index": "releaseindex",
        "_type": "rootobject",
        "_id": "5",
        "status": 400,
        "error": {
          "type": "illegal_argument_exception",
          "reason": "Field [_field_names] is defined twice in [rootobject]"
        }
      }
    },
    {
      "index": {
        "_index": "releaseindex",
        "_type": "rootobject",
        "_id": "5",
        "status": 400,
        "error": {
          "type": "mapper_parsing_exception",
          "reason": "Field [_index] is a metadata field and cannot be added inside a document. Use the index API request parameters."
        }
      }
:SearcMapping >non-issue Search

Most helpful comment

@Rijul1204 Please feel free to work on this, or any other issue. Those labeled with help wanted in particular indicate nobody else plans to work on them.

All 27 comments

I agree that the message is a bit confusing since we usually fail any document that contain a metadata field. Though _field_names is not listed as a metadata field in MapperService.META_FIELDS and this is why the message is different.
Thanks @narph, are you interested in submitting a PR for this ?

Hello,

if @narph is not available to submit a PR, I am interested in handling this issue.

@asettouf sure go for it, I'd be happy to review !

Note that MapperService.META_FIELDS is just wrong now and needs to be removed. Since adding MapperPlugin.getMetadataFields(), that is what should be used as the definition of which meta fields exist. So MapperService should use the MapperRegistry that is passed in the ctor, instead of a static set.

So for beginning I have just added "_field_names" in the MapperService.META_FIELDS. Especially given that some refactoring is needed, as right now the parsing and checking is made by static methods.

Am I right to understand that to use the MapperRegistry, I would need to refactor the static methods inside the DocumentParser to get a reference to it and use it?

I'd love to do the refactoring, now as I am new, I would just like some guidance on what would be the best way (perhaps I've missed a much smarter way of refactoring) to proceed before I start breaking everything!

@asettouf The META_FIELDS set is only used within MapperService. I think you only need to use the keyset of the mapperRegistry to reimplement isMetadataField(String) and getAllMetaFields().

@elastic/es-search-aggs

I guess the existing PR for this issue is closed now. Can I work on this issue ?

@Rijul1204 Please feel free to work on this, or any other issue. Those labeled with help wanted in particular indicate nobody else plans to work on them.

Hi, as there has been no activity for some time, can I pick this issue?

@Sandmannn, Sure please proceed. I am not currently working on it.

Hi @rjernst , you mentioned in your reply before, So MapperService should use the MapperRegistry that is passed in the ctor,. Since we cannot use exactly the same instance that has been passed in ctor due to isMetadataField being a static method, we need a way to grab such an instance of MapperRegistry inside this static method.

I looked at several places (constructor of MapperRegistry, IndicesModule) but did not find a nice way of getting a MapperRegistry instance (or metadataMapperParsers instance which is what we are really looking for). I am still looking for relevant usages in the codebase, but maybe there is a very straightforward way of getting a MapperRegistry instance that I am missing?

The isMetadataField and getAllMetaFields methods would need to become instance methods, and any callers would need to have an instance of MapperService. This can likely be done in increments by converting each use of the method independently. It will also likely require reworking several chains of construction. For example, DocumentField calls the method statically. Instead, it would need to take in whether the given field is a meta field as a ctor arg, and that would need to be passed through a few layers from there is access to the MapperService.

Thanks for your suggestion, @rjernst ! I think it is the only possible way to proceed here.

As number of classes affected seems to be slightly bigger than originally expected, lets maybe address the necessary class changes one by one. I looked for usages of DocumentField constructor. In half of the cases it is either obvious that we are looking at metadata field, or we have access to mapper service instance that can provide such information. Hovever, there are 2 groups of calls, where it is quite ambiguous: first, several static methods within DocumentField itself, which are creating DocumentField either from InputStream or from XContent. Second, the usage in PercolatorMatchedSlotSubFetchPhase. These are real blockers for adding the metadata information to the DocumentField constructor. As a next step i will check where are those methods used, but currently I don't see a straightforward way to proceed.

I was looking up various chains of funciton calling of DocumentField constructor, among others found this chain

static DocumentField.readDocumentField(StreamInput in)
Called from: GetResult.readFrom(StreamInput) 
Called from static static DocumentField.readGetResult (StreamInput in) 
Called from UpdateResponse.readFrom(StreamInput) 
Called from BulkItemResponse.readFrom(StreamInput) 
Called from static BulkItemResponse.readBulkItem(StreamInput)
Called from BulkItemRequest.readFrom(StreamInput)
Called from BulkItemRequest.readBulkItem(StreamInput)
Called from BulkShardRequest.readFrom(StreamInput)

BulkShardRequest readFrom() implements ReplicatedWriteRequest, thus if we want to extend the function signature of the methods above by handing over additional context, then we also need to adjust all calls to ReplicatedWriteRequest.readFrom(), affected classes DeleteRequest, IndexRequest,ResyncReplicationRequest. (further call chain must be researched separately for all affected classes)

is going up the call chain up to creation of a Node instance and passing the contexts from there really the only possible way of dealing with this issue? Maybe there is some easy way to get access to the instances of Node or IndicesService directly and reading the MapperRegistry from there?

For DocumentField, I would make this a member variable that is set on construction. The readFrom case is then simpl, because the new flag (whether a field is a meta field) would be serialized/deserialized on the StreamOutput/Input methods. So nothing else would need to be passed into those StreamInput methods.

FTIW, passing the boolean to the DocumentField constructor should be relatively simple: see FetchPhase.getSearchFields. This method has SearchContext, which has a ref to the MapperService.

Assuming it was already written during serialization indeed makes it much simpler. 2 questions though:
1) Do we have to care about deserializing DocumentField instances serialized in previous version (without the flag)?
2) What is the policy on splitting PRs? Is it possible to first create a PR concentrating on change in DocumentField and whoever calls its constructor and afterwards a separate PR for changing the static methods of the MapperService?

  1. Yes, backcompat needs to be addressed. In this case, the deserialization can fall back to calling the static method (it would need to stick around in 6.x then).
  2. Splitting PRs is highly encourage, assuming the split makes logical sense (each PR should leave the system in a stable state, still fully tested, etc). If I understand your suggested split, you would change the construction of DocumentField (and the deserialization) to call the static method in a first PR, and then change to instance methods in a followup? That would be fine, but I would personally prefer adding the instance methods first (which can internally call the static method) in order to begin getting rid of uses of the static method. After the uses are all changed, the impl can change. But in this case I think the order doesn't matter much (at least at a glance).

sure, lets add instance methods first. Does it mean that in the scope of the first PR we just adding the instance equivalents of isMetadataField and getAllMetaFields, which in turn internally call the static methods? We have 2 options then:
1) accept that we have both instance and static methods available in the codeline for some time, meaning we have to give the instance methods different names for the time of refactoring, as they have otherwise same signature. Is there anything less ugly than isMetadataFieldInstance and getAllMetaFieldsInstance?
2) refactor all uses of the static methods by instantiating empty MapperService, calling instance method which in turn internally calls static method. In this case we can rename static method (which is then called only inside MapperService body), plus get lots of calls to empty MapperService constructor, which can look puzzling.

Both options look ugly, but in different ways. What do you prefer?

I don't think we need to have both static and instance methods at the same time. My suggestion is to convert the uses of the meta fields checking to first call the existing static methods directly, instead of indirectly through wrapped layers.

For example, a first PR can change the constructor of DocumentField to take a boolean indicating whether or not the field is a metadata field. All the calls to the constructor would be changed to pass this statically (eg with percolator), or by calling the existing static method on MapperService. A second PR would then change those static method callers to check on the MapperService instance (and at the same time change to an instance method).

I am currently working on a first PR to change the constructor of DocumentField, plus adapting the serialization/deserialization. Previously each field-related serialization fragment was serialized as key/array, {"field":["value1","value2"]} but after the change it has to be serialized as an object, which contains the old value and the new metadata-related infromation {"field":{"value":["value1","value2"],"isMetadata":true}}

I cannot rely on the order of fields in this inner object anymore, thus instead of using low level methods of XContentParser which tokenize json sequentially, I would need to use some different API that first parses all the content of object first and stores it in a map. The right tool in this case would be using ObjectParser as done in SearchHit, right?

https://github.com/elastic/elasticsearch/blob/995bf0ee668282fbde420eaf2eb028af75a74d12/server/src/main/java/org/elasticsearch/search/SearchHit.java#L509-L517

Changing the xcontent structure of a document would be problematic from a bwc standpoint, as there is a large amount of code out there relying on that structure. When parsing a doc from xcontent, we should know what part of the json we are in (eg underneath _source or not), which indicates whether the field is a metadata field or not.

maybe lets continue discussing this issue related to the parsing in the respective PR https://github.com/elastic/elasticsearch/pull/38373 . While generally belonging to _source is a good rule for determining if the field is metadata, there seem to be some exceptions

@rjernst Can I work on this issue?

@sd1998 you can review https://github.com/elastic/elasticsearch/pull/41656, which was there for some time

Hi, as mentioned in the discussion in the last commit #41656 I will not have capacity to contribute at least in the next 3-6 months starting from the April 2020. so as of now everybody is welcome to take over this broader refactoring.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

clintongormley picture clintongormley  路  3Comments

clintongormley picture clintongormley  路  3Comments

abrahamduran picture abrahamduran  路  3Comments

rpalsaxena picture rpalsaxena  路  3Comments

jpountz picture jpountz  路  3Comments