Indexing a document with an object type on a field that has already been mapped as a string type causes MapperParsingException
, even if index.mapping.ignore_malformed
has been enabled.
On Elasticsearch 1.6.0:
$ curl -XPUT localhost:9200/broken -d'{"settings":{"index.mapping.ignore_malformed": true}}'
{"acknowledged":true}
$ curl -XPOST localhost:9200/broken/type -d '{"test":"a string"}'
{"_index":"broken","_type":"type","_id":"AU6wNDGa_qDGqxty2Dvw","_version":1,"created":true}
$ curl -XPOST localhost:9200/broken/type -d '{"test":{"nested":"a string"}}'
{"error":"MapperParsingException[failed to parse [test]]; nested: ElasticsearchIllegalArgumentException[unknown property [nested]]; ","status":400}
$ curl localhost:9200/broken/_mapping
{"broken":{"mappings":{"type":{"properties":{"test":{"type":"string"}}}}}}
Indexing a document with an object field where Elasticsearch expected a string field to be will not fail the whole document when index.mapping.ignore_malformed
is enabled. Instead, it will ignore the invalid object field.
+1
While working on this issue, I found out that it fails on other types too, but for another reason: For example, for integer:
$ curl -XPUT localhost:9200/broken -d'{"settings":{"index.mapping.ignore_malformed": true}}'
{"acknowledged":true}
$ curl -XPOST localhost:9200/broken/type -d '{"test2": 10}'
{"_index":"broken","_type":"type","_id":"AU6wNDGa_qDGqxty2Dvw","_version":1,"created":true}
$ curl -XPOST localhost:9200/broken/type -d '{"test2":{"nested": 20}}'
[elasticsearch] [2015-09-26 02:20:23,380][DEBUG][action.index ] [Tyrant] [broken][1], node[7WAPN-92TAeuFYbRLVqf8g], [P], v[2], s[STARTED], a[id=WlYpBZ6vTXS-4WMvAypeTA]: Failed to execute [index {[broken][type][AVAIGFNQZ9WMajLk5l0S], source[{"test2":{"nested":1}}]}]
[elasticsearch] MapperParsingException[failed to parse]; nested: IllegalArgumentException[Malformed content, found extra data after parsing: END_OBJECT];
[elasticsearch] at org.elasticsearch.index.mapper.DocumentParser.innerParseDocument(DocumentParser.java:157)
[elasticsearch] at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:77)
[elasticsearch] at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:319)
[elasticsearch] at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:475)
[elasticsearch] at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:466)
[elasticsearch] at org.elasticsearch.action.support.replication.TransportReplicationAction.prepareIndexOperationOnPrimary(TransportReplicationAction.java:1053)
[elasticsearch] at org.elasticsearch.action.support.replication.TransportReplicationAction.executeIndexRequestOnPrimary(TransportReplicationAction.java:1061)
[elasticsearch] at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:170)
[elasticsearch] at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.performOnPrimary(TransportReplicationAction.java:580)
[elasticsearch] at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1.doRun(TransportReplicationAction.java:453)
[elasticsearch] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
[elasticsearch] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[elasticsearch] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[elasticsearch] at java.lang.Thread.run(Thread.java:745)
[elasticsearch] Caused by: java.lang.IllegalArgumentException: Malformed content, found extra data after parsing: END_OBJECT
[elasticsearch] at org.elasticsearch.index.mapper.DocumentParser.innerParseDocument(DocumentParser.java:142)
[elasticsearch] ... 13 more
Thats happening because, unlike in the string
case, we are handling the ignoreMalformed for numeric types but, when we throw the exception here we didn't parse the field object until XContentParser.Token.END_OBJECT
and that comes to bite us later, here.
So, I think two things must be done:
(1) Use the ignoreMalformed settings in StringFieldMapper, which is not happening (hence the original reported issue)
(2) Parse until the end of the current object before throwing IllegalArgumentException("unknown property [" + currentFieldName + "]");
in the Mapper classes. To prevent the exception I reported from happening. Or maybe just ignore this exception, in innerParseDocument
, when ignoreMalformed is set?
Does this make sense, @clintongormley? I'll happily send a PR for this.
ah - i just realised that the original post refers to a string field, which doesn't support ignore_malformed...
@andrestc i agree with your second point, but i'm unsure about the first...
@rjernst what do you think?
Sorry for the delayed response, I lost this one in email.
@clintongormley I think it is probably worth making the behavior consistent, and it does seem to me finding an object where a specific piece of data is expected constitutes "malformed" data.
@andrestc A PR would be great.
I want to upvote this issue!
I have fields in my JSON that are objects, but when they are empty, they contain an empty string, i.e. "" (this is the result of an XML2JSON parser). Now when I add a document where this is the case, I get a
MapperParsingException[object mapping for [xxx] tried to parse field [xxx] as object, but found a concrete value]
This is not at all what I would expect from the documentation https://www.elastic.co/guide/en/elasticsearch/reference/2.0/ignore-malformed.html; please improve the documentation or fix the behavior (preferred!).
@clintongormley "i just realised that the original post refers to a string field, which doesn't support ignore_malformed..." Why should string fields not support ignore_malformed?
+1
I think there could be done much more e.g. set the field to a default value and add an annotation to the document - so users can see what went wrong. In my case all documents from Apache Logs having "-" in the size field (Integer) got ignored. I could tell you 100 stories, why Elasticsearch don't take documents from real data sources ... (just to mention one more https://github.com/elastic/elasticsearch/issues/3714)
I think this problem could be handled much better:
A good example is Logsene, it adds Error-Annotations to failed documents together with the String version of the original source document (@sematext can catch Elasticsearch errors during the indexing process). So at least Logsene users can see failed index operations and orginal document in their UI or in Kibana. Thanks to this feature I'm able to report this issue to you.
It would be nice when such improvements would be available out of box for all Elasticsearch users.
any news here?
I wish to upvote the issue too.
My understanding of the ignore_malformed purpose is to not lose events, even when you might lose some of its content.
In the current situation I'm in, a issue similar to what has been described here is occurring, and although it's identified and multiple mid-term approaches are looked into - Issue in our case relates to multiple sources sending similar event, so options like splitting the events in separate mappings, or even cleaning up the events before reaching elasticsearch could be done - I would have liked a short term approach similar to ignore_malformed functionality to be in place to help sort term.
Same problem with dates.
When adding an object with a field of type "date", in my DB whenever it is empty it's represented as "" (empty string) causing this error:
[DEBUG][action.admin.indices.mapping.put] [x] failed to put mappings on indices [[all]], type [seedMember]
java.lang.IllegalArgumentException: mapper [nms_recipient.birthDate] of different type, current_type [string], merged_type [date]
Same problem with me. I'm using the ELK stack in which people may use the same properties but with different types. I don't want those properties to be searchable but I don't want to loose the entity event neither. I though ignore_malformed
would do that but apparently is not working for all cases.
We are having issues with this same feature. We have documents that sometimes decide to have objects inside something that was intedended to have strings. We would like to not lose the whole document just because one of the nodes of data are malformed.
This is the behaviour I expected to get from setting ignore_malformed on the properties, and I would applaude such a feature.
Hay, I have the same problem. Is there any solution (even if it is a bit hacky) out there?
Facing this in elasticsearch 2.3.1 . Before this bug is fixed we should atleast have a list of bad fields inside mapper_parsing_exception error so that the app can choose to remove them . Currently there is no standard field in the error through which these keys can be retrieved -
"error":{"type":"mapper_parsing_exception","reason":"object mapping for [A.B.C.D] tried to parse field [D] as object, but found a concrete value"}}
The app would have to parse the reason string and extract A.B.C.D which will fail if the error doc format changes . Additionally mapper_parsing_exception error itself must be using different formats for different parsing error scenarios all of which need to be handled by the app
I used a workaround for this matter following the recommendations from Elasticsearch forums and official documentation.
Declaring the mapping of the objects you want to index (if you know it), choosing ignore_malfored
in dates and numbers, should do the trick. Those tricky ones that could have string
or nested
content could be simply declared as object
.
for usage as a real log stash I would say something like https://github.com/elastic/elasticsearch/issues/12366#issuecomment-175748358
is a must have!
I can get accustomed to losing indexed fields but losing log entries is a no-go for ELK from my perspective
Bumping, this issue is preventing a number of my messages to successfully be processed as a field object is returned as an empty string on rare cases.
Bump, this is proving to be an extremely tedious (non) feature to work around.
I've found a way around this but it comes at a cost. It could be worth it for those like me who are in a situation where intervening directly on your data flow (like checking and fixing the log line yourself before sending it to ES) is something you'd like to avoid in the short term. Set the enabled
setting of your field to false
. This will make the field non searchable though. This isn't too big of an issue in my context because the reason this field is so unpredictable is the reason I need ignore_malformed
to begin with, so it's not a particularly useful field to search on anyways, though you still have access to the data when you search for that document using another field. Incidentally, this solves both situations : writing an object
to a non-object field and vice versa.
Hope this helps. It certainly saved me a lot of trouble...
Thats a good trick. Ill try that out.
On 19 Jan 2017 16:01, "patrick-oyst" notifications@github.com wrote:
I've found a way around this but it comes at a cost. It could be worth it
for those like me who are in a situation where intervening directly on your
data flow (like checking and fixing the log line yourself before sending it
to ES) is something you'd like to avoid in the short term. Set your
object's enabled setting to false. This will make the fields non
searchable though. This isn't too big of an issue in my context because the
reason this field is so unpredictable is the reason I need
ignore_malformed to begin with, so it's not a particularly useful field
to search on anyways, though you still have access to the data when you
search for that document using another field.Hope this helps. It certainly saved me a lot of trouble...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/12366#issuecomment-273798499,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGC4v4w0ZIXlOGN40nAgl_8fpy0dj2CUks5rT3rhgaJpZM4Fcpph
.
+1
+1
+1
+1
Also an issue on ES 5.2.1. Very frustrating when dealing with some unexpected input that may possibly be malformed.
👍
Would definitely be great to enable the ignore_malformed
property for object. I've had many cases of mapping errors due to the fact that someone tried to index a string where a nested object should be and vice versa.
👍
👍
+1
👍
👍
I had a use case similar to @patrick-oyst and found enabled=false
helps me avoid the issue for now.
One additional observation is that ignore_malformed
setting worked fine until I did a snapshot/restore on my ES instance a day ago. After the restore, no matter what I did (delete index, clear cache, refresh index patterns, etc.), ES just keeps comparing between old and new types.
+1
enabled=false works for me.
:+1: relates to #10070
Quite a useful feature which has been lacking good implementation for too long. And official documentation is incomplete and cheating.
/tmp/elastic_dev/filebeat/current/filebeat -c /tmp/elastic_dev/filebeat/config/filebeat.yml -e
2017/11/23 08:05:06.633737 beat.go:426: INFO Home path: [/tmp/elastic_dev/filebeat/current] Config path: [/tmp/elastic_dev/filebeat/current] Data path: [/tmp/elastic_dev/filebeat/current/data] Logs path: [/tmp/elastic_dev/filebeat/current/logs]
2017/11/23 08:05:06.633916 beat.go:433: INFO Beat UUID: ca5704f8-9b1a-4c94-8766-1dc76b119230
2017/11/23 08:05:06.633952 beat.go:192: INFO Setup Beat: filebeat; Version: 6.0.0
2017/11/23 08:05:06.634604 metrics.go:23: INFO Metrics logging every 30s
2017/11/23 08:05:06.635838 client.go:123: INFO Elasticsearch url: https://sample.test.raju.com:9200
2017/11/23 08:05:06.636048 client.go:123: INFO Elasticsearch url: https://sample.test.raju.com:9220
2017/11/23 08:05:06.636161 client.go:123: INFO Elasticsearch url: https://sample.test.raju.com:9230
2017/11/23 08:05:06.636812 module.go:80: INFO Beat name: 10.20.175.66
2017/11/23 08:05:06.641468 beat.go:260: INFO filebeat start running.
2017/11/23 08:05:06.642313 registrar.go:88: INFO Registry file set to: /tmp/elastic_dev/filebeat/current/data/registry
2017/11/23 08:05:06.642475 registrar.go:108: INFO Loading registrar data from /tmp/elastic_dev/filebeat/current/data/registry
2017/11/23 08:05:06.643372 registrar.go:119: INFO States Loaded from registrar: 4
2017/11/23 08:05:06.643439 crawler.go:44: INFO Loading Prospectors: 2
2017/11/23 08:05:06.643746 registrar.go:150: INFO Starting Registrar
2017/11/23 08:05:06.644503 prospector.go:103: INFO Starting prospector of type: log; id: 9119168733948319376
2017/11/23 08:05:06.645260 harvester.go:207: INFO Harvester started for file: /opt/hello1/test_ServiceAudit.log
2017/11/23 08:05:06.645842 prospector.go:103: INFO Starting prospector of type: log; id: 17106901312407876564
2017/11/23 08:05:06.645874 crawler.go:78: INFO Loading and starting Prospectors completed. Enabled prospectors: 2
2017/11/23 08:05:06.648357 harvester.go:207: INFO Harvester started for file: /opt/hello2/test_ProtocolAudit.log
2017/11/23 08:05:07.697281 client.go:651: INFO Connected to Elasticsearch version 6.0.0
2017/11/23 08:05:07.700284 client.go:651: INFO Connected to Elasticsearch version 6.0.0
2017/11/23 08:05:07.704069 client.go:651: INFO Connected to Elasticsearch version 6.0.0
2017/11/23 08:05:08.722058 client.go:465: WARN Can not index event (status=400): {"type":"illegal_argument_exception","reason":"Rejecting mapping update to [service-audit-2017.11.23] as the final mapping would have more than 1 type: [log, doc]"}
2017/11/23 08:05:08.722107 client.go:465: WARN Can not index event (status=400): {"type":"illegal_argument_exception","reason":"Rejecting mapping update to [protocol-audit-2017.11.23] as the final mapping would have more than 1 type: [log, doc]"}
I am getting below erroe
2017/11/23 08:05:08.722058 client.go:465: WARN Can not index event (status=400): {"type":"illegal_argument_exception","reason":"Rejecting mapping update to [service-audit-2017.11.23] as the final mapping would have more than 1 type: [log, doc]"}
2017/11/23 08:05:08.722107 client.go:465: WARN Can not index event (status=400): {"type":"illegal_argument_exception","reason":"Rejecting mapping update to [protocol-audit-2017.11.23] as the final mapping would have more than 1 type: [log, doc]"}
+1
+1
+1
+1
+1
+1
I don't like the ignore_malformed option, it is a bit like silent data loss since indexed documents cannot be retrieved based on the malformed value. Say a document has all its values malformed for instance, only a match_all
query would match it, and nothing is going to warn the user about that. Even an exists
query would not match, which might be surprising to some users. I suspect a subsequent feature request would be to add the ability to tag malformed documents so that they could be found later, which I don't like either since the time we index a document is too late in my opinion to deal with this.
I am semi-ok with the current ignore_malformed
option is that figuring out whether a date, a geo-point or a number is well-formed on client-side is not easy. But I don't like the idea of making ignore_malformed
silently ignore objects, which it doesn't today, even on field types that support the ignore_malformed
option. In my opinion we should not expand the scope of this option.
To me the need for this option arises from the lack of a cleanup process earlier in the ingestion process. A better option would be ta have an ingest processor that cleans up malformed values and adds a malformed: true
flag to the document so that all malformed documents can be found later.
Isn't that the point though? Users are explicitly opting in to this functionality, generally in cases where we don't control/know our schema. Happens all the time in security use cases. Certainly we'd love to clean up and verify in advance, but that's not always possible.
Often, the desire is to index a partially dynamic schema which also contains well defined meta/header fields, to enable the best possible level of "schemaless" analysis and data integration. Think structured event log payloads. Right now, these are a liability.
I have written es dynamic mapping retrieval => json schema validator tooling that strips out mismatched fields, to work around this limitation, but that's a substantial amount of work to achieve what would be much more simply handled by dropping non-indexable fields with this setting consistently :(
Essentially, right now there's no safe way to use "dynamic" mapping templates with even partially schemaless data, which seems very unfortunate.
I agree with @EricMCornelius. It's great to use ingest to clean things up if you know that your troublesome fields are limited to a few names. The problem is that many users don't have that kind of control over their data - they have to deal with what is thrown their way. Using ingest pipelines for that would be like playing whack-a-mole.
I don't like the ignore_malformed option, it is a bit like silent data loss since indexed documents cannot be retrieved based on the malformed value.
I understand what you mean, but I don't think it is silent. The user has to opt in to ignore_malformed
at which stage, all bets are off. It's best effort only. But it is a get-out-of-jail-free card that a significant number of users need in the real world.
This still sounds like a dangerous game to me. What if a malformed document is the first to introduce a new field. It makes this field totally unusable. Furthermore it's not like we are otherwise happy in case mappings are not under control, for instance we enforce a maximum number of fields, a maximum depth of objects and a maximum number of nested fields.
In the worst-case scenario that you don't know anything about how your documents are formatted, you could still use an ingest processor to enforce prefixes or suffixes for field names based on their actual type in order to avoid conflicts. There wouldn't be any data loss, and I don't think this would be like playing wack-a-mole.
In the worst-case scenario that you don't know anything about how your documents are formatted, you could still use an ingest processor to enforce prefixes or suffixes for field names based on their actual type in order to avoid conflicts.
In other words, rewrite our dynamic mapping rules in Painless? Imagine somebody who runs a cluster for users in the org who want to log data. The sysadmin has no control of the data coming in. Somebody sends foo: true
then foo.bar: false
. This may even be a really minor field, compared to the other fields in the document, but now the whole document is rejected unless this poor sysadmin finds the user responsible and gets them to change their app, or tries to write ingest processors (whack-a-mole style) to cover all these issues.
It would be exceptionally difficult to build an ingest processor that checks for all the types of malformed data that might cause an exception in ES, and by the time the document is rejected it is too late. Also, the default action taken in the presence of malformed data will be to simply delete the field, which is essentially what ignore_malformed
does. I can imagine users writing ingest processors in special cases where (a) there is a common malformation limited to one or a few fields, and (b) there is something specific you could do to correct the malformation, but this will be the exception, not the rule.
We already support ignore_malformed
, users find it a very useful tool, nobody complains about it being dangerous. The only complaint is that it is not supported by all fields, or the implementation on supported fields is sometimes incomplete.
Elasticsearch shouldn't only work with perfect data, it should do the best it can with imperfect data too.
Elasticsearch shouldn't only work with perfect data, it should do the best it can with imperfect data too.
I disagree. Indexing data and cleaning up data should be separate concerns. Your arguments are based on the assumption that there are a minority of malformed documents and that not indexing some fields is harmless. I'm not willing to trade predictability of Elasticsearch, it's too important, not only for users, but also for those who are in the business of supporting Elasticsearch like us.
You mentioned the poor sysadmin who has to identify the user who sent a malformed document, what about not being able to investigate a production issue because the field that you need has not been indexed for the last 2 days because of a schema change that got silently ignored due to this ignore_malformed
leniency?
what about not being able to investigate a production issue because the field that you need has not been indexed for the last 2 days because of a schema change that got silently ignored due to this ignore_malformed leniency?
sure, agreed. Like I said, once you opt in to ignore malformed, you take it with all its issues. But think about this common example: Twitter frequently sends illegal geoshapes. If we didn't have ignore malformed support on that field, the user would either have to:
All this for a field which is nice to have, but not required.
This is why I think that ignore malformed is an important tool for users. And, from this issue, it appears that the number of users who have similar problems with object vs non-object is high. Sure, this is an easier one to fix with ingest than geoshapes, but why wouldn't we just extend this feature that we already have to cover all field types?
Yes, this is why I haven't complained too much about the existing ignore_malformed
option. But I still think this option is dangerous and support for it should be carefully considered. It could cause confusion due to bad scores, slow queries, documents that cannot be found or suspicious evolution of disk usage depending on how many documents are malformed.
I am ok to discuss adding such an option on a per-field basis. For instance I agree this could make sense on geo shapes, as an opt-in. However adding the ignored_malformed
option to all fields would significantly increase the surface of our API and at the same time I'm not convinced it is required. The object vs. non-object case which has been used a couple times as an example here can be worked around by adopting naming conventions that make sure that objects and regular fields cannot have the same field name.
We've had a long discussion about this ticket in our Fix It Friday session without reaching a conclusion. If you are a user who would like this feature, your feedback would be greatly appreciated.
There are two camps:
The first says that ignore_malformed
should be restricted only to use cases where it can be difficult to check validity client-side, eg a string which should be a data, or a malformed geoshape. This doesn't apply to the case where the same field name may be used with an object and (eg) a string. This would be easy to detect in (eg) an ingest pipeline and easy to fix by renaming (eg) the string field to my_field.string
or similar, which would be compatible with the object mapping.
This first camp says that it is a bad thing to use ignore_malformed, because you may base business decisions on the results of aggregations, without realising that only 10% of your documents have valid values.
The second camp says that some users are doing data exploration, and all they want to do is to get as much messy data loaded as possible so that they can explore, which allows them to build cleanup rules which will get them as close as possible to 100% clean data, possibly without ever reaching the goal of 100%.
What is your story? Why would you like (or not like) ignore_malformed
to support the difference between object and string fields? Why might the ingest pipeline solution not be sufficient? The more detail you can provide, the better.
It would be exceptionally difficult to build an ingest processor that checks for all the types of malformed data that might cause an exception in ES, and by the time the document is rejected it is too late.
Therein lies the rub for me. Most security use-cases involve a large effort to integrate external data sources and write message payload parsers. Formats frequently change during application upgrades and configuration changes. It's a continuous battle to get data clean and keep it clean, one that is never won. Having an option to index partially malformed documents makes detecting said partially erroneous documents significantly easier, because controlled invariants (e.g. standard injected metadata fields) can still be relied upon.
Of course, nothing that can't be achieved with (significant) external code that protects ES during indexing, discovers these errors, and deletes the fields from the documents before retrying.
I confess I haven't fully researched how a pipeline transformation would work, so stop me if I'm missing something obvious, but use an ingest processor to enforce prefixes or suffixes for field names based on their actual type in order to avoid conflicts
is not a viable solution. It presumes a controlled schema to begin with, which is not useful for integrating adhoc external sources that don't follow such conventions.
So, I would cast a vote for the pragmatic over idealized choice here, having suffered at the garbage that manages to sneak into logs in the real world, and knowing first hand how much effort is necessary to work around it.
At the end of the day, it's still an explicit opt-in feature, users can blow their own foot off any number of other ways as well.
Is anyone else seeing any difference in any of the ES metrics when this occurs? We graph all of the ES node metrics via Grafana, and I just can't see any difference when this is happening vs. not.
TL;DR If we have dynamic field mappings, we should have ignore_malformed
as an advanced opt-in option
Elasticsearch shouldn't only work with perfect data, it should do the best it can with imperfect data too.
Based on experience with real-world data sets I agree with the comments that it is often a constant challenge to maintain consistent schemas given application and configuration changes.
Elasticsearch clearly can be configured to be strict, but it's ability to ingest unclean data seems useful and popular.
As mentioned above this could be done on the client side. But if options such as dynamic field mapping and ignore_malformed
are available and are used correctly, they can help users understand data quality and consistency (at scale) via post-index exploration rather than exception logging in client code. For example, if I have 1TB of JSON, I would prefer to ingest it into a system where I can explore and report on inconsistencies, rather than decipher client exceptions (whack-a-mole cleaning). With the partially indexed data, I can at least get an understanding of the data and hopefully some useful insights.
However, I also agree with @jpountz that there are dangers if these features are used without fully understanding their liabilities. Clearly, ignore_malformed
can be dangerous and confusing as it can result in silently unindexed fields that are visible via match_all
but unsearchable and presented as empty strings in Kibana. But it is an advanced opt-in feature.
I'd therefore like to help enhance the documentation and examples in this area to show benefits and liabilities. For example, it would be great to show together examples of strict configuration, dynamic field mapping behaviours, ignore_malformed
behaviours, adding copy_to
to index all as text (even for malformed) and an example ingest processor that can suffix types on conflict!
Why would you like (or not like) ignore_malformed to support the difference between object and string fields?
Finally, my 2¢ is that if we are accepting ignore_malformed
is useful, it should be consistent. Accepting all types (including for example array
) and not object
seems inconsistent. But as objects are different I don't have a strong view on this feature.
@patrick-oyst @subhash @lizhongz For people using the enabled=false
workaround, are you finding that this works for nested objects? I'm setting it on a nested object value, but when I reindex using the new mapping, I'm still seeing ES trying to read all the datatypes of the subfields in that object.
@elastic/es-search-aggs
Wanted to add my voice/usecase.
We use elasticsearch to ingest somewhat unpredictable data stored in mongoDB. It's not uncommon for a single field to store different types across objects, especially a mixture of objects and strings. Because our data is unpredictable (and can have lots of fields defined by client systems we don't control), we rely heavily on elasticsearch's dynamic mapping capability. Of course, anytime an object tries to save a value to a field with an incompatible type, it errors.
This is the correct default behavior, to be sure. And I generally do go back and clean up these inconsistencies once I find them. But as we use elasticsearch/kibana as our primary view into the data, and we don't control the creation of this data, we typically don't know about one of these conflicts until after the mapping has been created. At that point, I "fix" the data in our mongoDB to have consistent types. To fix elasticsearch, however, I have 2 options if the initial mapping wasn't correct:
For option 2, I'd like to be able to use ignore_malformed
to get the reindex to complete, and then I can separately reprocess the objects that had the wrong type in the old index (which I can find using a search on that index). But since ignore_malformed
doesn't work in most of my cases (like objects/strings), this doesn't work for me.
So in short, I agree with those who think ignore_malformed
should work consistently for all types. And it should be clearly documented that this is a dangerous feature that can cause silent data loss (though an awesome improvement would be for elasticsearch to add a "_warnings" field or something that lists all the issues it suppressed for a given object, so we'd have an easy way to identify incomplete objects).
I am of the opinion that one of elasticsearch's biggest selling points is how well it works with unclean/unpredictable data. Making ignore_malformed
work as described would be a big help.
Now that #29494 has been merged, is the intention to pick this issue back up for application to multiple types (specifically object vs. primitive mismatches)?
Curious what the direction is.
I think so. I think I was the main person who voiced concerns about increasing the scope of ignore_malformed
, which are now mitigated with #29494.
@clintongormley In one of your comments, you mention
[…] the case where the same field name may be used with an object and (eg) a string. This would be easy to detect in (eg) an ingest pipeline and easy to fix by renaming (eg) the string field to my_field.string or similar, which would be compatible with the object mapping.
Can I ask you to elaborate a bit on how you would detect and fix this in an ingest pipeline? I cannot seem to find any processor that allows me to test if a field is an object or a string. Would I have to resolve to scripting?
@schourode Yes you would have to use scripting.
In case anyone else runs into this issue as well, here is what I used. If the field json.error
(supposed to be an object) is a text string, it's copied to the errormessage
field and then dropped. If it's an object, it remains unchanged.
curl -H 'Content-type: application/json' -s --write-out "%{response_code}" -o reindex.txt -XPOST -u user:pass http://127.0.0.1:9200/_reindex -d '{
"conflicts": "proceed",
"source": {
"index": "terrible-rubbish"
},
"dest": {
"index": "so-shiny"
},
"script": {
"source": "if (ctx._source.json != null && ctx._source.json.error != null && ctx._source.json.error instanceof String) { ctx._source.errormessage = ctx._source.json.remove(\"error\"); }",
"lang": "painless"
}
}'
You need to check whether each parent object element is null, or you'll get an error if you hit an index entry where it is absent.
I was referred here after raising #41372
Please, please, consider which options actual user have.
If „dirty” data is allowed to enter ES (and preferably flagged somehow) I can inspect it, I can analyze it, I can find it to test with it, I can count it. And I can see that it exists. Full Kibana to my power in particular.
If „dirty” data is rejected, I must visit ES logs with those horrible java stacktraces, to find cryptic error message about bulk post rejects. In most cases I don't even have a clue which data caused the problem or what the problem really is (see my #41372 for example error, good luck guessing why it happened).
Regarding data loss: you fear business decisions made on data with field missed? I can make those business decisions based on the database which doesn't have 20% of records at all because they were rejected (mayhaps due to minor field irrelevant in most cases). And unless I am ES sysadmin, I won't even know (with dirty data I have good chance to notice problematic records while exploring, and I can even have sanity queries).
From ELK own field: Logstash does very good thing with _grok_parse_failure tags (which can be further improved to differentiate between rules with custom tags). Sth is wrong? I see records with those tags, can inspect them, count them, and analyze the situation.
One issue to consider during implementation if this does get addressed, dynamic templates currently allow this setting even though it is rejected when directly mapping a field.
Mapping definition for [ignore_bool] has unsupported parameters: [ignore_malformed : true]
:
PUT test
{
"mappings": {
"properties": {
"ignore_bool": {
"type": "boolean",
"ignore_malformed": true
}
}
}
}
ok:
PUT test
{
"mappings": {
"dynamic_templates": [
{
"labels": {
"path_match": "ignore_bools.*",
"match_mapping_type": "string",
"mapping": {
"type": "boolean",
"ignore_malformed": true
}
}
}
]
}
}
The resulting fields are created without issue also.
我使用 springcloud aliaba 集成 es报以下错误
创建索引
es版本 7.6.2
springboot 2.3.4
springcloud alibaba 2.2.3
springcloud Hoxton.SR8
ElasticsearchStatusException[Elasticsearch exception [type=parse_exception, reason=Failed to parse content to map]]; nested: ElasticsearchException[Elasticsearch exception [type=json_parse_exception, reason=Invalid UTF-8 start byte 0xb5
at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@762d957b; line: 1, column: 86]]];
Most helpful comment
I was referred here after raising #41372
Please, please, consider which options actual user have.
If „dirty” data is allowed to enter ES (and preferably flagged somehow) I can inspect it, I can analyze it, I can find it to test with it, I can count it. And I can see that it exists. Full Kibana to my power in particular.
If „dirty” data is rejected, I must visit ES logs with those horrible java stacktraces, to find cryptic error message about bulk post rejects. In most cases I don't even have a clue which data caused the problem or what the problem really is (see my #41372 for example error, good luck guessing why it happened).
Regarding data loss: you fear business decisions made on data with field missed? I can make those business decisions based on the database which doesn't have 20% of records at all because they were rejected (mayhaps due to minor field irrelevant in most cases). And unless I am ES sysadmin, I won't even know (with dirty data I have good chance to notice problematic records while exploring, and I can even have sanity queries).
From ELK own field: Logstash does very good thing with _grok_parse_failure tags (which can be further improved to differentiate between rules with custom tags). Sth is wrong? I see records with those tags, can inspect them, count them, and analyze the situation.