Elasticsearch: Elasticsearch accepts invalid json with unpredictable behavior

Created on 27 Jul 2016  路  11Comments  路  Source: elastic/elasticsearch

When a key is present in json object multiple times it doesn't raise a parse error and only last value is used. This should instead raise json_parse_exception.

Elasticsearch version: verified on 2.x, 5.0.0-alpha3

Steps to reproduce:

  1. curl -X PUT localhost:9200/i -d '{"settings": {"number_of_replicas": 2}, "settings": {"number_of_shards": 1}}'
  2. curl -X GET localhost:9200/i
:CorInfrREST API >bug

Most helpful comment

Allowing duplicate keys adds a lot of confusion: https://discuss.elastic.co/t/using-the-remove-processor-for-ingest-node/56500

Maybe for certain apis we should enable strict parsing? (admin like APIs?)

All 11 comments

I could see this becoming a long discussion around whether that one is invalid json or not and whether we should return a parse exception or some other error. The json library we use for parsing allows this, then we should improve this on our end rather than being lenient.

This reminds me of #19547 too and is a very common problem with the way we pull parse json. It can easily be solved case by case but every single parser in our codebase is subject to this so it would be nice to have some generic solution for it. Not sure if there are alternatives to adding lots of ifs to all our pull parsers, we should evaluate that.

The json library we use for parsing allows this, then we should improve this on our end rather than being lenient.

For the record, the option is JsonParser.STRICT_DUPLICATE_DETECTION and has the following warning:

Note that enabling this feature will incur performance overhead 
due to having to store and check additional information: 
this typically adds 20-30% to execution time for basic parsing.

According to the JSON spec, this isn't invalid JSON. The spec doesn't mention how duplicate keys should be treated. Many languages will simply overwrite older values with newer values, without generating any warning. This is essentially what Elasticsearch does today, and i'm not sure it is worth a 20-30% penalty to prevent this behaviour.

Yes, strictly speaking (the rfc only says the keys SHOULD be unique), this is valid. I also agree that the performance penalty isn't worth it. It would, however, be nice to document this behavior and perhaps (if it's easy) have an option to turn on strict checking (ideally per request) - it would be useful as debugging tool and perhaps when running tests.

Allowing duplicate keys adds a lot of confusion: https://discuss.elastic.co/t/using-the-remove-processor-for-ingest-node/56500

Maybe for certain apis we should enable strict parsing? (admin like APIs?)

Discussed in FixitFriday: let's play with the jackon feature to reject duplicated keys and make sure that it works and has a reasonable performance hit. If it is not satisfactory, then let's look into whether there are things that we can do at a higher level such as ObjectParser.

Macrobenchmark Results

We have run our whole macrobenchmark suite with JsonParser.STRICT_DUPLICATE_DETECTION == false (baseline) and JsonParser.STRICT_DUPLICATE_DETECTION == true (STRICT_DUPLICATE_DETECTION) and published the results. https://elasticsearch-benchmarks.elastic.co/19614/

We see at most a reduction in median indexing throughput of 3% for our macrobenchmark suite (PMC track).

Microbenchmark Results

I also double-checked a few scenarios with a microbenchmark and saw similar results (see https://gist.github.com/danielmitterdorfer/9236796a46f3956447171313a6a0b365):

Below are the results of both configurations showing the average time for one iteration (smaller is better).

JsonParser.Feature.STRICT_DUPLICATE_DETECTION: false:

Benchmark                      Mode  Cnt   Score   Error  Units
JsonParserBenchmark.largeJson  avgt   60  19.414 卤 0.044  us/op
JsonParserBenchmark.smallJson  avgt   60   0.479 卤 0.001  us/op

JsonParser.Feature.STRICT_DUPLICATE_DETECTION: true:

Benchmark                      Mode  Cnt   Score   Error  Units
JsonParserBenchmark.largeJson  avgt   60  20.642 卤 0.064  us/op
JsonParserBenchmark.smallJson  avgt   60   0.487 卤 0.001  us/op

For smaller JSON objects (49 bytes) the overhead of duplication check is 8ns or 1.6%. For a large JSON object (6440 bytes) the overhead of duplication check is in the range 1.12us [1] and 1.3us [2] or in the range 5.8% and 6.7%.

[1] best case duplication check enabled 20.578 us, worst case duplication check enabled: 19.458 us
[2] worst case duplication check enabled: 20.706 us, best case duplication check disabled: 19.370 us

Please refer to the gist for more details.

Thanks @danielmitterdorfer. To me that means we should do it. We can have an undocumented escape hatch if we do not feel confident the overhead will be low in all cases.

We can have an undocumented escape hatch

@jpountz The relevant code is in a static block so we can't use our settings infrastructure. I guess that means we'd use a system property?

That would work for me. Or we could handle it like INDICES_MAX_CLAUSE_COUNT_SETTING I suppose, which is a node setting that sets the static limit on the number of boolean clauses.

Thanks @danielmitterdorfer.

I agree with @jpountz, and a first step would be to see if our tests pass (I'm pretty sure we will have to adapt some of them). Also, the same JSON factory is used for both parsing and generating JSON: if we enable this feature then we'll also see if we generate duplicate keys somewhere, which is cool.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

matthughes picture matthughes  路  3Comments

dawi picture dawi  路  3Comments

ppf2 picture ppf2  路  3Comments

brwe picture brwe  路  3Comments

rjernst picture rjernst  路  3Comments