Many issues have been reported, mainly related to JSON decoding either in a codec or filter, where a valid JSON document contains keys that starts with a [ which is interpreted as a logstash field reference and results in an LogStash::Json::ParserError: Invalid FieldReference error.
To reproduce:
echo '{"[foo":"bar"}' | bin/logstash -e 'input{stdin{codec=>json_lines}} output{stdout{codec=>rubydebug}}'
...
[2020-02-19T11:46:58,786][WARN ][logstash.codecs.jsonlines][main][ee68f56b1186b09c0ebc08387e2d8df11ff00788d3a22c61eeda228a073bb104] JSON parse error, original data now in message field {:error=>#<LogStash::Json::ParserError: Invalid FieldReference: `[foo`>, :data=>"{\"[foo\":\"bar\"}"}
{
"@timestamp" => 2020-02-19T16:46:58.803Z,
"message" => "{\"[foo\":\"bar\"}",
"host" => "mbp15r",
"@version" => "1",
"tags" => [
[0] "_jsonparsefailure"
]
}
The problem we have is that the keys are in fact valid JSON but are not parsable by logstash and result in a bad user experience.
I believe we should offer some way to mitigate that, maybe by allowing the user to specify some replacement character for the brackets that denote a field reference? Open to suggestions.
This relates to the FieldReference strict mode introduced in #9543
WDYT?
Hey, im also running into this, i found a work around using filter, but it kills my performance..... it goes from 2k events / sec to 500~1000
@rafael-adcp Right. What kind of solution would work for you? We cannot allow the use of field reference syntax into a field key so we have to come up with idea to deal with that kind of situation. Would replacing the brackets with another char work for you?
@colinsurprenant
place this at your logstash filter also heads up im grabing it from doc so if you are using a different field / structure just change that
filter {
# community workaround solution: https://discuss.elastic.co/t/avoiding-field-reference-grammer-in-json-parsing/177899/4
## the goal of applying this only on JSON failures is that
## filtering is way expensive which means it uses way more CPU + it slow things down (throughput)
## (and i really mean it)
## so by applying this only when it fails to parse we are ensuring
## that only "messages" with issues will be filtered therefore optimizing performance
## HEADS UP NOT THAT IM ONLY REPLACING: "[" AND "]" as more appears it just a matter of adding them to the regex
if "_jsonparsefailure" in [tags] {
ruby {
code => "
def sanitize_field_reference(item)
case item
when Hash
item.keys.each{ |k| item[k.gsub(/\[|\]/, '_')] = sanitize_field_reference(item.delete(k)) }
return item
when Array
return item.map { |e| sanitize_field_reference(e) }
else
return item
end
end
event.set('doc', sanitize_field_reference(event.get('doc')))
"
}
}
}
also im not 100% familiar into the reasons why the behaviour of this changed from logstash 6.X to 7+
though as a user if my json is valid it should work without any hacks on my side, so whatever behaviour happens when [ appears it should be behind a config
also important to notice that there are some well known "special characters" to logstash
https://www.elastic.co/guide/en/logstash/current/plugins-filters-kv.html#plugins-filters-kv-remove_char_key
so i bet those (<>\[\],) will also cause confusion
@rafael-adcp Thanks, I am well aware of the filtering workaround and its performance & complexity consequences.
This is a [discuss] issue (also tagged as such) to discuss how we can improve this behaviour. In my previous question 芦What kind of solution would work for you?禄 I was more interested in hearing what in the future could be improved in logstash to deal with this situation, not your current workaround, but thanks for sharing.
Also,
though as a user if my json is valid it should work without any hacks on my side
I agree in principle but practically speaking we still have to deal with the problem of not allowing a field key which is ambiguous with a field reference. Although the JSON is valid, we cannot allow a key such as [foo] to be used as a key in the event so we have to think about ways to mitigate that situation.
also important to notice that there are some well known "special characters" to logstash so i bet those (<>[],) will also cause confusion
Not really. We are really focusing on the case where a valid JSON document (note that this could also happen with other inputs/codecs) has a field key that uses the field reference syntax which cannot be allowed as-is.
For example, imagine your JSON input is the following: {"[foo]":"bar"} and imagine nothing was done in logstash to prevent that; it would lead to problems trying to access that field in the config because the field key itself is a field reference. You cannot not do something like if [[foo]] in your config and just writing if [foo] refers to the foo field key and not [foo].
The event is being tagged with _jsonparsefailure because the ruby LogStash::Event::from_json in logstash core raises a ruby LogStash::Json::ParserError upon encountering _any_ java exception (even exceptions that are unrelated to JSON parsing):
~ java
try {
events = Event.fromJson(value.asJavaString());
} catch (Exception e) {
throw RaiseException.from(context.runtime, RubyUtil.PARSER_ERROR, e.getMessage());
}
~
-- src/main/java/org/logstash/ext/JrubyEventExtLibrary.java:[email protected]
The various JSON codecs handle this specific exception by creating a new event with the un-decoded payload and the _jsonparsefailure tag.
This _tag_ leads people to believe that we cannot parse the JSON (which we can), when the real problem is that we cannot create a LogStash::Event from the _structure_ the encoded JSON represents.
@yaauie that's right and I believe that this is the topic of the discussion here; specifically find/offer a way for users to be able to create events from a JSON (or other format decoding) that contains a field key which is invalid in our event structure but valid in the original format (JSON) without having to resort to exception handling using the _jsonparsefailure tag which will be highly inefficient if this situation if not in fact exceptional but regular throughout the input data.
bump.
there's a few options this could get handled:
escape => "[]" characters by defaulttrim => ["[", "]", " "] characters from begin/end (partial solution)replace => [ "[", "", "]", "" ] characters characters using the user mappingreplace seems the most "universal" option, escape would be nice (the most user-friendly one out-of-the-box) but LS would need to make changes to the event API to not process escaped \[ references.
would also use this opportunity to decouple the JSON parsing from Event -> to have the ability to parse raw data using LS semantics into a Hash/Map like structure.
p.s. a bit annoying there isn't a specific error type: RuntimeError (Invalid FieldReference: `[foo`)
If we support the escape of chars [ and ] in field reference in side Logstash pipelines definitions, we should be ok.
I mean if a json contains a field named [field] we create a field in the event with the same name.
So usually in Logstash pipelines we use the synthax [normal_field] to reference normal_field, in this case we should use [\[squared\]] to reference the field [squared].
Am I missing something?
@kares agree, we should use a specific exception for Invalid FieldReference.
@andsel I think that could make sense, I'll play with this idea to see how it could work. I like it because it would be completely independent from the actual parser used, would be consistent and not require special configuration and would switch the burden to the config author to correctly address field names with brackets by escaping them.
Pointing out here that this is how syslog-ng encodes JSON arrays. i.e. a log message of {"foo": ["bar", "baz"]} gets encoded as {"foo[0]": "bar", "foo[1]": "baz"}. So when we upgraded this broke our syslog handling for any log that logged an array (e.g. tags)
@rafael-adcp Right. What kind of solution would work for you? We cannot allow the use of field reference syntax into a field key so we have to come up with idea to deal with that kind of situation. Would replacing the brackets with another char work for you?
My preferences:
replace => [ "[", "<", "]", ">" ]replace => [ "[", "_", "]", "_" ]replace => [ "[", "", "]", "" ]Whichever solution is chosen, I do not believe a JSON provider should have to modify valid JSON output due to a Logstash limitation. And it definitely should not crash the pipeline.
Although this may not be the correct place to discuss a workaround, I would expect that Elastic/Logstash would recommend official workaround(s) until this issue is addressed.
Most helpful comment
My preferences:
replace => [ "[", "<", "]", ">" ]replace => [ "[", "_", "]", "_" ]replace => [ "[", "", "]", "" ]Whichever solution is chosen, I do not believe a JSON provider should have to modify valid JSON output due to a Logstash limitation. And it definitely should not crash the pipeline.
Although this may not be the correct place to discuss a workaround, I would expect that Elastic/Logstash would recommend official workaround(s) until this issue is addressed.