Fluent-bit: Duplicated JSON fields on record

Created on 20 Dec 2019 · 7Comments · Source: fluent/fluent-bit

Bug Report

Describe the bug

When filters add entries to a record with same key, we can have duplicated keys. So, if we use an output with json format, it creates an invalid JSON with duplicated keys. It happens on nested maps too.

To Reproduce
As an example, we have the following filters:

    [FILTER]
        Name                kubernetes
        Match               kubernetes.algorithms.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kubernetes.algorithms.drivers.var.log.containers.
        Merge_Log           On

    [FILTER]
        Name            parser
        Match           kubernetes.algorithms.*
        Key_Name        log_file
        Preserve_Key    True
        Reserve_Data    True
        Parser          pod_metadata

   [FILTER]
        Name            nest
        Match           kubernetes.algorithms.*
        Operation       nest
        Wildcard        kubernetes_*
        Nest_under      kubernetes
        Remove_prefix   kubernetes_
...

    [PARSER]
        Name      pod_metadata
        Format    regex
        Regex     \/var\/log\/containers\/(?<kubernetes_pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<kubernetes_namespace_name>[^_]+)_(?<kubernetes_container_name>.+)-(?<kubernetes_docker_id>[a-z0-9]{64})\.log$

We have second filter due to problems with kubernetes filter (https://github.com/fluent/fluent-bit/issues/1399), so when kubernetes_filter fails we want to get some metadata using the parser.

If kubernetes_filter works fine, It should add kubernetes.pod_name, etc. and also parser should add too, so second should overwrite first. However, it generates a JSON like like this:

  "kubernetes": {
    "pod_name": "consents-importer-1576685889687-exec-1",
    "namespace_name": "consents-importer-f6e67dc4cd",
    "container_name": "executor",
    "docker_id": "2ecd08079f80d1483a8940a6545923430e4bb96ab77aaef55799a3c2a434cf0f",
    "pod_name": "consents-importer-1576685889687-exec-1",
    "namespace_name": "consents-importer-f6e67dc4cd",
    "pod_id": "4665e30a-21b4-11ea-b9af-0e875ecffd2f",
    ...

which have duplicated keys, so it's an invalid JSON.

Expected behavior
Second filter should override first, avoiding duplicated keys.

Your Environment

Version used: 1.3.4
Configuration: shown before.
Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.14.6
Server type and version:
Operating System and version:
Filters and plugins:

Additional context

enhancement fixed

Source

arodriguezdlc

👍21

Most helpful comment

FYI:

I've pushed fix 1d148860a8825d5f80aef40efd0d6d2812419740 to handle duplicated keys in a map. The workaround is basically: when converting the data to JSON, if a duplicated key is found, only pack the last value found for that key.

This will be part of v1.6 release next week.

@GeorgFleig the double/integer issue in Lua scripts was already fixed in the latest version of Fluent BIt (v1.5.7)

edsiper on 30 Sep 2020

🎉3

All 7 comments

We are hitting the same issue, we are about to rollback fluent-bit usage and start using fluentd again.

daguilarv on 20 Dec 2019

what's the expected behavior? how to solve the duplicates from a user perspective ?.

In Fluentd this is the default behavior (tested with in_forward -> out_stdout)

input

{"key1": 123, "key2": 456, "key1": 789}

output in Fluentd

{"key1":789,"key2":456}

There are two things, in one place for some reason you are adding a duplicate field, on the other hand, JSON spec doesn't say this is forbidden:

An object structure is represented as a pair of curly brackets
   surrounding zero or more name/value pairs (or members).  A name is a
   string.  A single colon comes after each name, separating the name
   from the value.  A single comma separates a value from a following
   name.  The names within an object SHOULD be unique.

https://tools.ietf.org/html/rfc7159#section-4

it's a SHOULD, not a MUST.

Now the problem is that every document database works differently.

I think the sanest workaround for this is to add an option to every plugin that converts data to JSON offering the option to override duplicates.

edsiper on 20 Dec 2019

👍2

It does seem tough to solve this in a single pass converting from message pack to JSON. In a two pass arrangement we could first establish counters per key value and then in the second pass only output for the last of the duplicate keys.

nigels-com on 27 Jan 2020

@edsiper do you plan on implementing the duplicate overwrite per plugin that you described ealier?

I'm also hitting this issue using fluentbit in a k8s cluster. An application (running in a pod) where I don't control the log output prints duplicate keys inside the log message. The json log is parsed and sent to Elasticsearch with duplicate keys. Elasticsearch does not accept the JSON and fluentbit keeps retrying.

After some hours fluentbit on that k8s node was not sending any new log messages to Elasticsearch anymore, because it seemed to be busy with retrying the messages with duplicate keys. This way legitimate messages were not processed by fluentbit anymore.

So a single pod caused fluentbit to fail on the whole node k8s :/

Unfortunately Elasticsearch does not provide a flag anymore to accept duplicate JSON keys. As I cannot control the log output of the application my only chance is to somehow do this in fluentbit.

GeorgFleig on 17 Apr 2020

👍2

I found a somewhat messy workaround using a Lua filter. When the log message is converted to a Lua table the duplicate fields are dropped automatically. Converting it back to json results in a cleaned log message. However this conversion forth and back also transforms integers to doubles (https://www.lua.org/pil/2.3.html) and does not handle empty lists JSON :/

Integers become doubles:

"statusCode"=>200
"statusCode"=>200.000000

And an empty list becomes an object

"tags"=>[]
"tags"=>{}

If anyone is interested in this workaround anyway, here is the configuration:

[FILTER]
    Name    lua
    Match   *
    Script  /fluent-bit/etc/filter_duplicate_fields.lua
    Call    filter_duplicate_fields

filter_duplicate_fields.lua:

function filter_duplicate_fields(tag, timestamp, record)
    return 1, timestamp, record
end

Would love to see this overwrite mechanism implemented within the fluent-bit plugins themselves.

Edit (for reference): Found another ticket where a Lua filter is modifying data due to type conversion: https://github.com/fluent/fluent-bit/issues/2015

GeorgFleig on 26 May 2020

@edsiper Can you confirm that what you mentioned in https://github.com/fluent/fluent-bit/issues/1835#issuecomment-568002973 is that Fluentd prunes duplicate fields?

In Fluentd this is the default behavior (tested with in_forward -> out_stdout)

input
{"key1": 123, "key2": 456, "key1": 789}
output in Fluentd
{"key1":789,"key2":456}

I'm currently testing Fluent Bit to replace Fluentd and have noticed many of my output requests failing because Stackdriver will not accept logs with duplicate fields. I don't (yet) see evidence of this occurring with Fluentd to Stackdriver.

Similarly to @GeorgFleig I cannot fully control the log outputs of the application in our cluster.

Using a Lua filter as a workaround has complications. If I want to preserve integers, I will have to know field names in advance, which again comes back to the issue that I don't fully control the log output.