Fluent-bit: Error encoding to GELF message applied Kubernetes fileter

Created on 31 Jan 2019  Â·  5Comments  Â·  Source: fluent/fluent-bit

Bug Report

Describe the bug
Error occured when message applied kubernetes filter encode to gelf message using gelf plugin.

To Reproduce
Created message is:

[159] kubernetes.fluent-bit-1.0.3-lstmf_kube-system_fluent-bit-ed3ada5ce94d2da9e98d762ed1b501b343ae86abbace8ed02734d80b751e6c08.log: [1548903233.004025739, {"log"=>"", "stream"=u003e"stderr", "time"=u003e"2019-01-31T02:53:48.869602266Z", "short_message"=u003e"short_message", "level"=u003e"5", "kubernetes"=u003e{"pod_name"=u003e"weave-net-m8xhq", "namespace_name"=u003e"kube-system", "pod_id"=u003e"81c1bd93-e4f9-11e8-81a0-005056151053", "labels"=u003e{"controller-revision-hash"=u003e"67692866", "name"=u003e"weave-net", "pod-template-generation"=u003e"6"}, "host"=u003e"worker-2", "container_name"=u003e"weave-npc", "docker_id"=u003e"0a96bd14e9f52cf4edd7ed93f3112a299bb3be03f100f1e42a677828fa684c31"}}]

Error message is:

[error] [out_gelf] error encoding to GELF

I have no idea why error occured :thinking:

Trace level log is:

[engine] caught signal (SIGSEGV)
#0  0x7efd4c9ab830      in  ???() at ???:0
#1  0x7efd4c972db9      in  ???() at ???:0
#2  0x7efd4c99ae88      in  ???() at ???:0
#3  0x7efd4e60e643      in  flb_log_print() at src/flb_log.c:365
#4  0x7efd4e62c3a0      in  flb_msgpack_gelf_key() at src/flb_pack.c:801
#5  0x7efd4e62ce0e      in  flb_msgpack_gelf_flatten() at src/flb_pack.c:1060
#6  0x7efd4e62cd27      in  flb_msgpack_gelf_flatten() at src/flb_pack.c:1041
#7  0x7efd4e62d4ec      in  flb_msgpack_to_gelf() at src/flb_pack.c:1246
#8  0x7efd4e679711      in  cb_gelf_flush() at plugins/out_gelf/gelf.c:307
#9  0x7efd4e619f4a      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:314
#10 0x7efd4e928686      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#11 0xffffffffffffffff  in  ???() at ???:0

Expected behavior
Sending to gelf message via udp.

Your Environment

  • Version used: 1.0.3
  • Environment name and version: Kubernetes 1.10.3
  • Server type and version: x86_64
  • Operating System and version: centos7
  • Filters and plugins:
  fluent-bit.conf: |
    [SERVICE]
        Flush             5
        Log_Level         trace
        Daemon            off
        Buffer_Path       /var/log/fluentd-buffers/kubernetes.system.buffer
        HTTP_Server       off
        Parsers_File      parsers.conf

    @INCLUDE input-kubernetes.conf
    @INCLUDE filter-kubernetes.conf
    @INCLUDE output-gelf.conf

  input-kubernetes.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Tag               kubernetes.*
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     5MB
        Skip_Long_Lines   On
        Refresh_Interval  10

  filter-kubernetes.conf: |

    [FILTER]
        Name record_modifier
        Match *
        Record short_message short_message

    [FILTER]
        Name                kubernetes
        Match               kubernetes.*
        Merge_Log           On
        K8S-Logging.Parser  On

  output-gelf.conf: |

    [OUTPUT]
        Name                     gelf
        Match                    kubernetes.*
        Host                     graylog.kube-system
        Port                     12201
        Mode                     udp
        Gelf_Short_Message_Key   short_message

  parsers.conf: |

    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On
        Decode_Field_As   escaped_utf8    log

Additional context

I have no idea why error occured because no logs and no documents related gelf output plugin.

Most helpful comment

While the underlying debug logging bug has been fixed, I don't think that the initial issue has been properly addressed. I also got bitten by this, and for a new user it's not immediately apparent that GELF doesn't support kubernetes annotations and labels — especially considering that in a default kubernetes install you already have lots of pods with annotations and labels containing /. In other words, I consider it not being good UX when you run into issues with a default kubernetes & fluent-bit setup.

My suggestion would be to replace, in flb_msgpack_to_gelf(), all non-valid GELF key characters with _ — or at least /, which is the single cause of this issue. If you think that's a reasonable approach, I could open a PR.

All 5 comments

Same issue with Fluent Bit v1.0.3.

I think, there is a problem with parsing some fields.
When using Kubernetes filter fluent-bit adds additional fields with meta

Output to stdout plugin shows

[1] tail.0: [1548923665.487556365, {"log"=>{"name"=>"Test", "age"=>10}, "short_message"=>{"name"=>"Test", "age"=>10}, "stream"=>"stdout", "time"=>"2019-01-31T08:34:25.487556365Z", "kubernetes"=>{"pod_name"=>"nginx-ingress-controller-8bd4d", "namespace_name"=>"ingress-nginx", "pod_id"=>"c3aef641-248e-11e9-9749-005056b45133", "labels"=>{"app"=>"ingress-nginx", "controller-revision-hash"=>"1802405329", "pod-template-generation"=>"5"}, "annotations"=>{"field.cattle.io/ports"=>"[[{\\\"containerPort\\\":80,\\\"dnsName\\\":\\\"nginx-ingress-controller-\\\",\\\"name\\\":\\\"http\\\",\\\"protocol\\\":\\\"TCP\\\",\\\"sourcePort\\\":0},{\\\"containerPort\\\":443,\\\"dnsName\\\":\\\"nginx-ingress-controller-\\\",\\\"name\\\":\\\"https\\\",\\\"protocol\\\":\\\"TCP\\\",\\\"sourcePort\\\":0}]]", "field.cattle.io/publicEndpoints"=>"[{\\\"nodeName\\\":\\\"local:machine-ngz5g\\\",\\\"addresses\\\":[\\\"1.1.1.1\\\"],\\\"port\\\":80,\\\"protocol\\\":\\\"TCP\\\",\\\"podName\\\":\\\"ingress-nginx:nginx-ingress-controller-8bd4d\\\",\\\"allNodes\\\":false},{\\\"nodeName\\\":\\\"local:machine-ngz5g\\\",\\\"addresses\\\":[\\\"1.1.1.1\\\"],\\\"port\\\":443,\\\"protocol\\\":\\\"TCP\\\",\\\"podName\\\":\\\"ingress-nginx:nginx-ingress-controller-8bd4d\\\",\\\"allNodes\\\":false}]", "prometheus.io/port"=>"10254", "prometheus.io/scrape"=>"true"}, "host"=>"ingress02", "container_name"=>"nginx-ingress-controller", "docker_id"=>"62c32ed9a3e7dd51927f13b4bdd421563db2f28865415d6a7204216b618e9d11"}, "host"=>"6edda30181d4"}]

Fields like prometheus.io/port or prometheus.io/scrape cannot be parssed

I tried to apply FILTER

[FILTER]
    Name modify
    Match *
    Remove annotations

But it did not work out. The record remained with annotation field (bug or feature ?)

However

[FILTER]
    Name modify
    Match *
    Remove kubrnetes 

works and removes all kube's meta, thus GELF plugin can parse all field

P.S. When debug or trace log level is set

[2019/02/04 12:44:59] [debug] [task] created task=0x7f137323b180 id=0 OK
[engine] caught signal (SIGSEGV)
#0  0x7f137482a830      in  ???() at ???:0
#1  0x7f13747f1db9      in  ???() at ???:0
#2  0x7f1374819e88      in  ???() at ???:0
#3  0x55f4de1ba643      in  flb_log_print() at src/flb_log.c:365
#4  0x55f4de1d83a0      in  flb_msgpack_gelf_key() at src/flb_pack.c:801
#5  0x55f4de1d8e0e      in  flb_msgpack_gelf_flatten() at src/flb_pack.c:1060
#6  0x55f4de1d8d27      in  flb_msgpack_gelf_flatten() at src/flb_pack.c:1041
#7  0x55f4de1d94ec      in  flb_msgpack_to_gelf() at src/flb_pack.c:1246
#8  0x55f4de20abb6      in  http_gelf() at plugins/out_http/http.c:343
#9  0x55f4de20ad70      in  cb_http_flush() at plugins/out_http/http.c:387
#10 0x55f4de1c5f4a      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:314
#11 0x55f4de4d4686      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#12 0xffffffffffffffff  in  ???() at ???:0

GELF format limit the chars of the key to: [A-Za-z0-9_.-], so the "/" in prometheus.io/scrape it's no a valid char.

I will check the crash when print the debug message.

Instead of discard the message we can skip the "invalid" keys or rewrite the incorrect chars.

The problem with the crash is that in flb_debug the prefix_key and prefix_key_len has the wrong order:

        flb_debug("[%s] invalid key char '%.*s'",  __FUNCTION__, prefix_key, prefix_key_len);

should be:

        flb_debug("[%s] invalid key char '%.*s'",  __FUNCTION__, prefix_key_len, prefix_key);

I will do a pull request to fix this.

Fixed by #1132

While the underlying debug logging bug has been fixed, I don't think that the initial issue has been properly addressed. I also got bitten by this, and for a new user it's not immediately apparent that GELF doesn't support kubernetes annotations and labels — especially considering that in a default kubernetes install you already have lots of pods with annotations and labels containing /. In other words, I consider it not being good UX when you run into issues with a default kubernetes & fluent-bit setup.

My suggestion would be to replace, in flb_msgpack_to_gelf(), all non-valid GELF key characters with _ — or at least /, which is the single cause of this issue. If you think that's a reasonable approach, I could open a PR.

Was this page helpful?
0 / 5 - 0 ratings