Describe the bug
Error occured when message applied kubernetes filter encode to gelf message using gelf plugin.
To Reproduce
Created message is:
[159] kubernetes.fluent-bit-1.0.3-lstmf_kube-system_fluent-bit-ed3ada5ce94d2da9e98d762ed1b501b343ae86abbace8ed02734d80b751e6c08.log: [1548903233.004025739, {"log"=>"", "stream"=u003e"stderr", "time"=u003e"2019-01-31T02:53:48.869602266Z", "short_message"=u003e"short_message", "level"=u003e"5", "kubernetes"=u003e{"pod_name"=u003e"weave-net-m8xhq", "namespace_name"=u003e"kube-system", "pod_id"=u003e"81c1bd93-e4f9-11e8-81a0-005056151053", "labels"=u003e{"controller-revision-hash"=u003e"67692866", "name"=u003e"weave-net", "pod-template-generation"=u003e"6"}, "host"=u003e"worker-2", "container_name"=u003e"weave-npc", "docker_id"=u003e"0a96bd14e9f52cf4edd7ed93f3112a299bb3be03f100f1e42a677828fa684c31"}}]
Error message is:
[error] [out_gelf] error encoding to GELF
I have no idea why error occured :thinking:
Trace level log is:
[engine] caught signal (SIGSEGV)
#0 0x7efd4c9ab830 in ???() at ???:0
#1 0x7efd4c972db9 in ???() at ???:0
#2 0x7efd4c99ae88 in ???() at ???:0
#3 0x7efd4e60e643 in flb_log_print() at src/flb_log.c:365
#4 0x7efd4e62c3a0 in flb_msgpack_gelf_key() at src/flb_pack.c:801
#5 0x7efd4e62ce0e in flb_msgpack_gelf_flatten() at src/flb_pack.c:1060
#6 0x7efd4e62cd27 in flb_msgpack_gelf_flatten() at src/flb_pack.c:1041
#7 0x7efd4e62d4ec in flb_msgpack_to_gelf() at src/flb_pack.c:1246
#8 0x7efd4e679711 in cb_gelf_flush() at plugins/out_gelf/gelf.c:307
#9 0x7efd4e619f4a in output_pre_cb_flush() at include/fluent-bit/flb_output.h:314
#10 0x7efd4e928686 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#11 0xffffffffffffffff in ???() at ???:0
Expected behavior
Sending to gelf message via udp.
Your Environment
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level trace
Daemon off
Buffer_Path /var/log/fluentd-buffers/kubernetes.system.buffer
HTTP_Server off
Parsers_File parsers.conf
@INCLUDE input-kubernetes.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE output-gelf.conf
input-kubernetes.conf: |
[INPUT]
Name tail
Path /var/log/containers/*.log
Tag kubernetes.*
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
filter-kubernetes.conf: |
[FILTER]
Name record_modifier
Match *
Record short_message short_message
[FILTER]
Name kubernetes
Match kubernetes.*
Merge_Log On
K8S-Logging.Parser On
output-gelf.conf: |
[OUTPUT]
Name gelf
Match kubernetes.*
Host graylog.kube-system
Port 12201
Mode udp
Gelf_Short_Message_Key short_message
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
Decode_Field_As escaped_utf8 log
Additional context
I have no idea why error occured because no logs and no documents related gelf output plugin.
Same issue with Fluent Bit v1.0.3.
I think, there is a problem with parsing some fields.
When using Kubernetes filter fluent-bit adds additional fields with meta
Output to stdout plugin shows
[1] tail.0: [1548923665.487556365, {"log"=>{"name"=>"Test", "age"=>10}, "short_message"=>{"name"=>"Test", "age"=>10}, "stream"=>"stdout", "time"=>"2019-01-31T08:34:25.487556365Z", "kubernetes"=>{"pod_name"=>"nginx-ingress-controller-8bd4d", "namespace_name"=>"ingress-nginx", "pod_id"=>"c3aef641-248e-11e9-9749-005056b45133", "labels"=>{"app"=>"ingress-nginx", "controller-revision-hash"=>"1802405329", "pod-template-generation"=>"5"}, "annotations"=>{"field.cattle.io/ports"=>"[[{\\\"containerPort\\\":80,\\\"dnsName\\\":\\\"nginx-ingress-controller-\\\",\\\"name\\\":\\\"http\\\",\\\"protocol\\\":\\\"TCP\\\",\\\"sourcePort\\\":0},{\\\"containerPort\\\":443,\\\"dnsName\\\":\\\"nginx-ingress-controller-\\\",\\\"name\\\":\\\"https\\\",\\\"protocol\\\":\\\"TCP\\\",\\\"sourcePort\\\":0}]]", "field.cattle.io/publicEndpoints"=>"[{\\\"nodeName\\\":\\\"local:machine-ngz5g\\\",\\\"addresses\\\":[\\\"1.1.1.1\\\"],\\\"port\\\":80,\\\"protocol\\\":\\\"TCP\\\",\\\"podName\\\":\\\"ingress-nginx:nginx-ingress-controller-8bd4d\\\",\\\"allNodes\\\":false},{\\\"nodeName\\\":\\\"local:machine-ngz5g\\\",\\\"addresses\\\":[\\\"1.1.1.1\\\"],\\\"port\\\":443,\\\"protocol\\\":\\\"TCP\\\",\\\"podName\\\":\\\"ingress-nginx:nginx-ingress-controller-8bd4d\\\",\\\"allNodes\\\":false}]", "prometheus.io/port"=>"10254", "prometheus.io/scrape"=>"true"}, "host"=>"ingress02", "container_name"=>"nginx-ingress-controller", "docker_id"=>"62c32ed9a3e7dd51927f13b4bdd421563db2f28865415d6a7204216b618e9d11"}, "host"=>"6edda30181d4"}]
Fields like prometheus.io/port or prometheus.io/scrape cannot be parssed
I tried to apply FILTER
[FILTER]
Name modify
Match *
Remove annotations
But it did not work out. The record remained with annotation field (bug or feature ?)
However
[FILTER]
Name modify
Match *
Remove kubrnetes
works and removes all kube's meta, thus GELF plugin can parse all field
P.S. When debug or trace log level is set
[2019/02/04 12:44:59] [debug] [task] created task=0x7f137323b180 id=0 OK
[engine] caught signal (SIGSEGV)
#0 0x7f137482a830 in ???() at ???:0
#1 0x7f13747f1db9 in ???() at ???:0
#2 0x7f1374819e88 in ???() at ???:0
#3 0x55f4de1ba643 in flb_log_print() at src/flb_log.c:365
#4 0x55f4de1d83a0 in flb_msgpack_gelf_key() at src/flb_pack.c:801
#5 0x55f4de1d8e0e in flb_msgpack_gelf_flatten() at src/flb_pack.c:1060
#6 0x55f4de1d8d27 in flb_msgpack_gelf_flatten() at src/flb_pack.c:1041
#7 0x55f4de1d94ec in flb_msgpack_to_gelf() at src/flb_pack.c:1246
#8 0x55f4de20abb6 in http_gelf() at plugins/out_http/http.c:343
#9 0x55f4de20ad70 in cb_http_flush() at plugins/out_http/http.c:387
#10 0x55f4de1c5f4a in output_pre_cb_flush() at include/fluent-bit/flb_output.h:314
#11 0x55f4de4d4686 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#12 0xffffffffffffffff in ???() at ???:0
GELF format limit the chars of the key to: [A-Za-z0-9_.-], so the "/" in prometheus.io/scrape it's no a valid char.
I will check the crash when print the debug message.
Instead of discard the message we can skip the "invalid" keys or rewrite the incorrect chars.
The problem with the crash is that in flb_debug the prefix_key and prefix_key_len has the wrong order:
flb_debug("[%s] invalid key char '%.*s'", __FUNCTION__, prefix_key, prefix_key_len);
should be:
flb_debug("[%s] invalid key char '%.*s'", __FUNCTION__, prefix_key_len, prefix_key);
I will do a pull request to fix this.
Fixed by #1132
While the underlying debug logging bug has been fixed, I don't think that the initial issue has been properly addressed. I also got bitten by this, and for a new user it's not immediately apparent that GELF doesn't support kubernetes annotations and labels — especially considering that in a default kubernetes install you already have lots of pods with annotations and labels containing /. In other words, I consider it not being good UX when you run into issues with a default kubernetes & fluent-bit setup.
My suggestion would be to replace, in flb_msgpack_to_gelf(), all non-valid GELF key characters with _ — or at least /, which is the single cause of this issue. If you think that's a reasonable approach, I could open a PR.
Most helpful comment
While the underlying debug logging bug has been fixed, I don't think that the initial issue has been properly addressed. I also got bitten by this, and for a new user it's not immediately apparent that GELF doesn't support kubernetes annotations and labels — especially considering that in a default kubernetes install you already have lots of pods with annotations and labels containing
/. In other words, I consider it not being good UX when you run into issues with a default kubernetes & fluent-bit setup.My suggestion would be to replace, in
flb_msgpack_to_gelf(), all non-valid GELF key characters with_— or at least/, which is the single cause of this issue. If you think that's a reasonable approach, I could open a PR.