Describe the bug
Fluent-bit randomly crashes with SIGSEGV and from the stacktrace it looks like it's the polling of prometheus metrics that causes it:
[engine] caught signal (SIGSEGV)
#0 0x7f719373053e in ???() at ???:0
#1 0x55a0f2c6eb29 in string_cmp() at src/http_server/api/v1/metrics.c:142
#2 0x7f71936d5d20 in ???() at ???:0
#3 0x7f71936d5a97 in ???() at ???:0
#4 0x7f71936d5a81 in ???() at ???:0
#5 0x7f71936d5a97 in ???() at ???:0
#6 0x7f71936d5a97 in ???() at ???:0
#7 0x7f71936d5fde in ???() at ???:0
#8 0x55a0f2c6f248 in cb_metrics_prometheus() at src/http_server/api/v1/metrics.c:287
#9 0x55a0f2f50da5 in thread_cb_init_vars() at lib/monkey/mk_server/mk_http_thread.c:68
#10 0x55a0f2f5bd86 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#11 0xffffffffffffffff in ???() at ???:0
2019-11-19 09:45:47,071 INFO exited: fluentbit (terminated by SIGABRT (core dumped); not expected)
Additional context
Fluent-bit version: 1.3.2
Running in kubernetes
Added valgrind to fluent-bit container. Found this output related to cb_metrics_prometheus:
==15== Conditional jump or move depends on uninitialised value(s)
==15== at 0x27785D: metrics_get_latest (metrics.c:46)
==15== by 0x277C51: cb_metrics_prometheus (metrics.c:196)
==15== by 0x559DA5: thread_cb_init_vars (mk_http_thread.c:68)
==15== by 0x564D86: co_init (amd64.c:117)
==15== Uninitialised value was created by a stack allocation
==15== at 0x563D83: co_swap_function (in /opt/td-agent-bit/bin/td-agent-bit)
==15==
==15== Use of uninitialised value of size 8
==15== at 0x277881: metrics_get_latest (metrics.c:54)
==15== by 0x277C51: cb_metrics_prometheus (metrics.c:196)
==15== by 0x559DA5: thread_cb_init_vars (mk_http_thread.c:68)
==15== by 0x564D86: co_init (amd64.c:117)
==15== Uninitialised value was created by a stack allocation
==15== at 0x563D83: co_swap_function (in /opt/td-agent-bit/bin/td-agent-bit)
@maglun2 I experimented with /api/v1/metrics/prometheus today and could
not reproduce your issue, nor could I find the root cause by reading the related
part of the code.
@maglun2 I also took a deep dive into the metrics module, I found a potential memory leak by inspection, but no obvious race conditions. It seems to me that metrics is using both TLS (thread local storage) metrics_list = pthread_getspecific(hs_metrics_key) and reference counting buf->users++ which seemed a curious on the face of it. My impression is batches of metrics arrive from a queue and the queue is flushed taking care to check the user count is zero. But it's not clear to me how multiple consumers are all getting copies of the data unless there is pub/sub going on on the queuing side. And if so, why also reference counting?
@fujimotos yes looks like it's always the same stacktrace. Here is our config:
fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Merge_JSON_Log Off
@INCLUDE input-kubernetes.conf
@INCLUDE input-tcp.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE output-forward.conf
filter-kubernetes.conf: |
# merge log must be off otherwise messages logged as json will bloat elasticsearch with lots of fields
[FILTER]
Name kubernetes
Match kube_file.*
Kube_Tag_Prefix kube_file.var.log.containers.
Kube_URL https://kubernetes.default.svc.cluster.local:443
Merge_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
Labels On
Annotations Off
# we need merge log on for nginx ingress controller access logs to work
[FILTER]
Name kubernetes
Match kube_file.*nginx-ingress-controller*
Kube_Tag_Prefix kube_file.var.log.containers.
Kube_URL https://kubernetes.default.svc.cluster.local:443
Merge_Log On
Merge_Log_Trim On
K8S-Logging.Parser On
K8S-Logging.Exclude On
Labels On
Annotations Off
# add cluster name
[FILTER]
Name record_modifier
Match *
Record cluster ${K8S_CLUSTER_NAME}
# we dont know what records from tcp forward source will be tagged with since it is set by the sender
# assume all records come from tcp forward source
[FILTER]
Name record_modifier
Match *
Record log_source fluentbit_tcp
# and then fix those that we know come from other source(s)
[FILTER]
Name record_modifier
Match kube_file.*
Record log_source fluentbit_tail
input-kubernetes.conf: |
[INPUT]
Name tail
Flush 2
Path /var/log/containers/*.log
#Exclude_Path /var/log/containers/fluent*.log
Tag kube_file.*
Parser docker
DB /var/log/flb_kube.db
Buffer_Chunk_Size 64k
Buffer_Max_Size 256k
Mem_Buf_Limit 100m
Skip_Long_Lines Off
Refresh_Interval 10
Merge_JSON_Log Off
Multiline Off
input-tcp.conf: |
[INPUT]
Name forward
Flush 2
# pretty sure tag isn't supported for forward input
Tag kube_tcp.*
Listen 0.0.0.0
Log_Level info
Port 5500
Buffer_Chunk_Size 64k
Buffer_Max_Size 256k
Mem_Buf_Limit 100m
output-forward.conf: |
[OUTPUT]
Name forward
Match *
Host ${FLUENT_HOST}
Port ${FLUENT_PORT}
Retry_Limit False
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
# Command | Decoder | Field | Optional Action
# =============|==================|=================
Decode_Field_As escaped log
@nigels-com I appreciate you digging into the problem though I hope the question isn't directed at me 馃槃
@maglun2. Sometimes the process of putting things into words can help narrow or pinpoint a problem. But not this time.
Hm. I'm not entirely sure why this is happening.
#1 0x55a0f2c6eb29 in string_cmp() at src/http_server/api/v1/metrics.c:142
...
#8 0x55a0f2c6f248 in cb_metrics_prometheus() at src/http_server/api/v1/metrics.c:287
The stacktrace suggests that metrics_list somehow contains an invalid
string and it makes strcmp() run over the buffer ... but it's unlikely
since we use flb_sds_cat() that guarantees null termination.
The only possible route I can think of is flb_sds_cat() returning NULL
at metrics.c#L279, and this causes the SIGSEGV in qsort().
I posted #1762 that attempts to fix the issue based on this hypothesis.
@maglun2 Can you check out the version and see if it solves your problem?
Thanks @fujimotos will try it now
@fujimotos initial tests looks very promising. No sigsegv since I deployed your version. Will deploy to more test clusters and let it run during the day.
I realize now that your branch is based off master so there might be other changes affecting the behaviour I guess?
I realize now that your branch is based off master so there might be other changes affecting the behaviour I guess?
For metrics.c, we have no particular diff between v1.3.2 and master.
You can confirm it by running
$ git log v1.3.2..master http_server/api/v1/metrics.c
So if the patch fixes the issue, that's it.
Looks really good, no SIGSEGV at all since I deployed your patched version! Previous version got multiple SIGSEGV a day. I would say that your patch fixed the problem. Thanks you so much for looking into and fixing the problem @fujimotos! Should I make a comment in the PR?
Looks really good, no SIGSEGV at all since I deployed your patched version! Previous version got multiple SIGSEGV a day. I would say that your patch fixed the problem.
Nice! I appreciate if you make a comment in the PR for reference.
Now we can just wait for @edsiper to merge that patch into the main line.
Fix merged into v1.3 branch also for the next v1.3.4 release (this week).
Most helpful comment
Looks really good, no SIGSEGV at all since I deployed your patched version! Previous version got multiple SIGSEGV a day. I would say that your patch fixed the problem. Thanks you so much for looking into and fixing the problem @fujimotos! Should I make a comment in the PR?