Fluent-bit: Kubernetes filter can't find pods anymore after upgrading from 0.14.9 -> 1.0.2

Created on 18 Jan 2019  路  20Comments  路  Source: fluent/fluent-bit

Bug Report

Describe the bug
Configuration is working in 0.14.9 but after upgrade to 1.0.2 the kubernetes filter complains with the following error:

[2019/01/21 12:32:37] [ info] [filter_kube] https=1 host=10.233.0.1 port=443 [2019/01/21 12:32:37] [ info] [filter_kube] local POD info OK [2019/01/21 12:32:37] [ info] [filter_kube] testing connectivity with API server... [2019/01/21 12:32:37] [debug] [filter_kube] API Server (ns=kube-system, pod=fluent-bit-2bsvt) http_do=0, HTTP Status: 200 [2019/01/21 12:32:37] [ info] [filter_kube] API server connectivity OK [2019/01/21 12:32:38] [debug] [filter_kube] API Server (ns=online-tst, pod=services.xx-prerender-cache-redis-ha-server-2) http_do=0, HTTP Status: 404 [2019/01/21 12:32:38] [debug] [filter_kube] API Server response {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"services.xx-prerender-cache-redis-ha-server-2\" not found","reason":"NotFound","details":{"name":"services.xx-prerender-cache-redis-ha-server-2","kind":"pods"},"code":404}

Suddenly it can not find the pods anymore

Configuration:
Kubernetes 1.8.4
Docker version 17.05.0-ce, build 89658be
OS: Redhat Linux 4.17.11 x86_64 x86_64 x86_64 GNU/Linux

`
[PARSER]
Decode_Field_As escaped log
Format json
Name json
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Key time

[PARSER]
    Decode_Field_As escaped log
    Format json
    Name docker
    Time_Format %Y-%m-%dT%H:%M:%S.%L
    Time_Keep true
    Time_Key time

[PARSER]
    Format regex
    Name mongo
    Regex ^(?<time>[^ ]*)\s+(?<severity>\w)\s+(?<context>[^ ]+)\s+\[(?<connection>[^\]]+)]\s+(?<message>.*)$
    Time_Format %Y-%m-%dT%H:%M:%S.%L
    Time_Keep true
    Time_Key time
[SERVICE]
    Flush        1
    Daemon       Off
    Log_Level    debug
    HTTP_Server    On
    HTTP_Listen    0.0.0.0
    HTTP_PORT    2020
    Parsers_File parsers_custom.conf

[INPUT]
    Buffer_Chunk_Size 1MB
    Buffer_Max_Size 25MB
    DB /var/log/containers/fluent-bit.db
    Exclude_Path *kube-system*.log
    Mem_Buf_Limit 25MB
    Name tail
    Parser docker
    Path /var/log/containers/*.log
    Refresh_Interval 5
    Skip_Long_Lines On
    Tag kube.services.*

[INPUT]
    Buffer_Chunk_Size 1MB
    Buffer_Max_Size 25MB
    DB /var/log/containers/fluent-bit-nginx.db
    Mem_Buf_Limit 25MB
    Name tail
    Parser docker
    Path /var/log/containers/*nginx*.log
    Refresh_Interval 5
    Skip_Long_Lines On
    Tag kube.ingress.*

[FILTER]
    K8S-Logging.Parser On
    K8S-Logging.exclude True
    Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
    Kube_URL https://${KUBERNETES_SERVICE_HOST}:443
    Match kube.*
    Merge_Log On
    Merge_Log_Key k8s
    Name kubernetes
    tls.verify On
[OUTPUT]
    Host xxx.xxx.xxx.xxx
    Include_Tag_Key true
    Logstash_DateFormat %G.%V
    Logstash_Format On
    Logstash_Prefix k8s-services-tst
    Match kube.services.*
    Name es
    Port 9200
    Retry_Limit False
[OUTPUT]
    Host xxx.xxx.xxx.xxx
    Include_Tag_Key true
    Logstash_DateFormat %G.%V
    Logstash_Format On
    Logstash_Prefix k8s-ingress-tst
    Match kube.ingress.*
    Name es
    Port 9200
    Retry_Limit False

`

Am I missing some changes in the kubernetes filter? Can't find anything. It already cost me a day of work.

Already tried:
Changing Kube_URL to IP address
Changing Kube_URL to DNS name
Removed port
Changed port to 6443

Changing back to 0.14.9 and it all works again...:-S

enhancement fixed

Most helpful comment

FYI:

I've updated our dev docs for 1.1 (not published yet) but you can see the new explanation of the Workflow here:

https://github.com/fluent/fluent-bit-docs/blob/master/filter/kubernetes.md#workflow-of-tail--kubernetes-filter

All 20 comments

Is the namespace and pod in your error log correct?
I had the same issue but found out that this happens because of #814.
The tag is where the plugin gets the namespace and pod info and the change in regex is a cause of this.
https://github.com/fluent/fluent-bit/blob/v1.0.2/plugins/filter_kubernetes/kube_regex.h#L25

I haven't dig deep enough to see if this is by intention or a bug but you can get around this by adding your custom 'Regex_Parser' to parse your tag correctly.

I am also facing the same issue.
Info logs say that namespace is getting prepend as . instead of /

[2019/01/25 06:47:40] [debug] [in_tail] file=/var/log/containers/prod-interactions-consumer-6bf8f96c84-nr789_production_fluent-bit-f28d0640ffa2945125b3d8d6ed7c8b1fe59950f462b3f4781a2079eeb1e35a2c.log event
[2019/01/25 06:47:40] [debug] [filter_kube] API Server (ns=production, pod=production.prod-interactions-consumer-6bf8f96c84-nr789) http_do=0, HTTP Status: 404
[2019/01/25 06:47:40] [debug] [filter_kube] API Server response
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"production.prod-interactions-consumer-6bf8f96c84-nr789\" not found","reason":"NotFound","details":{"name":"production.prod-interactions-consumer-6bf8f96c84-nr789","kind":"pods"},"code":404}

Fluentbit versoin: 1.0.2

@puremad Well....actually....no...I did not even noticed. Thank you...that must be the problem. That explains that it can't find anything because there is no "services" namespace.

@edsiper Is this a bug or a new feature? If its a new feature I have to create my own regex, but that takes some time to test...etc.

Tried to investigate the new regex, but failed to find a good regex tester. Regex101.com is probably not the right one.

Old Regex:

var\\.log\\.containers\\.(?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\\.log$

New Regex:

(?<tag>[^.]+)?\\.?(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\\.log$

(Removed the escaping backslash during test)

Test String:

/var/log/containers/idp-54f445cbfd-9zgsf_online-tst_idp-73dbd7f11c1f0b367df1fd179963ed45c8b6e773a485f35e9fe5f409138ad317.log

Fluent-bit logs

[in_tail] file=/var/log/containers/idp-54f445cbfd-9zgsf_online-tst_idp-73dbd7f11c1f0b367df1fd179963ed45c8b6e773a485f35e9fe5f409138ad317.log read=747 lines=1
[filter_kube] API Server (ns=online-tst, pod=services.idp-54f445cbfd-9zgsf) http_do=0, HTTP Status: 404
[filter_kube] API Server response\n","stream":"stderr","time":"2019-01-29T09:22:23.022331859Z"}
{"log":"{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"pods \\\"services.idp-54f445cbfd-9zgsf\\\" not found\",\"reason\":\"NotFound\",\"details\":{\"name\":\"services.idp-54f445cbfd-9zgsf\",\"kind\":\"pods\"},\"code\":404}

Sure I can replace the new regex by the old one and build a new version, but that takes so much time and I need to roll out on so many clusters.

@edsiper Can you please shine a light on this?

Also to note, if you use Tag_Regex in your INPUT and then have to use kube.<namespace>.<pod> or such in your Match of the kubernetes FILTER it won't work, skips over the logs. If you switch the INPUT back to remove the Tag_Regex and simply use Tag kube.* the kubernetes filter works.

I think the kubernetes filter is looking for the log tag to match that regex and it won't match anything else, even if specified w/ using <pattern>

@chiefy Sorry for responding so late. Not sure what you meant. I think my configuration looks fine, but you said something about Tag_Regex in my INPUT. My kubernetes filter is not skipping the records. Its just not using the right regex.

Probably i'm one of few having this issue.

Also to note, if you use Tag_Regex in your INPUT and then have to use kube.<namespace>.<pod> or such in your Match of the kubernetes FILTER it won't work, skips over the logs. If you switch the INPUT back to remove the Tag_Regex and simply use Tag kube.* the kubernetes filter works.

I think the kubernetes filter is looking for the log tag to match that regex and it won't match anything else, even if specified w/ using <pattern>

Yes , I think i hit this issue , how should we format the Match pattern (on the FILTER) for this to work ? Anything other than kube.* , looks like the filter plugin is skipping.

@mailtoraja18 you're correct, there seems to be a bug w/ the K8s filter where it only accepts a certain kube.* pattern.

I also have a problem with this:

[2019/03/01 10:56:33] [debug] [in_tail] file=/var/log/containers/coredns-7966c859dd-t4nm8_kube-system_coredns-cd673551bf0e11ccf0d184142c4aa9544f0d5f28b78c4889bed299baabb69be0.log read=2398 lines=10
[2019/03/01 10:56:33] [debug] [in_tail] file=/var/log/containers/coredns-7966c859dd-77dss_kube-system_coredns-7ffe22f9bb5558856a878c5bee658b07811d8813a972927f23e7a752b5489059.log event
[2019/03/01 10:56:33] [debug] [filter_kube] API Server (ns=kube-system, pod=kubernetes.coredns-7966c859dd-77dss) http_do=0, HTTP Status: 404
[2019/03/01 10:56:33] [debug] [filter_kube] API Server response
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"kubernetes.coredns-7966c859dd-77dss\" not found","reason":"NotFound","details":{"name":"kubernetes.coredns-7966c859dd-77dss","kind":"pods"},"code":404}

In my case an extra 'kubernetes.' is prepended to the pod name, I think, for some reason. But it would not be needed:

# (Note: log updated. Initially I accidentally took the logs from a different system and therefore the pod names did not match the ones above. Now coredns-7966c859dd-77dss happens in both logs.)

$ kubectl get all --all-namespaces | grep coredns

kube-system   deploy/coredns                          2         2         2            2           2d

kube-system   rs/coredns-7966c859dd                          2         2         2         2d



kube-system   po/coredns-7966c859dd-77dss                         1/1       Running                 0          2d
kube-system   po/coredns-7966c859dd-t4nm8                         1/1       Running                 0          2d

kube-system   svc/coredns                          ClusterIP   10.3.0.10      <none>        53/UDP,53/TCP                                                                      2d

@donbowman @edsiper you touched the above mentioned line
https://github.com/fluent/fluent-bit/blob/v1.0.2/plugins/filter_kubernetes/kube_regex.h#L25 could you please kindly comment on this issue?

As I read at https://github.com/fluent/fluent-bit-docs/blob/master/installation/kubernetes.md : "... a built-in filter plugin called kubernetes talks to the Kubernetes API Server to retrieve relevant information such as the pod_id, labels and annotations, other fields such as pod_name, container_id and container_name are retrieved locally from the log file names. All of this is handled automatically, no intervention is required from a configuration aspect."

So the pod name is supposed to come from the log file name? Hmm, but how is this 'kubernetes.' is prepended to the pod name?

The relevant parts from our configmap (was done by a colleague):

    [INPUT]
        Name              tail
        Tag               edge.kubernetes.*
        Exclude_Path      /var/log/containers/fluent*.log
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     10MB
        Skip_Long_Lines   On
        Refresh_Interval  10
        Buffer_Chunk_Size 1MB
        Buffer_Max_Size   1MB

    [FILTER]
        Name           kubernetes
        Match          edge.kubernetes.*
        Kube_URL       https://kubernetes.default.svc.cluster.local:443
        Merge_JSON_Log On      

    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On
        Decode_Field_As   escaped    log

Due the lack of resources I'm not able to spend time on this to fix it. So for now I'll stick to 0.14.9.

Update: my problem was that earlier a colleague changed the kube.* in the input plugin Tag (and also for filter Match) to edge.kubernetes.* for some reason (maybe he just wanted to use some descriptive/custom name). This worked previously to 1.0.0, but not with 1.0.x. I debugged this by adding some extra debug logging temporarily to flb_regex_do() (in src/flb_regex.c). After changing back this setting to kube.*, it seems to work fine with fluent-bit 1.0.4.

The main problem I see with this is that the regexp (based on the discussion above I guess the problem is something like this) does not seem to be configurable (based on https://github.com/fluent/fluent-bit-docs/blob/master/installation/kubernetes.md which tells users to use the following configmap: https://raw.githubusercontent.com/fluent/fluent-bit-kubernetes-logging/master/output/elasticsearch/fluent-bit-configmap.yaml), but users may change this tag.
This kind of gives a false sense of configurability of this Tag for the users.
@edsiper you seemed to have edited this configmap, what do you think?

@edsiper Could you please give some input on this? I confirm what @attila123 mentioned above. Changed the tag on the input plugin from kube.services* to kube.* and it works again. But this will have some consequences. I have multiple input plugins and multiple output plugins that are controlled by different tags.

FYI: I will take a look today.

thanks everyone for your feedback on this issue.

I've pushed some improvements to GIT master that will be reflected on 1.1 release that aims to address the main issues. Please refer to the following relevant commits and it explanation:

  • 33b189a: in_tail: restore the support of absolute path in expanded Tags - Eduardo Silva
  • 6a9c8e0: filter_kubernetes: new configuration property 'kube_tag_prefix' - Eduardo

Note that from now (1.1) having different tail sections with expanded tag prefixes that need filter kubernetes, will require different kubernetes filters with defined prefixes (to be documented shortly)

FYI:

I've updated our dev docs for 1.1 (not published yet) but you can see the new explanation of the Workflow here:

https://github.com/fluent/fluent-bit-docs/blob/master/filter/kubernetes.md#workflow-of-tail--kubernetes-filter

@edsiper Thank you very much! That explains a lot. This changes you made are great. Gives us back the flexibility. Any idea when version 1.1 will be released?

Just FYI, I can't confirm this but I believe we are experiencing similar issues after going from 1.0.4 -> 1.1.2 was this a breaking change @edsiper ?

Also this is weird but the pod name is getting truncated in the logs (this happens for other pods as well):

[2019/06/04 12:00:14] [debug] [in_tail] file=/var/log/containers/fluent-bit-6mh28_halo-system_fluent-bit-d6fe25e85927ddd3365ba71eea5c16bcbfd885d48a00fb1dfe993622bae80706.log read=1214 lines=6
[2019/06/04 12:00:14] [debug] [in_tail] file=/var/log/containers/fluent-bit-6mh28_halo-system_fluent-bit-d6fe25e85927ddd3365ba71eea5c16bcbfd885d48a00fb1dfe993622bae80706.log event
[2019/06/04 12:00:14] [debug] [filter_kube] API Server (ns=halo-system, pod=luent-bit-6mh28) http_do=0, HTTP Status: 404
[2019/06/04 12:00:14] [debug] [filter_kube] API Server response
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"luent-bit-6mh28\" not found","reason":"NotFound","details":{"name":"luent-bit-6mh28","kind":"pods"},"code":404}

Upgrading notes:

https://docs.fluentbit.io/manual/installation/upgrade_notes

If that don't cover the issue you are facing let me know

@edsiper I figured it out - we were using k8s.* as a tag, so the new default didn't work.

closing as fixed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lbogdan picture lbogdan  路  3Comments

arienchen picture arienchen  路  3Comments

Markbnj picture Markbnj  路  4Comments

mbelchin picture mbelchin  路  3Comments

botzill picture botzill  路  4Comments