Fluent-bit: Excessive memory usage on kubernetes

Created on 8 Nov 2018  路  9Comments  路  Source: fluent/fluent-bit

Running on Azure Kubernetes Service (kubernetes v1.11.3) as a daemon set using the fluent/fluent-bit:0.14.6 image. The nodes are quite small with each one running roughly 15 containers that are sending JSON logs over tcp. The pod memory limit is currently set to 200Mi and fluent-bit keeps hitting this and restarting. Any suggestions? Here is the config:

[SERVICE]
    Flush         5
    Log_Level     info
    Daemon        off
    Parsers_File  parsers.conf
    HTTP_Server   On
    HTTP_Listen   0.0.0.0
    HTTP_Port     2020

[INPUT]
    Name      tcp
    Listen    0.0.0.0
    Port      5170

[OUTPUT]
    Name      null
    Match     *

parsers.conf:

[PARSER]
    Name   json
    Format json
    Time_Key time
    Time_Format %d/%b/%Y:%H:%M:%S %z
fixed question

Most helpful comment

Similar problem,
Observing very high memory usage on fluentbit pod. We observed around 10GB of memory usage. We have not specified resource limit on pod for the testing.

kubectl top po -n logging fluent-bit-gmnrt 
NAME               CPU(cores)   MEMORY(bytes) 
fluent-bit-gmnrt    46m             9861Mi 

When elastic search is heavily loaded it will give HTTP 429 Error to fluent bit. And fluent bit will keep unsent logs in its main memory for retry. Fluent bit is retrying for X number of times (as configured in output plugin's Retry_Limit setting). Retry_Limit. After X number of retries it should discard the message. Over here I am not sure it is discarding or keeping in its memory.
Mem_Buf_Limit is also set to 5MB but still 10GB is used by fluent bit.

To Reproduce
Start application and fluentbit when elastic search is heavily loaded.

Expected behavior
Once retry limit is reached fluentbit should not keep the record in its memory.

Your Environment
Kubernetes version is v1.12.2
Fluent bit version 0.14.7
Snippet of fluentbit Configuration

    [INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    Parser            docker
    DB                /var/log/flb_kube.db
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On
    Refresh_Interval  10
    ignore_older        1d

   [OUTPUT]
    Name            es
    Match           *
    Host            ${FLUENT_ELASTICSEARCH_HOST}
    Port            ${FLUENT_ELASTICSEARCH_PORT}
    Logstash_Format On
    Retry_Limit     2
    Buffer_Size     False

    [FILTER]
     Name record_modifier
     Match *
     Remove_key time

    [FILTER]
        Name            grep
        Match           *
        Regex           log [a-zA-Z1-9]*SOME_STRING[a-zA-Z1-9]*

All 9 comments

Is this even excessive memory usage? Are there any recommendations for what the resource limits/requests should be?

If all Pods send around 200MB of data within 5 seconds, yeah, it will be killed.

While Fluent Bit receives data, it will not deliver the logs until the Flush time expiration, my suggestion is to set Flush to 1 (one second) and append a Mem_Buf_Limit option into the TCP input plugin just for protection, you can read more about memory handling here:

https://docs.fluentbit.io/manual/configuration/backpressure

@edsiper It takes about 10-15 minutes for the fluentbit pod to be killed - it just has a nice straight memory graph that looks like it never cleans up any memory:
image
The drop in memory usage is when the pod gets killed:
image

I have tried various settings for the Mem_Buf_Limit but none of them make any difference.

Did u try Flush 1?

Yeh - those graphs are with it set to Flush 1

Looks like the issue is with our app - it never closed the TCP connection to fluentbit and instead just reused it for each batch of logs. Now we close the connection after a batch of logs and it has fixed the issue.

@tomstreet I am curious to learn more about the issue. My expectation is that Fluent Bit will protect it self from that scenario. Would you please share some steps to reproduce the problem ?

Sure.. so the config is above, here is the daemonset yaml:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
  labels:
    app: fluent-bit-logging
    kubernetes.io/cluster-service: "true"
spec:
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: fluent-bit-logging
        kubernetes.io/cluster-service: "true"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "2020"
        prometheus.io/path: /api/v1/metrics/prometheus
    spec:
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:0.14.7
        imagePullPolicy: Always
        ports:
          - containerPort: 2020
          - containerPort: 5170
            hostPort: 5170
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
        resources:
          limits:
            cpu: 2
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
      terminationGracePeriodSeconds: 10
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule

and our app is written in c# and here is a simplified version of the log emitter:

public class Emitter 
{
    private TcpClient _client;
    private FluentBitSettings _settings;

    private async Task Connect()
    {
        if(_client != null)
        {
            if (_client.Connected)
            {
                return;
            }

            _client.Dispose();
            _client = null;
        }

        _client = new TcpClient();

        await _client.ConnectAsync(_settings.Host, _settings.Port);
    }

    private void Disconnect()
    {
        _client?.Dispose();
        _client = null;
    }

    public async Task Emit(byte[] logsBatch) 
    {
        try
        {
            await Connect();

            var tcpStream = _client.GetStream();

            await tcpStream.WriteAsync(logsBatch);
            await tcpStream.FlushAsync();
        }
        finally
        {
            Disconnect();
        }
    }
}

If we remove the Disconnect() in the finally block of the Emit method, then it reuses the TCP connection without closing it every time the Emit method is called - this is what causes the memory issue in Fluent Bit. Including it not only stopped the issue in Fluent Bit but also reduced the memory usage of our own service.

Similar problem,
Observing very high memory usage on fluentbit pod. We observed around 10GB of memory usage. We have not specified resource limit on pod for the testing.

kubectl top po -n logging fluent-bit-gmnrt 
NAME               CPU(cores)   MEMORY(bytes) 
fluent-bit-gmnrt    46m             9861Mi 

When elastic search is heavily loaded it will give HTTP 429 Error to fluent bit. And fluent bit will keep unsent logs in its main memory for retry. Fluent bit is retrying for X number of times (as configured in output plugin's Retry_Limit setting). Retry_Limit. After X number of retries it should discard the message. Over here I am not sure it is discarding or keeping in its memory.
Mem_Buf_Limit is also set to 5MB but still 10GB is used by fluent bit.

To Reproduce
Start application and fluentbit when elastic search is heavily loaded.

Expected behavior
Once retry limit is reached fluentbit should not keep the record in its memory.

Your Environment
Kubernetes version is v1.12.2
Fluent bit version 0.14.7
Snippet of fluentbit Configuration

    [INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    Parser            docker
    DB                /var/log/flb_kube.db
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On
    Refresh_Interval  10
    ignore_older        1d

   [OUTPUT]
    Name            es
    Match           *
    Host            ${FLUENT_ELASTICSEARCH_HOST}
    Port            ${FLUENT_ELASTICSEARCH_PORT}
    Logstash_Format On
    Retry_Limit     2
    Buffer_Size     False

    [FILTER]
     Name record_modifier
     Match *
     Remove_key time

    [FILTER]
        Name            grep
        Match           *
        Regex           log [a-zA-Z1-9]*SOME_STRING[a-zA-Z1-9]*
Was this page helpful?
0 / 5 - 0 ratings

Related issues

arienchen picture arienchen  路  3Comments

brycefisher picture brycefisher  路  3Comments

JavaCS3 picture JavaCS3  路  3Comments

dbluxo picture dbluxo  路  4Comments

botzill picture botzill  路  4Comments