Fluent-bit: [v1.6] storage.total_limit_size is throwing error

Created on 14 Oct 2020  ยท  43Comments  ยท  Source: fluent/fluent-bit

Bug Report

Describe the bug
no enough space in filesystem to buffer chunk 1-1602687091.955423455.flb in plugin es.0 coming when I enable storage.total_limit_size 5M in es output

To Reproduce

  • Example log message if applicable:
โ”‚ [2020/10/14 15:18:04] [debug] [input:tail:tail.0] inode=1291902 events: IN_MODIFY                                                                                                                            โ”‚
โ”‚ [2020/10/14 15:18:04] [debug] [storage] tail.0:1-1602688684.470023866.flb mapped OK                                                                                                                          โ”‚
โ”‚ [2020/10/14 15:18:04] [debug] [input chunk] chunk 1-1602688684.470023866.flb required 840 bytes and 8206935 bytes left in plugin es.0                                                                        โ”‚
โ”‚ [2020/10/14 15:18:04] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1602688684.470023866.flb in plugin es.0                                                                          โ”‚
โ”‚ [2020/10/14 15:18:04] [debug] [storage] [cio file] synced at: tail.0/1-1602688684.470023866.flb                                                                                                              โ”‚
โ”‚ [2020/10/14 15:18:04] [error] [input chunk] no available chunk                                                                                                                                               โ”‚
โ”‚ [2020/10/14 15:18:04] [debug] [storage] tail.0:1-1602688684.470514408.flb mapped OK                                                                                                                          โ”‚
โ”‚ [2020/10/14 15:18:04] [debug] [input chunk] chunk 1-1602688684.470514408.flb required 963 bytes and 8206935 bytes left in plugin es.0                                                                        โ”‚
โ”‚ [2020/10/14 15:18:04] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1602688684.470514408.flb in plugin es.0                                                                          โ”‚
โ”‚ [2020/10/14 15:18:04] [debug] [storage] [cio file] synced at: tail.0/1-1602688684.470514408.flb                                                                                                              โ”‚
โ”‚ [2020/10/14 15:18:04] [error] [input chunk] no available chunk                                                                                                                                               โ”‚
โ”‚ [2020/10/14 15:18:04] [debug] [input:tail:tail.0] inode=1291902 events: IN_MODIFY                                                                                                                            โ”‚
โ”‚ [2020/10/14 15:18:04] [debug] [storage] tail.0:1-1602688684.471037398.flb mapped OK                                                                                                                          โ”‚
โ”‚ [2020/10/14 15:18:04] [debug] [input chunk] chunk 1-1602688684.471037398.flb required 943 bytes and 8206935 bytes left in plugin es.0                                                                        โ”‚
โ”‚ [2020/10/14 15:18:04] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1602688684.471037398.flb in plugin es.0                                                                          โ”‚
โ”‚ [2020/10/14 15:18:04] [debug] [storage] [cio file] synced at: tail.0/1-1602688684.471037398.flb   
  • Steps to reproduce the problem:

Expected behavior

Storage LRU should kick in once the limit exeeds

Your Environment

  • Environment name and version (e.g. Kubernetes? What version?):
Client Version: v1.19.2
Server Version: v1.18.6+k3s1

Additional context
My disk got completely utilized by fluentbit when my es was down. So figured out I've to upgrade to 1.6 to have LRU like fluentd.

bug fixed

All 43 comments

@JeffLuoo FYI

Thank you, I will look into it.

Could you please provide the config file that you are using if possible? Thank you so much!

The config file and entire manifest is present in the https://github.com/fluent/fluent-bit/files/5379167/fluentbit.txt

Thanks!

Hi @JeffLuoo please lemme know if you need someone to test or help.
Thanks

One thing I noticed is that you set the limit to 5M however the log message

โ”‚ [2020/10/14 15:18:04] [debug] [input chunk] chunk 1-1602688684.470023866.flb required 840 bytes and 8206935 bytes left in plugin es.0                                                           

shows that there are 8206935 bytes (~8M) left.. I am wondering is this part of log the first part of log or it is just a random part of logs?

@JeffLuoo This is a random part. If I enable the storage limit, then all the time, I get this error. I tried this in multiple k8s clusters.
If you want I can run the same and give you a complete output.

@rjshrjndrn That would be perfect If I can get a complete output. This will help me a lot! Thank you!

@JeffLuoo
Here is the complete log with trace input.
Steps

  1. Created a fresh cluster
  2. Installed bitnami/es in es namespace helm install es bitnami/es -n es
  3. Applied above manifest
  4. Got this log file, log level is error. fluentbit.log.0.txt
  5. Uninstalled the manifest kubectl delete -f fluent.txt
  6. Installed using trace logging: log file fluentbit.log

Please let me know if you need anything else.
Sorry for the delayed reply.

@rjshrjndrn Got it! Thank you so much.

I noticed that you set the file system limit to 5M

storage.total_limit_size  5M

However from the log file fluentbit.log I noticed

fluent-bit-xmjw9 fluent-bit [2020/10/20 07:12:48] [debug] [input chunk] chunk 1-1603177968.740186336.flb required 4115 bytes and 5000000000 bytes left in plugin es.0

and this is 5GB, which is weird.. Could you please double check what value is assigned for storage.total_limit_size? Thank you so much!!

@rjshrjndrn would you please share your full configuration so we can try to reproduce the issue ?

@edsiper Thank you for taking a look in to this issue. I watched a lot of your youtube videos. Kudos.

This is my configuration

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: fluent-bit-read
rules:
- apiGroups: [""]
  resources:
  - namespaces
  - pods
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: fluent-bit-read
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluent-bit-read
subjects:
- kind: ServiceAccount
  name: fluent-bit
  namespace: logging
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
  labels:
    k8s-app: fluent-bit
data:
  # Configuration files: server, input, filters and output
  # ======================================================
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     trace
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020
        storage.path              /var/log/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 5M
        # How many chunks from file system, which is in que to flush to es, has to be in memory(warm)
        # storage.max_chunks_up 50

    @INCLUDE input-kubernetes.conf
    @INCLUDE filter-kubernetes.conf
    @INCLUDE output-elasticsearch.conf

  input-kubernetes.conf: |
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     5MB
        Skip_Long_Lines   On
        Refresh_Interval  1
        storage.type  filesystem

  filter-kubernetes.conf: |
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off

  output-elasticsearch.conf: |
    [OUTPUT]
        Name            es
        Match           *
        Host            ${FLUENT_ELASTICSEARCH_HOST}
        Port            ${FLUENT_ELASTICSEARCH_PORT}
        Logstash_Format On
        Replace_Dots    On
        Retry_Limit     5
        storage.total_limit_size  5G

  parsers.conf: |
    [PARSER]
        Name   apache
        Format regex
        Regex  ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   apache2
        Format regex
        Regex  ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   apache_error
        Format regex
        Regex  ^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\](?: \[pid (?<pid>[^\]]*)\])?( \[client (?<client>[^\]]*)\])? (?<message>.*)$

    [PARSER]
        Name   nginx
        Format regex
        Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   json
        Format json
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On

    [PARSER]
        Name        syslog
        Format      regex
        Regex       ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key    time
        Time_Format %b %d %H:%M:%S
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
  labels:
    k8s-app: fluent-bit-logging
    version: v1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: fluent-bit-logging
  template:
    metadata:
      labels:
        k8s-app: fluent-bit-logging
        version: v1
        kubernetes.io/cluster-service: "true"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "2020"
        prometheus.io/path: /api/v1/metrics/prometheus
    spec:
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:1.6
        ports:
          - containerPort: 2020
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "es-elasticsearch-coordinating-only.es.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
        - name: mnt
          mountPath: /mnt
          readOnly: true
      terminationGracePeriodSeconds: 10
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      - name: mnt
        hostPath:
          path: /mnt
      serviceAccountName: fluent-bit
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      - operator: "Exists"
        effect: "NoExecute"
      - operator: "Exists"
        effect: "NoSchedule"

@JeffLuoo Sorry for that misunderstanding.
1st I've tried with 5M, when I got issue, I updated that to 5G

I have the following:

[2020/10/20 16:15:54] [debug] [storage] tail.0:1-1603210554.901364045.flb mapped OK
[2020/10/20 16:15:54] [debug] [input chunk] chunk 1-1603210554.901364045.flb required 937 bytes and 1024034522 bytes left in plugin http.0
[2020/10/20 16:15:54] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1603210554.901364045.flb in plugin http.0

Immediately after I start fluentbit start to get these messages. My free space is plenty. Fluentbit detects that 1G is free for the output plugin but still reports no space. I am running fluentbit as kubernetes daemonset.

Here is my config:

config:
  ## https://docs.fluentbit.io/manual/service
  service: |
    [SERVICE]
        Flush                     5
        Daemon                    Off
        Log_Level                 debug
        Parsers_File              parsers.conf
        Parsers_File              custom_parsers.conf
        HTTP_Server               On
        HTTP_Listen               0.0.0.0
        HTTP_Port                 2020
        storage.type              memory
        storage.sync              normal
        storage.checksum          off
        storage.metrics           on
        storage.backlog.mem_limit 1024M
        storage.path              /var/log/flb-storage/

  ## https://docs.fluentbit.io/manual/pipeline/inputs
  inputs: |
    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        Exclude_Path /var/log/containers/*fluent*.log
        Parser docker
        Tag kube.*
        storage.type filesystem
        Mem_Buf_Limit 32M 
        Buffer_Chunk_Size 2M 
        Buffer_Max_Size 2M
        Skip_Long_Lines On
        DB /var/log/containers/fluent-bit.db
        DB.locking true

  ## https://docs.fluentbit.io/manual/pipeline/filters
  filters: |
    [FILTER]
        Name record_modifier
        Match kube.*
        Record CLUSTER_NAME ${CLUSTER_NAME}
        Record REGION ${REGION}
        Record ACCOUNT_ALIAS ${ACCOUNT_ALIAS}
    [FILTER]
        Name kubernetes
        Match kube.*
        Merge_Log On
        Keep_Log Off
        K8S-Logging.Parser On
        K8S-Logging.Exclude On
    [FILTER]
        Name    lua
        Match   kube.*
        script  /dedot-lua/dedot.lua
        call    dedot

  ## https://docs.fluentbit.io/manual/pipeline/outputs
  outputs: |
    [OUTPUT]
        Name http
        Match kube.*
        json_date_format iso8601
        Host <dest_host>
        Port 443        
        Header Origin <src_host>  
        Retry_Limit False
        Format json_stream
        tls On
        tls.verify Off
        storage.total_limit_size  1024M
    [OUTPUT]
        Name flowcounter
        Match *
        Unit second

  ## https://docs.fluentbit.io/manual/pipeline/parsers
  customParsers: |
    [PARSER]
        Name docker
        Format json
        Time_Keep On
        Time_Key time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
    [PARSER]
        Name        multiline_time
        Format      regex
        Regex       ^\d+\s(?<level>\w{3})\s(?<time>\w{3}\s\d+\s\d+:\d+:\d+\.\d+)\s(?<message>.*)
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On
    [PARSER]
        Name        java_multiline
        Format      regex
        Regex       /^(?<time>(?:\d{4}-\d{1,2}-\d{1,2} )?\d{1,2}:\d{1,2}:\d{1,2}(\.\d{1,})?) \[(?<thread>.*)\] (?<level>[^\s]+)(?<message>.*)/
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L

@rjshrjndrn
@Cyanopus
Confirmed that I was able to reproduce the issue in a cluster environment. Debugging now.

Issue can be reproduced with:

$ bin/fluent-bit -i random -F record_modifier -p "record=hostname aaabbbbbccc" -m '*' -o stdout -p storage.total_limit_size=55 -f 1 -v

note that depending of the scheduler, a plugin might try to ingest data before a chunk gets destroyed so it gets limited.

I am posting a possible fix, @JeffLuoo can you validate if this is the right one ?

diff --git a/src/flb_input_chunk.c b/src/flb_input_chunk.c
index 8fe6641a..64e498ec 100644
--- a/src/flb_input_chunk.c
+++ b/src/flb_input_chunk.c
@@ -459,6 +459,7 @@ int flb_input_chunk_destroy(struct flb_input_chunk *ic, int del)
         bytes = flb_input_chunk_get_size(ic);
         if ((ic->routes_mask & o_ins->mask_id) > 0) {
             o_ins->fs_chunks_size -= bytes;
+            printf("chunks size => %" PRIu64 "\n", o_ins->fs_chunks_size);
         }
     }

@@ -692,7 +693,9 @@ int flb_input_chunk_append_raw(struct flb_input_instance *in,
     int ret;
     int set_down = FLB_FALSE;
     int min;
+    size_t diff;
     size_t size;
+    size_t pre_size;
     struct flb_input_chunk *ic;
     struct flb_storage_input *si;

@@ -739,6 +742,12 @@ int flb_input_chunk_append_raw(struct flb_input_instance *in,
         set_down = FLB_TRUE;
     }

+    /*
+     * Previous size from the chunk, used to calculate the difference
+     * after filtering
+     */
+    pre_size = cio_chunk_get_content_size(ic->chunk);
+
     /* Write the new data */
     ret = flb_input_chunk_write(ic, buf, buf_size);
     if (ret == -1) {
@@ -748,8 +757,6 @@ int flb_input_chunk_append_raw(struct flb_input_instance *in,
         return -1;
     }

-    flb_input_chunk_update_output_instances(ic, buf_size);
-
     /* Update 'input' metrics */
 #ifdef FLB_HAVE_METRICS
     if (ic->total_records > 0) {
@@ -758,6 +765,8 @@ int flb_input_chunk_append_raw(struct flb_input_instance *in,
     }
 #endif

+    printf("pre size => %lu\n", buf_size);
+
     /* Apply filters */
     flb_filter_do(ic,
                   buf, buf_size,
@@ -765,6 +774,10 @@ int flb_input_chunk_append_raw(struct flb_input_instance *in,

     /* Get chunk size */
     size = cio_chunk_get_content_size(ic->chunk);
+    printf("pos size => %lu\n", size);
+
+    diff = llabs(size - pre_size);
+    flb_input_chunk_update_output_instances(ic, diff);

     /* Lock buffers where size > 2MB */
     if (size > FLB_INPUT_CHUNK_FS_MAX_SIZE) {

FYI: the function flb_input_chunk_update_output_instances(...) always __append__ the new number of bytes, so the patch tries to come up with the exact number that are being added after filtering

@edsiper I tried this patch on my cluster environment and verify that the bytes we added to fs_chunks_size will match the bytes we remove from fs_chunks_size for a specific input chunk. From my view, the wrong calculation of fs_chunks_size is the root cause of this issue. And now the fs_chunks_size is updated correctly. Thank you so much by providing the patch!

thanks everyone. The fix c27896fd is already in place. It will be part of v1.6.2 release today.

fluent-bit version: 1.6.2 and 1.6.3
I don't think this issue is fixed. I'm still getting the same error, not so frequently. But after some time. I gave my storage.total_limit_size 20G still showing the same fluent-bit-4cgwq fluent-bit [2020/11/04 03:53:14] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1604461994.357757503.flb in plugin es.0 error.
@JeffLuoo @edsiper
@Cyanopus Are you seeing this behaviour ?

@rjshrjndrn pls share your configuration

@edsiper Please find the config attached

Config

  filter-kubernetes.conf: |
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        # Merge application log to kubernetes structure
        Merge_Log           On
        # Disabling merge key as all the fileds will be in the root
        # Else all merged content will be under 
        # log_processed:
        #    {
        #    }
        # Merge_Log_Key       log_processed
        #
        # Adding custom parsor not to do any json parse
        Merge_Parser        catchall
        # Don't need merged logs
        # More over, we're using custom parser, which catch all and tag it as log
        Keep_Log            Off
        # Annotate pod to select the parser
        # annotaions:
        #   fluentbit.io/parsers: nginx2
        K8S-Logging.Parser  On
        # Disable fluent-bit from parsing logs
        # annotaions:
        #   fluentbit.io/exclude: true
        K8S-Logging.Exclude On
    [FILTER]
        # Discard all health debug and info logs
        Name grep
        Match kube.*
        exclude message /^.*(?:debug|info|GET (\/service)?\/health).*$

  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     WARNING
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020
        storage.path              /mnt/log/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 120M
        # How many chunks from file system, which is in que to flush to es, has to be in memory(warm)
        storage.max_chunks_up 50

    @INCLUDE input-kubernetes.conf
    @INCLUDE filter-kubernetes.conf
    @INCLUDE output-elasticsearch.conf
  input-kubernetes.conf: |
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /mnt/log/flb_kube.db
        Mem_Buf_Limit     120MB
        # Set the initial buffer size to read files data.
        # This value is used too to increase buffer size.
        Buffer_Chunk_Size 256k
        # Set the limit of the buffer size per monitored file.
        # When a buffer needs to be increased (e.g: very long lines),
        # this value is used to restrict how much the memory buffer can grow.
        # If reading a file exceed this limit, the file is removed from the monitored file list
        Buffer_Max_Size   512k
        Skip_Long_Lines   On
        Refresh_Interval  1
        storage.type  filesystem
        # Ignore older logs of 30m
        Ignore_Older       30m
  output-elasticsearch.conf: |
    [OUTPUT]
        Name            es
        Match           *
        Host            ${FLUENT_ELASTICSEARCH_HOST}
        Port            ${FLUENT_ELASTICSEARCH_PORT}
        Logstash_Format On
        Replace_Dots    On
        Retry_Limit     10
        Type            _doc
        # Networking Setup
        net.connect_timeout         10
        # net.source_address          127.0.0.1
        net.keepalive               on
        net.keepalive_idle_timeout  100
        net.keepalive_max_recycle 200
        #
        # Limit the maximum number of Chunks in the filesystem for the current output logical destination.
        # After this LRU will kick in; that means LeastRecentlyUsed will get deleted
        # 1 chunk =~ 2MB
        # This config has some issues; see https://github.com/fluent/fluent-bit/issues/2688
        storage.total_limit_size 20G
  parsers.conf: |
    [PARSER]
        Name   apache
        Format regex
        Regex  ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   apache2
        Format regex
        Regex  ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   apache_error
        Format regex
        Regex  ^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\](?: \[pid (?<pid>[^\]]*)\])?( \[client (?<client>[^\]]*)\])? (?<message>.*)$

    [PARSER]
        Name   nginx
        Format regex
        Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   json
        Format json
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On
    [PARSER]
        Name        nginx2
        Format      regex
        # https://rubular.com/r/ZSLRpLS1B4XFbC
        Regex       ^(?<n_cip>[^ ]*) (?<n_host>[^ ]*) (?<n_uid>[^ ]*) \[(?<n_time>[^\]]*)\] "(?<n_verb>\S+)(?: +(?<n_path>[^\"]*?)(?: +\S*)?)?" (?<n_rscode>[^ ]*) (?<n_rqsize>[^ ]*) (?<n_rssize>[^ ]*) (?<n_rqtime>[^ ]+) (?<n_rstime>[^ ]+) (?<n_pipe>\.*) (?:"(?<n_referer>[^\"]*)" "(?<n_agent>[^\"]*)") "(?<n_tid>[^\"]*)" "(?<n_did>[^\"]*)" "(?<n_cid>[^\"]*)" "(?<n_aid>[^\"]*)" "(?<n_aver>[^\"]*)" "(?<n_sid>[^\"]*)"
        Time_Key    time
        Types       n_rstime:float n_rqtime:float n_rssize:integer n_rqsize:integer
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        syslog
        Format      regex
        Regex       ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key    time
        Time_Format %b %d %H:%M:%S
    [PARSER]
        Name    catchall
        Format  regex
        Regex   ^(?<message>.*)$

Daemonset

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  creationTimestamp: "2020-11-04T05:39:35Z"
  generation: 4
  labels:
    app.kubernetes.io/instance: fluent-bit
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: fluent-bit
    app.kubernetes.io/version: 1.6.2
    helm.sh/chart: fluent-bit-0.1.0
    k8s-app: fluent-bit-logging
    kubernetes.io/cluster-service: "true"
    version: v1
  name: fluent-bit
  namespace: logging
  resourceVersion: "19410581"
  selfLink: /apis/extensions/v1beta1/namespaces/logging/daemonsets/fluent-bit
  uid: 219612f9-3a1f-4c7f-a3d0-d1c94660494d
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: fluent-bit
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: fluent-bit
      app.kubernetes.io/version: 1.6.2
      helm.sh/chart: fluent-bit-0.1.0
      k8s-app: fluent-bit-logging
      kubernetes.io/cluster-service: "true"
      version: v1
  template:
    metadata:
      annotations:
        fleuntbit.io/exlude: "true"
        fluent-bitRolloutID: VtS9j
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: fluent-bit
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: fluent-bit
        app.kubernetes.io/version: 1.6.2
        helm.sh/chart: fluent-bit-0.1.0
        k8s-app: fluent-bit-logging
        kubernetes.io/cluster-service: "true"
        version: v1
    spec:
      containers:
      - env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: log-es.logging.svc.cluster.local
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        image: fluent/fluent-bit:1.6.3
        imagePullPolicy: Always
        name: fluent-bit
        ports:
        - containerPort: 2020
          name: http-metrics
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /mnt/log
          name: datapath
        - mountPath: /var/log
          name: varlog
        - mountPath: /var/lib/docker/containers
          name: varlibdockercontainers
          readOnly: true
        - mountPath: /fluent-bit/etc/
          name: fluent-bit-config
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: fluent-bit
      serviceAccountName: fluent-bit
      terminationGracePeriodSeconds: 10
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      volumes:
      - hostPath:
          path: /mnt/log
          type: ""
        name: datapath
      - hostPath:
          path: /var/log
          type: ""
        name: varlog
      - hostPath:
          path: /var/lib/docker/containers
          type: ""
        name: varlibdockercontainers
      - configMap:
          defaultMode: 420
          name: fluent-bit-config
        name: fluent-bit-config
  templateGeneration: 4
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 25%
    type: RollingUpdate

hi @edsiper were you able to reproduce the issue? need any additional info ?

Hi @rjshrjndrn , I saw similar problem just now when I _updated and applied_ only the configuration change. However, if I delete the pod completely and recreate it with the new configuration, the problem goes away.

Will appreciate to see whether you can reproduce the problem in a stable way.

@erain I tried the solution with version 1.6.2 and after some time, an issue popped up, so tried again with 1.6.3 which also didn't resolve the issue. I noticed one thing that didn't create the issue at the beginning but after some time.
BTW, I don't know whether the backlog I had caused the issue, in that case, the LRU, should kick in, not the error.
What's the log ingestion rate you're having? I have around 30MBps

I am seeing this issue as well, and I also have a feeling that the backlog caused this issue. Here is why:

One possible way to reproduce this issue:

  • Run fluent-bit in an older version, say 1.4
  • Run fluent-bit in the latest version 1.6.3, with the same configuration + storage.total_limit_size <size>

@edsiper @JeffLuoo : FYI, I think we should re-open this issue.

Hi @erain Thank you for the information!! You mentioned that "if I delete the pod completely and recreate it with the new configuration, the problem goes away". So my guess to the root cause of the issue was also the backoff logs on the container. If you run a new pod (without backoff logs) with configuration storage.total_limit_size 200M, did you notice the issue like fluent-bit-4cgwq fluent-bit [2020/11/04 03:53:14] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1604461994.357757503.flb in plugin es.0 ?

Hi @JeffLuoo , no I could not reproduce with an empty buffer folder.

In fact, I've tried to produced the bug in vanilla fluent bit (a non-containerized environment), but I failed by far:

  1. compile fluent-bit from 1.5 branch and get some offline buffer files.
  2. change the config to include something like storage.total_limit_size 1M
  3. compile fluent-bit from master and start it with those offline buffer files.

This leads to me think whether the problem has something to do with a containerized environment / k8s. I will keep on digging.

just trying to guess, but what about counters adjustments when a retry exceeded it limit and the chunk is not longer available for that output ?

I think I found something which is not right, specifically here upon initialization

ref: https://github.com/fluent/fluent-bit/blob/master/src/flb_output.c#L567

ins->total_limit_size = (size_t) limit;

struct flb_output_instance field total_limit_size data type is size_t , but limit can be -1 in case the flag was turned off, on assigning -1 the total_limit_size value is set to 18446744073709551615, there are a couple of places where the code checks if the struct field is -1:

I am not saying the data type is guilty because I am sure in the majority of cases the compiler will do the proper internal handling of this case. Anyways I think the internal logic for total_limit_size should be: if set to zero, means the absence of a limit, to check if there is a limit just use > 0.

Now @JeffLuoo comment might make sense, what if the storage directory already has data queued for a given destination ?, this will limit appending new data. we need to confirm if this happened. cc: @rjshrjndrn

note: we need to verify if upon starting with a full storage queue there are chunks deletion (oldest chunk)

@erain I tried the solution with version 1.6.2 and after some time, an issue popped up, so tried again with 1.6.3 which also didn't resolve the issue. I noticed one thing that didn't create the issue at the beginning but after some time.
BTW, I don't know whether the backlog I had caused the issue, in that case, the LRU, should kick in, not the error.

@edsiper As I mentioned earlier, I had back log from 1.6, I did an upgrade to 1.6.2 and then to 1.6.3. Since I've hostpath attched to that path, I had previous pending data.

just trying to guess, but what about counters adjustments when a retry exceeded it limit and the chunk is not longer available for that output ?

@edsiper I did not modify the logic that if the retry reaches the limit then flb_task_retry_destroy(1) will be called and it is followed by a call to flb_sched_request_invalidate(2). However I was not sure whether the call to flb_task_retry_destroy(1) will result in a destroy of the task? If the task was not destroyed then I think I might miss this case to update the chunk bytes counter accordingly.

@JeffLuoo any idea to get a reproducible case ?, we have a out_retry plugin which is disabled, it just triggers a retry N times

@edsiper Do you mean a reproducible case to verify the adjustments when a retry exceeded it limit and the chunk is not longer available for that output?

Edit:
@edsiper
In order to force a retry, I ran the fluent bit with

bin/fluent-bit -i random -o http://0.0.0.0 -m "*" -p storage.total_limit_size=10000 -p Retry_Limit=0 -f 1 -v

I believe it wasn't the best way to test the case according to the logs of fluent-bit, the retry task created for task 0 (the first task) will reach the limit after the first retry. And after reaching the limit, it will invalidate the retry task and also destroy the task as well. Below is part of logs generated with the information of the task 0:

[2020/11/12 04:48:17] [debug] [retry] new retry created for task_id=7 attempts=1
[2020/11/12 04:48:17] [ warn] [engine] failed to flush chunk '29912-1605156496.340157658.flb', retry in 10 seconds: task_id=7, input=random.0 > output=http.0
[2020/11/12 04:48:18] [debug] [task] created task=0x7fc2ac006690 id=1 OK
[2020/11/12 04:48:18] [debug] [input chunk] chunk 29912-1605156498.339987799.flb required 32 bytes and 9744 bytes left in plugin http.0
[2020/11/12 04:48:18] [debug] [input chunk] update plugin http.0 fs_chunk_size by 32 bytes, the current fs_chunk_size is 288 bytes
[2020/11/12 04:48:18] [error] [io] TCP connection failed: 0.0.0.0:80 (Connection refused)
[2020/11/12 04:48:18] [debug] [upstream] connection #31 failed to 0.0.0.0:80
[2020/11/12 04:48:18] [error] [output:http:http.0] no upstream connections available to 0.0.0.0:80
[2020/11/12 04:48:18] [error] [io] TCP connection failed: 0.0.0.0:80 (Connection refused)
[2020/11/12 04:48:18] [debug] [upstream] connection #24 failed to 0.0.0.0:80
[2020/11/12 04:48:18] [error] [output:http:http.0] no upstream connections available to 0.0.0.0:80
[2020/11/12 04:48:18] [debug] [task] task_id=0 reached retry-attempts limit 1/0
[2020/11/12 04:48:18] [ warn] [engine] chunk '29912-1605156489.339756951.flb' cannot be retried: task_id=0, input=random.0 > output=http.0
[2020/11/12 04:48:18] [debug] [task] destroy task=0x7fc2ac006210 (task_id=0)
[2020/11/12 04:48:18] [debug] [task] Delete chunk 29912-1605156489.339756951.flb of the task with bytes 32, fs_chunks_size after it is 256

Confirm that I am able to reproduced the issue. It will only occur when there are backoff logs on the container. Trouble shooting it.

@edsiper
I have provided a patch #2771. PTAL when available, Thank you!

I've built a new image and I'm testing it

Doesn't seem to work

[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967257035.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967302301.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967358640.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967404287.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967452877.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967509009.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967557220.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967604588.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk

[2021/01/07 10:51:00] [debug] [input chunk] chunk 1-1610016660.50582922.flb required 0 bytes and -54 bytes left in plugin kafka.0
[2021/01/07 10:51:00] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610016660.50582922.flb in plugin kafka.0
[2021/01/07 10:51:00] [error] [input chunk] no available chunk

Was this page helpful?
0 / 5 - 0 ratings