Describe the bug
no enough space in filesystem to buffer chunk 1-1602687091.955423455.flb in plugin es.0 coming when I enable storage.total_limit_size 5M in es output
To Reproduce
โ [2020/10/14 15:18:04] [debug] [input:tail:tail.0] inode=1291902 events: IN_MODIFY โ
โ [2020/10/14 15:18:04] [debug] [storage] tail.0:1-1602688684.470023866.flb mapped OK โ
โ [2020/10/14 15:18:04] [debug] [input chunk] chunk 1-1602688684.470023866.flb required 840 bytes and 8206935 bytes left in plugin es.0 โ
โ [2020/10/14 15:18:04] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1602688684.470023866.flb in plugin es.0 โ
โ [2020/10/14 15:18:04] [debug] [storage] [cio file] synced at: tail.0/1-1602688684.470023866.flb โ
โ [2020/10/14 15:18:04] [error] [input chunk] no available chunk โ
โ [2020/10/14 15:18:04] [debug] [storage] tail.0:1-1602688684.470514408.flb mapped OK โ
โ [2020/10/14 15:18:04] [debug] [input chunk] chunk 1-1602688684.470514408.flb required 963 bytes and 8206935 bytes left in plugin es.0 โ
โ [2020/10/14 15:18:04] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1602688684.470514408.flb in plugin es.0 โ
โ [2020/10/14 15:18:04] [debug] [storage] [cio file] synced at: tail.0/1-1602688684.470514408.flb โ
โ [2020/10/14 15:18:04] [error] [input chunk] no available chunk โ
โ [2020/10/14 15:18:04] [debug] [input:tail:tail.0] inode=1291902 events: IN_MODIFY โ
โ [2020/10/14 15:18:04] [debug] [storage] tail.0:1-1602688684.471037398.flb mapped OK โ
โ [2020/10/14 15:18:04] [debug] [input chunk] chunk 1-1602688684.471037398.flb required 943 bytes and 8206935 bytes left in plugin es.0 โ
โ [2020/10/14 15:18:04] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1602688684.471037398.flb in plugin es.0 โ
โ [2020/10/14 15:18:04] [debug] [storage] [cio file] synced at: tail.0/1-1602688684.471037398.flb
Expected behavior
Storage LRU should kick in once the limit exeeds
Your Environment
Client Version: v1.19.2
Server Version: v1.18.6+k3s1
Additional context
My disk got completely utilized by fluentbit when my es was down. So figured out I've to upgrade to 1.6 to have LRU like fluentd.
@JeffLuoo FYI
Thank you, I will look into it.
Could you please provide the config file that you are using if possible? Thank you so much!
The config file and entire manifest is present in the https://github.com/fluent/fluent-bit/files/5379167/fluentbit.txt
Thanks!
Hi @JeffLuoo please lemme know if you need someone to test or help.
Thanks
One thing I noticed is that you set the limit to 5M however the log message
โ [2020/10/14 15:18:04] [debug] [input chunk] chunk 1-1602688684.470023866.flb required 840 bytes and 8206935 bytes left in plugin es.0
shows that there are 8206935 bytes (~8M) left.. I am wondering is this part of log the first part of log or it is just a random part of logs?
@JeffLuoo This is a random part. If I enable the storage limit, then all the time, I get this error. I tried this in multiple k8s clusters.
If you want I can run the same and give you a complete output.
@rjshrjndrn That would be perfect If I can get a complete output. This will help me a lot! Thank you!
@JeffLuoo
Here is the complete log with trace input.
Steps
helm install es bitnami/es -n eskubectl delete -f fluent.txtPlease let me know if you need anything else.
Sorry for the delayed reply.
@rjshrjndrn Got it! Thank you so much.
I noticed that you set the file system limit to 5M
storage.total_limit_size 5M
However from the log file fluentbit.log I noticed
fluent-bit-xmjw9 fluent-bit [2020/10/20 07:12:48] [debug] [input chunk] chunk 1-1603177968.740186336.flb required 4115 bytes and 5000000000 bytes left in plugin es.0
and this is 5GB, which is weird.. Could you please double check what value is assigned for storage.total_limit_size? Thank you so much!!
Similar issue reported in slack here:
https://fluent-all.slack.com/archives/C0CTQGHKJ/p1603210660214400
@rjshrjndrn would you please share your full configuration so we can try to reproduce the issue ?
@edsiper Thank you for taking a look in to this issue. I watched a lot of your youtube videos. Kudos.
This is my configuration
apiVersion: v1
kind: ServiceAccount
metadata:
name: fluent-bit
namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: fluent-bit-read
rules:
- apiGroups: [""]
resources:
- namespaces
- pods
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: fluent-bit-read
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: fluent-bit-read
subjects:
- kind: ServiceAccount
name: fluent-bit
namespace: logging
---
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
labels:
k8s-app: fluent-bit
data:
# Configuration files: server, input, filters and output
# ======================================================
fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level trace
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 5M
# How many chunks from file system, which is in que to flush to es, has to be in memory(warm)
# storage.max_chunks_up 50
@INCLUDE input-kubernetes.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE output-elasticsearch.conf
input-kubernetes.conf: |
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 1
storage.type filesystem
filter-kubernetes.conf: |
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
output-elasticsearch.conf: |
[OUTPUT]
Name es
Match *
Host ${FLUENT_ELASTICSEARCH_HOST}
Port ${FLUENT_ELASTICSEARCH_PORT}
Logstash_Format On
Replace_Dots On
Retry_Limit 5
storage.total_limit_size 5G
parsers.conf: |
[PARSER]
Name apache
Format regex
Regex ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name apache2
Format regex
Regex ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name apache_error
Format regex
Regex ^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\](?: \[pid (?<pid>[^\]]*)\])?( \[client (?<client>[^\]]*)\])? (?<message>.*)$
[PARSER]
Name nginx
Format regex
Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name json
Format json
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
[PARSER]
Name syslog
Format regex
Regex ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
Time_Key time
Time_Format %b %d %H:%M:%S
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
labels:
k8s-app: fluent-bit-logging
version: v1
kubernetes.io/cluster-service: "true"
spec:
selector:
matchLabels:
k8s-app: fluent-bit-logging
template:
metadata:
labels:
k8s-app: fluent-bit-logging
version: v1
kubernetes.io/cluster-service: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "2020"
prometheus.io/path: /api/v1/metrics/prometheus
spec:
containers:
- name: fluent-bit
image: fluent/fluent-bit:1.6
ports:
- containerPort: 2020
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "es-elasticsearch-coordinating-only.es.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
- name: mnt
mountPath: /mnt
readOnly: true
terminationGracePeriodSeconds: 10
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-config
- name: mnt
hostPath:
path: /mnt
serviceAccountName: fluent-bit
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- operator: "Exists"
effect: "NoExecute"
- operator: "Exists"
effect: "NoSchedule"
@JeffLuoo Sorry for that misunderstanding.
1st I've tried with 5M, when I got issue, I updated that to 5G
I have the following:
[2020/10/20 16:15:54] [debug] [storage] tail.0:1-1603210554.901364045.flb mapped OK
[2020/10/20 16:15:54] [debug] [input chunk] chunk 1-1603210554.901364045.flb required 937 bytes and 1024034522 bytes left in plugin http.0
[2020/10/20 16:15:54] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1603210554.901364045.flb in plugin http.0
Immediately after I start fluentbit start to get these messages. My free space is plenty. Fluentbit detects that 1G is free for the output plugin but still reports no space. I am running fluentbit as kubernetes daemonset.
Here is my config:
config:
## https://docs.fluentbit.io/manual/service
service: |
[SERVICE]
Flush 5
Daemon Off
Log_Level debug
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.type memory
storage.sync normal
storage.checksum off
storage.metrics on
storage.backlog.mem_limit 1024M
storage.path /var/log/flb-storage/
## https://docs.fluentbit.io/manual/pipeline/inputs
inputs: |
[INPUT]
Name tail
Path /var/log/containers/*.log
Exclude_Path /var/log/containers/*fluent*.log
Parser docker
Tag kube.*
storage.type filesystem
Mem_Buf_Limit 32M
Buffer_Chunk_Size 2M
Buffer_Max_Size 2M
Skip_Long_Lines On
DB /var/log/containers/fluent-bit.db
DB.locking true
## https://docs.fluentbit.io/manual/pipeline/filters
filters: |
[FILTER]
Name record_modifier
Match kube.*
Record CLUSTER_NAME ${CLUSTER_NAME}
Record REGION ${REGION}
Record ACCOUNT_ALIAS ${ACCOUNT_ALIAS}
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
[FILTER]
Name lua
Match kube.*
script /dedot-lua/dedot.lua
call dedot
## https://docs.fluentbit.io/manual/pipeline/outputs
outputs: |
[OUTPUT]
Name http
Match kube.*
json_date_format iso8601
Host <dest_host>
Port 443
Header Origin <src_host>
Retry_Limit False
Format json_stream
tls On
tls.verify Off
storage.total_limit_size 1024M
[OUTPUT]
Name flowcounter
Match *
Unit second
## https://docs.fluentbit.io/manual/pipeline/parsers
customParsers: |
[PARSER]
Name docker
Format json
Time_Keep On
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
[PARSER]
Name multiline_time
Format regex
Regex ^\d+\s(?<level>\w{3})\s(?<time>\w{3}\s\d+\s\d+:\d+:\d+\.\d+)\s(?<message>.*)
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
[PARSER]
Name java_multiline
Format regex
Regex /^(?<time>(?:\d{4}-\d{1,2}-\d{1,2} )?\d{1,2}:\d{1,2}:\d{1,2}(\.\d{1,})?) \[(?<thread>.*)\] (?<level>[^\s]+)(?<message>.*)/
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
@rjshrjndrn
@Cyanopus
Confirmed that I was able to reproduce the issue in a cluster environment. Debugging now.
Issue can be reproduced with:
$ bin/fluent-bit -i random -F record_modifier -p "record=hostname aaabbbbbccc" -m '*' -o stdout -p storage.total_limit_size=55 -f 1 -v
note that depending of the scheduler, a plugin might try to ingest data before a chunk gets destroyed so it gets limited.
I am posting a possible fix, @JeffLuoo can you validate if this is the right one ?
diff --git a/src/flb_input_chunk.c b/src/flb_input_chunk.c
index 8fe6641a..64e498ec 100644
--- a/src/flb_input_chunk.c
+++ b/src/flb_input_chunk.c
@@ -459,6 +459,7 @@ int flb_input_chunk_destroy(struct flb_input_chunk *ic, int del)
bytes = flb_input_chunk_get_size(ic);
if ((ic->routes_mask & o_ins->mask_id) > 0) {
o_ins->fs_chunks_size -= bytes;
+ printf("chunks size => %" PRIu64 "\n", o_ins->fs_chunks_size);
}
}
@@ -692,7 +693,9 @@ int flb_input_chunk_append_raw(struct flb_input_instance *in,
int ret;
int set_down = FLB_FALSE;
int min;
+ size_t diff;
size_t size;
+ size_t pre_size;
struct flb_input_chunk *ic;
struct flb_storage_input *si;
@@ -739,6 +742,12 @@ int flb_input_chunk_append_raw(struct flb_input_instance *in,
set_down = FLB_TRUE;
}
+ /*
+ * Previous size from the chunk, used to calculate the difference
+ * after filtering
+ */
+ pre_size = cio_chunk_get_content_size(ic->chunk);
+
/* Write the new data */
ret = flb_input_chunk_write(ic, buf, buf_size);
if (ret == -1) {
@@ -748,8 +757,6 @@ int flb_input_chunk_append_raw(struct flb_input_instance *in,
return -1;
}
- flb_input_chunk_update_output_instances(ic, buf_size);
-
/* Update 'input' metrics */
#ifdef FLB_HAVE_METRICS
if (ic->total_records > 0) {
@@ -758,6 +765,8 @@ int flb_input_chunk_append_raw(struct flb_input_instance *in,
}
#endif
+ printf("pre size => %lu\n", buf_size);
+
/* Apply filters */
flb_filter_do(ic,
buf, buf_size,
@@ -765,6 +774,10 @@ int flb_input_chunk_append_raw(struct flb_input_instance *in,
/* Get chunk size */
size = cio_chunk_get_content_size(ic->chunk);
+ printf("pos size => %lu\n", size);
+
+ diff = llabs(size - pre_size);
+ flb_input_chunk_update_output_instances(ic, diff);
/* Lock buffers where size > 2MB */
if (size > FLB_INPUT_CHUNK_FS_MAX_SIZE) {
FYI: the function flb_input_chunk_update_output_instances(...) always __append__ the new number of bytes, so the patch tries to come up with the exact number that are being added after filtering
@edsiper I tried this patch on my cluster environment and verify that the bytes we added to fs_chunks_size will match the bytes we remove from fs_chunks_size for a specific input chunk. From my view, the wrong calculation of fs_chunks_size is the root cause of this issue. And now the fs_chunks_size is updated correctly. Thank you so much by providing the patch!
thanks everyone. The fix c27896fd is already in place. It will be part of v1.6.2 release today.
fluent-bit version: 1.6.2 and 1.6.3
I don't think this issue is fixed. I'm still getting the same error, not so frequently. But after some time. I gave my storage.total_limit_size 20G still showing the same fluent-bit-4cgwq fluent-bit [2020/11/04 03:53:14] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1604461994.357757503.flb in plugin es.0 error.
@JeffLuoo @edsiper
@Cyanopus Are you seeing this behaviour ?
@rjshrjndrn pls share your configuration
@edsiper Please find the config attached
Config
filter-kubernetes.conf: |
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
# Merge application log to kubernetes structure
Merge_Log On
# Disabling merge key as all the fileds will be in the root
# Else all merged content will be under
# log_processed:
# {
# }
# Merge_Log_Key log_processed
#
# Adding custom parsor not to do any json parse
Merge_Parser catchall
# Don't need merged logs
# More over, we're using custom parser, which catch all and tag it as log
Keep_Log Off
# Annotate pod to select the parser
# annotaions:
# fluentbit.io/parsers: nginx2
K8S-Logging.Parser On
# Disable fluent-bit from parsing logs
# annotaions:
# fluentbit.io/exclude: true
K8S-Logging.Exclude On
[FILTER]
# Discard all health debug and info logs
Name grep
Match kube.*
exclude message /^.*(?:debug|info|GET (\/service)?\/health).*$
fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level WARNING
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /mnt/log/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 120M
# How many chunks from file system, which is in que to flush to es, has to be in memory(warm)
storage.max_chunks_up 50
@INCLUDE input-kubernetes.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE output-elasticsearch.conf
input-kubernetes.conf: |
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /mnt/log/flb_kube.db
Mem_Buf_Limit 120MB
# Set the initial buffer size to read files data.
# This value is used too to increase buffer size.
Buffer_Chunk_Size 256k
# Set the limit of the buffer size per monitored file.
# When a buffer needs to be increased (e.g: very long lines),
# this value is used to restrict how much the memory buffer can grow.
# If reading a file exceed this limit, the file is removed from the monitored file list
Buffer_Max_Size 512k
Skip_Long_Lines On
Refresh_Interval 1
storage.type filesystem
# Ignore older logs of 30m
Ignore_Older 30m
output-elasticsearch.conf: |
[OUTPUT]
Name es
Match *
Host ${FLUENT_ELASTICSEARCH_HOST}
Port ${FLUENT_ELASTICSEARCH_PORT}
Logstash_Format On
Replace_Dots On
Retry_Limit 10
Type _doc
# Networking Setup
net.connect_timeout 10
# net.source_address 127.0.0.1
net.keepalive on
net.keepalive_idle_timeout 100
net.keepalive_max_recycle 200
#
# Limit the maximum number of Chunks in the filesystem for the current output logical destination.
# After this LRU will kick in; that means LeastRecentlyUsed will get deleted
# 1 chunk =~ 2MB
# This config has some issues; see https://github.com/fluent/fluent-bit/issues/2688
storage.total_limit_size 20G
parsers.conf: |
[PARSER]
Name apache
Format regex
Regex ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name apache2
Format regex
Regex ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name apache_error
Format regex
Regex ^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\](?: \[pid (?<pid>[^\]]*)\])?( \[client (?<client>[^\]]*)\])? (?<message>.*)$
[PARSER]
Name nginx
Format regex
Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name json
Format json
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
[PARSER]
Name nginx2
Format regex
# https://rubular.com/r/ZSLRpLS1B4XFbC
Regex ^(?<n_cip>[^ ]*) (?<n_host>[^ ]*) (?<n_uid>[^ ]*) \[(?<n_time>[^\]]*)\] "(?<n_verb>\S+)(?: +(?<n_path>[^\"]*?)(?: +\S*)?)?" (?<n_rscode>[^ ]*) (?<n_rqsize>[^ ]*) (?<n_rssize>[^ ]*) (?<n_rqtime>[^ ]+) (?<n_rstime>[^ ]+) (?<n_pipe>\.*) (?:"(?<n_referer>[^\"]*)" "(?<n_agent>[^\"]*)") "(?<n_tid>[^\"]*)" "(?<n_did>[^\"]*)" "(?<n_cid>[^\"]*)" "(?<n_aid>[^\"]*)" "(?<n_aver>[^\"]*)" "(?<n_sid>[^\"]*)"
Time_Key time
Types n_rstime:float n_rqtime:float n_rssize:integer n_rqsize:integer
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name syslog
Format regex
Regex ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
Time_Key time
Time_Format %b %d %H:%M:%S
[PARSER]
Name catchall
Format regex
Regex ^(?<message>.*)$
Daemonset
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
creationTimestamp: "2020-11-04T05:39:35Z"
generation: 4
labels:
app.kubernetes.io/instance: fluent-bit
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: fluent-bit
app.kubernetes.io/version: 1.6.2
helm.sh/chart: fluent-bit-0.1.0
k8s-app: fluent-bit-logging
kubernetes.io/cluster-service: "true"
version: v1
name: fluent-bit
namespace: logging
resourceVersion: "19410581"
selfLink: /apis/extensions/v1beta1/namespaces/logging/daemonsets/fluent-bit
uid: 219612f9-3a1f-4c7f-a3d0-d1c94660494d
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: fluent-bit
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: fluent-bit
app.kubernetes.io/version: 1.6.2
helm.sh/chart: fluent-bit-0.1.0
k8s-app: fluent-bit-logging
kubernetes.io/cluster-service: "true"
version: v1
template:
metadata:
annotations:
fleuntbit.io/exlude: "true"
fluent-bitRolloutID: VtS9j
creationTimestamp: null
labels:
app.kubernetes.io/instance: fluent-bit
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: fluent-bit
app.kubernetes.io/version: 1.6.2
helm.sh/chart: fluent-bit-0.1.0
k8s-app: fluent-bit-logging
kubernetes.io/cluster-service: "true"
version: v1
spec:
containers:
- env:
- name: FLUENT_ELASTICSEARCH_HOST
value: log-es.logging.svc.cluster.local
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
image: fluent/fluent-bit:1.6.3
imagePullPolicy: Always
name: fluent-bit
ports:
- containerPort: 2020
name: http-metrics
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /mnt/log
name: datapath
- mountPath: /var/log
name: varlog
- mountPath: /var/lib/docker/containers
name: varlibdockercontainers
readOnly: true
- mountPath: /fluent-bit/etc/
name: fluent-bit-config
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: fluent-bit
serviceAccountName: fluent-bit
terminationGracePeriodSeconds: 10
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoExecute
operator: Exists
- effect: NoSchedule
operator: Exists
volumes:
- hostPath:
path: /mnt/log
type: ""
name: datapath
- hostPath:
path: /var/log
type: ""
name: varlog
- hostPath:
path: /var/lib/docker/containers
type: ""
name: varlibdockercontainers
- configMap:
defaultMode: 420
name: fluent-bit-config
name: fluent-bit-config
templateGeneration: 4
updateStrategy:
rollingUpdate:
maxUnavailable: 25%
type: RollingUpdate
hi @edsiper were you able to reproduce the issue? need any additional info ?
Hi @rjshrjndrn , I saw similar problem just now when I _updated and applied_ only the configuration change. However, if I delete the pod completely and recreate it with the new configuration, the problem goes away.
Will appreciate to see whether you can reproduce the problem in a stable way.
@erain I tried the solution with version 1.6.2 and after some time, an issue popped up, so tried again with 1.6.3 which also didn't resolve the issue. I noticed one thing that didn't create the issue at the beginning but after some time.
BTW, I don't know whether the backlog I had caused the issue, in that case, the LRU, should kick in, not the error.
What's the log ingestion rate you're having? I have around 30MBps
I am seeing this issue as well, and I also have a feeling that the backlog caused this issue. Here is why:
storage.total_limit_size 200M in the configuration, error logs as in https://github.com/fluent/fluent-bit/issues/2688#issuecomment-721498312 were immediately observed. One possible way to reproduce this issue:
storage.total_limit_size <size> @edsiper @JeffLuoo : FYI, I think we should re-open this issue.
Hi @erain Thank you for the information!! You mentioned that "if I delete the pod completely and recreate it with the new configuration, the problem goes away". So my guess to the root cause of the issue was also the backoff logs on the container. If you run a new pod (without backoff logs) with configuration storage.total_limit_size 200M, did you notice the issue like fluent-bit-4cgwq fluent-bit [2020/11/04 03:53:14] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1604461994.357757503.flb in plugin es.0 ?
Hi @JeffLuoo , no I could not reproduce with an empty buffer folder.
In fact, I've tried to produced the bug in vanilla fluent bit (a non-containerized environment), but I failed by far:
1.5 branch and get some offline buffer files. storage.total_limit_size 1Mmaster and start it with those offline buffer files. This leads to me think whether the problem has something to do with a containerized environment / k8s. I will keep on digging.
just trying to guess, but what about counters adjustments when a retry exceeded it limit and the chunk is not longer available for that output ?
I think I found something which is not right, specifically here upon initialization
ref: https://github.com/fluent/fluent-bit/blob/master/src/flb_output.c#L567
ins->total_limit_size = (size_t) limit;
struct flb_output_instance field total_limit_size data type is size_t , but limit can be -1 in case the flag was turned off, on assigning -1 the total_limit_size value is set to 18446744073709551615, there are a couple of places where the code checks if the struct field is -1:
I am not saying the data type is guilty because I am sure in the majority of cases the compiler will do the proper internal handling of this case. Anyways I think the internal logic for total_limit_size should be: if set to zero, means the absence of a limit, to check if there is a limit just use > 0.
Now @JeffLuoo comment might make sense, what if the storage directory already has data queued for a given destination ?, this will limit appending new data. we need to confirm if this happened. cc: @rjshrjndrn
note: we need to verify if upon starting with a full storage queue there are chunks deletion (oldest chunk)
@erain I tried the solution with version 1.6.2 and after some time, an issue popped up, so tried again with 1.6.3 which also didn't resolve the issue. I noticed one thing that didn't create the issue at the beginning but after some time.
BTW, I don't know whether the backlog I had caused the issue, in that case, the LRU, should kick in, not the error.
@edsiper As I mentioned earlier, I had back log from 1.6, I did an upgrade to 1.6.2 and then to 1.6.3. Since I've hostpath attched to that path, I had previous pending data.
just trying to guess, but what about counters adjustments when a retry exceeded it limit and the chunk is not longer available for that output ?
@edsiper I did not modify the logic that if the retry reaches the limit then flb_task_retry_destroy(1) will be called and it is followed by a call to flb_sched_request_invalidate(2). However I was not sure whether the call to flb_task_retry_destroy(1) will result in a destroy of the task? If the task was not destroyed then I think I might miss this case to update the chunk bytes counter accordingly.
@JeffLuoo any idea to get a reproducible case ?, we have a out_retry plugin which is disabled, it just triggers a retry N times
@edsiper Do you mean a reproducible case to verify the adjustments when a retry exceeded it limit and the chunk is not longer available for that output?
Edit:
@edsiper
In order to force a retry, I ran the fluent bit with
bin/fluent-bit -i random -o http://0.0.0.0 -m "*" -p storage.total_limit_size=10000 -p Retry_Limit=0 -f 1 -v
I believe it wasn't the best way to test the case according to the logs of fluent-bit, the retry task created for task 0 (the first task) will reach the limit after the first retry. And after reaching the limit, it will invalidate the retry task and also destroy the task as well. Below is part of logs generated with the information of the task 0:
[2020/11/12 04:48:17] [debug] [retry] new retry created for task_id=7 attempts=1
[2020/11/12 04:48:17] [ warn] [engine] failed to flush chunk '29912-1605156496.340157658.flb', retry in 10 seconds: task_id=7, input=random.0 > output=http.0
[2020/11/12 04:48:18] [debug] [task] created task=0x7fc2ac006690 id=1 OK
[2020/11/12 04:48:18] [debug] [input chunk] chunk 29912-1605156498.339987799.flb required 32 bytes and 9744 bytes left in plugin http.0
[2020/11/12 04:48:18] [debug] [input chunk] update plugin http.0 fs_chunk_size by 32 bytes, the current fs_chunk_size is 288 bytes
[2020/11/12 04:48:18] [error] [io] TCP connection failed: 0.0.0.0:80 (Connection refused)
[2020/11/12 04:48:18] [debug] [upstream] connection #31 failed to 0.0.0.0:80
[2020/11/12 04:48:18] [error] [output:http:http.0] no upstream connections available to 0.0.0.0:80
[2020/11/12 04:48:18] [error] [io] TCP connection failed: 0.0.0.0:80 (Connection refused)
[2020/11/12 04:48:18] [debug] [upstream] connection #24 failed to 0.0.0.0:80
[2020/11/12 04:48:18] [error] [output:http:http.0] no upstream connections available to 0.0.0.0:80
[2020/11/12 04:48:18] [debug] [task] task_id=0 reached retry-attempts limit 1/0
[2020/11/12 04:48:18] [ warn] [engine] chunk '29912-1605156489.339756951.flb' cannot be retried: task_id=0, input=random.0 > output=http.0
[2020/11/12 04:48:18] [debug] [task] destroy task=0x7fc2ac006210 (task_id=0)
[2020/11/12 04:48:18] [debug] [task] Delete chunk 29912-1605156489.339756951.flb of the task with bytes 32, fs_chunks_size after it is 256
Confirm that I am able to reproduced the issue. It will only occur when there are backoff logs on the container. Trouble shooting it.
@edsiper
I have provided a patch #2771. PTAL when available, Thank you!
I've built a new image and I'm testing it
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967257035.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967302301.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967358640.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967404287.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967452877.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967509009.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967557220.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:15:37] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610014537.967604588.flb in plugin kafka.0
[2021/01/07 10:15:37] [error] [input chunk] no available chunk
[2021/01/07 10:51:00] [debug] [input chunk] chunk 1-1610016660.50582922.flb required 0 bytes and -54 bytes left in plugin kafka.0
[2021/01/07 10:51:00] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1610016660.50582922.flb in plugin kafka.0
[2021/01/07 10:51:00] [error] [input chunk] no available chunk