Running on Azure Kubernetes Service (kubernetes v1.11.3) as a daemon set using the fluent/fluent-bit:0.14.6 image. The nodes are quite small with each one running roughly 15 containers that are sending JSON logs over tcp. The pod memory limit is currently set to 200Mi and fluent-bit keeps hitting this and restarting. Any suggestions? Here is the config:
[SERVICE]
Flush 5
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
[INPUT]
Name tcp
Listen 0.0.0.0
Port 5170
[OUTPUT]
Name null
Match *
parsers.conf:
[PARSER]
Name json
Format json
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
Is this even excessive memory usage? Are there any recommendations for what the resource limits/requests should be?
If all Pods send around 200MB of data within 5 seconds, yeah, it will be killed.
While Fluent Bit receives data, it will not deliver the logs until the Flush time expiration, my suggestion is to set Flush to 1 (one second) and append a Mem_Buf_Limit option into the TCP input plugin just for protection, you can read more about memory handling here:
@edsiper It takes about 10-15 minutes for the fluentbit pod to be killed - it just has a nice straight memory graph that looks like it never cleans up any memory:

The drop in memory usage is when the pod gets killed:

I have tried various settings for the Mem_Buf_Limit but none of them make any difference.
Did u try Flush 1?
Yeh - those graphs are with it set to Flush 1
Looks like the issue is with our app - it never closed the TCP connection to fluentbit and instead just reused it for each batch of logs. Now we close the connection after a batch of logs and it has fixed the issue.
@tomstreet I am curious to learn more about the issue. My expectation is that Fluent Bit will protect it self from that scenario. Would you please share some steps to reproduce the problem ?
Sure.. so the config is above, here is the daemonset yaml:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
labels:
app: fluent-bit-logging
kubernetes.io/cluster-service: "true"
spec:
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
app: fluent-bit-logging
kubernetes.io/cluster-service: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "2020"
prometheus.io/path: /api/v1/metrics/prometheus
spec:
containers:
- name: fluent-bit
image: fluent/fluent-bit:0.14.7
imagePullPolicy: Always
ports:
- containerPort: 2020
- containerPort: 5170
hostPort: 5170
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
resources:
limits:
cpu: 2
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
terminationGracePeriodSeconds: 10
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-config
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
and our app is written in c# and here is a simplified version of the log emitter:
public class Emitter
{
private TcpClient _client;
private FluentBitSettings _settings;
private async Task Connect()
{
if(_client != null)
{
if (_client.Connected)
{
return;
}
_client.Dispose();
_client = null;
}
_client = new TcpClient();
await _client.ConnectAsync(_settings.Host, _settings.Port);
}
private void Disconnect()
{
_client?.Dispose();
_client = null;
}
public async Task Emit(byte[] logsBatch)
{
try
{
await Connect();
var tcpStream = _client.GetStream();
await tcpStream.WriteAsync(logsBatch);
await tcpStream.FlushAsync();
}
finally
{
Disconnect();
}
}
}
If we remove the Disconnect() in the finally block of the Emit method, then it reuses the TCP connection without closing it every time the Emit method is called - this is what causes the memory issue in Fluent Bit. Including it not only stopped the issue in Fluent Bit but also reduced the memory usage of our own service.
Similar problem,
Observing very high memory usage on fluentbit pod. We observed around 10GB of memory usage. We have not specified resource limit on pod for the testing.
kubectl top po -n logging fluent-bit-gmnrt
NAME CPU(cores) MEMORY(bytes)
fluent-bit-gmnrt 46m 9861Mi
When elastic search is heavily loaded it will give HTTP 429 Error to fluent bit. And fluent bit will keep unsent logs in its main memory for retry. Fluent bit is retrying for X number of times (as configured in output plugin's Retry_Limit setting). Retry_Limit. After X number of retries it should discard the message. Over here I am not sure it is discarding or keeping in its memory.
Mem_Buf_Limit is also set to 5MB but still 10GB is used by fluent bit.
To Reproduce
Start application and fluentbit when elastic search is heavily loaded.
Expected behavior
Once retry limit is reached fluentbit should not keep the record in its memory.
Your Environment
Kubernetes version is v1.12.2
Fluent bit version 0.14.7
Snippet of fluentbit Configuration
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
ignore_older 1d
[OUTPUT]
Name es
Match *
Host ${FLUENT_ELASTICSEARCH_HOST}
Port ${FLUENT_ELASTICSEARCH_PORT}
Logstash_Format On
Retry_Limit 2
Buffer_Size False
[FILTER]
Name record_modifier
Match *
Remove_key time
[FILTER]
Name grep
Match *
Regex log [a-zA-Z1-9]*SOME_STRING[a-zA-Z1-9]*
Most helpful comment
Similar problem,
Observing very high memory usage on fluentbit pod. We observed around 10GB of memory usage. We have not specified resource limit on pod for the testing.
When elastic search is heavily loaded it will give HTTP 429 Error to fluent bit. And fluent bit will keep unsent logs in its main memory for retry. Fluent bit is retrying for X number of times (as configured in output plugin's
Retry_Limitsetting). Retry_Limit. After X number of retries it should discard the message. Over here I am not sure it is discarding or keeping in its memory.Mem_Buf_Limitis also set to 5MB but still 10GB is used by fluent bit.To Reproduce
Start application and fluentbit when elastic search is heavily loaded.
Expected behavior
Once retry limit is reached fluentbit should not keep the record in its memory.
Your Environment
Kubernetes version is v1.12.2
Fluent bit version 0.14.7
Snippet of fluentbit Configuration