Fluent-bit: [Loki] Out Of Order errors Fluentbit 1.6.3 on retries

Created on 5 Nov 2020 · 3Comments · Source: fluent/fluent-bit

First of all, thank you for the native Loki support in fluentbit. However
there is still a problem with the retry mechanism that makes chunks out of order.
In my understanding chunks are created based on input tags. If there is a network error
or the receiver part is busy the chunks go to a retry queue and have an exponential backoff mechanism.

If you follow the same scenario you can achieve the same behavior.

Set-up fluentbit and Loki. Everything working so far.
Have at least 1 Pod writing a lot of logs. Need to generate chunks for downtime.
Make Loki unavailable. (Like scale it down to 0)
The chunks goes into retry queue as the following logs demonstrates:

[2020/11/04 12:46:53] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:46:54] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:46:55] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:46:56] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:47:27] [error] [http_client] broken connection to one-eye-loki.default.svc:3100 ?
[2020/11/04 12:47:27] [error] [output:loki:loki.1] could not flush records to one-eye-loki.default.svc:3100 (http_do=-1)
[2020/11/04 12:47:27] [ warn] [engine] failed to flush chunk '1-1604494016.580698566.flb', retry in 10 seconds: task_id=0, input=tail.0 > output=loki.1
[2020/11/04 12:49:36] [error] [io] connection #65 failed to: one-eye-loki.default.svc:3100
[2020/11/04 12:49:36] [error] [output:loki:loki.1] no upstream connections available
[2020/11/04 12:51:47] [error] [io] connection #69 failed to: one-eye-loki.default.svc:3100
[2020/11/04 12:51:47] [error] [output:loki:loki.1] no upstream connections available
[2020/11/04 12:51:47] [ warn] [engine] failed to flush chunk '1-1604494047.217812847.flb', retry in 7 seconds: task_id=1, input=tail.0 > output=loki.1
[2020/11/04 12:51:47] [ warn] [engine] failed to flush chunk '1-1604494016.580698566.flb', retry in 74 seconds: task_id=0, input=tail.0 > output=loki.1
[2020/11/04 12:53:58] [error] [io] connection #65 failed to: one-eye-loki.default.svc:3100
[2020/11/04 12:53:58] [error] [output:loki:loki.1] no upstream connections available
[2020/11/04 12:56:10] [error] [io] connection #69 failed to: one-eye-loki.default.svc:3100
[2020/11/04 12:56:10] [error] [output:loki:loki.1] no upstream connections available
[2020/11/04 12:56:10] [ warn] [engine] failed to flush chunk '1-1604494307.884198524.flb', retry in 11 seconds: task_id=2, input=tail.0 > output=loki.1
[2020/11/04 12:56:10] [ warn] [engine] failed to flush chunk '1-1604494047.217812847.flb', retry in 87 seconds: task_id=1, input=tail.0 > output=loki.1
[2020/11/04 12:56:10] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:10] [ info] [engine] flush chunk '1-1604494016.580698566.flb' succeeded at retry 2: task_id=3, input=tail.0 > output=loki.1
[2020/11/04 12:56:11] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:12] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:13] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:14] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:15] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204

At the point when Loki comes back chunks will be sent Out Of Order

1-1604494016.580698566.flb (in queue)
1-1604494047.217812847.flb (in queue)
1-1604494307.884198524.flb (in queue)

1-1604494016.580698566.flb (retry succeeded) (2nd retry)
1-1604494307.884198524.flb (retry succeeded) (1st retry)
1-1604494047.217812847.flb (failed goes back to queue) (2nd retry)

That chunk will stay in the buffer forever

drwxr-xr-x    3 root     root          4096 Nov  3 12:49 ..
-rw-------    1 root     root        250541 Nov  4 12:47 1-1604494047.217812847.flb
-rw-------    1 root     root          4096 Nov  4 13:31 1-1604496685.581031644.flb
-rw-------    1 root     root          4096 Nov  4 13:31 1-1604496685.604853202.flb

Loki configuration

 [SERVICE]
     Flush        1
     Daemon       Off
     Log_Level    info
     Parsers_File parsers.conf
     HTTP_Server  On
     HTTP_Listen  0.0.0.0
     HTTP_Port    2020
     storage.backlog.mem_limit  256KB
     storage.path  /buffers

 [INPUT]
     Name         tail
     DB  /tail-db/tail-containers-state.db
     Mem_Buf_Limit  256KB
     Parser  cri
     Path  /var/log/containers/*.log
     Refresh_Interval  5
     Skip_Long_Lines  On
     Tag  kubernetes.*
     storage.type  filesystem

 [FILTER]
     Name        kubernetes
     Kube_CA_File  /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
     Kube_Tag_Prefix  kubernetes.var.log.containers
     Kube_Token_File  /var/run/secrets/kubernetes.io/serviceaccount/token
     Kube_URL  https://kubernetes.default.svc:443
     Match  kubernetes.*
     Merge_Log  On

 [OUTPUT]
     Name          forward
     Match         *
     Host          one-eye-fluentd.default.svc
     Port          24240

     Retry_Limit   False

 [OUTPUT]
     name                   loki
     match                  *
     host                   one-eye-loki.default.svc
     port                   3100
     labels                 job=fluentbit
     auto_kubernetes_labels off
     label_keys             $kubernetes['host'], $kubernetes['pod_name'], $kubernetes['container_name']
     Retry_Limit   False

troubleshooting

Source

tarokkk

Most helpful comment

@tarokkk

thanks for raising this issue.

Yeah, the retry logic is the problem here, I will work in a solution

edsiper on 6 Nov 2020

👍3

All 3 comments

@tarokkk

thanks for raising this issue.

Yeah, the retry logic is the problem here, I will work in a solution

edsiper on 6 Nov 2020

👍3

If we can help with this just let me know

tarokkk on 6 Nov 2020

@edsiper We still see this issue with fluent-bit 1.16.6 & loki 2

[2020/11/27 03:50:34] [error] [output:loki:loki.1] loki.loki.svc.cluster.local:3100, HTTP status=400
entry with timestamp 2020-11-27 03:50:21.883715355 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="\"fluent-bit\"", kubernetes_host="ip-10-2-2-208.us-west-2.compute.internal", namespace="XXXXX", pod="master-XXXXXXweb-88fb66984-86fgd"},
total ignored: 1 out of 1