Fluent-bit: [Loki] Out Of Order errors Fluentbit 1.6.3 on retries

Created on 5 Nov 2020  路  3Comments  路  Source: fluent/fluent-bit

First of all, thank you for the native Loki support in fluentbit. However
there is still a problem with the retry mechanism that makes chunks out of order.
In my understanding chunks are created based on input tags. If there is a network error
or the receiver part is busy the chunks go to a retry queue and have an exponential backoff mechanism.

If you follow the same scenario you can achieve the same behavior.

  1. Set-up fluentbit and Loki. Everything working so far.
  2. Have at least 1 Pod writing a lot of logs. Need to generate chunks for downtime.
  3. Make Loki unavailable. (Like scale it down to 0)
  4. The chunks goes into retry queue as the following logs demonstrates:
[2020/11/04 12:46:53] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:46:54] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:46:55] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:46:56] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:47:27] [error] [http_client] broken connection to one-eye-loki.default.svc:3100 ?
[2020/11/04 12:47:27] [error] [output:loki:loki.1] could not flush records to one-eye-loki.default.svc:3100 (http_do=-1)
[2020/11/04 12:47:27] [ warn] [engine] failed to flush chunk '1-1604494016.580698566.flb', retry in 10 seconds: task_id=0, input=tail.0 > output=loki.1
[2020/11/04 12:49:36] [error] [io] connection #65 failed to: one-eye-loki.default.svc:3100
[2020/11/04 12:49:36] [error] [output:loki:loki.1] no upstream connections available
[2020/11/04 12:51:47] [error] [io] connection #69 failed to: one-eye-loki.default.svc:3100
[2020/11/04 12:51:47] [error] [output:loki:loki.1] no upstream connections available
[2020/11/04 12:51:47] [ warn] [engine] failed to flush chunk '1-1604494047.217812847.flb', retry in 7 seconds: task_id=1, input=tail.0 > output=loki.1
[2020/11/04 12:51:47] [ warn] [engine] failed to flush chunk '1-1604494016.580698566.flb', retry in 74 seconds: task_id=0, input=tail.0 > output=loki.1
[2020/11/04 12:53:58] [error] [io] connection #65 failed to: one-eye-loki.default.svc:3100
[2020/11/04 12:53:58] [error] [output:loki:loki.1] no upstream connections available
[2020/11/04 12:56:10] [error] [io] connection #69 failed to: one-eye-loki.default.svc:3100
[2020/11/04 12:56:10] [error] [output:loki:loki.1] no upstream connections available
[2020/11/04 12:56:10] [ warn] [engine] failed to flush chunk '1-1604494307.884198524.flb', retry in 11 seconds: task_id=2, input=tail.0 > output=loki.1
[2020/11/04 12:56:10] [ warn] [engine] failed to flush chunk '1-1604494047.217812847.flb', retry in 87 seconds: task_id=1, input=tail.0 > output=loki.1
[2020/11/04 12:56:10] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:10] [ info] [engine] flush chunk '1-1604494016.580698566.flb' succeeded at retry 2: task_id=3, input=tail.0 > output=loki.1
[2020/11/04 12:56:11] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:12] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:13] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:14] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:15] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
  1. At the point when Loki comes back chunks will be sent Out Of Order

1-1604494016.580698566.flb (in queue)
1-1604494047.217812847.flb (in queue)
1-1604494307.884198524.flb (in queue)

1-1604494016.580698566.flb (retry succeeded) (2nd retry)
1-1604494307.884198524.flb (retry succeeded) (1st retry)
1-1604494047.217812847.flb (failed goes back to queue) (2nd retry)

  1. That chunk will stay in the buffer forever
drwxr-xr-x    3 root     root          4096 Nov  3 12:49 ..
-rw-------    1 root     root        250541 Nov  4 12:47 1-1604494047.217812847.flb
-rw-------    1 root     root          4096 Nov  4 13:31 1-1604496685.581031644.flb
-rw-------    1 root     root          4096 Nov  4 13:31 1-1604496685.604853202.flb

Loki configuration

 [SERVICE]
     Flush        1
     Daemon       Off
     Log_Level    info
     Parsers_File parsers.conf
     HTTP_Server  On
     HTTP_Listen  0.0.0.0
     HTTP_Port    2020
     storage.backlog.mem_limit  256KB
     storage.path  /buffers

 [INPUT]
     Name         tail
     DB  /tail-db/tail-containers-state.db
     Mem_Buf_Limit  256KB
     Parser  cri
     Path  /var/log/containers/*.log
     Refresh_Interval  5
     Skip_Long_Lines  On
     Tag  kubernetes.*
     storage.type  filesystem

 [FILTER]
     Name        kubernetes
     Kube_CA_File  /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
     Kube_Tag_Prefix  kubernetes.var.log.containers
     Kube_Token_File  /var/run/secrets/kubernetes.io/serviceaccount/token
     Kube_URL  https://kubernetes.default.svc:443
     Match  kubernetes.*
     Merge_Log  On

 [OUTPUT]
     Name          forward
     Match         *
     Host          one-eye-fluentd.default.svc
     Port          24240

     Retry_Limit   False

 [OUTPUT]
     name                   loki
     match                  *
     host                   one-eye-loki.default.svc
     port                   3100
     labels                 job=fluentbit
     auto_kubernetes_labels off
     label_keys             $kubernetes['host'], $kubernetes['pod_name'], $kubernetes['container_name']
     Retry_Limit   False
troubleshooting

Most helpful comment

@tarokkk

thanks for raising this issue.

Yeah, the retry logic is the problem here, I will work in a solution

All 3 comments

@tarokkk

thanks for raising this issue.

Yeah, the retry logic is the problem here, I will work in a solution

If we can help with this just let me know

@edsiper We still see this issue with fluent-bit 1.16.6 & loki 2

[2020/11/27 03:50:34] [error] [output:loki:loki.1] loki.loki.svc.cluster.local:3100, HTTP status=400
entry with timestamp 2020-11-27 03:50:21.883715355 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="\"fluent-bit\"", kubernetes_host="ip-10-2-2-208.us-west-2.compute.internal", namespace="XXXXX", pod="master-XXXXXXweb-88fb66984-86fgd"},
total ignored: 1 out of 1
Was this page helpful?
0 / 5 - 0 ratings

Related issues

dbluxo picture dbluxo  路  4Comments

c0ze picture c0ze  路  3Comments

jcdauchy-moodys picture jcdauchy-moodys  路  3Comments

arienchen picture arienchen  路  3Comments

mbelchin picture mbelchin  路  3Comments