First of all, thank you for the native Loki support in fluentbit. However
there is still a problem with the retry mechanism that makes chunks out of order.
In my understanding chunks are created based on input tags. If there is a network error
or the receiver part is busy the chunks go to a retry queue and have an exponential backoff mechanism.
If you follow the same scenario you can achieve the same behavior.
[2020/11/04 12:46:53] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:46:54] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:46:55] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:46:56] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:47:27] [error] [http_client] broken connection to one-eye-loki.default.svc:3100 ?
[2020/11/04 12:47:27] [error] [output:loki:loki.1] could not flush records to one-eye-loki.default.svc:3100 (http_do=-1)
[2020/11/04 12:47:27] [ warn] [engine] failed to flush chunk '1-1604494016.580698566.flb', retry in 10 seconds: task_id=0, input=tail.0 > output=loki.1
[2020/11/04 12:49:36] [error] [io] connection #65 failed to: one-eye-loki.default.svc:3100
[2020/11/04 12:49:36] [error] [output:loki:loki.1] no upstream connections available
[2020/11/04 12:51:47] [error] [io] connection #69 failed to: one-eye-loki.default.svc:3100
[2020/11/04 12:51:47] [error] [output:loki:loki.1] no upstream connections available
[2020/11/04 12:51:47] [ warn] [engine] failed to flush chunk '1-1604494047.217812847.flb', retry in 7 seconds: task_id=1, input=tail.0 > output=loki.1
[2020/11/04 12:51:47] [ warn] [engine] failed to flush chunk '1-1604494016.580698566.flb', retry in 74 seconds: task_id=0, input=tail.0 > output=loki.1
[2020/11/04 12:53:58] [error] [io] connection #65 failed to: one-eye-loki.default.svc:3100
[2020/11/04 12:53:58] [error] [output:loki:loki.1] no upstream connections available
[2020/11/04 12:56:10] [error] [io] connection #69 failed to: one-eye-loki.default.svc:3100
[2020/11/04 12:56:10] [error] [output:loki:loki.1] no upstream connections available
[2020/11/04 12:56:10] [ warn] [engine] failed to flush chunk '1-1604494307.884198524.flb', retry in 11 seconds: task_id=2, input=tail.0 > output=loki.1
[2020/11/04 12:56:10] [ warn] [engine] failed to flush chunk '1-1604494047.217812847.flb', retry in 87 seconds: task_id=1, input=tail.0 > output=loki.1
[2020/11/04 12:56:10] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:10] [ info] [engine] flush chunk '1-1604494016.580698566.flb' succeeded at retry 2: task_id=3, input=tail.0 > output=loki.1
[2020/11/04 12:56:11] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:12] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:13] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:14] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
[2020/11/04 12:56:15] [ info] [output:loki:loki.1] one-eye-loki.default.svc:3100, HTTP status=204
1-1604494016.580698566.flb (in queue)
1-1604494047.217812847.flb (in queue)
1-1604494307.884198524.flb (in queue)
1-1604494016.580698566.flb (retry succeeded) (2nd retry)
1-1604494307.884198524.flb (retry succeeded) (1st retry)
1-1604494047.217812847.flb (failed goes back to queue) (2nd retry)
drwxr-xr-x 3 root root 4096 Nov 3 12:49 ..
-rw------- 1 root root 250541 Nov 4 12:47 1-1604494047.217812847.flb
-rw------- 1 root root 4096 Nov 4 13:31 1-1604496685.581031644.flb
-rw------- 1 root root 4096 Nov 4 13:31 1-1604496685.604853202.flb
Loki configuration
[SERVICE]
Flush 1
Daemon Off
Log_Level info
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.backlog.mem_limit 256KB
storage.path /buffers
[INPUT]
Name tail
DB /tail-db/tail-containers-state.db
Mem_Buf_Limit 256KB
Parser cri
Path /var/log/containers/*.log
Refresh_Interval 5
Skip_Long_Lines On
Tag kubernetes.*
storage.type filesystem
[FILTER]
Name kubernetes
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Tag_Prefix kubernetes.var.log.containers
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_URL https://kubernetes.default.svc:443
Match kubernetes.*
Merge_Log On
[OUTPUT]
Name forward
Match *
Host one-eye-fluentd.default.svc
Port 24240
Retry_Limit False
[OUTPUT]
name loki
match *
host one-eye-loki.default.svc
port 3100
labels job=fluentbit
auto_kubernetes_labels off
label_keys $kubernetes['host'], $kubernetes['pod_name'], $kubernetes['container_name']
Retry_Limit False
@tarokkk
thanks for raising this issue.
Yeah, the retry logic is the problem here, I will work in a solution
If we can help with this just let me know
@edsiper We still see this issue with fluent-bit 1.16.6 & loki 2
[2020/11/27 03:50:34] [error] [output:loki:loki.1] loki.loki.svc.cluster.local:3100, HTTP status=400
entry with timestamp 2020-11-27 03:50:21.883715355 +0000 UTC ignored, reason: 'entry out of order' for stream: {job="\"fluent-bit\"", kubernetes_host="ip-10-2-2-208.us-west-2.compute.internal", namespace="XXXXX", pod="master-XXXXXXweb-88fb66984-86fgd"},
total ignored: 1 out of 1
Most helpful comment
@tarokkk
thanks for raising this issue.
Yeah, the retry logic is the problem here, I will work in a solution