Describe the bug
We use fluentbit to forward logs to a centralized location. When writing 100,000 lines, all of them arrive, but the order is messed up (apparently at the sender side?).
To Reproduce
On sender:
$ for n in $(seq 0 100000); do echo "$n qqq" >> lines; done
$ wc -l lines
100000 lines
$ grep '1234 qqq' lines
1234 qqq
11234 qqq
21234 qqq
31234 qqq
41234 qqq
51234 qqq
61234 qqq
71234 qqq
81234 qqq
91234 qqq
On the receiving machine:
$ wc -l lines
100000 lines
$ grep '1234 qqq' lines
1234 qqq
11234 qqq
71234 qqq
81234 qqq
51234 qqq
61234 qqq
41234 qqq
21234 qqq
31234 qqq
91234 qqq
Expected behavior
I expected the lines to arrive in the same order as they were written.
Your Environment
curl -s 127.0.0.1:2020 | jq
{
"fluent-bit": {
"version": "1.3.2",
"edition": "Community",
"flags": [
"FLB_HAVE_PARSER",
"FLB_HAVE_RECORD_ACCESSOR",
"FLB_HAVE_STREAM_PROCESSOR",
"FLB_HAVE_TLS",
"FLB_HAVE_SQLDB",
"FLB_HAVE_TRACE",
"FLB_HAVE_METRICS",
"FLB_HAVE_HTTP_SERVER",
"FLB_HAVE_SYSTEMD",
"FLB_HAVE_VALGRIND",
"FLB_HAVE_FORK",
"FLB_HAVE_GMTOFF",
"FLB_HAVE_UNIX_SOCKET",
"FLB_IS_TD_AGENT",
"FLB_HAVE_PROXY_GO",
"FLB_HAVE_SYSTEM_STRPTIME",
"FLB_HAVE_JEMALLOC",
"FLB_HAVE_LIBBACKTRACE",
"FLB_HAVE_REGEX",
"FLB_HAVE_LUAJIT",
"FLB_HAVE_C_TLS",
"FLB_HAVE_ACCEPT4",
"FLB_HAVE_INOTIFY"
]
}
}
[SERVICE]
Flush 5
Daemon Off
Log_Level info
Parsers_File parsers.conf
Config_Watch On
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.backlog.mem_limit 50M
[INPUT]
Name tail-traces
Tag traces
Multiline On
Skip_Long_Lines On
Parser_Firstline traces
Path_Key filename
Path /path/to/all/logs/*/*
DB /var/log/.logs.db
Mem_Buf_Limit 100MB
Buffer_Chunk_Size 128k
Buffer_Max_Size 16384k
Refresh_Interval 5
Exclude_Path *.gz,*.zip,*.backup
[FILTER]
Name record_modifier
Match *
Record var1 ${VAR1}
Record var2 ${VAR2}
[OUTPUT]
Name forward
Match *
Host fluentd
Port 60399
Retry_Limit 5
Log:
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=70 lines=10
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=738 lines=92
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=9362 lines=1026
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=6050 lines=605
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=10270 lines=1027
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=10020 lines=1002
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=10170 lines=1017
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=20320 lines=2032
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=10670 lines=1067
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=10400 lines=1040
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=10070 lines=1007
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11200 lines=1025
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11528 lines=1048
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11473 lines=1043
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11693 lines=1063
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11297 lines=1027
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11165 lines=1015
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11550 lines=1050
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11517 lines=1047
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11451 lines=1041
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11286 lines=1026
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11275 lines=1025
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11231 lines=1021
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11231 lines=1021
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=12012 lines=1092
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=12111 lines=1101
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11935 lines=1085
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=10318 lines=938
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11187 lines=1017
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=23045 lines=2095
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=25531 lines=2321
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=14234 lines=1294
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11781 lines=1071
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11539 lines=1049
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=21340 lines=1940
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11990 lines=1090
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11660 lines=1060
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11473 lines=1043
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11550 lines=1050
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11836 lines=1076
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11561 lines=1051
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11792 lines=1072
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11374 lines=1034
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11572 lines=1052
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11396 lines=1036
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11561 lines=1051
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11572 lines=1052
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11319 lines=1029
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11429 lines=1039
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11638 lines=1058
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11099 lines=1009
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=22682 lines=2062
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=12210 lines=1110
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11748 lines=1068
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11693 lines=1063
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11671 lines=1061
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11715 lines=1065
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11671 lines=1061
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11407 lines=1037
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11275 lines=1025
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11286 lines=1026
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11341 lines=1031
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11693 lines=1063
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11476 lines=1043
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11723 lines=1066
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11517 lines=1047
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11517 lines=1047
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=12089 lines=1099
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11946 lines=1086
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11682 lines=1062
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11352 lines=1032
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11583 lines=1053
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11737 lines=1067
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=22704 lines=2064
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=12793 lines=1163
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11605 lines=1055
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11396 lines=1036
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11638 lines=1058
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11836 lines=1076
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11825 lines=1075
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11693 lines=1063
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11682 lines=1062
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11572 lines=1052
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11880 lines=1080
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11715 lines=1065
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11715 lines=1065
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=12001 lines=1091
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11528 lines=1048
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11858 lines=1078
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11913 lines=1083
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=11550 lines=1050
[2019/11/18 12:47:22] [debug] [in_tail] file=lines event
[2019/11/18 12:47:22] [debug] [in_tail] file=lines read=3092 lines=281
Additional context
People looking at log files have an expectation that these will arrive in order.
Can you check if this problem is specific to the forward output plugin?
How would I go about doing that?
Checked with stdout plugin:
[OUTPUT]
Name stdout
Match *
It seems to work:
$ /opt/td-agent-bit/bin/td-agent-bit -c /etc/fluentbit/fluentbit.conf | grep '1234 qqq'
[1253] traces: [1574231224.589101536, {"filename"=>"/path/to/the/lines", "log"=>"1234 qqq", "var1"=>"1", "var2"=>"2"}]
[11274] traces: [1574231224.853998168, {"filename"=>"/path/to/the/lines", "log"=>"11234 qqq", "var1"=>"1", "var2"=>"2"}]
[4598] traces: [1574231224.943185502, {"filename"=>"/path/to/the/lines", "log"=>"21234 qqq", "var1"=>"1", "var2"=>"2"}]
[14598] traces: [1574231225.053945177, {"filename"=>"/path/to/the/lines", "log"=>"31234 qqq", "var1"=>"1", "var2"=>"2"}]
[7949] traces: [1574231225.214258257, {"filename"=>"/path/to/the/lines", "log"=>"41234 qqq", "var1"=>"1", "var2"=>"2"}]
[1306] traces: [1574231225.348539263, {"filename"=>"/path/to/the/lines", "log"=>"51234 qqq", "var1"=>"1", "var2"=>"2"}]
[11306] traces: [1574231225.445014308, {"filename"=>"/path/to/the/lines", "log"=>"61234 qqq", "var1"=>"1", "var2"=>"2"}]
[4665] traces: [1574231225.635418289, {"filename"=>"/path/to/the/lines", "log"=>"71234 qqq", "var1"=>"1", "var2"=>"2"}]
[14665] traces: [1574231225.711601283, {"filename"=>"/path/to/the/lines", "log"=>"81234 qqq", "var1"=>"1", "var2"=>"2"}]
[9150] traces: [1574231226.829944456, {"filename"=>"/path/to/the/lines", "log"=>"91234 qqq", "var1"=>"1", "var2"=>"2"}]
I don't have too much familiarity with fluentd. But my broad impression using other things such as SumoLogic is that timestamps are the organising principle of data fragments at scale, not order of creation or order of delivery. SumoLogic for example typically supports the use of either "source" timestamps or "arrival" timestamps. If there is any buffering the "source" timestamps might be more accurate, but data from multiple sources is relying on those various clocks to be synchronised, using the same timezone, etc.
@eplodn may I ask how does your fluentd config look like?
@eplodn also what do you see if you use a 1MB Mem_Buf_Limit ? In my measurements if fluentbit sends more data than what fluentd can write out, there will be issues just like you reported.
Order is not guaranteed if multiple chunks of data were generated (every chunk has N records), and upon flush/delivery time one output has to retry, that chunk will arrive later. Forcing an order will slow down things and add backpressure.
@edsiper is this similar to flush threads in fluentd? can this behaviour be controlled?
With normal fluentd you can use the buffer chunk key configuration to buffer logs by timestamp (and/or by source) before sending, so you can get some level of ordering depending on how long you wait for logs: https://docs.fluentd.org/configuration/buffer-section#time
I'm currently investigating how to deal with this also (as seen in my most recent commit in https://github.com/grafana/loki/issues/898#issuecomment-567722481)
@edsiper there are situations where the order is more important than throughput. Is there a flag to control this (disable concurrent chunk processing) for any given output?
there is no such control at the moment, but there is a plan to implement that (Loki need it).
@edsiper is this now implemented with the latest version?
Most helpful comment
there is no such control at the moment, but there is a plan to implement that (Loki need it).