Fluent-bit: Filesystem buffer isn't working as expected / design issues?

Created on 16 Mar 2020 · 12Comments · Source: fluent/fluent-bit

We're doing load test for 10K logs per second from syslog and found at least two possibly design problems with the filesystem buffer, which (I hope) is supposed to relieve the backpressure:

As soon input memory limit is hit, each new chunk is immediately put down after one log record is appended, effectively creating one file per record. This is easily solvable by putting down every locked (full) chunks which ensures the worst-case performance.
The engine seems to be creating new output task as soon as input chunk comes, making the memory limit in inputs rather useless when filesystem buffer is used and inputs are thus never paused. It keeps opening new output connection and memory keeps growing far beyond limit (due to busy chunks, those in process of being sent to output).

PS: Due to the nature of syslog forwarding, pausing input doesn't work well and it always results in massive losses in our tests.

troubleshooting waiting-for-user

Source

jiping-s

Most helpful comment

update

I have done good progress on this, this time I did a couple of improvements for a more intensive scenario:

Flush time: 10 seconds
Syslog messages, parsed with regex backend
Ingest 200000 records (10 times = 2 Million)

Performance report

Report after code improvements

$ bin/flb-tcp-writer -c 1 -d ../data/syslog/syslog.log -r 200000 -p `pidof fluent-bit` -o 127.0.0.1:5150

 records   write (b)     write   secs |  % cpu  user (ms)  sys (ms)  Mem (bytes)      Mem
--------  ----------  --------  ----- + ------  ---------  --------  -----------  -------
  200000    37041308    35.33M   1.60 |  46.93        710        40     55701504   53.12M
  200000    37041308    35.33M   3.08 | 100.01       2950       130     91095040   86.88M
  200000    37041308    35.33M   9.79 |  99.06       9300       400    135757824  129.47M
  200000    37041308    35.33M  11.16 |  99.32      10640       440    164794368  157.16M
  200000    37041308    35.33M  11.03 |  99.84      10620       390    197013504  187.89M
  200000    37041308    35.33M  13.57 |  99.40      12990       500    231346176  220.63M
  200000    37041308    35.33M  15.04 |  97.67      14020       670    261799936  249.67M
  200000    37041308    35.33M  14.23 |  99.91      13730       490    293003264  279.43M
  200000    37041308    35.33M  14.55 |  93.91      13050       610    334270464  318.79M
  200000    37041308    35.33M  13.88 |  99.19      13270       500    368586752  351.51M
       0           0       0 b   1.00 |  99.99        940        60    370651136  353.48M
       0           0       0 b   1.00 |  69.77        640        60    372703232  355.44M
       0           0       0 b   1.00 |  25.00        250         0    375934976  358.52M
       0           0       0 b   1.00 |  58.99        580        10    374755328  357.39M
       0           0       0 b   1.00 |  45.99        380        80     18169856   17.33M
       0           0       0 b   1.00 |   0.00          0         0     18169856   17.33M
       0           0       0 b   1.00 |   0.00          0         0     18169856   17.33M
       0           0       0 b   1.00 |   0.00          0         0     18169856   17.33M

- Summary
  - Process     : fluent-bit
  - PID         : 7935
  - Elapsed Time: 112.93 seconds
  - Avg Memory  : 196.04M
  - Avg CPU     : 82.33%
  - Avg Rate    : 3.05M/sec

new dump report

To make easier to troubleshoot internals I've implemented a simple dump that can be triggered using SIGCONT signal (similar to Fluentd), on running this is the output

===== Input =====
syslog_debug (syslog)
|- status
   |- overlimit     : yes
      |- mem size   : 286.6M (300553227 bytes)
      |- mem limit  : 61.0M (64000000 bytes)

|- tasks
   |- total tasks   : 143
   |- status
      |- new        : 0
      |- running    : 143
   |- coroutines    : 143
|- chunks
   |- total chunks  : 150
      |- up chunks  : 150
      |- down chunks: 0
      |- busy chunks: 143
         |- size    : 273.0M (286235640 bytes)
         |- size err: 0

===== Storage Layer =====
total chunks     : 150
|- mem chunks    : 0
|- fs chunks     : 150
   |- up         : 150
   |- down       : 0

status

the code improvement is not yet on github, I will make sure to push the changes to a special branch so you can play with them.

definitely another improvement is to manage the scenario where an incoming data chunk is bigger than our chunks capacity, I will work on that tomorrow.

note: before this improvement, we usually see more than 100k small files, now that is fixed.

edsiper on 17 Mar 2020

👍4

All 12 comments

Can you share please your Fluent Bit configuration and the version you
are using?

On Mon, Mar 16, 2020, 06:23 Ji-Ping Shen notifications@github.com wrote:

We're doing load test for 10K logs per second from syslog and found at
least two possibly design problems with the filesystem buffer, which (I
hope) is supposed to relieve the backpressure:

As soon input memory limit is limit, each new chunk is immediately
put down after one log record is appended, effectively creating one file
per record. This is easily solvable by putting down every locked (full)
chunks which ensures the worst-case performance.

The engine seems to be creating new output task as soon as input
chunk comes, making the memory limit in inputs rather useless when
filesystem buffer is used and inputs are thus never paused. It keeps
opening new output connection and memory keeps growing far beyond limit
(due to busy chunks, those in process of being sent to output).

PS: Due to the nature of syslog forwarding, pausing input doesn't work
well and it always results in massive losses in our tests.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/fluent/fluent-bit/issues/2025, or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAC2INXVP4E3Z5TCRRKNQNDRHYK2PANCNFSM4LMFC4PQ
.

edsiper on 16 Mar 2020

version 1.3.9,

[SERVICE]
    Flush 5
    Daemon False
    Log_Level debug
    Parsers_File parsers.conf
    Storage.Path /data/
    Storage.Max_Chunks_Up 1024
    Storage.Backlog.Mem_Limit 32M

[INPUT]
    Name syslog
    Alias syslog_debug
    Parser syslog-rfc5424
    Listen 127.0.0.1
    Port 5140
    Mode tcp
    Mem_Buf_Limit 64M
    Buffer_Chunk_Size 16M
    Buffer_Max_Size 16M
    Storage.Type filesystem

[OUTPUT]
    Name forward
    Alias forward
    Match *
    Host somewhere
    Port 1234
    tls on
    tls.verify off
    KeepAlive on
    KeepAlive_Timeout 60
    Retry_Limit off

The hardcoded input chunk limit 2048000 (bytes) has been raised to 6MB. The first problem can be noticed by listing the storage/queue dir - we started getting large files with size a bit over the chunk limit, then it's all single record files around 1-2KB (we're feeding about 10GB / several millions of logs).

The second problem was detected by netstat and my failed patch to attempt to unload (down) previous chunks rather the current / last one when limit is hit. It showed most of time all but 1-2 chunks are always busy.

at the bottom of flb_input_chunk_append_raw, src/flb_input_chunk.c:

    if (flb_input_chunk_is_overlimit(in) == FLB_TRUE &&
        si->type == CIO_STORE_FS) {
        if (cio_chunk_is_locked(ic->chunk)) {
            if (cio_chunk_is_up(ic->chunk) == CIO_TRUE) {
                flb_warn("[input chunk] unload current");
                cio_chunk_down(ic->chunk);
            }
        }
        else {
            flb_warn("[input chunk] try to unload previous chunk...");
            struct mk_list *head;
            struct mk_list *tmp;
            struct flb_input_chunk *ic_prev;
            int i = 0;
            mk_list_foreach_safe_r(head, tmp, &ic->_head) {
                i -= 1;
                ic_prev = mk_list_entry(head, struct flb_input_chunk, _head);
                flb_warn("[input chunk] walking %i: busy=%i, up=%i", i, ic_prev->busy, cio_chunk_is_up(ic_prev->chunk));
                if (ic_prev->busy == FLB_TRUE) {
                    break;
                }
                if (cio_chunk_is_up(ic_prev->chunk) == CIO_TRUE) {
                    ssize_t prev_chunk_size = cio_chunk_get_content_size(ic_prev->chunk);
                    in->mem_chunks_size -= prev_chunk_size;
                    flb_warn("[input chunk] unload %i, size=%i", i, prev_chunk_size);
                    cio_chunk_down(ic_prev->chunk);
                    if (flb_input_chunk_is_overlimit(in) == FLB_FALSE) {
                        flb_warn("[input chunk] finished unloading previous chunks");
                        return 0;
                    }
                }
            }
            flb_warn("[input chunk] unloading previous chunk failed, consider to increase mem buffer");
        }
        return 0;
    }
    // flb_input_chunk_protect ... for memory buffer type input

PS: the patch caused corrupted saves. No idea why but unloading all locked chunks seems to work good enough.

If fluent-bit is restarted after all inputs are finished, the in_storage_backlog and buffer limit do function normally.

I'll try to limit busy chunks in flb_engine_dispatch, maybe it could help the second problem.

jiping-s on 16 Mar 2020

I am troubleshooting

edsiper on 16 Mar 2020

troubleshooting

I did a test with null output, with a flush time of 5 seconds things looks bad:

$ bin/flb-tcp-writer -c 1 -d ../data/syslog/syslog.log -r 100000 -p `pidof fluent-bit` -o 127.0.0.1:5150

 records   write (b)     write   secs |  % cpu  user (ms)  sys (ms)  Mem (bytes)      Mem
--------  ----------  --------  ----- + ------  ---------  --------  -----------  -------
  100000    18524763    17.67M   1.20 |  19.12        150        80     68616192   65.44M
  100000    18524763    17.67M   1.20 |  20.00        130       110     70975488   67.69M
  100000    18524763    17.67M   1.25 |  17.63        160        60     68857856   65.67M
  100000    18524763    17.67M   1.25 |  18.43        120       110     70635520   67.36M
  100000    18524763    17.67M   1.19 |  17.62        130        80     68538368   65.36M
  100000    18524763    17.67M   1.18 |  19.43        130       100     70590464   67.32M
  100000    18524763    17.67M   1.20 |  15.00        120        60     68493312   65.32M
  100000    18524763    17.67M   1.19 |  20.12        150        90     70778880   67.50M
  100000    18524763    17.67M   1.21 |  19.05        150        80     68681728   65.50M
  100000    18524763    17.67M   1.26 |  18.31        110       120     70909952   67.62M
       0           0       0 b   1.00 |  20.00        120        80     69259264   66.05M
       0           0       0 b   1.00 |  20.00        120        80     71237632   67.94M
       0           0       0 b   1.00 |  22.00        140        80     69664768   66.44M
       0           0       0 b   1.00 |  20.00        150        50     68091904   64.94M
       0           0       0 b   1.00 |  21.00        150        60     70787072   67.51M
       0           0       0 b   1.00 |  17.00        100        70     69214208   66.01M
       0           0       0 b   1.00 |  21.00        120        90     67567616   64.44M
       0           0       0 b   1.00 |  21.00        140        70     70262784   67.01M
       0           0       0 b   1.00 |  21.00        130        80     68616192   65.44M
       0           0       0 b   1.00 |  21.00        150        60     71311360   68.01M
       0           0       0 b   1.00 |  20.00        120        80     69640192   66.41M
       0           0       0 b   1.00 |  20.00        110        90     68038656   64.89M

Setting flush to 1 second (less contention in the queue), things goes well:

$ bin/flb-tcp-writer -c 1 -d ../data/syslog/syslog.log -r 100000 -p `pidof fluent-bit` -o 127.0.0.1:5150

 records   write (b)     write   secs |  % cpu  user (ms)  sys (ms)  Mem (bytes)      Mem
--------  ----------  --------  ----- + ------  ---------  --------  -----------  -------
  100000    18524763    17.67M   1.20 |  32.62        370        20     10452992    9.97M
  100000    18524763    17.67M   1.20 |  34.17        390        20     10293248    9.82M
  100000    18524763    17.67M   1.24 |  33.07        390        20     10518528   10.03M
  100000    18524763    17.67M   1.21 |  34.68        390        30     10518528   10.03M
  100000    18524763    17.67M   1.21 |  33.80        390        20     10518528   10.03M
  100000    18524763    17.67M   1.22 |  31.86        370        20     10584064   10.09M
  100000    18524763    17.67M   1.21 |  34.65        370        50     10727424   10.23M
  100000    18524763    17.67M   1.22 |  33.72        390        20     10727424   10.23M
  100000    18524763    17.67M   1.18 |  35.59        400        20     10727424   10.23M
  100000    18524763    17.67M   1.22 |  33.61        390        20     10727424   10.23M
       0           0       0 b   1.00 |   0.00          0         0     10727424   10.23M
       0           0       0 b   1.00 |   0.00          0         0     10727424   10.23M
       0           0       0 b   1.00 |   0.00          0         0     10727424   10.23M

- Summary
  - Process     : fluent-bit
  - PID         : 16076
  - Elapsed Time: 12.11 seconds
  - Avg Memory  : 10.12M
  - Avg CPU     : 33.78%
  - Avg Rate    : 11.69M/sec

I will continue troubleshooting contention with a higher flush, then I will troubleshoot behaviors with retries and queued tasks

edsiper on 16 Mar 2020

update

I have done good progress on this, this time I did a couple of improvements for a more intensive scenario:

Flush time: 10 seconds
Syslog messages, parsed with regex backend
Ingest 200000 records (10 times = 2 Million)

Performance report

Report after code improvements

$ bin/flb-tcp-writer -c 1 -d ../data/syslog/syslog.log -r 200000 -p `pidof fluent-bit` -o 127.0.0.1:5150

 records   write (b)     write   secs |  % cpu  user (ms)  sys (ms)  Mem (bytes)      Mem
--------  ----------  --------  ----- + ------  ---------  --------  -----------  -------
  200000    37041308    35.33M   1.60 |  46.93        710        40     55701504   53.12M
  200000    37041308    35.33M   3.08 | 100.01       2950       130     91095040   86.88M
  200000    37041308    35.33M   9.79 |  99.06       9300       400    135757824  129.47M
  200000    37041308    35.33M  11.16 |  99.32      10640       440    164794368  157.16M
  200000    37041308    35.33M  11.03 |  99.84      10620       390    197013504  187.89M
  200000    37041308    35.33M  13.57 |  99.40      12990       500    231346176  220.63M
  200000    37041308    35.33M  15.04 |  97.67      14020       670    261799936  249.67M
  200000    37041308    35.33M  14.23 |  99.91      13730       490    293003264  279.43M
  200000    37041308    35.33M  14.55 |  93.91      13050       610    334270464  318.79M
  200000    37041308    35.33M  13.88 |  99.19      13270       500    368586752  351.51M
       0           0       0 b   1.00 |  99.99        940        60    370651136  353.48M
       0           0       0 b   1.00 |  69.77        640        60    372703232  355.44M
       0           0       0 b   1.00 |  25.00        250         0    375934976  358.52M
       0           0       0 b   1.00 |  58.99        580        10    374755328  357.39M
       0           0       0 b   1.00 |  45.99        380        80     18169856   17.33M
       0           0       0 b   1.00 |   0.00          0         0     18169856   17.33M
       0           0       0 b   1.00 |   0.00          0         0     18169856   17.33M
       0           0       0 b   1.00 |   0.00          0         0     18169856   17.33M

- Summary
  - Process     : fluent-bit
  - PID         : 7935
  - Elapsed Time: 112.93 seconds
  - Avg Memory  : 196.04M
  - Avg CPU     : 82.33%
  - Avg Rate    : 3.05M/sec

new dump report

To make easier to troubleshoot internals I've implemented a simple dump that can be triggered using SIGCONT signal (similar to Fluentd), on running this is the output

===== Input =====
syslog_debug (syslog)
|- status
   |- overlimit     : yes
      |- mem size   : 286.6M (300553227 bytes)
      |- mem limit  : 61.0M (64000000 bytes)

|- tasks
   |- total tasks   : 143
   |- status
      |- new        : 0
      |- running    : 143
   |- coroutines    : 143
|- chunks
   |- total chunks  : 150
      |- up chunks  : 150
      |- down chunks: 0
      |- busy chunks: 143
         |- size    : 273.0M (286235640 bytes)
         |- size err: 0

===== Storage Layer =====
total chunks     : 150
|- mem chunks    : 0
|- fs chunks     : 150
   |- up         : 150
   |- down       : 0

status

the code improvement is not yet on github, I will make sure to push the changes to a special branch so you can play with them.

definitely another improvement is to manage the scenario where an incoming data chunk is bigger than our chunks capacity, I will work on that tomorrow.

note: before this improvement, we usually see more than 100k small files, now that is fixed.

edsiper on 17 Mar 2020

👍4

That's my latest patch for throttling concurrent output tasks and unloading locked chunks immediately to try to keep numbers of up chunks low: https://gist.github.com/AqD/6fce5fc0c26c81d014878504e609f3d0

With 1 second flush interval and null output, it still performs oddly as soon as the memory limit is hit on 10k logs/s syslog input. Could be the processing of incoming inputs (we always hit 100% cpu on 10k logs) interfering with output tasks? Debug logs showed that even with null output only some tasks take seconds to be destroyed.

We're moving to a new setup of dedicated fluent-bit instance as input/log processor and a dedicated output forwarder, sending those filesystem chunks.

jiping-s on 18 Mar 2020

@AqD

I've pushed several improvements based in your feedback, I did high-stress tests and the results look good here, but I would like to get your feedback.

Details about the fix and optimizations here: 9420ed9fe8cbd5413313e6bb59459fdb516bc61f

There are other patches involved, also I've included a new 'dump' feature on GIT master that can be used to dump internal stats to stdout for debugging/dev purposes, e.g:

$ kill -CONT `pidof fluent-bit`

output:

[engine] caught signal (SIGCONT)
[2020/03/18 18:21:49] Fluent Bit Dump

===== Input =====
syslog_debug (syslog)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 0b (0 bytes)
│     └─ mem limit  : 61.0M (64000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 0
      ├─ up chunks  : 0
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0

storage_backlog.1 (storage_backlog)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 0b (0 bytes)
│     └─ mem limit  : 0b (0 bytes)
│
├─ tasks
│  ├─ total tasks   : 17
│  ├─ new           : 16
│  ├─ running       : 1
│  └─ size          : 32.2M (33714420 bytes)
│
└─ chunks
   └─ total chunks  : 17
      ├─ up chunks  : 17
      ├─ down chunks: 0
      └─ busy chunks: 17
         ├─ size    : 32.2M (33714420 bytes)
         └─ size err: 0

===== Storage Layer =====
total chunks     : 26
├─ mem chunks    : 0
└─ fs chunks     : 26
   ├─ up         : 17

Please test GIT master and let me know if some issue persists.

edsiper on 19 Mar 2020

did you try the last v1.3.11 that comes with the improvements ?

edsiper on 23 Mar 2020

ping @AqD

edsiper on 26 Mar 2020

@edsiper Thanks! The first problem disappeared with 1.3.11; the second problem remains and it quickly runs out of fd with around 20K logs/s from syslog.

However, 1.3.11 is good enough for our forwarder to work with (sending fluent-bit chunk files by a separated service), I'll stop here.

jiping-s on 30 Mar 2020

what is running out of file descriptors ? syslog in tcp mode ?

edsiper on 31 Mar 2020

Yes. It's due to too many output tasks I guess, because the memory kept increasing and there were more and more output connections (up to hundreds until fluent-bit segfaults)

jiping-s on 7 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings