Vector: File source slows down on many completed files

Created on 31 Dec 2019  Â·  6Comments  Â·  Source: timberio/vector

Reading performance decreases dramatically when vector have many files at EOF position.

vector 0.6.0 (v0.6.0 x86_64-unknown-linux-musl 2019-12-12)

Ubuntu 18.04.3 LTS / Linux 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Test config:

data_dir = "data-dir/"

[sources.infile]
  type = "file"
  include = ["logs/*.log"]

[sinks.outfile]
  type = "blackhole"
  inputs = ["infile"]
  print_amount = 10000

My test:

  1. Add N - 1 small ~200k files, running vector to process them, shutdown the daemon.
  2. Add 100Mb / 500k lines test file
  3. Start vector and get blackhole traces.

Results:

Num of files | events/sec | bytes/sec
-- | -- | --
1 | 312 899 | 64 903 829
2 | 8 608 | 1 785 591
3 | 4 363 | 905 046
5 | 2 177 | 448 335
50 | 178 | 36 594

I don't speak Rust, but according to strace it makes tons of read syscalls and sleeps in between all the time.

Maybe you should poll "done" files less often, or use inotify and friends instead of polling files.

performance file enhancement

Most helpful comment

I've tested the nightly build — works as it should. Thank you

All 6 comments

Thanks @anton-ryzhov, this summary is great.

Logs from one of my test runs:

Dec 31 01:32:14.202  INFO source{name=infile type=file}:file_server: file_source::file_server: Found file to watch. path="logs/xbj.log" file_position=222912
Dec 31 01:32:14.202  INFO source{name=infile type=file}:file_server: file_source::file_server: Found file to watch. path="logs/xbi.log" file_position=214922
Dec 31 01:32:14.202  INFO source{name=infile type=file}:file_server: file_source::file_server: Found file to watch. path="logs/test.log" file_position=0
Dec 31 01:32:16.422  INFO sink{name=outfile type=blackhole}: vector::sinks::blackhole: Total events collected events=10000 raw_bytes_collected=1946283
...
Dec 31 01:34:08.725  INFO sink{name=outfile type=blackhole}: vector::sinks::blackhole: Total events collected events=500000 raw_bytes_collected=103585679

Agreed, this is very helpful. Thanks, @anton-ryzhov!

You're right, the basic model of the file source is that we maintain a list of all relevant files and iterate through it, calling read to check for new data. We have a backoff mechanism, but it works off of the total amount of data read across all files and sleeps between iterations of the whole list. This is definitely not ideal for the case you bring up, where we'll spend a lot of time reading from files with no new data.

I think in order to address this issue, we would need to add some per-file bookkeeping that would allow us to backoff reads to files that have not seen updates in a longer period of time. This is definitely something we can do, we'll just need to keep in mind things like fairness and how it would interact with the oldest_first feature.

@anton-ryzhov I've opened #1476 with one way we could address this. Basically, after a certain amount of time with no new data, we will begin reading from files far less frequently.

@anton-ryzhov I'm going to close this since #1476 should have resolved this. Please reopen if that is not the case. We're planning to get a new release out this week.

I've tested the nightly build — works as it should. Thank you

Was this page helpful?
0 / 5 - 0 ratings

Related issues

LucioFranco picture LucioFranco  Â·  3Comments

leebenson picture leebenson  Â·  3Comments

jhgg picture jhgg  Â·  4Comments

MOZGIII picture MOZGIII  Â·  3Comments

trK54Ylmz picture trK54Ylmz  Â·  3Comments