We have a centralised vector instance to which all vector agents send their logs via the vector sink. On the agent side, our vector sink uses the buffer option to hold logs until they are being sent to the master. When configured as an on-disk buffer, we noticed vector using up an insane amount of CPU resources on the system ( about 80-90%). When switching to memory buffer or turning off the buffer option completely, the CPU usage clocks in at about 0,5%. Is this an error in our configuration or a bug of the vector sink?
It is probably caused by LSM tree compactions performed by LevelDB.
I think that we can make LevelDB perform compactions less frequently and in larger batches by tuning its options and/or exposing them in the Vector's config. In particular, it looks like increasing block size could help.
Thanks for this report @s1mps! Vector definitely shouldn't be using that much CPU, no matter how you configure the disk buffer.
Are you able to share more about your configuration and the data throughput in your deployment?
We are using vector as an agent to read logfiles and send them to a master who parses them and sends them to a local elasticsearch instance.
Agent config:
data_dir = "/var/lib/vector"
[sources.local_syslog]
type = "file"
include = ["<FILES>" ]
glob_minimum_cooldown = 1000000
[transforms.local_syslog_add_hostname]
type = "add_fields"
inputs = ["local_syslog"]
[transforms.local_syslog_add_hostname.fields]
hostname = "${HOSTNAME}"
[sinks.forward_syslog]
inputs = ["local_syslog_add_hostname"]
type = "vector"
address = "<IP>:<PORT>"
healthcheck = true
[sinks.forward_syslog.buffer]
type = "disk"
when_full = "block"
max_size = 104900
num_items = 500
Master config:
data_dir = "/var/lib/vector"
[sources.remote_syslog]
type = "vector"
address = "<IP>:<PORT>"
[transforms.add_time_field]
type = "lua"
inputs = ["remote_syslog", "local_syslog_add_hostname"]
source = """
event["log_timestamp"] = string.sub(event["file"], -10, -7) .. " " .. string.sub(event["message"], 1, 15)
"""
[transforms.parse_time_field]
type = "coercer"
inputs = ["add_time_field"]
[transforms.parse_time_field.types]
log_timestamp = "timestamp|%Y %b %e %X"
[transforms.syslog_grok]
type = "grok_parser"
inputs = ["parse_time_field"]
pattern = "<PATTERN>"
field = "message"
drop_field = true
[transforms.syslog_grok.types]
host = "string"
program = "string"
pid = "int"
log_message = "string"
[sinks.es_syslog]
type = "elasticsearch"
inputs = ["syslog_grok"]
host = "<IP>:<PORT>"
index = "syslog-%Y-%m-%d"
healthcheck = true
batch_size = 1049000
batch_timeout = 1
request_timeout_secs = 30
request_retry_attempts = 3
request_retry_backoff_secs = 5
[sinks.es_syslog.basic_auth]
user = "<USER>"
password = "<PASS>"
[sinks.es_syslog.buffer]
type = "disk"
when_full = "block"
max_size = 1049000
num_items = 500
Hope this helps!
Thank you! And just to be clear, you're seeing this on the agent side but not on the master?
We'll dig in and see what we can find.
Yes, it only affects the agents, when the buffer on the vector sink is set to write on disk, the buffer on the elasticsearch sink seems to be fine.
@s1mps hey! Just for a complete picture, do you have any information on the configuration of the instances you are running the agent on? What type of disk? CPU size etc. Any info would be helpful :)
I also experienced this problem for TCP and Vector sinks. For HTTP works fine.
I'm using vector 0.6.0 (v0.6.0 x86_64-unknown-linux-musl 2019-12-12).
Minimal config which causes the problem:
data_dir = "data-dir/"
[sources.infile]
type = "file"
include = ["*.log"]
[sinks.tcp]
type = "tcp"
inputs = ["infile"]
address = "localhost:1234"
encoding = "json"
[sinks.tcp.buffer]
type = "disk"
max_size = 104900
when_full = "block"
I'm recreating data-dir for every test run, I don't have any matched logs to ship — but right after start the process consumes 60% of all cores.
Same for
[sinks.tcp]
type = "tcp"
inputs = ["infile"]
address = "localhost:1234"
encoding = "json"
[sinks.tcp.buffer]
type = "disk"
max_size = 104900
when_full = "block"
But not for
[sinks.http]
type = "http"
inputs = ["infile"]
uri = "http://localhost:8080"
encoding = "text"
[sinks.http.buffer]
type = "disk"
max_size = 104900
when_full = "block"
@anton-ryzhov thank you! Do you have any idea of how many logs you push through and what type of system is running vector? Any extra information is super helpful here.
Ubuntu 18.04.3 LTS / Linux 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
I don't have any matched logs to ship
So no real workload — but vector is busy. Same picture when there is a small file (2Mb). Seems it doesn't depend on the log shipping.
@anton-ryzhov @s1mps thanks for the repro configs! I have a fix open in #1465 :)