Vector: Using on-disk buffer in the vector sink causes vector to use high amounts of CPU resources.

Created on 14 Nov 2019 · 10Comments · Source: timberio/vector

We have a centralised vector instance to which all vector agents send their logs via the vector sink. On the agent side, our vector sink uses the buffer option to hold logs until they are being sent to the master. When configured as an on-disk buffer, we noticed vector using up an insane amount of CPU resources on the system ( about 80-90%). When switching to memory buffer or turning off the buffer option completely, the CPU usage clocks in at about 0,5%. Is this an error in our configuration or a bug of the vector sink?

buffers bug

Source

s1msn

👍1

All 10 comments

It is probably caused by LSM tree compactions performed by LevelDB.

I think that we can make LevelDB perform compactions less frequently and in larger batches by tuning its options and/or exposing them in the Vector's config. In particular, it looks like increasing block size could help.

a-rodin on 14 Nov 2019

Thanks for this report @s1mps! Vector definitely shouldn't be using that much CPU, no matter how you configure the disk buffer.

Are you able to share more about your configuration and the data throughput in your deployment?

lukesteensen on 14 Nov 2019

We are using vector as an agent to read logfiles and send them to a master who parses them and sends them to a local elasticsearch instance.

Agent config:

data_dir = "/var/lib/vector"

[sources.local_syslog]
  type          = "file"
  include       = ["<FILES>" ]
  glob_minimum_cooldown = 1000000

[transforms.local_syslog_add_hostname]
  type = "add_fields"
  inputs = ["local_syslog"]

  [transforms.local_syslog_add_hostname.fields]
    hostname = "${HOSTNAME}"

[sinks.forward_syslog]
  inputs        = ["local_syslog_add_hostname"]
  type          = "vector"
  address       = "<IP>:<PORT>"
  healthcheck   = true

    [sinks.forward_syslog.buffer]
      type = "disk"
      when_full = "block"
      max_size = 104900
      num_items = 500

Master config:


data_dir        = "/var/lib/vector"

[sources.remote_syslog]
  type          = "vector"
  address       = "<IP>:<PORT>"

[transforms.add_time_field]
  type = "lua"
  inputs = ["remote_syslog", "local_syslog_add_hostname"]
  source = """
    event["log_timestamp"] = string.sub(event["file"], -10, -7) .. " " .. string.sub(event["message"], 1, 15)
"""

[transforms.parse_time_field]
  type = "coercer"
  inputs = ["add_time_field"]

  [transforms.parse_time_field.types]
    log_timestamp = "timestamp|%Y %b %e %X"

[transforms.syslog_grok]
  type          = "grok_parser"
  inputs        = ["parse_time_field"]
  pattern       = "<PATTERN>"
  field         = "message"
  drop_field    = true

  [transforms.syslog_grok.types]
    host            = "string"
    program         = "string"
    pid             = "int"
    log_message     = "string"

[sinks.es_syslog]
  type                          = "elasticsearch"
  inputs                        = ["syslog_grok"]
  host                          = "<IP>:<PORT>"
  index                         = "syslog-%Y-%m-%d"
  healthcheck                   = true
  batch_size                    = 1049000
  batch_timeout                 = 1
  request_timeout_secs          = 30
  request_retry_attempts        = 3
  request_retry_backoff_secs    = 5

  [sinks.es_syslog.basic_auth]
    user      = "<USER>"
    password   = "<PASS>"

  [sinks.es_syslog.buffer]
    type        = "disk"
    when_full   = "block"
    max_size    = 1049000
    num_items   = 500

Hope this helps!

s1msn on 14 Nov 2019

Thank you! And just to be clear, you're seeing this on the agent side but not on the master?

We'll dig in and see what we can find.

lukesteensen on 14 Nov 2019

Yes, it only affects the agents, when the buffer on the vector sink is set to write on disk, the buffer on the elasticsearch sink seems to be fine.

s1msn on 15 Nov 2019

👍1

@s1mps hey! Just for a complete picture, do you have any information on the configuration of the instances you are running the agent on? What type of disk? CPU size etc. Any info would be helpful :)

LucioFranco on 20 Dec 2019

I also experienced this problem for TCP and Vector sinks. For HTTP works fine.

I'm using vector 0.6.0 (v0.6.0 x86_64-unknown-linux-musl 2019-12-12).

Minimal config which causes the problem:

data_dir = "data-dir/"

[sources.infile]
  type = "file"
  include = ["*.log"]

[sinks.tcp]
  type = "tcp"
  inputs = ["infile"]
  address = "localhost:1234"
  encoding = "json"
  [sinks.tcp.buffer]
    type = "disk"
    max_size = 104900
    when_full = "block"

I'm recreating data-dir for every test run, I don't have any matched logs to ship — but right after start the process consumes 60% of all cores.

Same for

[sinks.tcp]
  type = "tcp"
  inputs = ["infile"]
  address = "localhost:1234"
  encoding = "json"
  [sinks.tcp.buffer]
    type = "disk"
    max_size = 104900
    when_full = "block"

But not for

[sinks.http]
  type = "http"
  inputs = ["infile"]
  uri = "http://localhost:8080"
  encoding = "text"
  [sinks.http.buffer]
    type = "disk"
    max_size = 104900
    when_full = "block"

anton-ryzhov on 23 Dec 2019

@anton-ryzhov thank you! Do you have any idea of how many logs you push through and what type of system is running vector? Any extra information is super helpful here.

LucioFranco on 27 Dec 2019

Ubuntu 18.04.3 LTS / Linux 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I don't have any matched logs to ship

So no real workload — but vector is busy. Same picture when there is a small file (2Mb). Seems it doesn't depend on the log shipping.

anton-ryzhov on 28 Dec 2019

@anton-ryzhov @s1mps thanks for the repro configs! I have a fix open in #1465 :)

LucioFranco on 30 Dec 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Investigate `tracing-subscriber` memory leak

LucioFranco · 3Comments

Define how metrics are converted to logs

binarylogic · 3Comments

Improve on-disk buffer benchmarks and collect more data

binarylogic · 3Comments

Allow `--config` to be optional and let vector search for the config

LucioFranco · 3Comments

Re-enable kubernetes tests in CI

LucioFranco · 3Comments