Vector: File source have problems with reading files under high load

Created on 9 Jul 2020  路  9Comments  路  Source: timberio/vector

Hi,
I'm trying vector 0.9.1 in docker (image - timberio/vector:0.9.1-debian)
My docker-compose has 3 services - nginx, vector and kafka.

vector has such configuration

[sources.some_log]
  # General
  type = "file" # required
  include = ["/opt/logs/some.log"] # required
  start_at_beginning = true # optional, default
  host_key = "host"
  data_dir = "/tmp"

  # Priority
  oldest_first = true # optional, default
  max_read_bytes = 1048576 #1mb

[transforms.parse_json]
  type = "json_parser" # required
  inputs = ["hit_log"] # required
  drop_field = true # optional, default
  drop_invalid = true # required
  field = "message" # optional, default
  target_field = "src"

[transforms.remove_file]
  type = "remove_fields" # required
  inputs = ["parse_json"] # required
  fields = ["source_type"] # required

[sinks.hit_output_kafka]
  # General
  type = "kafka" # required
  inputs = ["remove_file"] # required
  bootstrap_servers = "kafka:9092" # required
  compression = "none" # optional, default
  healthcheck = true # optional, default
  key_field = "user_id" # required
  topic = "hits" # required

  encoding.codec = "json" # required

I do benchmark with calling 20-30k qps to nginx. Nginx log every request as json.
And for 2kk requests made in benchmark I see 2kk lines in nginx log and 1999998 messages in kafka.
When I turn on debug logging in vector I see few json parsing errors.

I`ve tried two more tests:

  1. Turn off vector while nginx is being under benchmark load. And after bench is finished - I see no errors and no messages lost.
  2. Restart vector with empty state after few messages were lost. For example - under load I have made 2kk messages and got 1999998 message in kafka. I restart vector with empty source offsets (by defining new data_dir location in vector.toml). And in such case all messages are processed. I see 3999998 messages in kafka (1999998 for first run and 2000000 for second).

I`ve done some research and looks like found cause of such behaviour:
https://github.com/timberio/vector/blob/master/lib/file-source/src/file_watcher.rs#L257-L261

Look like under high load another process or kernel didn't flushed buffers to disk in 1ms interval.
As for me vector should not threat EOF as line delimiter, or such behavior should be configurable.

performance should file bug

Most helpful comment

I'll also throw an idea,
we can add option complete_line that accepts bool with behavior:

  • false, file source is permitted to emit partial lines.
  • true, file source is not permitted to emit partial lines and will wait indefinitely for it.

Case for complete_line

We have lines and information that they cary, let's call it content. Contents can be categorized into:

  1. From any subsection of a line at least partial content can be extracted. So for this type we can minimize the latency by slicing it into parts when practical. Example of such content is text.
  2. Not even partial content can be extracted from all subsections of a line. So for this type it makes sense to wait for the complete line, even indefinitely, since there is no value in emitting partial line. Example of such content is json.

Then users can set complete_line option according to which type of content they have.

I think this a good compromise since in most cases the file source will exclusively read one of those types, and if not, it can probably be solved by using two file sources.

One minus of complete_line behavior on true is that users must emit newline on last line otherwise it will never be picked up.

Implementation

For complete_line true, the implementation can backoff when EOF is reached.

All 9 comments

Thanks for the report @agnek! You're right that the short sleep there is not a full solution to the problem. The current design of our file source would make it difficult to do something smarter here, which is why we've discussed trying to evolve its architecture.

Would making that duration configurable be a suitable solution for your use case? The downside of a longer sleep would be for cases where you're tailing many files and would like to continue doing useful work while you wait for more data to show up. If you're only tailing a single busy file there wouldn't be much downside at all.

Actually I suppose that longer sleep is not a solution - we cannot get guarantee than log lines will not be lost even with quite long sleep.

In my case I need to know that log-shipper will never drop any line. Actually I need at-least-once delivery semantic.
As I see issue can be resolved with one configurable param - number of loop iterations allowed while we are waiting data chunk with "end of line".

Or more complicated solution - in such case, when reader didn't found end of line you can emit that no lines were found, do not change file read offset and try to fill buffer on next iteration.

Guys, looks like we have similar problem and it's very critical for us. We built huge logs system and then we found out that we lose some part of logs for unknown reason.

I see a lot of logs like this

Sep 10 08:22:18.808  WARN transform{name=logs_to_json_err_logs type=json_parser}: vector::internal_events::json: 1 "Event failed to parse as JSON" events were rate limited. rate_limit_secs=5
Sep 10 08:22:18.808  WARN transform{name=logs_to_json_err_logs type=json_parser}: vector::internal_events::json: Event failed to parse as JSON field=message self.error=expected value at line 1 column 1 rate_limit_secs=30

Files were double checked - everywhere is pure JSON. Hence, problem not in data.

After some investigation and finding this issue - I understood, that under the load - vector can read now whole string and then sends incomplete JSON to json_parser.

Couple questions:
1) Why drop_invalid does not work in this case? It should just drop events that were read incompletely.
2) Can I disable rate_limit for this case?
2.1) If rate limiter works - all other sources and transformers will stop?

We rely on vector and will be so glad to use it in future. But with bug like this we are not sure, that it is possible.

Thanks a lot for any help.

@lukesteensen Did we reach any conclusions here?

Hi @mikhno-s, sorry for the trouble you're having here!

Why drop_invalid does not work in this case?

You will see that log message regardless of your drop_invalid setting. It's just letting you know that it saw invalid JSON and it will handle it according to your config. If you have drop_invalid = false, it will still forward the message.

Can I disable rate_limit for this case

The rate_limit here is only for the log messages themselves. The intention is to avoid filling your logs with repeated messages for things that can happen on each event process. The actual data processing within Vector is continuing at full speed.

We rely on vector and will be so glad to use it in future. But with bug like this we are not sure, that it is possible.

We understand and appreciate your help so far! Just to summarize, the problem we've encountered here is that Vector will sometimes return split lines when the filesystem is under load. These split lines then are not recognized as valid JSON, which can prevent them from being processed correctly.

This is a bit of a tricky problem because we can't wait forever to see a newline. There may not be one if the writer process was killed abruptly and we don't want to ignore the data in that situation. It also ties into how our file tailing code is written because we don't currently have a graceful way to wait for a newline without blocking progress on reading other files.

I've been digging into this today and I have a couple of ideas I think we could implement relatively quickly to make this situation better. High throughput file tailing is a tricky problem but it's one that's very important to Vector so I really do want to get this right. Again, thank you for your patience so far and hopefully we can get this resolved for you soon.

We have parallel running system with filebeat - it does not have that problem. Do you know have they make it work?

I'll also throw an idea,
we can add option complete_line that accepts bool with behavior:

  • false, file source is permitted to emit partial lines.
  • true, file source is not permitted to emit partial lines and will wait indefinitely for it.

Case for complete_line

We have lines and information that they cary, let's call it content. Contents can be categorized into:

  1. From any subsection of a line at least partial content can be extracted. So for this type we can minimize the latency by slicing it into parts when practical. Example of such content is text.
  2. Not even partial content can be extracted from all subsections of a line. So for this type it makes sense to wait for the complete line, even indefinitely, since there is no value in emitting partial line. Example of such content is json.

Then users can set complete_line option according to which type of content they have.

I think this a good compromise since in most cases the file source will exclusively read one of those types, and if not, it can probably be solved by using two file sources.

One minus of complete_line behavior on true is that users must emit newline on last line otherwise it will never be picked up.

Implementation

For complete_line true, the implementation can backoff when EOF is reached.

Sounds good.
According to docs (https://vector.dev/docs/reference/sources/file/#line-delimiters) Each line is read until a new line delimiter (the 0xA byte) or EOF is found., so default value for complete_line param should be true.

@lukesteensen are you working on this? If so, let's unassign @ktff or vice versa.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

leebenson picture leebenson  路  3Comments

raghu999 picture raghu999  路  3Comments

a-rodin picture a-rodin  路  3Comments

a-rodin picture a-rodin  路  3Comments

LucioFranco picture LucioFranco  路  3Comments