Hi,
I'm trying vector 0.9.1 in docker (image - timberio/vector:0.9.1-debian)
My docker-compose has 3 services - nginx, vector and kafka.
vector has such configuration
[sources.some_log]
# General
type = "file" # required
include = ["/opt/logs/some.log"] # required
start_at_beginning = true # optional, default
host_key = "host"
data_dir = "/tmp"
# Priority
oldest_first = true # optional, default
max_read_bytes = 1048576 #1mb
[transforms.parse_json]
type = "json_parser" # required
inputs = ["hit_log"] # required
drop_field = true # optional, default
drop_invalid = true # required
field = "message" # optional, default
target_field = "src"
[transforms.remove_file]
type = "remove_fields" # required
inputs = ["parse_json"] # required
fields = ["source_type"] # required
[sinks.hit_output_kafka]
# General
type = "kafka" # required
inputs = ["remove_file"] # required
bootstrap_servers = "kafka:9092" # required
compression = "none" # optional, default
healthcheck = true # optional, default
key_field = "user_id" # required
topic = "hits" # required
encoding.codec = "json" # required
I do benchmark with calling 20-30k qps to nginx. Nginx log every request as json.
And for 2kk requests made in benchmark I see 2kk lines in nginx log and 1999998 messages in kafka.
When I turn on debug logging in vector I see few json parsing errors.
I`ve tried two more tests:
I`ve done some research and looks like found cause of such behaviour:
https://github.com/timberio/vector/blob/master/lib/file-source/src/file_watcher.rs#L257-L261
Look like under high load another process or kernel didn't flushed buffers to disk in 1ms interval.
As for me vector should not threat EOF as line delimiter, or such behavior should be configurable.
Thanks for the report @agnek! You're right that the short sleep there is not a full solution to the problem. The current design of our file source would make it difficult to do something smarter here, which is why we've discussed trying to evolve its architecture.
Would making that duration configurable be a suitable solution for your use case? The downside of a longer sleep would be for cases where you're tailing many files and would like to continue doing useful work while you wait for more data to show up. If you're only tailing a single busy file there wouldn't be much downside at all.
Actually I suppose that longer sleep is not a solution - we cannot get guarantee than log lines will not be lost even with quite long sleep.
In my case I need to know that log-shipper will never drop any line. Actually I need at-least-once delivery semantic.
As I see issue can be resolved with one configurable param - number of loop iterations allowed while we are waiting data chunk with "end of line".
Or more complicated solution - in such case, when reader didn't found end of line you can emit that no lines were found, do not change file read offset and try to fill buffer on next iteration.
Guys, looks like we have similar problem and it's very critical for us. We built huge logs system and then we found out that we lose some part of logs for unknown reason.
I see a lot of logs like this
Sep 10 08:22:18.808 WARN transform{name=logs_to_json_err_logs type=json_parser}: vector::internal_events::json: 1 "Event failed to parse as JSON" events were rate limited. rate_limit_secs=5
Sep 10 08:22:18.808 WARN transform{name=logs_to_json_err_logs type=json_parser}: vector::internal_events::json: Event failed to parse as JSON field=message self.error=expected value at line 1 column 1 rate_limit_secs=30
Files were double checked - everywhere is pure JSON. Hence, problem not in data.
After some investigation and finding this issue - I understood, that under the load - vector can read now whole string and then sends incomplete JSON to json_parser.
Couple questions:
1) Why drop_invalid does not work in this case? It should just drop events that were read incompletely.
2) Can I disable rate_limit for this case?
2.1) If rate limiter works - all other sources and transformers will stop?
We rely on vector and will be so glad to use it in future. But with bug like this we are not sure, that it is possible.
Thanks a lot for any help.
@lukesteensen Did we reach any conclusions here?
Hi @mikhno-s, sorry for the trouble you're having here!
Why
drop_invaliddoes not work in this case?
You will see that log message regardless of your drop_invalid setting. It's just letting you know that it saw invalid JSON and it will handle it according to your config. If you have drop_invalid = false, it will still forward the message.
Can I disable rate_limit for this case
The rate_limit here is only for the log messages themselves. The intention is to avoid filling your logs with repeated messages for things that can happen on each event process. The actual data processing within Vector is continuing at full speed.
We rely on vector and will be so glad to use it in future. But with bug like this we are not sure, that it is possible.
We understand and appreciate your help so far! Just to summarize, the problem we've encountered here is that Vector will sometimes return split lines when the filesystem is under load. These split lines then are not recognized as valid JSON, which can prevent them from being processed correctly.
This is a bit of a tricky problem because we can't wait forever to see a newline. There may not be one if the writer process was killed abruptly and we don't want to ignore the data in that situation. It also ties into how our file tailing code is written because we don't currently have a graceful way to wait for a newline without blocking progress on reading other files.
I've been digging into this today and I have a couple of ideas I think we could implement relatively quickly to make this situation better. High throughput file tailing is a tricky problem but it's one that's very important to Vector so I really do want to get this right. Again, thank you for your patience so far and hopefully we can get this resolved for you soon.
We have parallel running system with filebeat - it does not have that problem. Do you know have they make it work?
I'll also throw an idea,
we can add option complete_line that accepts bool with behavior:
false, file source is permitted to emit partial lines.true, file source is not permitted to emit partial lines and will wait indefinitely for it. complete_lineWe have lines and information that they cary, let's call it content. Contents can be categorized into:
text.json.Then users can set complete_line option according to which type of content they have.
I think this a good compromise since in most cases the file source will exclusively read one of those types, and if not, it can probably be solved by using two file sources.
One minus of complete_line behavior on true is that users must emit newline on last line otherwise it will never be picked up.
For complete_line true, the implementation can backoff when EOF is reached.
Sounds good.
According to docs (https://vector.dev/docs/reference/sources/file/#line-delimiters) Each line is read until a new line delimiter (the 0xA byte) or EOF is found., so default value for complete_line param should be true.
@lukesteensen are you working on this? If so, let's unassign @ktff or vice versa.
Most helpful comment
I'll also throw an idea,
we can add option
complete_linethat acceptsboolwith behavior:false,filesource is permitted to emit partial lines.true,filesource is not permitted to emit partial lines and will wait indefinitely for it.Case for
complete_lineWe have lines and information that they cary, let's call it content. Contents can be categorized into:
text.json.Then users can set
complete_lineoption according to which type of content they have.I think this a good compromise since in most cases the file source will exclusively read one of those types, and if not, it can probably be solved by using two file sources.
One minus of
complete_linebehavior ontrueis that users must emit newline on last line otherwise it will never be picked up.Implementation
For
complete_linetrue, the implementation can backoff whenEOFis reached.