vector 0.10.0 (g0f0311a x86_64-unknown-linux-gnu 2020-07-22) on Archlinux
data_dir = "/tmp"
[sources.file]
type = "file"
include = ["/var/log/nginx/*.log"]
start_at_beginning = false
[sinks.sink0]
healthcheck = false
inputs = ["file"]
type = "console"
encoding = "json"
According to the docs, I would expect only new data is read from the files. However, if I start Vector it will initially read and produce log events for each file in the /var/log/nginx directory starting from the beginning of the file. Additionally, if I drop a new log file into the directory, a new file is discovered and data is read from the beginning which is not aligned with what the docs state.
All files are read from the beginning.
127.0.0.1 - - [25/Feb/2020:14:09:40 +0100] "GET / HTTP/1.1" 502 494 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:12:37 +0100] "GET / HTTP/1.1" 502 494 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:12:39 +0100] "GET / HTTP/1.1" 502 494 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:13:08 +0100] "GET / HTTP/1.1" 502 494 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:13:30 +0100] "GET / HTTP/1.1" 502 494 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:13:53 +0100] "GET / HTTP/1.1" 404 19 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:21:07 +0100] "GET / HTTP/1.1" 404 21 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:21:32 +0100] "GET / HTTP/1.1" 404 21 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:23:04 +0100] "GET / HTTP/1.1" 404 19 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:24:53 +0100] "GET / HTTP/1.1" 499 0 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:35:03 +0100] "GET / HTTP/1.1" 504 494 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:38:54 +0100] "GET / HTTP/1.1" 404 31 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:39:39 +0100] "GET / HTTP/1.1" 404 31 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:41:09 +0100] "GET / HTTP/1.1" 404 31 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:41:26 +0100] "GET / HTTP/1.1" 404 31 "-" "curl/7.64.0"
127.0.0.1 - - [25/Feb/2020:14:41:26 +0100] "GET / HTTP/1.1" 404 31 "-" "curl/7.64.0"
If this is indeed a bug it would be a na$ty $urpri$e for anyone who starts "tailing" large log files and shipping them to a log management service. @binarylogic should this be added to the current milestone/release?
Hi @rabbitstack, what you describe appears to be the correct behavior, but we'll have an engineer chime in to verify. Vector will ingest log data for files it discovers and checkpoint along the way. It will not read the same data twice due to the checkpoints.
This reminds me of #1020, you can see the truth table in that issue. I admit, it's confusing, and #3972 plans to discuss improvements to this option.
Once we have an engineer take a look, and we determine this cannot be solved through configuration, we'll identify a short-term change and expedite that.
As a Vector user I would like to be able to choose when I want Vector to _only_ tail the log file, meaning "don't read any bytes previously written to the log file, just read/tail whatever gets written from now on", and when I want it to read/tail from the very beginning of the file. Do I have both of these options available to me today?
Our services are distributed on multiple machines, and the log files on each machine are vary large. If we get the entire log file, the network traffic is very large at the beginning, so we want to collect the log from the end of the file. However, this function is not currently supported and we look forward to solving it。