Under some circumstances, doing a checksum just by the fixed amount of bytes becomes very suboptimal.
To work around this, we could implement a checksumming process that would read the first line - that is up to the first \n character (or a specified ceiling in bytes) - and checksum that. It is a very reasonable measure for line-separated logs, and solves the major issues with the pure bytes-oriented approach.
I figured this would help significantly with the proper implementation of the k8s source.
Current code for reference:
https://github.com/timberio/vector/blob/c293c492e97a7249822be4907f6bab84414dae7d/lib/file-source/src/file_server.rs#L391-L409
Ref #2701.
Under some circumstances, doing a checksum just by the fixed amount of bytes becomes very suboptimal.
As far as I understand, there is exactly one circumstance where this does not work. Requiring files to be a certain minimum size is a known tradeoff. If there are other things you consider "suboptimal" please explain further.
we could implement a checksumming process that would read the first line - that is up to the first
\ncharacter (or a specified ceiling in bytes) - and checksum that
What happens if you read and there is no newline character yet? What if the first character is a newline? Should we require a minimum number of bytes before the newline to be considered valid? What about files with common headers?
solves the major issues with the pure bytes-oriented approach
Again, please explain these major issues so that we can evaluate your proposal against them.
I figured this would help significantly with the proper implementation of the k8s source.
Is there something specific about k8s that is creating many files smaller than the default 256B limit? Please explain further so that we can have an informed discussion before proposing specific changes.
As far as I understand, there is exactly one circumstance where this does not work. Requiring files to be a certain minimum size is a known tradeoff. If there are other things you consider "suboptimal" please explain further.
Correct.
The smallest message at k8s consists of 40 bytes with one log format, and 62 bytes with the other log format. Both log formats allow files to be less than 256 bytes. This case is precisely what I have at E2E tests, and thus kubernetes_logs source behaves incorrectly (or "correctly within known tradeoff").
My solution - reading only the first line - has very similar properties to the checksum solution that we already have - but it doesn't have that tradeoff. It does require another set of tradeoffs, but as long as the conditions are met - it will just work properly every time. The said conditions are known to be true in the kubernetes_logs source.
Now, more about the properties of checksumming the first line (and the required conditions).
Effectively, most of the properties are common with the current length-based checksum algorithm. But there's no downside of ignoring small non-empty files.
This solution would be good for the kubernetes_logs source, but there are definitely other applications too. Thinking about my real-life use cases, I'd probably pick this over the current checksum implementation almost every time. In the majority of my real-world cases, when this approach stumbles upon an edge case, the length-based checksums would not do a significantly better job. But they definitely do a lot of harm in the common case, that's not a good trade-off imo, and I just want to offer a different one.
In the use cases, I don't see that often in real life (to me - "on paper") line-based checksums may do better. Maybe I'm one of the (un)lucky ones, and other people have real-world applications where that's way more important. I personally just don't understand how it can make sense.
I hope this message is not too big, and at the same time explains my point of view on the matter reasonably.
There are a few other improvements that, I figured, we can apply to the fingerprinting design. This one is critical for achieving better trade-offs for kubernetes_logs source though, so I have to get into it while I'm still busy with the k8s PR.
To summarize all of that, it sounds like @MOZGIII just wants to guarantee that the timestamp is included in the fingerprint since that is likely to be unique. Two solutions:
To summarize all of that, it sounds like @MOZGIII just wants to guarantee that the timestamp is included in the fingerprint since that is likely to be unique. Two solutions:
- If the timestamp prefixes the line then you can reduce the bytes to the smallest you listed (40 bytes). That'll always include the timestamp.
- If the timestamp does not prefix the line then I can see why he wants to grab the first line.
You're absolutely right! I already attempted to use the first approach, but in docker log format the log field comes first. Because of this, when I switched the fingerprinter to checksums (https://github.com/timberio/vector/pull/2653/commits/ecf28a3da7e702ff197a5d6c9846287713a76f90), E2E tests started failing. So I rolled the change back for now (https://github.com/timberio/vector/pull/2653/commits/d987ee2c7f01f7025c92db051ab36ee566a507f3).
I prepared a concrete implementation of this proposal at #2904. There's a test suite that demonstrates and ensure the properties that we'd want from a checksum.
I thought about it more, and the summary isn't quite correct.
I also want this important property:
This is, in essence, the tradeoff of the proposed checksum implementation. While the current one has a slightly different one:
In practice, it means that if a Pod crashes at startup with a tiny error (for instance EVN_VAR was not set) - it will never be picked up by Vector. If I would be considering to deploy Vector in my recent projects, I know for sure this would just be a deal-breaker.
Both log formats allow files to be less than 256 bytes. This case is precisely what I have at E2E tests, and thus
kubernetes_logssource behaves incorrectly (or "correctly within known tradeoff").
I understand that there are valid files smaller than 256 bytes and that it is your position that ignoring those makes our implementation incorrect.
My solution - reading only the first line - has very similar properties to the checksum solution that we already have - but _it doesn't have that tradeoff_. It does require another set of tradeoffs, but as long as the conditions are met - it will just work properly every time.
Every solution will work every time if you assume that all of its required conditions are met :smile:
The said conditions are _known to be true_ in the
kubernetes_logssource.
This needs to be made more clear. Are we talking about a solution specifically for the k8s logs source? Are we talking about something that's intended to be more general-purpose? Is it intended to replace the existing checksum algorithm?
With a clear understanding of the intended scope, it's impossible to evaluate changes like this.
The file server isn't for binary files, so why do we throw that property away when doing the fingerprinting?
This ignores our existing support for gzipped files.
they definitely do a lot of harm in the common case
To be clear, you're claiming that log files smaller than 256 bytes are the common case? We all understand that they are valid and possible, but our tradeoff was made on the assumption that it is extremely uncommon for real files to remain that small for any meaningful amount of time. If you disagree, it's far more helpful to explain why than to repeatedly claim that we're ignoring a "common case".
I personally just don't understand how it can make sense.
Statements like this are not helpful. Please be willing to accept that our position is the result of logical reasoning from a reasonable set of assumptions. If you can't understand why something is the way is it, ask for clarification. The implication that we've made unreasonable decisions to this point is a serious distraction from your actual proposal.
I already attempted to use the first approach, but in docker log format the
logfield comes first.
This is where I see the largest technical problem with your proposal. The combination of a maximum line length and the fact that timestamps do not come first in these formats means you need to set the max larger than the largest log message you expect to see. How do you make that decision without meaningful correctness tradeoffs?
Another practical problem is when the first event is too long:
{"log":"... more than 256 symbols, easily possible with nested json ...", "time": "...", ... }
This demonstrates the case where current checksum wouldn't read long enough into the file, that the timestamp is covered, meaning the checksum will be generated solely on the payload of the log message. This may be very problematic, as in k8s it's a usual case when multiple Pods from a same ReplicaSet start up on the same node at the same time, and print the same startup messages. This will trick Vector into thinking the file was moved, rather than that those are two different files. If user is lucky, the inner json event will also have a timestamp with high precision, and that timestamp is included in the checksum - as that would mitigate the issue for them. That's far from reliable though.
This ignores our existing support for gzipped files.
Oh, I missed this, true. :thinking: It explains a lot. :smile:
Is it intended to replace the existing checksum algorithm?
No, definitely not replace the current checksum, not without more confidence and discussions.
Are we talking about a solution specifically for the k8s logs source? Are we talking about something that's intended to be more general-purpose?
Initially - just for the k8s source. Once we discuss it in detail - maybe we can expose is as an option to the users at the file source too.
This is where I see the largest technical problem with your proposal. The combination of a maximum line length and the fact that timestamps do not come first in these formats means you need to set the max larger than the largest log message you expect to see. How do you make that decision without meaningful correctness tradeoffs?
max_line_bytes at the file source.I've left my system running k8s E2E tests with the implementation from https://github.com/timberio/vector/pull/2904 in a loop, and so far no failures :tada:
Thanks @MOZGIII, you should have lead with that example 馃槃. Vector aims to "just work" for the common observability use case. That is:
That's it. We are not concerned about solving strange edge cases by default, such as:
Solutions for you:
Hope that helps organize this discussion and move it forward.
Real world data point - we have a number of Python data processing jobs that chuck out a single short "Starting work" message at the start before spending a few hours doing data processing without (hopefully!) outputting anything until "Done" at the end. Getting that initial message come through is reassuring. I suspect _most_ would manage the 256 minimum with the expected final message, but wouldn't like to insist on it, and always getting something by the end would be mandatory.
While we could work around this by printing out "Lorem ipsum" filler at the start I think everyone would regard this at really a bit rubbish, and lead to a poor impression of the new system. I was surprised as to why they weren't coming through until I read the Vector output logs at INFO/DEBUG level. My colleagues who didn't just implement a new logging system will have no such clue & will waste time looking at for bugs that aren't there.
Thanks, @tyrken, that's a useful data point. We definitely want use cases like that to just work as much as possible.
@MOZGIII
In the ordinary case, the first line payload is usually about 128 bytes on average.
Where are you getting this data?
The value is 32 KiB. It's the next meaningful size value after 16 KiB, which is the maximum length of the _payload_ at kubernetes log formats.
If we're confident that k8s implementations enforce that upper limit, then it seems like we can be reasonably confident we'll avoid situations where a long line means we don't include a timestamp in our checksum.
I'm not opposed to adding an internal mode specifically for k8s if we can really rely on these properties to avoid making equivalent tradeoffs to the current checksumming implementation. I just wish you'd made this clearer earlier in the conversation :smile:
As I understand it, the best argument for a new mode is roughly as follows:
If that's all true, then the result is that we can reliably cover all k8s logs (even small ones) by checksumming the first line and I'm fine to proceed with adding an internal-only mode for that (if/how to present it to users is another discussion).
The biggest question I'm unsure of is around compressed files. My limited reading of k8s docs implies that files are rotated via normal systems means (e.g. logrotate) and not via some k8s-specific behavior. Does our implementation somehow ensure that we won't encounter rotated files that have been compressed?
I just wish you'd made this clearer earlier in the conversation
Sorry about that, I was thinking it would make sense even as a generic solution, thus I didn't focus on the k8s use cases initially. Even though I need it specifically for the k8s use cases.
If that's all true, then the result is that we can reliably cover all k8s logs (even small ones) by checksumming the first line and I'm fine to proceed with adding an internal-only mode for that (if/how to present it to users is another discussion).
Yep, that's the case.
The biggest question I'm unsure of is around compressed files. My limited reading of k8s docs implies that files are rotated via normal systems means (e.g. logrotate) and not via some k8s-specific behavior. Does our implementation somehow ensure that we won't encounter rotated files that have been compressed?
This is a really good question. The rotated logs may or may not be gzipped. More specifically:
To offer compression support in the first line checksum, we can just uncompress the gzip stream and read the line of the compressed file (in streaming mode).
However, compressed rotated files might be one of the things that we don't need to bother to support: neither kubectl nor fluentd (the "official" log aggregation solution) don't support compressed files. It looks like it's generally viewed as a non-issue to not be able to pick up data from the gzipped files - it makes sense to some degree, as all compressed data was at some point uncompressed, and would've been handled.
Either way, I suggest we count this as something we can add later if we see need for it.
Our implementation currently only picks up .log files for processing, .gz files will be ignored.
@lukesteensen please take a look at the proposed implementation when you have time: https://github.com/timberio/vector/pull/2904
It looks like it's generally viewed as a non-issue to not be able to pick up data from the gzipped files
The issue would be if we tried to read from a compressed file as if it were an uncompressed file. I agree we don't need to worry about full support for compression right now.
Our implementation currently only picks up
.logfiles for processing,.gzfiles will be ignored.
This seems sufficient to ensure we won't attempt to read any compressed file. If we're confident in that I think we can move forward here.
Thanks everyone! Closing this as done, the next step is #2926.
Most helpful comment
Thanks, @tyrken, that's a useful data point. We definitely want use cases like that to just work as much as possible.
@MOZGIII
Where are you getting this data?
If we're confident that k8s implementations enforce that upper limit, then it seems like we can be reasonably confident we'll avoid situations where a long line means we don't include a timestamp in our checksum.
I'm not opposed to adding an internal mode specifically for k8s if we can really rely on these properties to avoid making equivalent tradeoffs to the current checksumming implementation. I just wish you'd made this clearer earlier in the conversation :smile:
As I understand it, the best argument for a new mode is roughly as follows:
If that's all true, then the result is that we can reliably cover all k8s logs (even small ones) by checksumming the first line and I'm fine to proceed with adding an internal-only mode for that (if/how to present it to users is another discussion).
The biggest question I'm unsure of is around compressed files. My limited reading of k8s docs implies that files are rotated via normal systems means (e.g. logrotate) and not via some k8s-specific behavior. Does our implementation somehow ensure that we won't encounter rotated files that have been compressed?