Vector: The ability to checksum by the first line at the file server

Created on 23 Jun 2020 · 14Comments · Source: timberio/vector

Under some circumstances, doing a checksum just by the fixed amount of bytes becomes very suboptimal.
To work around this, we could implement a checksumming process that would read the first line - that is up to the first \n character (or a specified ceiling in bytes) - and checksum that. It is a very reasonable measure for line-separated logs, and solves the major issues with the pure bytes-oriented approach.

I figured this would help significantly with the proper implementation of the k8s source.

Current code for reference:
https://github.com/timberio/vector/blob/c293c492e97a7249822be4907f6bab84414dae7d/lib/file-source/src/file_server.rs#L391-L409

Ref #2701.

idea approval file enhancement

Source

MOZGIII

Most helpful comment

Thanks, @tyrken, that's a useful data point. We definitely want use cases like that to just work as much as possible.

@MOZGIII

In the ordinary case, the first line payload is usually about 128 bytes on average.

Where are you getting this data?

The value is 32 KiB. It's the next meaningful size value after 16 KiB, which is the maximum length of the _payload_ at kubernetes log formats.

If we're confident that k8s implementations enforce that upper limit, then it seems like we can be reasonably confident we'll avoid situations where a long line means we don't include a timestamp in our checksum.

I'm not opposed to adding an internal mode specifically for k8s if we can really rely on these properties to avoid making equivalent tradeoffs to the current checksumming implementation. I just wish you'd made this clearer earlier in the conversation :smile:

As I understand it, the best argument for a new mode is roughly as follows:

Kubernetes log formats do not reliably put timestamps at the beginning of each line, which means the default 256 bytes have a meaningfully high likelihood of not including a timestamp.
Kubernetes log formats do enforce a size limit per line, so we do not have to worry about setting our own limit that may be smaller than some users' longest lines.
We don't have to worry about all the other edge cases normal checksumming was designed to handle (e.g. headers, compressed files, arbitrary formats, etc)

If that's all true, then the result is that we can reliably cover all k8s logs (even small ones) by checksumming the first line and I'm fine to proceed with adding an internal-only mode for that (if/how to present it to users is another discussion).

The biggest question I'm unsure of is around compressed files. My limited reading of k8s docs implies that files are rotated via normal systems means (e.g. logrotate) and not via some k8s-specific behavior. Does our implementation somehow ensure that we won't encounter rotated files that have been compressed?

lukesteensen on 29 Jun 2020

👍2

All 14 comments

Under some circumstances, doing a checksum just by the fixed amount of bytes becomes very suboptimal.

As far as I understand, there is exactly one circumstance where this does not work. Requiring files to be a certain minimum size is a known tradeoff. If there are other things you consider "suboptimal" please explain further.

we could implement a checksumming process that would read the first line - that is up to the first \n character (or a specified ceiling in bytes) - and checksum that

What happens if you read and there is no newline character yet? What if the first character is a newline? Should we require a minimum number of bytes before the newline to be considered valid? What about files with common headers?

solves the major issues with the pure bytes-oriented approach

Again, please explain these major issues so that we can evaluate your proposal against them.

I figured this would help significantly with the proper implementation of the k8s source.

Is there something specific about k8s that is creating many files smaller than the default 256B limit? Please explain further so that we can have an informed discussion before proposing specific changes.

lukesteensen on 23 Jun 2020

As far as I understand, there is exactly one circumstance where this does not work. Requiring files to be a certain minimum size is a known tradeoff. If there are other things you consider "suboptimal" please explain further.

Correct.

The smallest message at k8s consists of 40 bytes with one log format, and 62 bytes with the other log format. Both log formats allow files to be less than 256 bytes. This case is precisely what I have at E2E tests, and thus kubernetes_logs source behaves incorrectly (or "correctly within known tradeoff").

My solution - reading only the first line - has very similar properties to the checksum solution that we already have - but it doesn't have that tradeoff. It does require another set of tradeoffs, but as long as the conditions are met - it will just work properly every time. The said conditions are known to be true in the kubernetes_logs source.

Now, more about the properties of checksumming the first line (and the required conditions).

Files without the first logs line can be safely ignored until the first line appears, the same as with the existing checksums.
In k8s, every line has a timestamp (in both log formats that k8s has) and thus is very likely to have a unique checksum. Again, this is not different from the regular checksum.
In k8s, the first character will never be a newline - it's not allowed with the supported log formats; every line will at least have a timestamp.
No headers in k8s formats.
After the first line is obtained, the checksum remains unchanged, since lines aren't rewritten unless the file is truncated.
No matter how long the first line is, the file will be picked up and for processing effectively at the same time the first event from it ready, thus eliminating this nasty case of small log files just being ignored. At the same time, log lines have a timestamp, and thus even small we can re reasonably certain they'll have unique checksums (chances are even arguably better than in the regular checksums case).
With the current checksum, we can't pick an optimal size because the file format we're working with is not known at the time we specify that setting; it would still be worse since we won't be capturing the data as elegantly.
The file server is oriented toward reading line-delimited text files, not binary files, so it doesn't make a lot of sense to think in bytes in the first place. Thinking in terms of lines, on the other hand, is more natural, as it exactly corresponds with the payload format the rest of the code there is designed to work with. The file server isn't for binary files, so why do we throw that property away when doing the fingerprinting?

Effectively, most of the properties are common with the current length-based checksum algorithm. But there's no downside of ignoring small non-empty files.

This solution would be good for the kubernetes_logs source, but there are definitely other applications too. Thinking about my real-life use cases, I'd probably pick this over the current checksum implementation almost every time. In the majority of my real-world cases, when this approach stumbles upon an edge case, the length-based checksums would not do a significantly better job. But they definitely do a lot of harm in the common case, that's not a good trade-off imo, and I just want to offer a different one.
In the use cases, I don't see that often in real life (to me - "on paper") line-based checksums may do better. Maybe I'm one of the (un)lucky ones, and other people have real-world applications where that's way more important. I personally just don't understand how it can make sense.

I hope this message is not too big, and at the same time explains my point of view on the matter reasonably.

There are a few other improvements that, I figured, we can apply to the fingerprinting design. This one is critical for achieving better trade-offs for kubernetes_logs source though, so I have to get into it while I'm still busy with the k8s PR.

MOZGIII on 24 Jun 2020

To summarize all of that, it sounds like @MOZGIII just wants to guarantee that the timestamp is included in the fingerprint since that is likely to be unique. Two solutions:

If the timestamp prefixes the line then you can reduce the bytes to the smallest you listed (40 bytes). That'll always include the timestamp.
If the timestamp does not prefix the line then I can see why he wants to grab the first line.

binarylogic on 24 Jun 2020

To summarize all of that, it sounds like @MOZGIII just wants to guarantee that the timestamp is included in the fingerprint since that is likely to be unique. Two solutions:

If the timestamp prefixes the line then you can reduce the bytes to the smallest you listed (40 bytes). That'll always include the timestamp.

If the timestamp does not prefix the line then I can see why he wants to grab the first line.

You're absolutely right! I already attempted to use the first approach, but in docker log format the log field comes first. Because of this, when I switched the fingerprinter to checksums (https://github.com/timberio/vector/pull/2653/commits/ecf28a3da7e702ff197a5d6c9846287713a76f90), E2E tests started failing. So I rolled the change back for now (https://github.com/timberio/vector/pull/2653/commits/d987ee2c7f01f7025c92db051ab36ee566a507f3).

I prepared a concrete implementation of this proposal at #2904. There's a test suite that demonstrates and ensure the properties that we'd want from a checksum.

MOZGIII on 25 Jun 2020

I thought about it more, and the summary isn't quite correct.
I also want this important property:

If the file no meaningful data to parse as an event - we don't care, but if it has at least one event - we should take it into account.

This is, in essence, the tradeoff of the proposed checksum implementation. While the current one has a slightly different one:

If the file is smaller than the predefined threshold (regardless is there are parsable events in it) - we don't care, and take it into account only after it reaches a certain size.

In practice, it means that if a Pod crashes at startup with a tiny error (for instance EVN_VAR was not set) - it will never be picked up by Vector. If I would be considering to deploy Vector in my recent projects, I know for sure this would just be a deal-breaker.

MOZGIII on 27 Jun 2020

Both log formats allow files to be less than 256 bytes. This case is precisely what I have at E2E tests, and thus kubernetes_logs source behaves incorrectly (or "correctly within known tradeoff").

I understand that there are valid files smaller than 256 bytes and that it is your position that ignoring those makes our implementation incorrect.

My solution - reading only the first line - has very similar properties to the checksum solution that we already have - but _it doesn't have that tradeoff_. It does require another set of tradeoffs, but as long as the conditions are met - it will just work properly every time.

Every solution will work every time if you assume that all of its required conditions are met :smile:

The said conditions are _known to be true_ in the kubernetes_logs source.

This needs to be made more clear. Are we talking about a solution specifically for the k8s logs source? Are we talking about something that's intended to be more general-purpose? Is it intended to replace the existing checksum algorithm?

With a clear understanding of the intended scope, it's impossible to evaluate changes like this.

The file server isn't for binary files, so why do we throw that property away when doing the fingerprinting?

This ignores our existing support for gzipped files.

they definitely do a lot of harm in the common case

To be clear, you're claiming that log files smaller than 256 bytes are the common case? We all understand that they are valid and possible, but our tradeoff was made on the assumption that it is extremely uncommon for real files to remain that small for any meaningful amount of time. If you disagree, it's far more helpful to explain why than to repeatedly claim that we're ignoring a "common case".

I personally just don't understand how it can make sense.

Statements like this are not helpful. Please be willing to accept that our position is the result of logical reasoning from a reasonable set of assumptions. If you can't understand why something is the way is it, ask for clarification. The implication that we've made unreasonable decisions to this point is a serious distraction from your actual proposal.

I already attempted to use the first approach, but in docker log format the log field comes first.

This is where I see the largest technical problem with your proposal. The combination of a maximum line length and the fact that timestamps do not come first in these formats means you need to set the max larger than the largest log message you expect to see. How do you make that decision without meaningful correctness tradeoffs?

lukesteensen on 29 Jun 2020

Another practical problem is when the first event is too long:

{"log":"... more than 256 symbols, easily possible with nested json ...", "time": "...", ... }

This demonstrates the case where current checksum wouldn't read long enough into the file, that the timestamp is covered, meaning the checksum will be generated solely on the payload of the log message. This may be very problematic, as in k8s it's a usual case when multiple Pods from a same ReplicaSet start up on the same node at the same time, and print the same startup messages. This will trick Vector into thinking the file was moved, rather than that those are two different files. If user is lucky, the inner json event will also have a timestamp with high precision, and that timestamp is included in the checksum - as that would mitigate the issue for them. That's far from reliable though.

MOZGIII on 29 Jun 2020

This ignores our existing support for gzipped files.

Oh, I missed this, true. :thinking: It explains a lot. :smile:

Is it intended to replace the existing checksum algorithm?

No, definitely not replace the current checksum, not without more confidence and discussions.

Are we talking about a solution specifically for the k8s logs source? Are we talking about something that's intended to be more general-purpose?

Initially - just for the k8s source. Once we discuss it in detail - maybe we can expose is as an option to the users at the file source too.

This is where I see the largest technical problem with your proposal. The combination of a maximum line length and the fact that timestamps do not come first in these formats means you need to set the max larger than the largest log message you expect to see. How do you make that decision without meaningful correctness tradeoffs?

In the kubernetes case, we'll be setting the max line length to the same value we use for max_line_bytes at the file source.
The value is 32 KiB. It's the next meaningful size value after 16 KiB, which is the maximum length of the payload at kubernetes log formats.
Messages with longer payloads are split into parts and encoded as separate lines. The effective max size of the raw log line that we'll observe is slightly bigger than 16 KiB - so 32 KiB is a safe bet.
In the ordinary case, the first line payload is usually about 128 bytes on average. This means that we won't be reading whole 32 KiB of data for every file - sometimes even less than the current checksum would.

I've left my system running k8s E2E tests with the implementation from https://github.com/timberio/vector/pull/2904 in a loop, and so far no failures :tada:

MOZGIII on 29 Jun 2020

Thanks @MOZGIII, you should have lead with that example 😄. Vector aims to "just work" for the common observability use case. That is:

Small files are rare.
If a file is small, it is unlikely to remain that way.
Checksumming works reliably across all platforms (ex: Windows).

That's it. We are not concerned about solving strange edge cases by default, such as:

Test environments that produce very small files that do not represent real-world usage.
Rapidly rotating files. This indicates that the user is using Vector in a non-observability capacity.

Solutions for you:

Have your tests create larger files that are more realistic.
Explore improving our checkpointing strategy beyond introducing a special internal mode. For example, if a file is only temporarily small why can't we include the file name in the fingerprint until we get enough bytes? Once we are able to satisfy the byte limit we use the file contents exclusively.

Hope that helps organize this discussion and move it forward.

binarylogic on 29 Jun 2020

😕1

Real world data point - we have a number of Python data processing jobs that chuck out a single short "Starting work" message at the start before spending a few hours doing data processing without (hopefully!) outputting anything until "Done" at the end. Getting that initial message come through is reassuring. I suspect _most_ would manage the 256 minimum with the expected final message, but wouldn't like to insist on it, and always getting something by the end would be mandatory.

While we could work around this by printing out "Lorem ipsum" filler at the start I think everyone would regard this at really a bit rubbish, and lead to a poor impression of the new system. I was surprised as to why they weren't coming through until I read the Vector output logs at INFO/DEBUG level. My colleagues who didn't just implement a new logging system will have no such clue & will waste time looking at for bugs that aren't there.

tyrken on 29 Jun 2020

👍1

Thanks, @tyrken, that's a useful data point. We definitely want use cases like that to just work as much as possible.

@MOZGIII

In the ordinary case, the first line payload is usually about 128 bytes on average.

Where are you getting this data?

The value is 32 KiB. It's the next meaningful size value after 16 KiB, which is the maximum length of the _payload_ at kubernetes log formats.

As I understand it, the best argument for a new mode is roughly as follows:

Kubernetes log formats do not reliably put timestamps at the beginning of each line, which means the default 256 bytes have a meaningfully high likelihood of not including a timestamp.
Kubernetes log formats do enforce a size limit per line, so we do not have to worry about setting our own limit that may be smaller than some users' longest lines.
We don't have to worry about all the other edge cases normal checksumming was designed to handle (e.g. headers, compressed files, arbitrary formats, etc)

lukesteensen on 29 Jun 2020

👍2

I just wish you'd made this clearer earlier in the conversation

Sorry about that, I was thinking it would make sense even as a generic solution, thus I didn't focus on the k8s use cases initially. Even though I need it specifically for the k8s use cases.

If that's all true, then the result is that we can reliably cover all k8s logs (even small ones) by checksumming the first line and I'm fine to proceed with adding an internal-only mode for that (if/how to present it to users is another discussion).

Yep, that's the case.

The biggest question I'm unsure of is around compressed files. My limited reading of k8s docs implies that files are rotated via normal systems means (e.g. logrotate) and not via some k8s-specific behavior. Does our implementation somehow ensure that we won't encounter rotated files that have been compressed?

This is a really good question. The rotated logs may or may not be gzipped. More specifically:

CRI gzips old rotated files - that means the current log file and the most recent rotated files aren't gzipped, older files are;
docker by default doesn't gzip the rotated files, but can be enabled to do so.

To offer compression support in the first line checksum, we can just uncompress the gzip stream and read the line of the compressed file (in streaming mode).

However, compressed rotated files might be one of the things that we don't need to bother to support: neither kubectl nor fluentd (the "official" log aggregation solution) don't support compressed files. It looks like it's generally viewed as a non-issue to not be able to pick up data from the gzipped files - it makes sense to some degree, as all compressed data was at some point uncompressed, and would've been handled.

Either way, I suggest we count this as something we can add later if we see need for it.

Our implementation currently only picks up .log files for processing, .gz files will be ignored.

@lukesteensen please take a look at the proposed implementation when you have time: https://github.com/timberio/vector/pull/2904

MOZGIII on 30 Jun 2020

It looks like it's generally viewed as a non-issue to not be able to pick up data from the gzipped files

The issue would be if we tried to read from a compressed file as if it were an uncompressed file. I agree we don't need to worry about full support for compression right now.

Our implementation currently only picks up .log files for processing, .gz files will be ignored.

This seems sufficient to ensure we won't attempt to read any compressed file. If we're confident in that I think we can move forward here.

lukesteensen on 30 Jun 2020

👍1

Thanks everyone! Closing this as done, the next step is #2926.

2926 isn't a blocker for the k8s stuff, so we can deal with it at a slower pace.

MOZGIII on 1 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Define how metrics are converted to logs

binarylogic · 3Comments

Benefits and downsides of `Atom`s

a-rodin · 3Comments

New `clickhouse` sink

jhgg · 4Comments

[aws_s3] vector is unable to start if ec2 instance does not have rw to S3 bucket.

kaarolch · 3Comments

Improve on-disk buffer benchmarks and collect more data

binarylogic · 3Comments