We need a high-level plan/specification for how we want to complete this project. Afterwards, we'll create all of the issues representing this work.
I plan to break this project down into a handful of individual parts:
To be valuable, we need to get at least to item 3. At this stage, we'd be able to set up a configuration and leave it running, knowing that we'd get alerted about any issues with the output data. There would still be a lot of gaps, but it'd help us smoke out some simple problems.
The goal here will be to build off of the existing test harness work to make it easy to spin up a box (or multiple) with Vector on them and ready to accept a given config. We want to be able to have multiple of these small test envs up and running at a time, ideally as independent as possible.
We also want to be able to configure upstream services like cloudwatch groups/streams, s3 buckets, etc.
This will probably take the shape of a new tool that generates log data designed to be easy to check for consistency. For example, incrementing integers on every line makes it easy to identify gaps. At first, it will probably just write out to files. It can manually implement a number of different rotation strategies and intentionally try to seek out edge cases there. We'll also want to make sure that the throughput is configurable (maybe dynamically to induce occasional spikes).
This will consist of a "source" for every Vector sink being tested that can fetch the output data for comparison. It will then scan through it, using whatever scheme from the generator to find and alert on any missing or reordered data.
Once we have those basics in place, there are a couple more things that should dramatically increase how much value we get out of the whole thing.
Adding fault injection should be pretty straightforward and would dramatically decrease the amount of time takes to flush out edge cases. To start, we can add a network fault injection proxy like toxiproxy or muxy and configure sinks to run through it. If we have more time, we can explore more advanced things like namazu or krf. Some of these tools need a driver written to tell them what to do when, and we'll also want to collect logs against which we can correlated an detected data loss from validation.
With a steady stream of failures from fault injection, we should be in a good place to iterate on how Vector reacts to those failures. The first step will probably be better logs around these edge cases. After that we can focus on situations like backpressure where there isn't necessarily one obvious place to log, but we still want to make apparent to users. Finally, we can go through each of the types of failures (or partial failures or slowdowns) and make sure we're exposing enough metrics that it would show up on a dashboard at a glance.
It'd be neat to do some dogfooding here and monitor Vector with Vector, but that's not the first priority.
This seems sensible to me! In addition to verifying correctness it might be interesting to measure and report:
This looks great!
One of the things I was thinking about is how we might be able to integrate this into our workflow? How do you see us reviewing reports and ensuring new PRs don't change behavior in unexpected ways?
This looks interesting and useful!
In addition, I think it might be worth to fuzz Vector with AFL/libfuzz to find possible crashes or vulnerabilities on malformed input data.
For example, incrementing integers on every line makes it easy to identify gaps.
It could be interesting to monitor not only gaps/reordering, but also how often the same data happens to be sent two or more times according to Vector's "at least once" delivery guarantees.
@LucioFranco
One of the things I was thinking about is how we might be able to integrate this into our workflow? How do you see us reviewing reports and ensuring new PRs don't change behavior in unexpected ways?
I don't think we should expect to be able to gate all PRs on "passing" this environment. Ideally, issues found in this environment would be translated to the normal test suite to protect against regressions (similar to fuzzing or property-based tests). We can definitely have the ability to run these envs with arbitrary versions of vector (e.g. one built from a PR), but that'd probably be something we decide to do for particularly relevant PRs only.
@a-rodin
In addition, I think it might be worth to fuzz Vector with AFL/libfuzz to find possible crashes or vulnerabilities on malformed input data.
Definitely agree this would be interesting. The requirements are a bit different (probably simpler), but we should be able to reuse most all of the same infrastructure.
Fun fact: we've already done some targetted fuzzing of the tokenizer parser and added some unit test based on crashes it found. This is another approach that I think can be super effective (i.e. more focused fuzzing of individual parser components).
It could be interesting to monitor not only gaps/reordering, but also how often the same data happens to be sent two or more times according to Vector's "at least once" delivery guarantees.
Agreed! The goal will be a general purpose diff of the input and output, so duplicates should be caught as well.
This is good! Maybe we can add test coverage reporting into the plan?
@loony-bean I'm not too sure we would get much out of coverage reporting, I've seen it be very flaky in rust projects and hard to tune to make it a productive addition imo.
@lukesteensen I think this is done, but we never created issues for the work here. Let's do this on Monday and close this out.
Most helpful comment
I plan to break this project down into a handful of individual parts:
Basics
To be valuable, we need to get at least to item 3. At this stage, we'd be able to set up a configuration and leave it running, knowing that we'd get alerted about any issues with the output data. There would still be a lot of gaps, but it'd help us smoke out some simple problems.
Environment
The goal here will be to build off of the existing test harness work to make it easy to spin up a box (or multiple) with Vector on them and ready to accept a given config. We want to be able to have multiple of these small test envs up and running at a time, ideally as independent as possible.
We also want to be able to configure upstream services like cloudwatch groups/streams, s3 buckets, etc.
Data generation
This will probably take the shape of a new tool that generates log data designed to be easy to check for consistency. For example, incrementing integers on every line makes it easy to identify gaps. At first, it will probably just write out to files. It can manually implement a number of different rotation strategies and intentionally try to seek out edge cases there. We'll also want to make sure that the throughput is configurable (maybe dynamically to induce occasional spikes).
Data validation
This will consist of a "source" for every Vector sink being tested that can fetch the output data for comparison. It will then scan through it, using whatever scheme from the generator to find and alert on any missing or reordered data.
Next steps
Once we have those basics in place, there are a couple more things that should dramatically increase how much value we get out of the whole thing.
Fault injection
Adding fault injection should be pretty straightforward and would dramatically decrease the amount of time takes to flush out edge cases. To start, we can add a network fault injection proxy like toxiproxy or muxy and configure sinks to run through it. If we have more time, we can explore more advanced things like namazu or krf. Some of these tools need a driver written to tell them what to do when, and we'll also want to collect logs against which we can correlated an detected data loss from validation.
Tuning observability
With a steady stream of failures from fault injection, we should be in a good place to iterate on how Vector reacts to those failures. The first step will probably be better logs around these edge cases. After that we can focus on situations like backpressure where there isn't necessarily one obvious place to log, but we still want to make apparent to users. Finally, we can go through each of the types of failures (or partial failures or slowdowns) and make sure we're exposing enough metrics that it would show up on a dashboard at a glance.
It'd be neat to do some dogfooding here and monitor Vector with Vector, but that's not the first priority.