Vector: Spec out raise reliability confidence project

Created on 9 Sep 2019 · 8Comments · Source: timberio/vector

We need a high-level plan/specification for how we want to complete this project. Afterwards, we'll create all of the issues representing this work.

task

Source

binarylogic

Most helpful comment

I plan to break this project down into a handful of individual parts:

Environment for long-running tests
Data generation
Data validation
Fault injection
Observability

Basics

To be valuable, we need to get at least to item 3. At this stage, we'd be able to set up a configuration and leave it running, knowing that we'd get alerted about any issues with the output data. There would still be a lot of gaps, but it'd help us smoke out some simple problems.

Environment

The goal here will be to build off of the existing test harness work to make it easy to spin up a box (or multiple) with Vector on them and ready to accept a given config. We want to be able to have multiple of these small test envs up and running at a time, ideally as independent as possible.

We also want to be able to configure upstream services like cloudwatch groups/streams, s3 buckets, etc.

Data generation

This will probably take the shape of a new tool that generates log data designed to be easy to check for consistency. For example, incrementing integers on every line makes it easy to identify gaps. At first, it will probably just write out to files. It can manually implement a number of different rotation strategies and intentionally try to seek out edge cases there. We'll also want to make sure that the throughput is configurable (maybe dynamically to induce occasional spikes).

Data validation

This will consist of a "source" for every Vector sink being tested that can fetch the output data for comparison. It will then scan through it, using whatever scheme from the generator to find and alert on any missing or reordered data.

Next steps

Once we have those basics in place, there are a couple more things that should dramatically increase how much value we get out of the whole thing.

Fault injection

Adding fault injection should be pretty straightforward and would dramatically decrease the amount of time takes to flush out edge cases. To start, we can add a network fault injection proxy like toxiproxy or muxy and configure sinks to run through it. If we have more time, we can explore more advanced things like namazu or krf. Some of these tools need a driver written to tell them what to do when, and we'll also want to collect logs against which we can correlated an detected data loss from validation.

Tuning observability

With a steady stream of failures from fault injection, we should be in a good place to iterate on how Vector reacts to those failures. The first step will probably be better logs around these edge cases. After that we can focus on situations like backpressure where there isn't necessarily one obvious place to log, but we still want to make apparent to users. Finally, we can go through each of the types of failures (or partial failures or slowdowns) and make sure we're exposing enough metrics that it would show up on a dashboard at a glance.

It'd be neat to do some dogfooding here and monitor Vector with Vector, but that's not the first priority.

lukesteensen on 12 Sep 2019

👍6

All 8 comments

I plan to break this project down into a handful of individual parts:

Environment for long-running tests
Data generation
Data validation
Fault injection
Observability

Basics

Environment

We also want to be able to configure upstream services like cloudwatch groups/streams, s3 buckets, etc.

Data generation

Data validation

Next steps

Once we have those basics in place, there are a couple more things that should dramatically increase how much value we get out of the whole thing.

Fault injection

Tuning observability

It'd be neat to do some dogfooding here and monitor Vector with Vector, but that's not the first priority.

lukesteensen on 12 Sep 2019

👍6

This seems sensible to me! In addition to verifying correctness it might be interesting to measure and report:

memory usage to identify any leaks
throughput to see if it degrades over time

jszwedko on 12 Sep 2019

👍1

This looks great!

One of the things I was thinking about is how we might be able to integrate this into our workflow? How do you see us reviewing reports and ensuring new PRs don't change behavior in unexpected ways?

LucioFranco on 12 Sep 2019

This looks interesting and useful!

In addition, I think it might be worth to fuzz Vector with AFL/libfuzz to find possible crashes or vulnerabilities on malformed input data.

For example, incrementing integers on every line makes it easy to identify gaps.

It could be interesting to monitor not only gaps/reordering, but also how often the same data happens to be sent two or more times according to Vector's "at least once" delivery guarantees.

a-rodin on 12 Sep 2019

@LucioFranco

One of the things I was thinking about is how we might be able to integrate this into our workflow? How do you see us reviewing reports and ensuring new PRs don't change behavior in unexpected ways?

I don't think we should expect to be able to gate all PRs on "passing" this environment. Ideally, issues found in this environment would be translated to the normal test suite to protect against regressions (similar to fuzzing or property-based tests). We can definitely have the ability to run these envs with arbitrary versions of vector (e.g. one built from a PR), but that'd probably be something we decide to do for particularly relevant PRs only.

@a-rodin

In addition, I think it might be worth to fuzz Vector with AFL/libfuzz to find possible crashes or vulnerabilities on malformed input data.

Definitely agree this would be interesting. The requirements are a bit different (probably simpler), but we should be able to reuse most all of the same infrastructure.

Fun fact: we've already done some targetted fuzzing of the tokenizer parser and added some unit test based on crashes it found. This is another approach that I think can be super effective (i.e. more focused fuzzing of individual parser components).

It could be interesting to monitor not only gaps/reordering, but also how often the same data happens to be sent two or more times according to Vector's "at least once" delivery guarantees.

Agreed! The goal will be a general purpose diff of the input and output, so duplicates should be caught as well.

lukesteensen on 17 Sep 2019

This is good! Maybe we can add test coverage reporting into the plan?

loony-bean on 25 Sep 2019

@loony-bean I'm not too sure we would get much out of coverage reporting, I've seen it be very flaky in rust projects and hard to tune to make it a productive addition imo.

LucioFranco on 25 Sep 2019

@lukesteensen I think this is done, but we never created issues for the work here. Let's do this on Monday and close this out.

binarylogic on 6 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Extract usage of `tokio::runtime::Runtime` into our own

LucioFranco · 3Comments

Add `multiline` option for all sources

MOZGIII · 3Comments

documentation: Support SVGBOB graphs

Hoverbear · 3Comments

New `http` source

binarylogic · 4Comments

Investigate `tracing-subscriber` memory leak

LucioFranco · 3Comments