Vector: Proposal: Pipelines V2

Created on 3 Feb 2020 · 11Comments · Source: timberio/vector

We've already had a proposal for a new pipelines directive (https://github.com/timberio/vector/issues/1447) which allows a user to simply list component names in order to create a topology:

[pipelines]
  p1 = ["tfm1", "tfm2"]
  p2 = ["src1", "p1", "tfm3", "snk1"]
  p3 = ["src2", "tfm1", "snk2"]

And based on these pipelines the inputs of each component would be implicitly populated in order to create the described topology.

Problems

Leaks

With the original spec the above snippet would create unexpected side effects:

Events from src1 would leak into snk2.
Events from src2 would leak into tfm2, tfm3 and snk1.

Multiplexing

One of the key strengths of our current spec is that it supports a wide range of topologies whilst retaining a flat configuration spec. If a new spec is to replace the current inputs way of life then it needs to support multiplexing sources and sinks.

However, expanding pipelines to support multiple inputs opens us up for situations where there's no clear behavior to expect. Given the following config:

[pipelines]
  p1 = ["src1", "tfm1", "tfm2"]
  p2 = ["src2", "p1", "snk1"]

It would make sense for Vector to construct the following topology:

src1 -> tfm1 -> tfm2 -> snk1
src2 -----------------> snk1

But it could also look like this:

src1 -> tfm1 -> tfm2 -> snk1
src2 ----^

Or this:

src2 -> tfm1 -> tfm2 -> snk1

Even if we have a clear spec, can we expect a user seeing this config for the first time to grok it? Multiple sinks have the same issue.

config setup approval more demand requirements enhancement

Source

Jeffail

Most helpful comment

After implementing pipeline longer than "hello-world" (2 sources, 8 transforms), I can confirm that this proposal looks very promising.

anton-ryzhov on 7 Feb 2020

👍5

All 11 comments

Sub Proposal 1

Based on the simpler spec for a compose transform (https://github.com/timberio/vector/issues/1653) we introduce a concept of component copies (versus references). This allows us to refer to the configuration of a component but create a copy for our pipeline rather than mutate the inputs of the global component. This gives us a mechanism for avoiding the problem of leaks.

Next, we expand the pipelines spec from before to explicitly state that a pipeline is a list that comes in three stages:

An optional list of sources, or pipelines that contain sources and no sinks. These are always REFERENCES.
A list of transforms, or pipelines that contain neither sources nor sinks. These are always COPIES.
An optional list of sinks, or pipelines that contain sinks and no sources. These are always REFERENCES.

If a pipeline does not follow this spec then we are able to deliver a clear error message.

With this spec I'm fairly confident that we can support all of the same topologies as we currently do without any unintended side effects. However, if we decide to go ahead with this proposal we need to investigate further.

Remaining Issues

I still feel as though this spec is somewhat hostile to new users. This is mostly down to the fact that you're looking at a linear list of component names as if they're all equivalent:

[pipelines]
  p1 = [ "src2", "tfm1", "tfm2" ]
  p2 = [ "src1", "p1", "tfm3", "snk1" ]

Whereas in reality you're looking at a combined list of three different element types, more clearly represented as:

[pipelines.p1]
  inputs = [ "src2" ]
  transforms = [ "tfm1", "tfm2" ]

[pipelines.p2]
  inputs = [ "src1", "p1" ]
  transforms = [ "tfm3" ]
  outputs = [ "snk1" ]

(NOTE: I'm NOT suggesting this as a spec)

We're pushing the spec in favor of writing speed at the cost of readability, and I'm not 100% convinced the sacrifices we're making aren't going to sting new users trying to grok Vector configs (i.e. does src1 feed into p1?)

Jeffail on 4 Feb 2020

Sub Proposal 2

Exactly the same as proposal 1 except we make it more explicit:

An optional array element of inputs, which can be sources, transforms or pipelines. These are always REFERENCES.
A list of transforms, or pipelines that contain neither sources nor sinks. These are always COPIES.
An optional array element of sinks, or pipelines that contain sinks and no sources. These are always REFERENCES.

The change being that source and sink lists within a pipeline must themselves be in an array. The purpose of this requirement is purely for the sake of distinguishing the tiers:

[pipelines]
  p1 = [ [ "src2" ], "tfm1", "tfm2" ]
  p2 = [ [ "src1", "p1" ], "tfm3", [ "snk1" ] ]

Advantages

One key technical advantage over proposal 1 is that because we are explicitly declaring which components are inputs and which are simply transformations of the pipeline, we are now able to specify a transform as an input (and therefore a reference). This makes it possible to add the pipeline syntax into existing configs with transforms in the topology.

From the usability perspective a user familiar with the pipeline syntax is now able to glance at a pipeline and tell immediately how many sources and sinks it's connecting. For a user not yet familiar with this syntax they at least have _some_ indication that the first and last elements are special cases.

Remaining Issues

This syntax still doesn't provide a full picture, but merely a hint of what's going on. Adding brackets also adds more opportunities for typos to break the topology.

There's also the (unlikely) problem of pipelines that are only a list of sinks. Imagine if we were to create a group of sinks that all want to consume data from the same range of pipelines. For convenience we might group them in their own pipeline with something like:

[pipelines]
  p1 = [ [ "src1" ], "tfm1", "tfm2" ]
  p2 = [ [ "src2" ], "tfm3" ]
  p3 = [ [ "snk1", "snk2", "snk3" ] ]
  p4 = [ [ "p1" ], [ "p3" ] ]
  p5 = [ [ "p2" ], "tfm4", [ "p3" ] ]

There's a more concise way of expressing these pipelines, but assuming that this were the best way to structure it then p3 could easily be mistaken at a glance for a pipeline of inputs.

Jeffail on 4 Feb 2020

Sub Proposal 3

Roughly the same as sub proposal 2 except in a structured format, with three fields:

[pipelines.p1]
  inputs = [ "src1" ]
  pipe = [ "tfm1", "tfm2" ]
  outputs = [ "snk1" ]

The field inputs is an optional array element of inputs, which can be sources, transforms or pipelines. These are always REFERENCES.
The field pipe is an optional array of transforms, or pipelines that contain neither sources nor sinks, to execute within the pipeline. These are always COPIES.
The field outputs is an optional array element of sinks, or pipelines that contain sinks and no sources. These are always REFERENCES.

The name pipe was chosen here because it's short, it's a little odd because of the stuttering, but makes the syntax for a simple pipeline of transforms _almost_ as short as proposal 1:

[pipelines]
  p1.pipe = [ "tfm1", "tfm2" ]

  p2.inputs = [ "src1" ]
  p2.pipe = [ "tfm3", "tfm4" ]

  p3.inputs = [ "src2", "p2" ]
  p3.pipe = [ "p1", "tfm5" ]
  p3.outputs = [ "snk1" ]

Other name candidates are do, exec, transforms, run, and literally anything else you lot can come up with.

Advantages

This has all of the advantages of proposal 2 along with clear naming in order to distinguish the three tiers of the pipeline even further. A new user not necessarily familiar with pipeline syntax is likely able to fully comprehend the topology expressed here.

Remaining Issues

It's more words.

Jeffail on 4 Feb 2020

👍1

Sub Proposal 4

I'd like to throw another proposal into the mix. One that explicitly uses & as a reference identifier and nested arrays as a fan-out/in technique:

[pipelines]
  p1 = ["tfm1", ["tfm2", "tfm3"], "tfm4"]
  p2 = ["&src1", "p1", "tfm3", "&snk1"]
  p3 = ["&src2", "tfm1", "&snk2"]

If an identifier is prefixed with & it is a _reference_.
If an identifier is _not_ prefixed with & then it is _copied_ and given a unique identifier (ex: p1.tfm1).
Arrays allow users to fan-out/in.
Sources and sinks _MUST_ be referenced, they cannot be copied. An error will be thrown otherwise.

Identifiers and observability

It's worth noting that a _copied_ component will get a unique ID that is used in logs, metrics, etc.

p1.tfm1
p2.p1.tfm1
src1 <- since a reference is used

Advantages

This borrows from the Go pointer syntax (and possibly other languages), so it doesn't feel as foreign.
The syntax is cleaner. Nested arrays more clearly communicate the topology.
The default of copying for transforms should solve the common use cases without the user having to understand all of the copy/reference edge-cases.
It allows for advanced use cases where a transform should be referenced.

Remaining issues

I dislike exposing the pointer/copy syntax at all to the user, but these are developers and I don't think this concept is too advanced. Alternatively, we could just "make it work" by assuming users want to copy transforms and reference sources/sinks.

binarylogic on 4 Feb 2020

These are all really interesting! I appreciate the time and effort spent trying to munge TOML into a useful graph language :smile:

My biggest question around all of these proposals is whether we're making our TOML complex enough that we lose the benefits of using TOML in the first place (simplicity, familiarity, etc). Because if that's the case, we'll end up with the worst of both worlds: an awkward and unnatural language for expressing graphs, and a config format that's difficult for new users to pick up.

I know writing our own config language is the nuclear option, but it is at least a valuable strawman to compare these proposals against.

lukesteensen on 4 Feb 2020

Having written a config language for something like this, I would strongly advise against "the nuclear option". I found it far preferable to embed a scripting language (like Lua or JS, since we already use them) and let it deal with the complexities.

bruceg on 4 Feb 2020

Ideally, I'd like to avoid conflating the _format_ of our config (TOML, YAML, DOT, custom, etc) with the _structure_ (pipelines versus inputs, vs something else) because they aren't necessarily the same and one can't be used to solve the other.

For example, we could explore DOT (https://github.com/timberio/vector/issues/1699) as an alternative to pipelines. However, in terms of structure it actually puts us in the same situation as the original pipelines spec, where we need to add more syntax (or assumptions) on top in order to distinguish between references/copies of a subgraph, otherwise we can't support snippet reuse.

This digression leads us into the exploration of syntax alone which I don't think is helpful unless we're committed to a certain structure. Vector components aren't generalised nodes on a graph, they have different _types_ (source, transform, sink), which each have their own rules. So when we create a structure for expressing chains of components we need to take that into account somehow. We also want to support snippet reuse without causing unexpected side effects.

If we can defer the decision of our config format then it allows us to choose the right structure for Vector, and then afterwards select a format that suits it well, instead of confusing the two and using one as a crutch for the other.

With that said I think it's worth doing a review of the structure concepts we currently have so that we're not comparing apples with oranges. I'm picking arbitrary names for these:

Flattened

This is what we currently have. Each component is defined globally and selects the global siblings it wishes to consume from. This results in a flat list of components where the way in which they interact isn't immediately clear, and changing that often requires editing multiple places, giving ample opportunity for errors.

The compose transform proposal (https://github.com/timberio/vector/issues/1653) is an attempt to mitigate some of the pain points of writing and maintaining lists of transforms with this structure, but is a complement to the spec rather than a solution.

Advantages

Limited indentation, or nesting.
Complex fan-out topologies with multiplexing are much easier to write.

Disadvantages

As topologies scale they become increasingly difficult to decipher.

Pipelines

Stemming from the pipelines proposals, taking a lot of inspiration from graph syntaxes. Topologies are defined as linkable lists of component names. This allows the definition of complex graphs from linear arrays, making them easy to parse for both humans and machines.

Advantages

Limited indentation, or nesting.
Topologies are much easier to write and edit.
Topologies are (potentially) easier to read.
Is complementary to our existing flattened (global components) structure.

Disadvantages

Linear lists of components have the potential to obfuscate component _types_, which are still relevant for understanding a Vector topology.
Doesn't necessarily lend well to a concept of graph snippet reuse, requires odd syntaxes or "magic" behavior to avoid bleeding and other side effects.

Hierarchical

This is something we haven't really explored yet as it's pretty much the opposite of the existing flattened structure, and is therefore the most extreme change. In a hierarchical structure there aren't necessarily any global components, just pipelines themselves, where each one specifies its sources, transforms, and sinks:

pipeline:
  sources:
    - type: foo
      some: field
    - type: bar
      some_other: dumb_field

  transforms:
    - type: a_thing:
      do_it: "like this"
    - type: a_fork
      if: "field.type in [ doc, article, comment ]"
      then:
        - type: do_this
          wat: "this is another transform"
      else:
        - type: do_this_instead
          huh: "this is yet another transform"

  sinks:
    - type: baz
    - type: shared_channel
      called: foo

Pipelines can be linked to each other, which is how we might decide to handle content based multiplexing:

pipeline:
  sources:
    - type: shared_channel
      which: foo

  transforms:
    - type: remove_stuff_i_dont_want
      like: "field.message contains 'nah m8'"

  sinks:
    - type: boo

Note that this may seem very similar to sub proposal 3, but in fact it also requires the ability to inline transforms in order to have forked processing. This also means transforms themselves as part of their spec need to be able to define their children, so in reality this is still a far cry from pipelines.

Advantages

What you see is what you get, very easy to understand and easy to write.
Components are connected by composition and therefore can be referenced and reused however you like without confusion or side effects.
Transforms are expressed as actual functions rather than connected nodes with their own inputs.

Disadvantages

Lots of nesting (this is where format might come in).
For us specifically it's a stretch from what we have already, and doesn't necessarily provide the value that the effort would require in order to be justified.

Jeffail on 5 Feb 2020

Just noting, that we've decided to defer this change, once again, because it is not obvious that this is a clear win. A couple of reasons:

It's possible that other tangential changes will make this moot (#1327 and #1328). I'd like to see more progress on these issues before moving forward with this.
We need more feedback from users.

It'll be obvious a few months from now if we want to do this. It should continue to pop up in conversations.

binarylogic on 6 Feb 2020

After implementing pipeline longer than "hello-world" (2 sources, 8 transforms), I can confirm that this proposal looks very promising.

anton-ryzhov on 7 Feb 2020

👍5

Thanks @anton-ryzhov, I'm curious, which one of the syntaxes would you prefer? Or do you have a different proposal?

binarylogic on 7 Feb 2020

Closing via https://github.com/timberio/vector/pull/4427

binarylogic on 12 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings