We've already had a proposal for a new pipelines directive (https://github.com/timberio/vector/issues/1447) which allows a user to simply list component names in order to create a topology:
[pipelines]
p1 = ["tfm1", "tfm2"]
p2 = ["src1", "p1", "tfm3", "snk1"]
p3 = ["src2", "tfm1", "snk2"]
And based on these pipelines the inputs of each component would be implicitly populated in order to create the described topology.
With the original spec the above snippet would create unexpected side effects:
src1 would leak into snk2.src2 would leak into tfm2, tfm3 and snk1.One of the key strengths of our current spec is that it supports a wide range of topologies whilst retaining a flat configuration spec. If a new spec is to replace the current inputs way of life then it needs to support multiplexing sources and sinks.
However, expanding pipelines to support multiple inputs opens us up for situations where there's no clear behavior to expect. Given the following config:
[pipelines]
p1 = ["src1", "tfm1", "tfm2"]
p2 = ["src2", "p1", "snk1"]
It would make sense for Vector to construct the following topology:
src1 -> tfm1 -> tfm2 -> snk1
src2 -----------------> snk1
But it could also look like this:
src1 -> tfm1 -> tfm2 -> snk1
src2 ----^
Or this:
src2 -> tfm1 -> tfm2 -> snk1
Even if we have a clear spec, can we expect a user seeing this config for the first time to grok it? Multiple sinks have the same issue.
Based on the simpler spec for a compose transform (https://github.com/timberio/vector/issues/1653) we introduce a concept of component copies (versus references). This allows us to refer to the configuration of a component but create a copy for our pipeline rather than mutate the inputs of the global component. This gives us a mechanism for avoiding the problem of leaks.
Next, we expand the pipelines spec from before to explicitly state that a pipeline is a list that comes in three stages:
If a pipeline does not follow this spec then we are able to deliver a clear error message.
With this spec I'm fairly confident that we can support all of the same topologies as we currently do without any unintended side effects. However, if we decide to go ahead with this proposal we need to investigate further.
I still feel as though this spec is somewhat hostile to new users. This is mostly down to the fact that you're looking at a linear list of component names as if they're all equivalent:
[pipelines]
p1 = [ "src2", "tfm1", "tfm2" ]
p2 = [ "src1", "p1", "tfm3", "snk1" ]
Whereas in reality you're looking at a combined list of three different element types, more clearly represented as:
[pipelines.p1]
inputs = [ "src2" ]
transforms = [ "tfm1", "tfm2" ]
[pipelines.p2]
inputs = [ "src1", "p1" ]
transforms = [ "tfm3" ]
outputs = [ "snk1" ]
(NOTE: I'm NOT suggesting this as a spec)
We're pushing the spec in favor of writing speed at the cost of readability, and I'm not 100% convinced the sacrifices we're making aren't going to sting new users trying to grok Vector configs (i.e. does src1 feed into p1?)
Exactly the same as proposal 1 except we make it more explicit:
The change being that source and sink lists within a pipeline must themselves be in an array. The purpose of this requirement is purely for the sake of distinguishing the tiers:
[pipelines]
p1 = [ [ "src2" ], "tfm1", "tfm2" ]
p2 = [ [ "src1", "p1" ], "tfm3", [ "snk1" ] ]
One key technical advantage over proposal 1 is that because we are explicitly declaring which components are inputs and which are simply transformations of the pipeline, we are now able to specify a transform as an input (and therefore a reference). This makes it possible to add the pipeline syntax into existing configs with transforms in the topology.
From the usability perspective a user familiar with the pipeline syntax is now able to glance at a pipeline and tell immediately how many sources and sinks it's connecting. For a user not yet familiar with this syntax they at least have _some_ indication that the first and last elements are special cases.
This syntax still doesn't provide a full picture, but merely a hint of what's going on. Adding brackets also adds more opportunities for typos to break the topology.
There's also the (unlikely) problem of pipelines that are only a list of sinks. Imagine if we were to create a group of sinks that all want to consume data from the same range of pipelines. For convenience we might group them in their own pipeline with something like:
[pipelines]
p1 = [ [ "src1" ], "tfm1", "tfm2" ]
p2 = [ [ "src2" ], "tfm3" ]
p3 = [ [ "snk1", "snk2", "snk3" ] ]
p4 = [ [ "p1" ], [ "p3" ] ]
p5 = [ [ "p2" ], "tfm4", [ "p3" ] ]
There's a more concise way of expressing these pipelines, but assuming that this were the best way to structure it then p3 could easily be mistaken at a glance for a pipeline of inputs.
Roughly the same as sub proposal 2 except in a structured format, with three fields:
[pipelines.p1]
inputs = [ "src1" ]
pipe = [ "tfm1", "tfm2" ]
outputs = [ "snk1" ]
inputs is an optional array element of inputs, which can be sources, transforms or pipelines. These are always REFERENCES.pipe is an optional array of transforms, or pipelines that contain neither sources nor sinks, to execute within the pipeline. These are always COPIES.outputs is an optional array element of sinks, or pipelines that contain sinks and no sources. These are always REFERENCES.The name pipe was chosen here because it's short, it's a little odd because of the stuttering, but makes the syntax for a simple pipeline of transforms _almost_ as short as proposal 1:
[pipelines]
p1.pipe = [ "tfm1", "tfm2" ]
p2.inputs = [ "src1" ]
p2.pipe = [ "tfm3", "tfm4" ]
p3.inputs = [ "src2", "p2" ]
p3.pipe = [ "p1", "tfm5" ]
p3.outputs = [ "snk1" ]
Other name candidates are do, exec, transforms, run, and literally anything else you lot can come up with.
This has all of the advantages of proposal 2 along with clear naming in order to distinguish the three tiers of the pipeline even further. A new user not necessarily familiar with pipeline syntax is likely able to fully comprehend the topology expressed here.
It's more words.
I'd like to throw another proposal into the mix. One that explicitly uses & as a reference identifier and nested arrays as a fan-out/in technique:
[pipelines]
p1 = ["tfm1", ["tfm2", "tfm3"], "tfm4"]
p2 = ["&src1", "p1", "tfm3", "&snk1"]
p3 = ["&src2", "tfm1", "&snk2"]
& it is a _reference_.& then it is _copied_ and given a unique identifier (ex: p1.tfm1).It's worth noting that a _copied_ component will get a unique ID that is used in logs, metrics, etc.
p1.tfm1p2.p1.tfm1src1 <- since a reference is usedI dislike exposing the pointer/copy syntax at all to the user, but these are developers and I don't think this concept is too advanced. Alternatively, we could just "make it work" by assuming users want to copy transforms and reference sources/sinks.
These are all really interesting! I appreciate the time and effort spent trying to munge TOML into a useful graph language :smile:
My biggest question around all of these proposals is whether we're making our TOML complex enough that we lose the benefits of using TOML in the first place (simplicity, familiarity, etc). Because if that's the case, we'll end up with the worst of both worlds: an awkward and unnatural language for expressing graphs, and a config format that's difficult for new users to pick up.
I know writing our own config language is the nuclear option, but it is at least a valuable strawman to compare these proposals against.
Having written a config language for something like this, I would strongly advise against "the nuclear option". I found it far preferable to embed a scripting language (like Lua or JS, since we already use them) and let it deal with the complexities.
Ideally, I'd like to avoid conflating the _format_ of our config (TOML, YAML, DOT, custom, etc) with the _structure_ (pipelines versus inputs, vs something else) because they aren't necessarily the same and one can't be used to solve the other.
For example, we could explore DOT (https://github.com/timberio/vector/issues/1699) as an alternative to pipelines. However, in terms of structure it actually puts us in the same situation as the original pipelines spec, where we need to add more syntax (or assumptions) on top in order to distinguish between references/copies of a subgraph, otherwise we can't support snippet reuse.
This digression leads us into the exploration of syntax alone which I don't think is helpful unless we're committed to a certain structure. Vector components aren't generalised nodes on a graph, they have different _types_ (source, transform, sink), which each have their own rules. So when we create a structure for expressing chains of components we need to take that into account somehow. We also want to support snippet reuse without causing unexpected side effects.
If we can defer the decision of our config format then it allows us to choose the right structure for Vector, and then afterwards select a format that suits it well, instead of confusing the two and using one as a crutch for the other.
With that said I think it's worth doing a review of the structure concepts we currently have so that we're not comparing apples with oranges. I'm picking arbitrary names for these:
This is what we currently have. Each component is defined globally and selects the global siblings it wishes to consume from. This results in a flat list of components where the way in which they interact isn't immediately clear, and changing that often requires editing multiple places, giving ample opportunity for errors.
The compose transform proposal (https://github.com/timberio/vector/issues/1653) is an attempt to mitigate some of the pain points of writing and maintaining lists of transforms with this structure, but is a complement to the spec rather than a solution.
Stemming from the pipelines proposals, taking a lot of inspiration from graph syntaxes. Topologies are defined as linkable lists of component names. This allows the definition of complex graphs from linear arrays, making them easy to parse for both humans and machines.
This is something we haven't really explored yet as it's pretty much the opposite of the existing flattened structure, and is therefore the most extreme change. In a hierarchical structure there aren't necessarily any global components, just pipelines themselves, where each one specifies its sources, transforms, and sinks:
pipeline:
sources:
- type: foo
some: field
- type: bar
some_other: dumb_field
transforms:
- type: a_thing:
do_it: "like this"
- type: a_fork
if: "field.type in [ doc, article, comment ]"
then:
- type: do_this
wat: "this is another transform"
else:
- type: do_this_instead
huh: "this is yet another transform"
sinks:
- type: baz
- type: shared_channel
called: foo
Pipelines can be linked to each other, which is how we might decide to handle content based multiplexing:
pipeline:
sources:
- type: shared_channel
which: foo
transforms:
- type: remove_stuff_i_dont_want
like: "field.message contains 'nah m8'"
sinks:
- type: boo
Note that this may seem very similar to sub proposal 3, but in fact it also requires the ability to inline transforms in order to have forked processing. This also means transforms themselves as part of their spec need to be able to define their children, so in reality this is still a far cry from pipelines.
Just noting, that we've decided to defer this change, once again, because it is not obvious that this is a clear win. A couple of reasons:
It'll be obvious a few months from now if we want to do this. It should continue to pop up in conversations.
After implementing pipeline longer than "hello-world" (2 sources, 8 transforms), I can confirm that this proposal looks very promising.
Thanks @anton-ryzhov, I'm curious, which one of the syntaxes would you prefer? Or do you have a different proposal?
Closing via https://github.com/timberio/vector/pull/4427
Most helpful comment
After implementing pipeline longer than "hello-world" (2 sources, 8 transforms), I can confirm that this proposal looks very promising.