Vector: Add new metadata for all of Vector's internal component

Created on 3 Oct 2020  路  17Comments  路  Source: timberio/vector

As part of transitioning to the new metadata system, we'll be submitting PRs for each component. The PR will contain the base definition file, but it is not complete! We'll need humans to finish of the final changes.

New metadata system

Cue

The new metadata files are written in Cue, a new configuration language from Google that is largely inspired by Jsonnet. You can read more about the differences here, but the big difference is the typing and validation. I actually started the transition with Jsonnet but it felt very error-prone, especially since a lot of this can become mind-numbing. It's very easy to forget to add an attribute, make a typo, etc.

Setting aside the strict validation, something I probably don't need to sell to a bunch of Rust developers, the language itself is thoughtful and elegant. The schema/policy/data tree-like organization described here is very powerful and a pattern I tried to adopt here. This is largely enabled by definitions. I would also recommend running through the basics tutorial. You can learn the language very quickly browsing through it.

Validating

  1. Install cue.
  2. Run make check-docs.

What you need to do

  • [ ] Review the initial base data, change as necessary
  • [ ] Add one or more output definitions
  • [ ] Add one or more how_it_works definitions

Review base data, change as necessary

The base data was derived from the current metadata, but we still want to review it for opportunities to improve it.

Add one or more output definitions

Logs

Log components output events and events have a schema. Currently, every component outputs a single event, but in the future, we will be adding components that output multiple events. For now, name your event something generic, like "line" for the file source, and define the schema accordingly.

Metrics

Metric components output metric events. Unlike logs, metrics components almost always output multiple events.

Example

Add how_it_works sections

Finally, we want to move the "How it works" sections into the definitions.

Each component type automatically includes default sections based on the component type. Do not add these sections. See sources, transforms, and sinks sections.

Example

feature

Most helpful comment

@kirillt yep, although I stayed away from defaults for most options because I want us to address them. A problem with the last system was exactly that. We'd forget to address options because they weren't required. This makes them discoverable and explicit.

All 17 comments

@binarylogic the transforms and sinks links at the bottom don't link to anything yet. I did check transforms (https://github.com/timberio/vector/blob/master/docs/reference/components/transforms.cue), but it doesn't include any default sections yet.

@binarylogic am I correct in assuming that the output section is only relevant to sources?

@JeanMertz will fix now.

@jszwedko it's for both source and transforms. I've updated the issue description to include a transform example.

We can use default values, for example for supported platforms:

--- a/docs/reference/components.cue
+++ b/docs/reference/components.cue
@@ -150,12 +150,12 @@ components: close({
       // The platforms that this component is available in. It is possible for
       // Vector to disable some components on a per-platform basis.
       platforms: {
-        "aarch64-unknown-linux-gnu": bool
-        "aarch64-unknown-linux-musl": bool
-        "x86_64-apple-darwin": bool
-        "x86_64-pc-windows-msv": bool
-        "x86_64-unknown-linux-gnu": bool
-        "x86_64-unknown-linux-musl": bool
+        "aarch64-unknown-linux-gnu": bool | *true
+        "aarch64-unknown-linux-musl": bool | *true
+        "x86_64-apple-darwin": bool | *true
+        "x86_64-pc-windows-msv": bool | *true
+        "x86_64-unknown-linux-gnu": bool | *true
+        "x86_64-unknown-linux-musl": bool | *true
       }

https://cuelang.org/docs/tutorials/tour/types/defaults/

@kirillt yep, although I stayed away from defaults for most options because I want us to address them. A problem with the last system was exactly that. We'd forget to address options because they weren't required. This makes them discoverable and explicit.

There's no "how it works" component reference for sinks yet? I see sources.cue and transforms.cue but not sinks.cue

We probably want to make some things stricter.

For instance, the _.output.logs._.fields.type:

type: {
    "*": {}
    "string"?: {
        examples: [string, ...string]
    }
    "timestamp"?: {
        examples: ["2020-11-01T21:15:47.443232Z"]
    }
}

the way it's defined currently allows field to be any combination of the types:

type: {
    string: {
        examples: ["qwe"]
    }
    timestamp: {}
}

is accepted, but, clearly, it's not intended. This is because our semantics is different - we really want to just have a "one of" the possible values, not any combination of them.

Otoh, if we write the constraint as:

type:
    close({"*": {}}) |
    close({
        "string":
            close({
                examples: [string, ...string]
            })
    }) |
    close({
        timestamp: close({
            examples: ["2020-11-01T21:15:47.443232Z"]
        })
    })

The sample above is rejected as expected!

I think there are other places like this I didn't encounter yet.

This neat trick makes use of two properties of the language: closed structs and disjunctions of structs. It is important that closed structs are used, and a simple disjunction of open structs is not enough - if any of the structs in the disjunction are open, the evaluation can't conclude that a particular type will be the _only one_ to satisfy the requested disjoint, as open struct may be extended to also satisfy a condition. Using closed structs prevents this possibility and allows the evaluation to determine what single struct of the disjunction currently satisfies the value.

I hope this was educational, and not too long :D

Docs:

Also, I find this form of command really useful when working on the docs PRs: find docs -name '*.cue' | xargs cue eval --all-errors -e 'components.sources.docker' -c
Might be useful to someone.

Ah! Yes, I like that, I'll add.

@binarylogic The relevant_when field defined in reference/components.cue is defined as a string:

      relevant_when?: string

Is that intentional? The old way would define this field more as a map, eg:

relevant_when = {mode = ["tcp", "udp"]}

Yep, it is. I'm leaving it open-ended since it can be much more complex than that.

Cool. So is something like:

relevant_when = "mode is tcp or udp"

the way to go?

Exactly, much easier to represent complex conditions. I would do:

relevant_when = "`mode` = `tcp` or `udp`"

Hi @binarylogic , can you clarify what encoding options mean? For example, what does json: null mean?

@juchiast it signals if the json encoding option is supported or not.

Then it seems like the encoding option isn't relevant, right? So I'd just do:

encoding: enabled: false
Was this page helpful?
0 / 5 - 0 ratings

Related issues

raghu999 picture raghu999  路  3Comments

kaarolch picture kaarolch  路  3Comments

binarylogic picture binarylogic  路  3Comments

LucioFranco picture LucioFranco  路  3Comments

trK54Ylmz picture trK54Ylmz  路  3Comments