Vector: Kubernetes integration

Created on 8 Apr 2019 · 23Comments · Source: timberio/vector

Description

We should have first-class support for using Vector in infrastructures running k8s. This will involve a combination of good documentation and potentially a few k8s-specific parsers/transforms/sources/etc. This should include both ingesting and processing data from applications running in k8s, as well as best practices for running Vector itself in k8s.

Prior Art

Behavior

Requierments

An image of vector is available in some repository.
vector.yaml file is served on some url. Let's call it yaml_url.
kubectl is installed.

Installation/Running

To install/run Vector with default configuration run:

kubectl apply -f yaml_url

Configuration

To configure Vector, download/copy-paste vector.yaml file, with for example:

wget yaml_url

Edit toml part of vector.yaml to configure Vector.
Let the path to edited vector.yaml be yaml_path, then run:

kubectl apply -f yaml_path

which will install/run Vector with edited configuration.

Reconfiguration

Edit toml part of vector.yaml. Run:

kubectl apply -f yaml_path

`vector.yaml`

All Kubernetes and Vector configurations are in this one file. Where vector.toml, that is usualy a separate file, is now embeded, and clearly documented, into vector.yaml.

Benefits of only yaml file:

Having one configuration file makes UX better in at least above mentioned cases.
Vector configuration can pull configurations from Kubernetes part of file.
Easy to create/share configurations. This would allow supporting a lot of common use cases out of the box with minimal effort. For example: aggregating all logs and exposing them on one public http endpoint. This will also empower users in the same way.

`kubernetes source`

The kubernetes source ingests log data from local Kubernetes node and outputs log events.

[sources.my_source_id]
 # REQUIRED - General
  type = "kubernetes" # must be: "kubernetes"

  (EDIT: out of scope)
  # Collect logs from kubernetes node components: kubelet, container runtime, kube-proxy. 
  # And also from master components, if kubernetes is configured to run user containers on master  machine. 
 log_system = false # by default false 

  # OPTIONAL

  (EDIT: covered by #1059)
  # Collect logs from these pods.
  named = ["pod_name"]

  (EDIT: more info in todo section)
  # And collect logs from pods with all of these requirements.
  # Requirements are defined as in https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors 
  # Example: ["environment = production","tier notin (frontend, backend)"]
  match = ["kubernetes_requirement"]

If named and match are empty, kubernetes source will collect logs from all applicable pods, except from itself.

Implementation

Kubernetes has CRI (Container Runtime Interface) which all container runtimes for Kubernetes should implement. Docker implements it fully, while OCI, rkt, Frakti, Containerd, and Singularity, are an active work in progress.

CRI defines how and where log files are to be stored. kubernetes source can read those files to get logs from all containers on it's node. This can be done with file source, which has already been demonstrated to work by @LucioFranco.

Kubernetes documentation defines where Kubernetes node components keep ther logs. This is also collectable with file source and journald source.

Applicable pods, that is pods on which this implementation is capable of collecting logs, are those that have configured logging to a file. Docker has this as default, and Kubernetes highly recommends it as it also uses those logs for it's own features. Therefor if this implementation doesn't have access to some logs, then neither does Kubernetes. And as Kubernetes assume that logging is then done in some other way by the user, this implementation assumes the same.

Communication between vector nodes can be done with vector source/sink pair.

Enrichment

Besides:

message
timestamp
stream
pod_uid
container_name
~instance_number~
~labels~ (Edit: will be part of later enrichment issue)

which are almost freely available, other information could be pulled over Kubernetes API to enrich the Event. But I would delay this for now as it can be added later. My main reason for this is that I expect testing this properly will take most of the time, and adding/testing things after that will be much easier as base of it is already added/tested.

Topologies

There are two base topologies that are to be supported from start by having a dedicated vector.yaml file.

Distributed

Matches Distributed topology in Vector Docs.
vector.yaml file for this topology would have:

One DaemonSet with template of Vector agent.
Toml configuration inside of it would have preadded default kubernetes source configuration.

This configuration is a base for almost all other configurations/deployments.

Centralized (EDIT: delayed for now)

Matches Centralized topology in Vector Docs.
This is an upgrade on Distributed topology with Vector also being on down-stream end of things. As such vector.yaml for this is based on Distributed version with additions:

Toml configuration of Vector agent would have preadded vector sink configuration.
One Deployment with template of Vector master.
Toml configuration of Vector master would have preadded vector source configuration.
Necessary parts in Vector agent and Vector master templates to facilitate networking between them.

This configuration is a base for all configurations/deployments with Centralized topology.

Alternatives

kubernetes source implementation could spinup Vector sources dedicated for each present container runtime, and aggregate logs from them. This could also be a fallback for original implementation.
kubernetes source implementation could have only one master agent which would collect logs from pods over Kubernetes API logs command.

`logging operator`

This specification is compatible with idea of logging operator. So if it was ever to be implemented, it can be build upon this specification.

This section should describe:

How Vector will be installed (ex: daemonset)
How Vector will ingest data and the structure of the event leaving the source.
How Vector will enrich the event with k8s metadata.
How users would go about filtering specific services via labels and selectors.

Requirements

[x] Users should be able to install Vector with a single kubectl [apply|create] command.
[x] Vector should be able to collect collect logs for _all_ services by default.
[ ] Users should be able to filter the logs collected via labels and selectors (ex: app=nginx).
[ ] Each event should be enriched with k8s metadata.

Todo

[ ] match filtering. Requires #1072. But where to put it? In that transform?
[ ] centralized deployment

must feature

Source

lukesteensen

👍32 ❤8

Most helpful comment

I would expect it to work the same way that fluentbit does in terms of log enrichment via the k8s api.
https://docs.fluentbit.io/manual/installation/kubernetes/

derekperkins on 3 Jul 2019

👍9

All 23 comments

@lukesteensen as for this, do we think leveldb and rdkafka support is needed for this? I imagine we can produce a very small container that can run as a side car in kube without those two.

LucioFranco on 8 Apr 2019

Our release binary is 25MB with those two and 19MB without them, so I don't think there's any good reason to exclude them. If anything it'd be confusing to have a k8s build that supported fewer features.

lukesteensen on 8 Apr 2019

I would expect it to work the same way that fluentbit does in terms of log enrichment via the k8s api.
https://docs.fluentbit.io/manual/installation/kubernetes/

derekperkins on 3 Jul 2019

👍9

Created this issue as well: https://github.com/timberio/vector/issues/768 , I believe having an operator would be the way to go.

tlvenn on 19 Aug 2019

@ktff I've assigned this issue to you. As a first step, I'd like to finish the spec. Could you fill in the "Behavior" and "Requirements" sections above? Feel free to expand out as much as you'd like, whatever you need to describe how this will work.

Note: the first version of this can be simple, it does not need to include every feature. We are big fans of shipping in small incremental changes. Ex: maybe it makes sense to separate out the metadata enrich as a follow up PR.

binarylogic on 27 Aug 2019

I have filled in the Behavior section.

ktff on 1 Sep 2019

Prior art also includes Filebeat, which has a processor that adds K8s metadata: https://www.elastic.co/guide/en/beats/filebeat/master/add-kubernetes-metadata.html

sciyoshi on 1 Sep 2019

👍2

@ktff thank you for writing this up!

I think this approach sounds generally pretty good!

vector.yaml

All Kubernetes and Vector configurations are in this one file. Where vector.toml, that is usualy a separate file, is now embeded, and clearly documented, into vector.yaml.

Do you have an example of what this would look like? This kinda sounds a bit messy and something we may not want to do. Even for IDE's this will make the formatting harder.

Easy to create/share configurations. This would allow supporting a lot of common use cases out of the box with minimal effort.

I think actually embedding the toml within the yaml will make it less sharable since many users will share their configs as direct toml files, not as yaml.

The other option is to provide some packing tool that will generate a daemon set yaml with the provided toml embedded within it.

This also leads me to think that we should 100% provide a way to load a config via http and/or grpc. This would even allow in a centralized setup to only need one config since the master/primary can then supply a subset of that config to the agents. This also would allow us to uncouple the deployment of vector with its config. Aka introduces a kinda control layer. I will defer on this for now but its something we should think of as we introduce more complex setups like k8.

 # Collect logs from kubernetes node components: kubelet, container runtime, kube-proxy. 
 # And also from master components, if kubernetes is configured to run user containers on master machine. 
 log_system = false # by default false

Does it make sense for the inital version to just support the container runtime api and defer this extra collecting to either a transform or a second version of the kube api? These seems _somewhat_ out of scope.

  # Collect logs from these pods.
 named = ["pod_name"]

Do we want to think about possibly supporting the kube selector api? I'm not sure how much work this would be but it could add a lot of value.

which are almost freely available, other information could be pulled over Kubernetes API to enrich the Event.

This is :+1: I think we would probably want to do this as a separate component anyways.

Topologies

There are two base topologies that are to be supported from start by having a dedicated vector.yaml file.

As for the topologies, we should 100% start with the decentralized version. I think there are still many questions about how we will do the centralized version. Like do we support service discovery via the k8 api? etc

Overall, I think this approach is good! We should also think about supporting pulling the logs via file and supporting pulling logs via the docker.sock.

LucioFranco on 3 Sep 2019

@LucioFranco thank you for the detailed feedback.

Do you have an example of what this would look like? This kinda sounds a bit messy and something we may not want to do. Even for IDE's this will make the formatting harder.

Here is an example of how it looks:

# Vector master
apiVersion: apps/v1
kind: Deployment
metadata:
  name: master_vector
  namespace: default
spec:
  selector:
    matchLabels:
      name: master_vector
  template:
    metadata:
      labels:
        name: master_vector
    spec:
      containers:
      - name: vector
        image: timberio/vector
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        env:
        - name: CONFIG
          value: |
                # VECTOR.TOML

                # Set global options
                data_dir = "/var/lib/vector"

                [sources.agents]
                  type = "vector" 
                  address = "0.0.0.0:$(MASTER_VECTOR_SERVICE_PORT)"

                  shutdown_timeout_secs = 30 # default, seconds


        # This line is not in VECTOR.TOML  
---
# Vector master service
apiVersion: v1
kind: Service
metadata:
  name: vector-service
spec:
  selector:
    name: master_vector
  ports:
    - protocol: TCP
      port: 9000
---
# Vector agent
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: vector
  namespace: default
spec:
  selector:
    matchLabels:
      name: vector
  template:
    metadata:
      labels:
        name: vector
    spec:
      containers:
      - name: vector
        image: timberio/vector
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        env:
        - name: CONFIG
          value: |
                # VECTOR.TOML

                # Set global options
                data_dir = "/var/lib/vector"

                # Ingest logs from Kubernetes
                [sources.kubernetes_logs]
                  type         = "kubernetes"
                  log_system = false

                  match = ["environment = production","tier notin (frontend, backend)"]   

                [sinks.my_sink_id]
                  # REQUIRED - General
                  type = "vector" # must be: "vector"
                  inputs = ["kubernetes_logs"]
                  address = "$(MASTER_VECTOR_SERVICE_HOST):$(MASTER_VECTOR_SERVICE_PORT)"

                  # OPTIONAL - General
                  healthcheck = true # default

        # This line is not in VECTOR.TOML

Obviously there are things missing, but this is only an example.

IDE's should format yaml correctly, but yes toml part probably wont have special syntax. I have tried it out in online yaml formatters, and they deal with it nicely, and they recognize toml as a string. Editing is also nice. Try it out online yaml formatter. Visual studio code has almost the same behavior.

I think actually embedding the toml within the yaml will make it less sharable since many users will share their configs as direct toml files, not as yaml.

Generally in Vector ecosystem yes. But among those using Kubernetes, I suspect yaml will be more convenient. Since various vector.toml configurations also require some coordinating configuration in Kubernetes yaml file. Examples are:

Centralized topology, requires networking configuration coordinated with vector source/sink
Http server, builded on Centralized topology with logs served on http edpoint, requires networking configuration coordinated with http sink

The other option is to provide some packing tool that will generate a daemon set yaml with the provided toml embedded within it.

The third option is to use ConfigMap feature like Fluentbit logging operator does. But in that case, the above example would require three files in total.

This also leads me to think that we should 100% provide a way to load a config via http and/or grpc. This would even allow in a centralized setup to only need one config since the master/primary can then supply a subset of that config to the agents. This also would allow us to uncouple the deployment of vector with its config. Aka introduces a kinda control layer. I will defer on this for now but its something we should think of as we introduce more complex setups like k8.

In any setup, only one yaml file is necessary. The reason is it's ability to have multiple documents in one file. The above configuration example has this.

Not all configurations can be achieved by only changing toml. For example: if any sink that serves data is added where there was none, a public IP address needs to be associated with the pod, and that requires configuration through Kubernetes which can be done through yaml.

Does it make sense for the inital version to just support the container runtime api and defer this extra collecting to either a transform or a second version of the kube api? These seems somewhat out of scope.

I agree. This seems out of scope. I am for this being a separate feature of kube api.

Do we want to think about possibly supporting the kube selector api? I'm not sure how much work this would be but it could add a lot of value.

Do you mean label selectors? If yes, they are present.

  # And collect logs from pods with all of these requirements.
  # Requirements are defined as in https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors 
  # Example: ["environment = production","tier notin (frontend, backend)"]
  match = ["kubernetes_requirement"]

As for the topologies, we should 100% start with the decentralized version. I think there are still many questions about how we will do the centralized version. Like do we support service discovery via the k8 api? etc

Yes, it is doable with Service. And fetching it's IP is just a matter of using right env var. The above configuration example has this.

We should also think about supporting pulling the logs via file and supporting pulling logs via the docker.sock.

I agree. And this is addable after, so this can be a separate issue for each container runtime source that Vector supports/will support.

ktff on 4 Sep 2019

I don't have all of the context here, but embedding TOML in the YAML is perfectly fine as a first step. I've seen this done before (ex: Elasticbeanstalk configuration). I don't think it's a blocker for the first version of this unless we have a light weight alternative.

binarylogic on 4 Sep 2019

I don't have all of the context here, but embedding TOML in the YAML is perfectly fine as a first step.

Agreed. A potential next step that would be pretty simple could be a very basic "fetch config over http" feature.

lukesteensen on 4 Sep 2019

@ktff

Ok, I think the embedded is fine for now but it seems like we will have to build a way to load the config via env var as well? Which I think is totally fine for now!

As for the config coordination, how do you expect that a vector to vector sink might find each other? I assume in a centralized setup we would have many agents to one server/master/primary. This master would live as some sort of pod that is discoverable through the k8 service discovery api. It looks like it can inject env vars for the destination so we should be able to set that up via env var injection into the config. 👍

In any setup, only one yaml file is necessary. The reason is it's ability to have multiple documents in one file. The above configuration example has this.

👍 This actually follows k8 config guidelines so that is good.

I agree. And this is addable after, so this can be a separate issue for each container runtime source that Vector supports/will support.

Agreed, this should be pretty easy to do!

I am on board with all this, thanks for explaining!

LucioFranco on 4 Sep 2019

👍1

@LucioFranco

Ok, I think the embedded is fine for now but it seems like we will have to build a way to load the config via env var as well?

Yes.

It looks like it can inject env vars for the destination so we should be able to set that up via env var injection into the config.

Yes, in the above example that is visible as $(MASTER_VECTOR_SERVICE_HOST) and $(MASTER_VECTOR_SERVICE_PORT) env var. That feature will save us from a lot of issues.

ktff on 5 Sep 2019

It sounds like we're in agreement with the above spec. Nice work @ktff! I think we're ready to proceed work unless you have any outstanding issues you'd like to discuss?

Before we dive, how do you want to break this up across pull requests? Do you want to address this in a single PR or break it up into steps?

binarylogic on 10 Sep 2019

Excellent.

There is a lot of moving parts in the specification, and around it. So going with smaller steps is the way.

I see three PRs:

Decentralized configuration, without kubernetes source optional configuration.
- This will also serve as a proof of concept.
Centralized configuration.
- Major networking questions are contained in this.
kubernetes source optional configuration.
- Proper Kubernetes API <--> Vector communication is contained in this.

ktff on 10 Sep 2019

@ktff 1. sounds 👍 to me, 2. I think maybe we can do last, I do feel like it is one area we have not spent much time on anyways. 3. Curious what you see this containing? Is this more related to adding additional k8 metadata to events or is there something else?

LucioFranco on 10 Sep 2019

@LucioFranco 3. will need to fetch additional info on pods it encounters in the log folder. More specifically, name and label-value pairs. They are needed to support kubernetes source optional configuration.

Alright, we will do 2. last. So 1. 3. 2. is the order.

ktff on 10 Sep 2019

@ktff sounds good 👍 , do you know if this pod level info is available on disk, will it require hooking into k8's api, or is it fetchable via env var?

LucioFranco on 10 Sep 2019

@LucioFranco I know that it's available on k8's api, and how to hook on it. That's the worst case scenario, but it's doable. I haven't encountered other better ways of getting them. But, I also didn't specifically searched for that. And that was enough for the specification, but I plan to address that when it's 3. order to be implemented.

ktff on 10 Sep 2019

👍2

A note:

There is a peculiarity around testing new Kubernetes features.
That is, since testing is conducted in Kubernetes cluster a image of vector is necessary. That image should be made from branch with the new Kubernetes features that is being tested. And the PR can be merged with that, but after it's merger a separate PR should change image pulled from custom one to timberio/vector:latest-alpine once a version with the change has been released.

ktff on 15 Dec 2019

👍1

Is there currently a way to use stable version of Vector within a kubernetes cluster and have at least basic info as event attributes (at least pod/container name, and ideally namespace)?
I'd love to switch from fluent* stack (or at least test it) since it gave me quite some headache lately. I don't really need advanced filtering for kubernetes logs, most of my filtering is based on log entry itself.

Alexx-G on 21 Mar 2020

@Alexx-G there is. Current stable Vector contains kubernetes source, and it's alpha but the only thing that remains to be changed is field naming. It's documentation is in the works, but here is the inital guide that works.

ktff on 22 Mar 2020

Superseded by #2222.