We should have first-class support for using Vector in infrastructures running k8s. This will involve a combination of good documentation and potentially a few k8s-specific parsers/transforms/sources/etc. This should include both ingesting and processing data from applications running in k8s, as well as best practices for running Vector itself in k8s.
vector is available in some repository.vector.yaml file is served on some url. Let's call it yaml_url.kubectl is installed.To install/run Vector with default configuration run:
kubectl apply -f yaml_url
To configure Vector, download/copy-paste vector.yaml file, with for example:
wget yaml_url
Edit toml part of vector.yaml to configure Vector.
Let the path to edited vector.yaml be yaml_path, then run:
kubectl apply -f yaml_path
which will install/run Vector with edited configuration.
Edit toml part of vector.yaml. Run:
kubectl apply -f yaml_path
vector.yamlAll Kubernetes and Vector configurations are in this one file. Where vector.toml, that is usualy a separate file, is now embeded, and clearly documented, into vector.yaml.
Benefits of only yaml file:
http endpoint. This will also empower users in the same way.kubernetes sourceThe kubernetes source ingests log data from local Kubernetes node and outputs log events.
[sources.my_source_id]
# REQUIRED - General
type = "kubernetes" # must be: "kubernetes"
(EDIT: out of scope)
# Collect logs from kubernetes node components: kubelet, container runtime, kube-proxy.
# And also from master components, if kubernetes is configured to run user containers on master machine.
log_system = false # by default false
# OPTIONAL
(EDIT: covered by #1059)
# Collect logs from these pods.
named = ["pod_name"]
(EDIT: more info in todo section)
# And collect logs from pods with all of these requirements.
# Requirements are defined as in https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
# Example: ["environment = production","tier notin (frontend, backend)"]
match = ["kubernetes_requirement"]
If named and match are empty, kubernetes source will collect logs from all applicable pods, except from itself.
Kubernetes has CRI (Container Runtime Interface) which all container runtimes for Kubernetes should implement. Docker implements it fully, while OCI, rkt, Frakti, Containerd, and Singularity, are an active work in progress.
CRI defines how and where log files are to be stored. kubernetes source can read those files to get logs from all containers on it's node. This can be done with file source, which has already been demonstrated to work by @LucioFranco.
Kubernetes documentation defines where Kubernetes node components keep ther logs. This is also collectable with file source and journald source.
Applicable pods, that is pods on which this implementation is capable of collecting logs, are those that have configured logging to a file. Docker has this as default, and Kubernetes highly recommends it as it also uses those logs for it's own features. Therefor if this implementation doesn't have access to some logs, then neither does Kubernetes. And as Kubernetes assume that logging is then done in some other way by the user, this implementation assumes the same.
Communication between vector nodes can be done with vector source/sink pair.
Besides:
messagetimestampstreampod_uidcontainer_nameinstance_number~labels~ (Edit: will be part of later enrichment issue)which are almost freely available, other information could be pulled over Kubernetes API to enrich the Event. But I would delay this for now as it can be added later. My main reason for this is that I expect testing this properly will take most of the time, and adding/testing things after that will be much easier as base of it is already added/tested.
There are two base topologies that are to be supported from start by having a dedicated vector.yaml file.
Matches Distributed topology in Vector Docs.
vector.yaml file for this topology would have:
DaemonSet with template of Vector agent. Toml configuration inside of it would have preadded default kubernetes source configuration.This configuration is a base for almost all other configurations/deployments.
Matches Centralized topology in Vector Docs.
This is an upgrade on Distributed topology with Vector also being on down-stream end of things. As such vector.yaml for this is based on Distributed version with additions:
Toml configuration of Vector agent would have preadded vector sink configuration. Deployment with template of Vector master.Toml configuration of Vector master would have preadded vector source configuration.This configuration is a base for all configurations/deployments with Centralized topology.
kubernetes source implementation could spinup Vector sources dedicated for each present container runtime, and aggregate logs from them. This could also be a fallback for original implementation.kubernetes source implementation could have only one master agent which would collect logs from pods over Kubernetes API logs command.logging operatorThis specification is compatible with idea of logging operator. So if it was ever to be implemented, it can be build upon this specification.
This section should describe:
kubectl [apply|create] command.app=nginx).match filtering. Requires #1072. But where to put it? In that transform?@lukesteensen as for this, do we think leveldb and rdkafka support is needed for this? I imagine we can produce a very small container that can run as a side car in kube without those two.
Our release binary is 25MB with those two and 19MB without them, so I don't think there's any good reason to exclude them. If anything it'd be confusing to have a k8s build that supported fewer features.
I would expect it to work the same way that fluentbit does in terms of log enrichment via the k8s api.
https://docs.fluentbit.io/manual/installation/kubernetes/
Created this issue as well: https://github.com/timberio/vector/issues/768 , I believe having an operator would be the way to go.
@ktff I've assigned this issue to you. As a first step, I'd like to finish the spec. Could you fill in the "Behavior" and "Requirements" sections above? Feel free to expand out as much as you'd like, whatever you need to describe how this will work.
Note: the first version of this can be simple, it does not need to include every feature. We are big fans of shipping in small incremental changes. Ex: maybe it makes sense to separate out the metadata enrich as a follow up PR.
I have filled in the Behavior section.
Prior art also includes Filebeat, which has a processor that adds K8s metadata: https://www.elastic.co/guide/en/beats/filebeat/master/add-kubernetes-metadata.html
@ktff thank you for writing this up!
I think this approach sounds generally pretty good!
vector.yamlAll Kubernetes and Vector configurations are in this one file. Where vector.toml, that is usualy a separate file, is now embeded, and clearly documented, into vector.yaml.
Do you have an example of what this would look like? This kinda sounds a bit messy and something we may not want to do. Even for IDE's this will make the formatting harder.
Easy to create/share configurations. This would allow supporting a lot of common use cases out of the box with minimal effort.
I think actually embedding the toml within the yaml will make it less sharable since many users will share their configs as direct toml files, not as yaml.
The other option is to provide some packing tool that will generate a daemon set yaml with the provided toml embedded within it.
This also leads me to think that we should 100% provide a way to load a config via http and/or grpc. This would even allow in a centralized setup to only need one config since the master/primary can then supply a subset of that config to the agents. This also would allow us to uncouple the deployment of vector with its config. Aka introduces a kinda control layer. I will defer on this for now but its something we should think of as we introduce more complex setups like k8.
# Collect logs from kubernetes node components: kubelet, container runtime, kube-proxy. # And also from master components, if kubernetes is configured to run user containers on master machine. log_system = false # by default false
Does it make sense for the inital version to just support the container runtime api and defer this extra collecting to either a transform or a second version of the kube api? These seems _somewhat_ out of scope.
# Collect logs from these pods. named = ["pod_name"]
Do we want to think about possibly supporting the kube selector api? I'm not sure how much work this would be but it could add a lot of value.
which are almost freely available, other information could be pulled over Kubernetes API to enrich the Event.
This is :+1: I think we would probably want to do this as a separate component anyways.
Topologies
There are two base topologies that are to be supported from start by having a dedicated vector.yaml file.
As for the topologies, we should 100% start with the decentralized version. I think there are still many questions about how we will do the centralized version. Like do we support service discovery via the k8 api? etc
Overall, I think this approach is good! We should also think about supporting pulling the logs via file and supporting pulling logs via the docker.sock.
@LucioFranco thank you for the detailed feedback.
Do you have an example of what this would look like? This kinda sounds a bit messy and something we may not want to do. Even for IDE's this will make the formatting harder.
Here is an example of how it looks:
# Vector master
apiVersion: apps/v1
kind: Deployment
metadata:
name: master_vector
namespace: default
spec:
selector:
matchLabels:
name: master_vector
template:
metadata:
labels:
name: master_vector
spec:
containers:
- name: vector
image: timberio/vector
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
env:
- name: CONFIG
value: |
# VECTOR.TOML
# Set global options
data_dir = "/var/lib/vector"
[sources.agents]
type = "vector"
address = "0.0.0.0:$(MASTER_VECTOR_SERVICE_PORT)"
shutdown_timeout_secs = 30 # default, seconds
# This line is not in VECTOR.TOML
---
# Vector master service
apiVersion: v1
kind: Service
metadata:
name: vector-service
spec:
selector:
name: master_vector
ports:
- protocol: TCP
port: 9000
---
# Vector agent
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: vector
namespace: default
spec:
selector:
matchLabels:
name: vector
template:
metadata:
labels:
name: vector
spec:
containers:
- name: vector
image: timberio/vector
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
env:
- name: CONFIG
value: |
# VECTOR.TOML
# Set global options
data_dir = "/var/lib/vector"
# Ingest logs from Kubernetes
[sources.kubernetes_logs]
type = "kubernetes"
log_system = false
match = ["environment = production","tier notin (frontend, backend)"]
[sinks.my_sink_id]
# REQUIRED - General
type = "vector" # must be: "vector"
inputs = ["kubernetes_logs"]
address = "$(MASTER_VECTOR_SERVICE_HOST):$(MASTER_VECTOR_SERVICE_PORT)"
# OPTIONAL - General
healthcheck = true # default
# This line is not in VECTOR.TOML
Obviously there are things missing, but this is only an example.
IDE's should format yaml correctly, but yes toml part probably wont have special syntax. I have tried it out in online yaml formatters, and they deal with it nicely, and they recognize toml as a string. Editing is also nice. Try it out online yaml formatter. Visual studio code has almost the same behavior.
I think actually embedding the toml within the yaml will make it less sharable since many users will share their configs as direct toml files, not as yaml.
Generally in Vector ecosystem yes. But among those using Kubernetes, I suspect yaml will be more convenient. Since various vector.toml configurations also require some coordinating configuration in Kubernetes yaml file. Examples are:
vector source/sinkhttp sinkThe other option is to provide some packing tool that will generate a daemon set yaml with the provided toml embedded within it.
The third option is to use ConfigMap feature like Fluentbit logging operator does. But in that case, the above example would require three files in total.
This also leads me to think that we should 100% provide a way to load a config via http and/or grpc. This would even allow in a centralized setup to only need one config since the master/primary can then supply a subset of that config to the agents. This also would allow us to uncouple the deployment of vector with its config. Aka introduces a kinda control layer. I will defer on this for now but its something we should think of as we introduce more complex setups like k8.
In any setup, only one yaml file is necessary. The reason is it's ability to have multiple documents in one file. The above configuration example has this.
Not all configurations can be achieved by only changing toml. For example: if any sink that serves data is added where there was none, a public IP address needs to be associated with the pod, and that requires configuration through Kubernetes which can be done through yaml.
Does it make sense for the inital version to just support the container runtime api and defer this extra collecting to either a transform or a second version of the kube api? These seems somewhat out of scope.
I agree. This seems out of scope. I am for this being a separate feature of kube api.
Do we want to think about possibly supporting the kube selector api? I'm not sure how much work this would be but it could add a lot of value.
Do you mean label selectors? If yes, they are present.
# And collect logs from pods with all of these requirements.
# Requirements are defined as in https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
# Example: ["environment = production","tier notin (frontend, backend)"]
match = ["kubernetes_requirement"]
As for the topologies, we should 100% start with the decentralized version. I think there are still many questions about how we will do the centralized version. Like do we support service discovery via the k8 api? etc
Yes, it is doable with Service. And fetching it's IP is just a matter of using right env var. The above configuration example has this.
We should also think about supporting pulling the logs via file and supporting pulling logs via the docker.sock.
I agree. And this is addable after, so this can be a separate issue for each container runtime source that Vector supports/will support.
I don't have all of the context here, but embedding TOML in the YAML is perfectly fine as a first step. I've seen this done before (ex: Elasticbeanstalk configuration). I don't think it's a blocker for the first version of this unless we have a light weight alternative.
I don't have all of the context here, but embedding TOML in the YAML is perfectly fine as a first step.
Agreed. A potential next step that would be pretty simple could be a very basic "fetch config over http" feature.
@ktff
Ok, I think the embedded is fine for now but it seems like we will have to build a way to load the config via env var as well? Which I think is totally fine for now!
As for the config coordination, how do you expect that a vector to vector sink might find each other? I assume in a centralized setup we would have many agents to one server/master/primary. This master would live as some sort of pod that is discoverable through the k8 service discovery api. It looks like it can inject env vars for the destination so we should be able to set that up via env var injection into the config. 馃憤
In any setup, only one yaml file is necessary. The reason is it's ability to have multiple documents in one file. The above configuration example has this.
馃憤 This actually follows k8 config guidelines so that is good.
I agree. And this is addable after, so this can be a separate issue for each container runtime source that Vector supports/will support.
Agreed, this should be pretty easy to do!
I am on board with all this, thanks for explaining!
@LucioFranco
Ok, I think the embedded is fine for now but it seems like we will have to build a way to load the config via env var as well?
Yes.
It looks like it can inject env vars for the destination so we should be able to set that up via env var injection into the config.
Yes, in the above example that is visible as $(MASTER_VECTOR_SERVICE_HOST) and $(MASTER_VECTOR_SERVICE_PORT) env var. That feature will save us from a lot of issues.
It sounds like we're in agreement with the above spec. Nice work @ktff! I think we're ready to proceed work unless you have any outstanding issues you'd like to discuss?
Before we dive, how do you want to break this up across pull requests? Do you want to address this in a single PR or break it up into steps?
Excellent.
There is a lot of moving parts in the specification, and around it. So going with smaller steps is the way.
I see three PRs:
kubernetes source optional configuration.kubernetes source optional configuration.@ktff 1. sounds 馃憤 to me, 2. I think maybe we can do last, I do feel like it is one area we have not spent much time on anyways. 3. Curious what you see this containing? Is this more related to adding additional k8 metadata to events or is there something else?
@LucioFranco 3. will need to fetch additional info on pods it encounters in the log folder. More specifically, name and label-value pairs. They are needed to support kubernetes source optional configuration.
Alright, we will do 2. last. So 1. 3. 2. is the order.
@ktff sounds good 馃憤 , do you know if this pod level info is available on disk, will it require hooking into k8's api, or is it fetchable via env var?
@LucioFranco I know that it's available on k8's api, and how to hook on it. That's the worst case scenario, but it's doable. I haven't encountered other better ways of getting them. But, I also didn't specifically searched for that. And that was enough for the specification, but I plan to address that when it's 3. order to be implemented.
A note:
There is a peculiarity around testing new Kubernetes features.
That is, since testing is conducted in Kubernetes cluster a image of vector is necessary. That image should be made from branch with the new Kubernetes features that is being tested. And the PR can be merged with that, but after it's merger a separate PR should change image pulled from custom one to timberio/vector:latest-alpine once a version with the change has been released.
Is there currently a way to use stable version of Vector within a kubernetes cluster and have at least basic info as event attributes (at least pod/container name, and ideally namespace)?
I'd love to switch from fluent* stack (or at least test it) since it gave me quite some headache lately. I don't really need advanced filtering for kubernetes logs, most of my filtering is based on log entry itself.
@Alexx-G there is. Current stable Vector contains kubernetes source, and it's alpha but the only thing that remains to be changed is field naming. It's documentation is in the works, but here is the inital guide that works.
Superseded by #2222.
Most helpful comment
I would expect it to work the same way that fluentbit does in terms of log enrichment via the k8s api.
https://docs.fluentbit.io/manual/installation/kubernetes/