Vector: Add new `source_type` field to Vector's log schema

Created on 25 Mar 2020 · 14Comments · Source: timberio/vector

As discussed in https://github.com/timberio/vector/issues/2142#issuecomment-603864777, we should add a new field to our log schema: source_type. This field is a Splunk concept that I rather like and it ensures that our splunk_hec sink is including all of the proper Splunk fields (#2268). I'd like to deviate slightly from Splunk's implementation though:

source_type - should be the official type of Vector's source component (http, socket, file, etc).

Action items include:

[x] Add a new log_schema.source_type_key option.
[x] For all of the below. Only set these keys if they do not exist. They should not overwrite existing values.
[ ] In the splunk_hec sink, map these fields properly, in the same way we handle the timestamp_key, host_key, and message_key.
[ ] In the docker source,
- [ ] Add a new source_type_key option and default it to "source_type".
- [x] The value of that key should be set to docker.
- [ ] Setting these options to "" should disable them and not add the keys.
[ ] In the http source,
- [ ] Add a new source_type_key option and default it to "source_type".
- [x] The value of that key should be set to http.
- [ ] Setting these options to "" should disable them and not add the keys.
[ ] In the journald source,
- [ ] Add a new source_type_key option and default it to "source_type".
- [x] The value of that key should be set to journald.
- [ ] Setting these options to "" should disable them and not add the keys.
[ ] In the kafka source,
- [ ] Add a new source_type_key option and default it to "source_type".
- [x] The value of that key should be set to kafka.
- [ ] Setting these options to "" should disable them and not add the keys.
[ ] In the kuberntes source,
- [ ] Add a new source_type_key option and default it to "source_type".
- [x] The value of that key should be set to kubernetes.
- [ ] Setting these options to "" should disable them and not add the keys.
[ ] In the logplex source,
- [ ] Add a new source_type_key option and default it to "source_type".
- [x] The value of that key should be set to logplex.
- [ ] Setting these options to "" should disable them and not add the keys.
[ ] In the socket source,
- [ ] Add a new source_type_key option and default it to "source_type".
- [x] The value of that key should be set to socket.
- [ ] Setting these options to "" should disable them and not add the keys.
[ ] In the splunk_hec source,
- [ ] Add a new source_type_key option and default it to "source_type".
- [x] The value of that key should be set to splunk_hec.
- [ ] Setting these options to "" should disable them and not add the keys.
[ ] In the stdin source,
- [ ] Add a new source_type_key option and default it to "source_type".
- [x] The value of that key should be set to stdin.
- [ ] Setting these options to "" should disable them and not add the keys.
[ ] In the syslog source,
- [ ] Add a new source_type_key option and default it to "source_type".
- [x] The value of that key should be set to syslog.
- [ ] Setting these options to "" should disable them and not add the keys.
[ ] In the file source,
- [ ] Add a new source_type_key option and default it to "source_type".
- [x] The value of that key should be set to file.
- [ ] Setting these options to "" should disable them and not add the keys.

sources should blocked approval enhancement

Source

binarylogic

👍5

Most helpful comment

Hi @Alexx-G, in that sense I think the mapping should move in the splunk_hec sink directly, like our other sinks. I've opened #2268 to represent this work, and we can adjust the details as necessary. This will require us to add additional pieces of context in our sources, which I'll open other issues for.

But I agree with @lukesteensen that it feels a little awkward to shoehorn multiple pieces of context into a single ambiguous source field. Vector is not Splunk specific, so anything that is should move into the Splunk specific components.

What do you think?

binarylogic on 8 Apr 2020

👍2

All 14 comments

And for kubernetes source the source_key would be set to {{ pod_name }}, right?

Alexx-G on 25 Mar 2020

Ah, yep! Just added it. Left out that source since we haven't officially announced it yet.

binarylogic on 25 Mar 2020

👍1

I'm curious how this will interact with things like https://github.com/timberio/vector/issues/1150. Should we only do one and not the other?

I worry a little bit about trying to shoehorn all types of "source" data into a single field when it can differ quite a bit across different systems.

lukesteensen on 8 Apr 2020

That's a good point. I think source_type is fine, but source is questionable. You'll notice that I deprecated fields like file in the file source, but I'm not sure that's the best move. If we agree, I can remove the source field. It's also worth noting that users can set this to "file" if they want.

binarylogic on 8 Apr 2020

I'm just not sure there's a good reason to try to abstract over these things. I definitely agree source_type should be kafka or something like that, and then users can switch on that to decide which source-specific fields to look for.

lukesteensen on 8 Apr 2020

Sounds good. I'll remove source then.

binarylogic on 8 Apr 2020

@binarylogic Do you plan to remove the source from this issue altogether or just for kafka sink?
Because source remains actual for the splunk sink.

Alexx-G on 8 Apr 2020

What do you think?

binarylogic on 8 Apr 2020

👍2

Yeah, this totally makes sense. As long as we can set both fields (source and source type) for the splunk sink it fits our particular use cases. And I totally agree that Vector's log schema should be as independent and unambiguous as possible.

Alexx-G on 8 Apr 2020

@binarylogic This makes sense. Then, should we remove the log_scheme.source_key as it isn't need anymore?

ktff on 9 Apr 2020

I think making log_schema.source_type_key as a default instead of "source_type" and making splunk_hec sink also use log_schema.source_type_key would make it easier for the user to change the default key without breaking it's use in splunk_hec sink which would also by extension break the use of source in #2268.

@binarylogic Also I'm curious as to why are we adding source_type_key option to every source?

ktff on 10 Apr 2020

Then, should we remove the log_scheme.source_key as it isn't need anymore?

Yes, please. Sorry for the back and forth on that.

I think making log_schema.source_type_key

Correct, the value of this should be the default.

Also I'm curious as to why are we adding source_type_key option to every source?

It was mainly for consistency, but I'm realizing we aren't actually consistent on this (yet). For example, sources do not provide a message_key or host_key option consistently. I think they should, but let's address this separately. For now, you can forgo adding a source_type_key option to each source. Does that make sense?

binarylogic on 10 Apr 2020

It does, then source_type_key option can be added later perhaps with other such options in bulk.

ktff on 11 Apr 2020

@ktff we want to pause work on this temporarily. We plan to solve this but we want to address it in #2414 before doing so. This data is metadata, in my opinion, and should be treated as such. #2414 is being worked on now.

binarylogic on 22 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings