As discussed in https://github.com/timberio/vector/issues/2142#issuecomment-603864777, we should add a new field to our log schema: source_type. This field is a Splunk concept that I rather like and it ensures that our splunk_hec sink is including all of the proper Splunk fields (#2268). I'd like to deviate slightly from Splunk's implementation though:
source_type - should be the official type of Vector's source component (http, socket, file, etc).Action items include:
log_schema.source_type_key option.splunk_hec sink, map these fields properly, in the same way we handle the timestamp_key, host_key, and message_key.docker source,source_type_key option and default it to "source_type".docker."" should disable them and not add the keys.http source,source_type_key option and default it to "source_type".http."" should disable them and not add the keys.journald source,source_type_key option and default it to "source_type".journald."" should disable them and not add the keys.kafka source,source_type_key option and default it to "source_type".kafka."" should disable them and not add the keys.kuberntes source,source_type_key option and default it to "source_type".kubernetes."" should disable them and not add the keys.logplex source,source_type_key option and default it to "source_type".logplex."" should disable them and not add the keys.socket source,source_type_key option and default it to "source_type".socket."" should disable them and not add the keys.splunk_hec source,source_type_key option and default it to "source_type".splunk_hec."" should disable them and not add the keys.stdin source,source_type_key option and default it to "source_type".stdin."" should disable them and not add the keys.syslog source,source_type_key option and default it to "source_type".syslog."" should disable them and not add the keys.file source,source_type_key option and default it to "source_type".file."" should disable them and not add the keys.And for kubernetes source the source_key would be set to {{ pod_name }}, right?
Ah, yep! Just added it. Left out that source since we haven't officially announced it yet.
I'm curious how this will interact with things like https://github.com/timberio/vector/issues/1150. Should we only do one and not the other?
I worry a little bit about trying to shoehorn all types of "source" data into a single field when it can differ quite a bit across different systems.
That's a good point. I think source_type is fine, but source is questionable. You'll notice that I deprecated fields like file in the file source, but I'm not sure that's the best move. If we agree, I can remove the source field. It's also worth noting that users can set this to "file" if they want.
I'm just not sure there's a good reason to try to abstract over these things. I definitely agree source_type should be kafka or something like that, and then users can switch on that to decide which source-specific fields to look for.
Sounds good. I'll remove source then.
@binarylogic Do you plan to remove the source from this issue altogether or just for kafka sink?
Because source remains actual for the splunk sink.
Hi @Alexx-G, in that sense I think the mapping should move in the splunk_hec sink directly, like our other sinks. I've opened #2268 to represent this work, and we can adjust the details as necessary. This will require us to add additional pieces of context in our sources, which I'll open other issues for.
But I agree with @lukesteensen that it feels a little awkward to shoehorn multiple pieces of context into a single ambiguous source field. Vector is not Splunk specific, so anything that is should move into the Splunk specific components.
What do you think?
Yeah, this totally makes sense. As long as we can set both fields (source and source type) for the splunk sink it fits our particular use cases. And I totally agree that Vector's log schema should be as independent and unambiguous as possible.
@binarylogic This makes sense. Then, should we remove the log_scheme.source_key as it isn't need anymore?
I think making log_schema.source_type_key as a default instead of "source_type" and making splunk_hec sink also use log_schema.source_type_key would make it easier for the user to change the default key without breaking it's use in splunk_hec sink which would also by extension break the use of source in #2268.
@binarylogic Also I'm curious as to why are we adding source_type_key option to every source?
Then, should we remove the log_scheme.source_key as it isn't need anymore?
Yes, please. Sorry for the back and forth on that.
I think making log_schema.source_type_key
Correct, the value of this should be the default.
Also I'm curious as to why are we adding source_type_key option to every source?
It was mainly for consistency, but I'm realizing we aren't actually consistent on this (yet). For example, sources do not provide a message_key or host_key option consistently. I think they should, but let's address this separately. For now, you can forgo adding a source_type_key option to each source. Does that make sense?
It does, then source_type_key option can be added later perhaps with other such options in bulk.
@ktff we want to pause work on this temporarily. We plan to solve this but we want to address it in #2414 before doing so. This data is metadata, in my opinion, and should be treated as such. #2414 is being worked on now.
Most helpful comment
Hi @Alexx-G, in that sense I think the mapping should move in the
splunk_hecsink directly, like our other sinks. I've opened #2268 to represent this work, and we can adjust the details as necessary. This will require us to add additional pieces of context in our sources, which I'll open other issues for.But I agree with @lukesteensen that it feels a little awkward to shoehorn multiple pieces of context into a single ambiguous
sourcefield. Vector is not Splunk specific, so anything that is should move into the Splunk specific components.What do you think?