Vector: Add new `source_type` field to Vector's log schema

Created on 25 Mar 2020  路  14Comments  路  Source: timberio/vector

As discussed in https://github.com/timberio/vector/issues/2142#issuecomment-603864777, we should add a new field to our log schema: source_type. This field is a Splunk concept that I rather like and it ensures that our splunk_hec sink is including all of the proper Splunk fields (#2268). I'd like to deviate slightly from Splunk's implementation though:

  • source_type - should be the official type of Vector's source component (http, socket, file, etc).

Action items include:

  • [x] Add a new log_schema.source_type_key option.
  • [x] For all of the below. Only set these keys if they do not exist. They should not overwrite existing values.
  • [ ] In the splunk_hec sink, map these fields properly, in the same way we handle the timestamp_key, host_key, and message_key.
  • [ ] In the docker source,

    • [ ] Add a new source_type_key option and default it to "source_type".

    • [x] The value of that key should be set to docker.

    • [ ] Setting these options to "" should disable them and not add the keys.

  • [ ] In the http source,

    • [ ] Add a new source_type_key option and default it to "source_type".

    • [x] The value of that key should be set to http.

    • [ ] Setting these options to "" should disable them and not add the keys.

  • [ ] In the journald source,

    • [ ] Add a new source_type_key option and default it to "source_type".

    • [x] The value of that key should be set to journald.

    • [ ] Setting these options to "" should disable them and not add the keys.

  • [ ] In the kafka source,

    • [ ] Add a new source_type_key option and default it to "source_type".

    • [x] The value of that key should be set to kafka.

    • [ ] Setting these options to "" should disable them and not add the keys.

  • [ ] In the kuberntes source,

    • [ ] Add a new source_type_key option and default it to "source_type".

    • [x] The value of that key should be set to kubernetes.

    • [ ] Setting these options to "" should disable them and not add the keys.

  • [ ] In the logplex source,

    • [ ] Add a new source_type_key option and default it to "source_type".

    • [x] The value of that key should be set to logplex.

    • [ ] Setting these options to "" should disable them and not add the keys.

  • [ ] In the socket source,

    • [ ] Add a new source_type_key option and default it to "source_type".

    • [x] The value of that key should be set to socket.

    • [ ] Setting these options to "" should disable them and not add the keys.

  • [ ] In the splunk_hec source,

    • [ ] Add a new source_type_key option and default it to "source_type".

    • [x] The value of that key should be set to splunk_hec.

    • [ ] Setting these options to "" should disable them and not add the keys.

  • [ ] In the stdin source,

    • [ ] Add a new source_type_key option and default it to "source_type".

    • [x] The value of that key should be set to stdin.

    • [ ] Setting these options to "" should disable them and not add the keys.

  • [ ] In the syslog source,

    • [ ] Add a new source_type_key option and default it to "source_type".

    • [x] The value of that key should be set to syslog.

    • [ ] Setting these options to "" should disable them and not add the keys.

  • [ ] In the file source,

    • [ ] Add a new source_type_key option and default it to "source_type".

    • [x] The value of that key should be set to file.

    • [ ] Setting these options to "" should disable them and not add the keys.

sources should blocked approval enhancement

Most helpful comment

Hi @Alexx-G, in that sense I think the mapping should move in the splunk_hec sink directly, like our other sinks. I've opened #2268 to represent this work, and we can adjust the details as necessary. This will require us to add additional pieces of context in our sources, which I'll open other issues for.

But I agree with @lukesteensen that it feels a little awkward to shoehorn multiple pieces of context into a single ambiguous source field. Vector is not Splunk specific, so anything that is should move into the Splunk specific components.

What do you think?

All 14 comments

And for kubernetes source the source_key would be set to {{ pod_name }}, right?

Ah, yep! Just added it. Left out that source since we haven't officially announced it yet.

I'm curious how this will interact with things like https://github.com/timberio/vector/issues/1150. Should we only do one and not the other?

I worry a little bit about trying to shoehorn all types of "source" data into a single field when it can differ quite a bit across different systems.

That's a good point. I think source_type is fine, but source is questionable. You'll notice that I deprecated fields like file in the file source, but I'm not sure that's the best move. If we agree, I can remove the source field. It's also worth noting that users can set this to "file" if they want.

I'm just not sure there's a good reason to try to abstract over these things. I definitely agree source_type should be kafka or something like that, and then users can switch on that to decide which source-specific fields to look for.

Sounds good. I'll remove source then.

@binarylogic Do you plan to remove the source from this issue altogether or just for kafka sink?
Because source remains actual for the splunk sink.

Hi @Alexx-G, in that sense I think the mapping should move in the splunk_hec sink directly, like our other sinks. I've opened #2268 to represent this work, and we can adjust the details as necessary. This will require us to add additional pieces of context in our sources, which I'll open other issues for.

But I agree with @lukesteensen that it feels a little awkward to shoehorn multiple pieces of context into a single ambiguous source field. Vector is not Splunk specific, so anything that is should move into the Splunk specific components.

What do you think?

Yeah, this totally makes sense. As long as we can set both fields (source and source type) for the splunk sink it fits our particular use cases. And I totally agree that Vector's log schema should be as independent and unambiguous as possible.

@binarylogic This makes sense. Then, should we remove the log_scheme.source_key as it isn't need anymore?

I think making log_schema.source_type_key as a default instead of "source_type" and making splunk_hec sink also use log_schema.source_type_key would make it easier for the user to change the default key without breaking it's use in splunk_hec sink which would also by extension break the use of source in #2268.

@binarylogic Also I'm curious as to why are we adding source_type_key option to every source?

Then, should we remove the log_scheme.source_key as it isn't need anymore?

Yes, please. Sorry for the back and forth on that.

I think making log_schema.source_type_key

Correct, the value of this should be the default.

Also I'm curious as to why are we adding source_type_key option to every source?

It was mainly for consistency, but I'm realizing we aren't actually consistent on this (yet). For example, sources do not provide a message_key or host_key option consistently. I think they should, but let's address this separately. For now, you can forgo adding a source_type_key option to each source. Does that make sense?

It does, then source_type_key option can be added later perhaps with other such options in bulk.

@ktff we want to pause work on this temporarily. We plan to solve this but we want to address it in #2414 before doing so. This data is metadata, in my opinion, and should be treated as such. #2414 is being worked on now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lewisthompson picture lewisthompson  路  3Comments

a-rodin picture a-rodin  路  3Comments

binarylogic picture binarylogic  路  4Comments

valyala picture valyala  路  3Comments

kaarolch picture kaarolch  路  3Comments