Vector: New `merge` transform

Created on 7 Jan 2020  路  16Comments  路  Source: timberio/vector

I'm opening this issue to explicitly represent the work of a merge transform (I'm open to better names). The purpose of this transform is to merge log lines that were split upstream. For example, the docker source caps logs at 16kb and therefore splits logs that exceed this size. We need a transform that allows users to merge these lines together to form the single log line they are meant to represent.

Use cases

  • [x] Ability to merge docker log events that were split due to size
  • [x] Ability to merge journald log events that were split due to size
  • [ ] Ability to merge file log events based on a start and stop pattern (the current message_start_indicator option)

Spec

This spec is largely based on the comments from @MOZGIII. See https://github.com/timberio/vector/issues/1436#issuecomment-569457769, https://github.com/timberio/vector/issues/1436#issuecomment-569458437, and https://github.com/timberio/vector/issues/1436#issuecomment-570084944).

  • [x] This should be built as a separate configurable transform, enabling users merge logs from any source and cover a wide variety of use cases that we are not thinking about.
  • [x] The docker source should come pre-configured with this transform since we know Docker will split lines.
  • [ ] The journald source should come pre-configured with this transform since we know Journald splits lines.
  • [ ] message_start_indicator should be removed from the file source in favor of using this new transform. We can provide guides that educate the users on this.

Let me know if these requirements are not accurate.

feature

All 16 comments

message_start_indicator should be removed from the file source in favor of using this new transform. We can provide guides that educate the users on this.

I'm not sure about this part. Would we roll that functionality into the new transform? There are likely quite a few situations where it would be a better fit than the new strategy of the source detecting partial messages. Or we could keep it as a way to have sources identify which messages are partial?

More generally, while I like the idea of implementing this in a way that's not tied to a specific source, I am slightly skeptical of requiring the user to configure a separate transform with this. Could we not just expose it through options on the sources like message_start_indicator that use a common implementation? You've already hinted at this by saying some sources will be preconfigured with it.

Before I start with the actual implementation, I'd like to prepare some test cases. Is there any place where I could put this setup? Not sure it should belong to the main vector repo.
I plan to replicate the docker line splitting behavior - so I'd need a specially crafted Dockerfile with long echos and a vector config.

Before I start with the actual implementation, I'd like to prepare some test cases. Is there any place where I could put this setup?

This could fit as a correctness test in our test harness /cc @binarylogic

Great. I created a repo in the meantime: https://github.com/MOZGIII/vector-merge-test-setup
I'll see if I can migrate it to the test harness repo.

Yeah, that would be ideal. There's a little more involved there since the test harness involves terraform and ansible scripts. I'd say keep moving forward with you repo and then we can pair on transitioning it to our test hardness when you're ready.

Sounds good! I've used Ansible and Terraform before, but still, I'd appreciate guidance! Before we can move it to test harness, I need to figure out a way to programmatically assert the behavior. With my current setup that'd be problematic, however, I imagine it can be done better if we can spare a full VM for it.

Yep. For example, here's a very simple nested JSON test:

https://github.com/timberio/vector-test-harness/blob/master/cases/wrapped_json_correctness/ansible/run/vector.yml

You can read about the test here:

https://github.com/timberio/vector-test-harness/tree/master/cases/wrapped_json_correctness

I don't want to distract you with this stuff right now, but I think it's good for you to see how we're using Ansible to run test cases. When you're ready we can pair and get it set up.

message_start_indicator should be removed from the file source in favor of using this new transform. We can provide guides that educate the users on this.

I'm not sure about this part. Would we roll that functionality into the new transform? There are likely quite a few situations where it would be a better fit than the new strategy of the source detecting partial messages. Or we could keep it as a way to have sources identify which messages are partial?

I'm also not sure about the deprecation since I don't fully understand yet if we're covering all the use cases of the message_start_indicator with the merge functionality. It might be that we don't. I'll think about it tomorrow.

More generally, while I like the idea of implementing this in a way that's not tied to a specific source, I am slightly skeptical of requiring the user to configure a separate transform with this. Could we not just expose it through options on the sources like message_start_indicator that use a common implementation? You've already hinted at this by saying some sources will be preconfigured with it.

The main beauty of having this a separate transform is it won't be limited to just merging the raw input from the sources. For instance in the JSON-in-JSON case, one would be able to unwrap the "JSON envelope" and merge over a field from the wrapped document. I think that'd be cool.

If anyone happen to have a VM dump with journald configured for testing with vector - please share it with me :smile: I need to replicate the message splitting behavior for journald.
UPD: I'll be working on my own setup tomorrow if nothing comes up by then.

Also, regarding journald setup, as was discussed here, the partial messages are specific to the docker use-case - when docker outputs the container logs to the journald, it adds it's own partial marker - CONTAINER_PARTIAL_MESSAGE.
This is covered by adding a merge transform manually. I doubt we want this docker-specific logic to be built-in to a journald source (which would be an odd coupling, since journald source is currently generic, and not specific to docker). I'd consider this use-case covered by just the addition of the transform, and not provide any specific details.
We can hint to this issue with docker in the documentation for the journald source and suggest adding a merge transform there.

I also realized that we should probably take a look at k8s source as well. It's implemented separately, but the design goal there, as noted in the comments, is for it to be interchangeable with docker source. So, whatever defaults we'll apply to docker we have to make sure the same are applied to kubernetes.

Also, regarding journald setup, as was discussed here, the partial messages are specific to the docker use-case - when docker outputs the container logs to the journald, it adds it's own partial marker - CONTAINER_PARTIAL_MESSAGE. This is covered by adding a merge transform manually.

Sure, but the CONTAINER_PARTIAL_MESSAGE is an explicit flag that is set, right? I know it's set by Docker, but I'm wondering if we can treat it like a generic flag? Then it's not Docker specific (if that makes sense).

So, whatever defaults we'll apply to docker we have to make sure the same are applied to kubernetes.

Make sense to me.

Also, regarding journald setup, as was discussed here, the partial messages are specific to the docker use-case - when docker outputs the container logs to the journald, it adds it's own partial marker - CONTAINER_PARTIAL_MESSAGE. This is covered by adding a merge transform manually.

Sure, but the CONTAINER_PARTIAL_MESSAGE is an explicit flag that is set, right? I know it's set by Docker, but I'm wondering if we can treat it like a generic flag? Then it's not Docker specific (if that makes sense).

Hmm, that's interesting. To my knowledge, there's no spec that rules container runtime to add CONTAINER_PARTIAL_MESSAGE for the partial messages. In practice, rkt doesn't do it, but cri-o added it recently: https://github.com/cri-o/cri-o/commit/17cbc3a9cae03b70800afe0d1feb7162c8829f07. So, we might as well consider it the emerging standard. But still, this would be a container-runtime specific thing. Do you think it's a good idea to make it enabled by default? Though, most likely it won't hurt, as long as we provide means to opt-out...

I just realized the docker approach to partial lines (partial message marker by the presence or absence of the \n at the end) is unsound. Consider the following example:

{"log":"{ \"long_value\": \"long value ... \\n", ...}
{"log":" ... \" }", ...}

It might so happen that a long log message is split such that the end of the first chunk is \\n. This would break the invariant docker implies, and in turn, cause a false-negative partial message detection. This is still better than nothing, but it's more of a heuristic than a correct solution. And this one is on docker, we don't seem to have any way of fixing this for general case.

I just realized the docker approach to partial lines (partial message marker by the presence or absence of the \n at the end) is unsound. Consider the following example:

{"log":"{ \"long_value\": \"long value ... \\n", ...}
{"log":" ... \" }", ...}

It might so happen that a long log message is split such that the end of the first chunk is \\n. This would break the invariant docker implies, and in turn, cause a false-negative partial message detection. This is still better than nothing, but it's more of a heuristic than a correct solution. And this one is on docker, we don't seem to have any way of fixing this for general case.

Actually, we can special casing for \\n and make it sound this on our end. I'll create an issue for this.

So, the summary of the progress on this issue:

  • the merge transform is implemented at #1504
  • automatic merging of partial events for docker source is also implemented at #1504
  • defaults for journald and kubernetes sources are not yet covered; journald worth a separate issue and discussion, for kubernetes - I just did the research - and the same logic we use for docker is applicable
  • a message_start_indicator for file source is still there, let's create a separate issue for it too; it might be that lua script is enough, otherwise we'll have to add a dedicated partial message marker transform to get rid of message_start_indicator
  • test harness integration - we can add a dedicated issue for this too to track the progress and have a place to discuss; maybe at the test harness repo even
  • as mentioned at https://github.com/timberio/vector/issues/1407#issuecomment-574876272, it'd be great to conduct a research to test the limits of the current design; the end goal is to provide a set of vector configs (that we could add to the documentation as examples) to cover various use cases - with journald, other sources, sinks that can benefit from merging the events together and so on; this will also help with the issues mentioned above to enable partial message merging for journald - it'd be easier to implement with if we've known what to aim for
Was this page helpful?
0 / 5 - 0 ratings

Related issues

binarylogic picture binarylogic  路  4Comments

valyala picture valyala  路  3Comments

LucioFranco picture LucioFranco  路  3Comments

binarylogic picture binarylogic  路  3Comments

Hoverbear picture Hoverbear  路  3Comments