I'm opening this issue to explicitly represent the work of a merge transform (I'm open to better names). The purpose of this transform is to merge log lines that were split upstream. For example, the docker source caps logs at 16kb and therefore splits logs that exceed this size. We need a transform that allows users to merge these lines together to form the single log line they are meant to represent.
docker log events that were split due to sizejournald log events that were split due to sizefile log events based on a start and stop pattern (the current message_start_indicator option)This spec is largely based on the comments from @MOZGIII. See https://github.com/timberio/vector/issues/1436#issuecomment-569457769, https://github.com/timberio/vector/issues/1436#issuecomment-569458437, and https://github.com/timberio/vector/issues/1436#issuecomment-570084944).
docker source should come pre-configured with this transform since we know Docker will split lines.journald source should come pre-configured with this transform since we know Journald splits lines.message_start_indicator should be removed from the file source in favor of using this new transform. We can provide guides that educate the users on this.Let me know if these requirements are not accurate.
message_start_indicatorshould be removed from thefilesource in favor of using this new transform. We can provide guides that educate the users on this.
I'm not sure about this part. Would we roll that functionality into the new transform? There are likely quite a few situations where it would be a better fit than the new strategy of the source detecting partial messages. Or we could keep it as a way to have sources identify which messages are partial?
More generally, while I like the idea of implementing this in a way that's not tied to a specific source, I am slightly skeptical of requiring the user to configure a separate transform with this. Could we not just expose it through options on the sources like message_start_indicator that use a common implementation? You've already hinted at this by saying some sources will be preconfigured with it.
Before I start with the actual implementation, I'd like to prepare some test cases. Is there any place where I could put this setup? Not sure it should belong to the main vector repo.
I plan to replicate the docker line splitting behavior - so I'd need a specially crafted Dockerfile with long echos and a vector config.
Before I start with the actual implementation, I'd like to prepare some test cases. Is there any place where I could put this setup?
This could fit as a correctness test in our test harness /cc @binarylogic
Great. I created a repo in the meantime: https://github.com/MOZGIII/vector-merge-test-setup
I'll see if I can migrate it to the test harness repo.
Yeah, that would be ideal. There's a little more involved there since the test harness involves terraform and ansible scripts. I'd say keep moving forward with you repo and then we can pair on transitioning it to our test hardness when you're ready.
Sounds good! I've used Ansible and Terraform before, but still, I'd appreciate guidance! Before we can move it to test harness, I need to figure out a way to programmatically assert the behavior. With my current setup that'd be problematic, however, I imagine it can be done better if we can spare a full VM for it.
Yep. For example, here's a very simple nested JSON test:
You can read about the test here:
https://github.com/timberio/vector-test-harness/tree/master/cases/wrapped_json_correctness
I don't want to distract you with this stuff right now, but I think it's good for you to see how we're using Ansible to run test cases. When you're ready we can pair and get it set up.
message_start_indicatorshould be removed from thefilesource in favor of using this new transform. We can provide guides that educate the users on this.I'm not sure about this part. Would we roll that functionality into the new transform? There are likely quite a few situations where it would be a better fit than the new strategy of the source detecting partial messages. Or we could keep it as a way to have sources identify which messages are partial?
I'm also not sure about the deprecation since I don't fully understand yet if we're covering all the use cases of the message_start_indicator with the merge functionality. It might be that we don't. I'll think about it tomorrow.
More generally, while I like the idea of implementing this in a way that's not tied to a specific source, I am slightly skeptical of requiring the user to configure a separate transform with this. Could we not just expose it through options on the sources like
message_start_indicatorthat use a common implementation? You've already hinted at this by saying some sources will be preconfigured with it.
The main beauty of having this a separate transform is it won't be limited to just merging the raw input from the sources. For instance in the JSON-in-JSON case, one would be able to unwrap the "JSON envelope" and merge over a field from the wrapped document. I think that'd be cool.
If anyone happen to have a VM dump with journald configured for testing with vector - please share it with me :smile: I need to replicate the message splitting behavior for journald.
UPD: I'll be working on my own setup tomorrow if nothing comes up by then.
Also, regarding journald setup, as was discussed here, the partial messages are specific to the docker use-case - when docker outputs the container logs to the journald, it adds it's own partial marker - CONTAINER_PARTIAL_MESSAGE.
This is covered by adding a merge transform manually. I doubt we want this docker-specific logic to be built-in to a journald source (which would be an odd coupling, since journald source is currently generic, and not specific to docker). I'd consider this use-case covered by just the addition of the transform, and not provide any specific details.
We can hint to this issue with docker in the documentation for the journald source and suggest adding a merge transform there.
I also realized that we should probably take a look at k8s source as well. It's implemented separately, but the design goal there, as noted in the comments, is for it to be interchangeable with docker source. So, whatever defaults we'll apply to docker we have to make sure the same are applied to kubernetes.
Also, regarding journald setup, as was discussed here, the partial messages are specific to the docker use-case - when docker outputs the container logs to the journald, it adds it's own partial marker - CONTAINER_PARTIAL_MESSAGE. This is covered by adding a merge transform manually.
Sure, but the CONTAINER_PARTIAL_MESSAGE is an explicit flag that is set, right? I know it's set by Docker, but I'm wondering if we can treat it like a generic flag? Then it's not Docker specific (if that makes sense).
So, whatever defaults we'll apply to docker we have to make sure the same are applied to kubernetes.
Make sense to me.
Also, regarding journald setup, as was discussed here, the partial messages are specific to the docker use-case - when docker outputs the container logs to the journald, it adds it's own partial marker - CONTAINER_PARTIAL_MESSAGE. This is covered by adding a merge transform manually.
Sure, but the
CONTAINER_PARTIAL_MESSAGEis an explicit flag that is set, right? I know it's set by Docker, but I'm wondering if we can treat it like a generic flag? Then it's not Docker specific (if that makes sense).
Hmm, that's interesting. To my knowledge, there's no spec that rules container runtime to add CONTAINER_PARTIAL_MESSAGE for the partial messages. In practice, rkt doesn't do it, but cri-o added it recently: https://github.com/cri-o/cri-o/commit/17cbc3a9cae03b70800afe0d1feb7162c8829f07. So, we might as well consider it the emerging standard. But still, this would be a container-runtime specific thing. Do you think it's a good idea to make it enabled by default? Though, most likely it won't hurt, as long as we provide means to opt-out...
I just realized the docker approach to partial lines (partial message marker by the presence or absence of the \n at the end) is unsound. Consider the following example:
{"log":"{ \"long_value\": \"long value ... \\n", ...}
{"log":" ... \" }", ...}
It might so happen that a long log message is split such that the end of the first chunk is \\n. This would break the invariant docker implies, and in turn, cause a false-negative partial message detection. This is still better than nothing, but it's more of a heuristic than a correct solution. And this one is on docker, we don't seem to have any way of fixing this for general case.
I just realized the docker approach to partial lines (partial message marker by the presence or absence of the
\nat the end) is unsound. Consider the following example:{"log":"{ \"long_value\": \"long value ... \\n", ...} {"log":" ... \" }", ...}It might so happen that a long log message is split such that the end of the first chunk is
\\n. This would break the invariant docker implies, and in turn, cause a false-negative partial message detection. This is still better than nothing, but it's more of a heuristic than a correct solution. And this one is on docker, we don't seem to have any way of fixing this for general case.
Actually, we can special casing for \\n and make it sound this on our end. I'll create an issue for this.
So, the summary of the progress on this issue:
merge transform is implemented at #1504 docker source is also implemented at #1504journald and kubernetes sources are not yet covered; journald worth a separate issue and discussion, for kubernetes - I just did the research - and the same logic we use for docker is applicablemessage_start_indicator for file source is still there, let's create a separate issue for it too; it might be that lua script is enough, otherwise we'll have to add a dedicated partial message marker transform to get rid of message_start_indicatorjournald, other sources, sinks that can benefit from merging the events together and so on; this will also help with the issues mentioned above to enable partial message merging for journald - it'd be easier to implement with if we've known what to aim for