We delimit our log messages using s-expressions:
("2020-01-21 14:51:39" 2 :MESSAGE (:ACTION SUBMIT-QUOTE) "Error occurred causing transaction to roll back : Database error 23503: insert or update on table \"sales_invoice\" violates foreign key constraint \"sales_invoice_sales_account_fkey\"
DETAIL: Key (sales_account)=(9000) is not present in table \"sales_account\".
QUERY: INSERT INTO \"sales_invoice\" (\"debtor_debit_transaction\", \"sales_account\") VALUES ($$2$$, $$9000$$) RETURNING \"id\";")
The file source is currently unable to parse this message as it is over multiple lines, and there is no set character to determine the start of the next log message. Using ( wont work as this is sometimes used within the log message itself.
It would be handy if there was a way to incorporate a more advanced parser that can keep track of all the quotes and parens (including taking into account when they are escaped) to determine when a message has been fully loaded.
It would probably be handy to share this logic with the socket source as well.
@MOZGIII this seems somewhat related to the work you're doing. I'm curious if what you're doing will solve this, or if it's closely related?
Yes, I guess this should be possible to implement using the merge transform I'm currently working on and a custom lua transform for a partial message marker.
It doesn't look like it'd possible to parse with regexp to do partial message marking, but with custom logic - it's doable.
Thanks for this @FungusHumungus. We have a discussion brewing about how to best solve this and will follow up as we get more clarity on our solution.
@MOZGIII I'm assigning this to you as part of your overall merge work. It probably makes sense to see how we can modify the existing file source behavior to cleanly support this. Again, my hope is that after we have a few sources with this functionality we can start to extract common patterns.
It would be handy if there was a way to incorporate a more advanced parser that can keep track of all the quotes and parens (including taking into account when they are escaped) to determine when a message has been fully loaded.
On the second thought, this case it's very tricky.
First of all, s-expressions are a context-free grammar, so we can't use regexps to parse it.
Then, our lua implementation is currently stateless - it resets the lua context each time the event is processed. This means that the state we'd need to store to be able to mark events as partial or non-partial (and to leverage the merge transform) can not be stored.
Good news is it's about to be changed in such a way that it'll (presumably) support this use case.
That said, to solve this issue with a lua transform and a merge transform 100% correctly you'd need to implement a tokenizer for the s-expressions variant you're using in lua - cause you really would want to do not just counting ( and ), but also detecting embedded string values - " and ignoring (/) inside them, as well as escaping \", \\.
There is a simpler way of course - to actually simply count ( and ) symbols in the string without any tokenization - and to hope that the inner log messages never contain an unmatched bracket.
In both scenarios (with and without the tokenizer) - if the sum of the opening and closing brackets is non-zero - the message is partial, so you'll just add the partial event marker field (_partial: true) to the event. If it's zero - it's non-partial, and should be passthrough as-is (do not add the partial event marker). This properly marks events for the merge transform, which will handle the rest of the heavy lifting - merging huge partial messages together may be a non-trivial task for a lua transform, but the merge transform, being implemented in rust can do it optimally, with less overhead than lua would have.
This case generalizes to the problem vector currently has with parsing multiline messages. For instance, if the case was not with s-expressions, but with multi-line JSON, the problem would be the same.
The merge transform that we implemented does not solve this general case. It does help though. For a particular case when messages are properly marked, it performs the merging of the messages independently of the inner grammar of the contents, and it does it very efficiently - and for some scenarios, this is a huge win, compared to the alternative.
However, the parsing of the multi-line messages is, in fact, a different thing.
We can approach this problem from two angles.
Let there be some text that spans across multiple messages (or any other kind of "chunk" if you will).
Merge transform helps with the first scenario, but it's useless with the second. The first scenario is also a special case of the second scenario - when the value we extract is a top-level string.
To sum up, the most flexible way of implemeting this whole thing would be implementing a streaming tokenizer/parser with pluggable grammars: JSON / top-level string / user-supplied pattern / user-supplied grammar. This way we'll be able to actually properly support incoming data streams without providing workarounds to handle "read framing".
Makes sense. Could there be a case for calling into Lua to do this parsing?
We've improved multi-line handling support, please check out https://github.com/timberio/vector/pull/1852.
Since we didn't address this issue, I'm not going to close this one, but I'll rename it to better represent the contents.
After the internal discussion, we deduced it's unlikely we'll be providing explicit support for the s-expressions in the near future. Our new multi-line parsing capabilities might help with your problem though!
Makes sense. Could there be a case for calling into Lua to do this parsing?
After our lua transform is upgraded to be able to persist data across invocations - this should definitely be doable!
Here's a tutorial on merging multi-line logs with lua: https://vector.dev/guides/advanced/merge-multiline-logs-with-lua/
Most helpful comment
Thanks for this @FungusHumungus. We have a discussion brewing about how to best solve this and will follow up as we get more clarity on our solution.