Vector: Better multiline handling: s-expressions

Created on 21 Jan 2020 · 10Comments · Source: timberio/vector

We delimit our log messages using s-expressions:

("2020-01-21 14:51:39" 2 :MESSAGE (:ACTION SUBMIT-QUOTE) "Error occurred causing transaction to roll back : Database error 23503: insert or update on table \"sales_invoice\" violates foreign key constraint \"sales_invoice_sales_account_fkey\"
DETAIL: Key (sales_account)=(9000) is not present in table \"sales_account\".
QUERY: INSERT INTO \"sales_invoice\" (\"debtor_debit_transaction\", \"sales_account\") VALUES ($$2$$, $$9000$$) RETURNING \"id\";")

The file source is currently unable to parse this message as it is over multiple lines, and there is no set character to determine the start of the next log message. Using ( wont work as this is sometimes used within the log message itself.

It would be handy if there was a way to incorporate a more advanced parser that can keep track of all the quotes and parens (including taking into account when they are escaped) to determine when a message has been fully loaded.

It would probably be handy to share this logic with the socket source as well.

parsing file enhancement help

Source

FungusHumungus

Most helpful comment

Thanks for this @FungusHumungus. We have a discussion brewing about how to best solve this and will follow up as we get more clarity on our solution.

binarylogic on 22 Jan 2020

❤1 👍1

All 10 comments

@MOZGIII this seems somewhat related to the work you're doing. I'm curious if what you're doing will solve this, or if it's closely related?

binarylogic on 21 Jan 2020

Yes, I guess this should be possible to implement using the merge transform I'm currently working on and a custom lua transform for a partial message marker.

It doesn't look like it'd possible to parse with regexp to do partial message marking, but with custom logic - it's doable.

MOZGIII on 21 Jan 2020

❤1

Thanks for this @FungusHumungus. We have a discussion brewing about how to best solve this and will follow up as we get more clarity on our solution.

binarylogic on 22 Jan 2020

❤1 👍1

@MOZGIII I'm assigning this to you as part of your overall merge work. It probably makes sense to see how we can modify the existing file source behavior to cleanly support this. Again, my hope is that after we have a few sources with this functionality we can start to extract common patterns.

binarylogic on 25 Jan 2020

👍1

It would be handy if there was a way to incorporate a more advanced parser that can keep track of all the quotes and parens (including taking into account when they are escaped) to determine when a message has been fully loaded.

On the second thought, this case it's very tricky.

First of all, s-expressions are a context-free grammar, so we can't use regexps to parse it.

Then, our lua implementation is currently stateless - it resets the lua context each time the event is processed. This means that the state we'd need to store to be able to mark events as partial or non-partial (and to leverage the merge transform) can not be stored.

Good news is it's about to be changed in such a way that it'll (presumably) support this use case.

That said, to solve this issue with a lua transform and a merge transform 100% correctly you'd need to implement a tokenizer for the s-expressions variant you're using in lua - cause you really would want to do not just counting ( and ), but also detecting embedded string values - " and ignoring (/) inside them, as well as escaping \", \\.
There is a simpler way of course - to actually simply count ( and ) symbols in the string without any tokenization - and to hope that the inner log messages never contain an unmatched bracket.
In both scenarios (with and without the tokenizer) - if the sum of the opening and closing brackets is non-zero - the message is partial, so you'll just add the partial event marker field (_partial: true) to the event. If it's zero - it's non-partial, and should be passthrough as-is (do not add the partial event marker). This properly marks events for the merge transform, which will handle the rest of the heavy lifting - merging huge partial messages together may be a non-trivial task for a lua transform, but the merge transform, being implemented in rust can do it optimally, with less overhead than lua would have.

MOZGIII on 17 Feb 2020

This case generalizes to the problem vector currently has with parsing multiline messages. For instance, if the case was not with s-expressions, but with multi-line JSON, the problem would be the same.

The merge transform that we implemented does not solve this general case. It does help though. For a particular case when messages are properly marked, it performs the merging of the messages independently of the inner grammar of the contents, and it does it very efficiently - and for some scenarios, this is a huge win, compared to the alternative.

However, the parsing of the multi-line messages is, in fact, a different thing.
We can approach this problem from two angles.
Let there be some text that spans across multiple messages (or any other kind of "chunk" if you will).

Scenario 1
When all we want to do from the parsing process is to just concatenate the text - like when there's a stacktrace or segfaut dump.
Scenario 2
When we want to extract the values from the message in the structured form - like in case a JSON document arrives in multiple lines, and we don't really care about the concatenated body of it - but just about the fields. This can be done in a single step - if we tokenize and parse the message in a streaming fashion. We can still achieve the same result if we concatenate the messages first and then run them though a non-streaming parser (go though the Scenario 1) - however, this may be way less efficient, especially for huge payloads.

Merge transform helps with the first scenario, but it's useless with the second. The first scenario is also a special case of the second scenario - when the value we extract is a top-level string.

To sum up, the most flexible way of implemeting this whole thing would be implementing a streaming tokenizer/parser with pluggable grammars: JSON / top-level string / user-supplied pattern / user-supplied grammar. This way we'll be able to actually properly support incoming data streams without providing workarounds to handle "read framing".

MOZGIII on 17 Feb 2020

Makes sense. Could there be a case for calling into Lua to do this parsing?

FungusHumungus on 20 Feb 2020

We've improved multi-line handling support, please check out https://github.com/timberio/vector/pull/1852.
Since we didn't address this issue, I'm not going to close this one, but I'll rename it to better represent the contents.
After the internal discussion, we deduced it's unlikely we'll be providing explicit support for the s-expressions in the near future. Our new multi-line parsing capabilities might help with your problem though!

MOZGIII on 25 Feb 2020