Vector: New `udp` source

Created on 10 May 2019 · 13Comments · Source: timberio/vector

A simple udp source. This is already supported in our syslog source so I assume we can reuse a lot of that code, and the tcp source is very similar.

I'm wondering if there are opportunities to share this code, but I would not include that in the scope of this change.

good first issue feature

Source

binarylogic

👍4

Most helpful comment

So implementation wise I think its totally sane to use https://github.com/timberio/vector/blob/master/src/sources/tcp.rs#L72 and support multiple messages per datagram. As for network saturation, UDP doesn't provide congestion control or flow control so this is 100% up to the sender and should not be a concern of vector.

I think the best path forward here is actually to do something similar to the TcpSource trait but for UdpSoruce. https://docs.rs/tokio/0.1.22/tokio/net/struct.UdpFramed.html this can be used to provide the same functionality that FramedRead does for TCP. From this we can then provide a syslog implementation and a base udp source. Happy to think through some more. I think there is some more unification we can do with TcpSource trait but we can punt on that for now.

LucioFranco on 9 Aug 2019

👍2

All 13 comments

As I see it, udp source should be on datagram/packet basis, where a datagram is separated by newline, as tcp source is, into messages. Where the last message extends to the end of the datagram.
The messages are then decoded into events as Value::RawBytes, as tcp source does.

Alternatively, the whole datagram can be considered as one message.

Then:

~~MTU will serve as max_length.~~ (EDIT: I just remebered that Jumbo packets exist, so max_length is still needed,and even if not for that then for configuration possibility and configuration consistency)
Out of order packets are not a problem.
Missing packets will cause only for contained messages to be lost.
Double packets will cause for multiple of the same messages to be forwarded further.

ktff on 8 Aug 2019

@binarylogic
Do correct me if I am off somewhere.

ktff on 8 Aug 2019

@ktff everything you outlined sounds correct to me, and clearly you have more knowledge of this protocol than me 😄 . It's safe to proceed with the requirements you outlined.

binarylogic on 8 Aug 2019

Before specifying this too thoroughly, a little thought should perhaps go into throughput and security.

Sending one log message per datagram is acceptable for an initial implementation, but may have difficulty scaling once the rate of incoming messages grows, particularly if this could be used in a fan-in type of configuration.

Making the transport more secure (with something like DTLS) only exacerbates the problem, as there are some potentially expensive operations that happen for each packet. Of course, this may only be possible by adding some reliability to the transport, so it all may be out of scope.

bruceg on 8 Aug 2019

Thanks @bruceg. Do you have any suggested changes you'd make the to requirements? And I assume these same caveats are currently present in the syslog source as well?

In general, this first iteration should be very close to that of the tcp and syslog source. If we find ways to improve this, I think follow up issues are best, so we can improve all sources.

binarylogic on 9 Aug 2019

I agree with @bruceg that having one message per one packet is inefficient, and would have low throughput. That's why I suggested accepting multiple messages in the same packet which would be separated by newline. That way, udp source would have throughput slightly greater than tcp source.

Although by using udp the sending side would have difficulty in adjusting it's sending intensity which could cause the sender to inadvertently saturate the network subsystem. But since that would be defacto a ddos attack that can happen anyway, it's mitigation is out of scope, and is more of a question of should udp source be supported or not.

If users wants udp with reliability then they would use tcp.
If they want to send and forget messages as simply as possible, and are ok with missing and duplicate messages, then udp is a good choice.
If there is a need for some protocol in between tcp and udp, then that's a separate issue.

As @bruceg said, security is an issue, but I also agree with @binarylogic that it's a separate issue.

ktff on 9 Aug 2019

LucioFranco on 9 Aug 2019

👍2

@binarylogic The syslog source is constrained by the syslog protocol, and so cannot be adapted in the same way. If I am reading the RFC right, it can only do one message per packet.

I don't have any immediate suggestions for changes to the spec, I just wanted to put on the table some issues that should be considered before nailing down a protocol.

bruceg on 9 Aug 2019

👍1

We actually already implement syslog udp already https://github.com/timberio/vector/blob/master/src/sources/syslog.rs#L131

LucioFranco on 9 Aug 2019

~~Then I'll start implementing udp source with multiple messages per datagram, and see on the way what code can be shared , is what I would like to say but.~~

I have digged deeper into syslog udp, and isn't the address in StatsdConfig useless, since UdpFramed::poll will underneath call UdpSocket::poll_recv_from which will accept udp packets from any address. Is that intendent or is a bug? Documentation for statsd source never explicitly says that it will accept only packets from specified address, but it's implied that it does so.

EDIT: I have mistaken local address for remote address.

ktff on 10 Aug 2019

I suggest we hold off implementing this at least until Monday to see if anyone else has something to add.

Then, as it currently stands, I'll start implementing udp source with multiple messages per datagram, and see on the way what code can be shared

ktff on 10 Aug 2019

I suggest we hold off implementing this at least until Monday to see if anyone else has something to add.
Then, as it currently stands, I'll start implementing udp source with multiple messages per datagram, and see on the way what code can be shared

That's perfectly fine. And apologies, this issue really should have been spec'd out a little more. Unfortunately this is not an area of expertise for me and the engineer we use to review these issues is on vacation this week.

In general though, we _strongly_ encourage people to start small and simple. It helps to focus the discussion and makes PR reviews higher quality. So please feel free to implement a basic version of this first, then we can create follow up issues and enhance it separately.

binarylogic on 10 Aug 2019

I think that this conversation is not over, but as @binarylogic said, it can extend into follow up issues. So I will close this one.

ktff on 14 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ECS log schema support

raghu999 · 3Comments

Define how metrics are converted to logs

binarylogic · 3Comments

Add `multiline` option for all sources

MOZGIII · 3Comments

Version mismatch in binary output

jamtur01 · 3Comments

New `http` source

binarylogic · 4Comments