A simple udp source. This is already supported in our syslog source so I assume we can reuse a lot of that code, and the tcp source is very similar.
I'm wondering if there are opportunities to share this code, but I would not include that in the scope of this change.
As I see it, udp source should be on datagram/packet basis, where a datagram is separated by newline, as tcp source is, into messages. Where the last message extends to the end of the datagram.
The messages are then decoded into events as Value::RawBytes, as tcp source does.
Alternatively, the whole datagram can be considered as one message.
Then:
@binarylogic
Do correct me if I am off somewhere.
@ktff everything you outlined sounds correct to me, and clearly you have more knowledge of this protocol than me 馃槃 . It's safe to proceed with the requirements you outlined.
Before specifying this too thoroughly, a little thought should perhaps go into throughput and security.
Sending one log message per datagram is acceptable for an initial implementation, but may have difficulty scaling once the rate of incoming messages grows, particularly if this could be used in a fan-in type of configuration.
Making the transport more secure (with something like DTLS) only exacerbates the problem, as there are some potentially expensive operations that happen for each packet. Of course, this may only be possible by adding some reliability to the transport, so it all may be out of scope.
Thanks @bruceg. Do you have any suggested changes you'd make the to requirements? And I assume these same caveats are currently present in the syslog source as well?
In general, this first iteration should be very close to that of the tcp and syslog source. If we find ways to improve this, I think follow up issues are best, so we can improve all sources.
I agree with @bruceg that having one message per one packet is inefficient, and would have low throughput. That's why I suggested accepting multiple messages in the same packet which would be separated by newline. That way, udp source would have throughput slightly greater than tcp source.
Although by using udp the sending side would have difficulty in adjusting it's sending intensity which could cause the sender to inadvertently saturate the network subsystem. But since that would be defacto a ddos attack that can happen anyway, it's mitigation is out of scope, and is more of a question of should udp source be supported or not.
If users wants udp with reliability then they would use tcp.
If they want to send and forget messages as simply as possible, and are ok with missing and duplicate messages, then udp is a good choice.
If there is a need for some protocol in between tcp and udp, then that's a separate issue.
As @bruceg said, security is an issue, but I also agree with @binarylogic that it's a separate issue.
So implementation wise I think its totally sane to use https://github.com/timberio/vector/blob/master/src/sources/tcp.rs#L72 and support multiple messages per datagram. As for network saturation, UDP doesn't provide congestion control or flow control so this is 100% up to the sender and should not be a concern of vector.
I think the best path forward here is actually to do something similar to the TcpSource trait but for UdpSoruce. https://docs.rs/tokio/0.1.22/tokio/net/struct.UdpFramed.html this can be used to provide the same functionality that FramedRead does for TCP. From this we can then provide a syslog implementation and a base udp source. Happy to think through some more. I think there is some more unification we can do with TcpSource trait but we can punt on that for now.
@binarylogic The syslog source is constrained by the syslog protocol, and so cannot be adapted in the same way. If I am reading the RFC right, it can only do one message per packet.
I don't have any immediate suggestions for changes to the spec, I just wanted to put on the table some issues that should be considered before nailing down a protocol.
We actually already implement syslog udp already https://github.com/timberio/vector/blob/master/src/sources/syslog.rs#L131
Then I'll start implementing udp source with multiple messages per datagram, and see on the way what code can be shared , is what I would like to say but.
I have digged deeper into syslog udp, and isn't the address in StatsdConfig useless, since UdpFramed::poll will underneath call UdpSocket::poll_recv_from which will accept udp packets from any address. Is that intendent or is a bug? Documentation for statsd source never explicitly says that it will accept only packets from specified address, but it's implied that it does so.
EDIT: I have mistaken local address for remote address.
I suggest we hold off implementing this at least until Monday to see if anyone else has something to add.
Then, as it currently stands, I'll start implementing udp source with multiple messages per datagram, and see on the way what code can be shared
I suggest we hold off implementing this at least until Monday to see if anyone else has something to add.
Then, as it currently stands, I'll start implementing udp source with multiple messages per datagram, and see on the way what code can be shared
That's perfectly fine. And apologies, this issue really should have been spec'd out a little more. Unfortunately this is not an area of expertise for me and the engineer we use to review these issues is on vacation this week.
In general though, we _strongly_ encourage people to start small and simple. It helps to focus the discussion and makes PR reviews higher quality. So please feel free to implement a basic version of this first, then we can create follow up issues and enhance it separately.
I think that this conversation is not over, but as @binarylogic said, it can extend into follow up issues. So I will close this one.
Most helpful comment
So implementation wise I think its totally sane to use https://github.com/timberio/vector/blob/master/src/sources/tcp.rs#L72 and support multiple messages per datagram. As for network saturation, UDP doesn't provide congestion control or flow control so this is 100% up to the sender and should not be a concern of vector.
I think the best path forward here is actually to do something similar to the
TcpSourcetrait but forUdpSoruce. https://docs.rs/tokio/0.1.22/tokio/net/struct.UdpFramed.html this can be used to provide the same functionality that FramedRead does for TCP. From this we can then provide a syslog implementation and a base udp source. Happy to think through some more. I think there is some more unification we can do with TcpSource trait but we can punt on that for now.