When using the NetworkTarget with the KeepConnection=true, the connection to the server is kept open so that it can be reused. Using the TCP protocol, if the server is configured to close the connection after a period of inactivity (a common configuration to avoid DOS), we then have a NetworkSenderholding a closed Socket. The problem is that it seems that the next log message we try to send will not be received by the server (executing https://github.com/NLog/NLog/blob/master/src/NLog/Targets/NetworkTarget.cs#L291).
One workaround would be to set KeepConnection=false but in case of a TCP connection this can be expensive when the application starts to write more logs.
Another solution would be to use RetryingTargetWrapper, but it seems a bit odd to configure a retry for that particular case since no error really occurred.
So do you think it is possible to check the status of the socket before using it to send the message and try to open a new one in case it was cleanly closed by the server ?
I guess it would make sense for the KeepConnection=true situation, that if the Socket has been used successfully before, then the target should perform an implicit connection-retry before doing hard failure.
Pull Requests are welcome.
That would be a nice feature to add. PR would be great indeed!
Created #3124 to allow faster detection of broken socket connections.
Probably need some extra logic to perform a zero-byte Send-call before using an existing socket:
https://docs.microsoft.com/en-us/dotnet/api/system.net.sockets.socket.connected#examples
This is actually pretty bad. If the TCP connection gets closed (this can be quite normal), NLog has no internal mechanism for observing that the connection has been gracefully closed. In these cases, .NET fires the callback handler for received data with 0 received bytes. NLog has no such implementation. I will see if I can try and write a patch for this in the near future.
@martinmine You are more than welcome to create a PR. Maybe make it so you only perform the connection-check when socket has been unused for X seconds (Default Check Disabled = -1, Check Always = 0)
. I will see if I can try and write a patch for this in the near future
That would be great :+1:
@snakefoot I wasn't thinking of performing any connection checking, but simply add support for listening for the connection to be gracefully closed. All in all, I simply register a receive handler and if that handler received 0 bytes, the connection gets marked as closed.
@martinmine Like this: https://stackoverflow.com/a/53315852/193178 (Sync) ? or like this https://stackoverflow.com/a/32244584/193178 + https://stackoverflow.com/a/2015520/193178 (Async) ?
@snakefoot Actually, neither. I don't want to mess with any socket parameters since I want to get a general solution to this problem. Different systems (routers/endpoints/whatever) may act differently with the keep-alive flag for example, and the root cause of this problem is lack of error-handling which is what I want to improve. Give me a few hours and I might have something going on a proof-of-concept branch :)
@martinmine Been able to decipher the NLog-code for NetworkTarget? Or maybe it is easier for you to present your idea as sample-code here in this issue?
@snakefoot Sorry for the delay, got quite busy the past few days ... :( I've managed to scramble something together, I haven't tested it out but I think it's a start: https://github.com/NLog/NLog/commit/39d39cff047e83a3e18be823804b0d8e530a3c03
@martinmine Ok so you are doing an Async-Receive and if it completes with no data returned, then you see the connection as broken. Just like the stackoverflow-articles are doing.
Some of the stackoverflow-articles improve on this logic by enabling socket-KeepAlive, so they will react to network-cable disconnect (and not just correct socket-close-handshake). But I also think it is a good idea to these two features separate.
But I guess some logic is needed for better handling of _pendingRequests. Maybe instead of closing the socket and throwing it all away, then maybe try reconnect first?
@snakefoot I don't see any reason implementing it synchronous as the APIs being used in NLog are asynchronous. I went a bit back and forth with how to approach it since I haven't used this specific async API on the Socket API, but it also had to fit in the context NLog use it.
I agree it would make sense to implement these two separately (imo gracefully handling socket closing is a bare minimum). I don't think attempting to reconnect would be appropriate. Let's say the connection gets closed on the server due to inactivity. It would appear odd if NLog would just open connections that don't send any data. The reason I close it at this stage (_pendingRequests) is that I don't have a way of signaling to the NetworkTarget class when the close event fires. Since a socket connection is this tightly coupled (the properties in the NetworkSender and TcpNetworkSender class reflects much of the current socket state) with how the class is written, I would argue it is more reasonable to throw it away.
It would appear odd if NLog would just open connections that don't send any data.
Well lets try this story with your new logic:
1) NetworkTarget sends 100 LogEvent to address IPA (Creates NetworkSender etc.)
2) NetworkSender enters a disconnected state after having sent 50 LogEvents (Still 50 events in pending-queue)
3) NetworkTarget sends 100 new LogEvent to address IPA (Lookup NetworkSender and sees not connected and disconnect, and then creates new NetworkSender to same address).
Now we have lost 50 LogEvents. And the NetworkTarget will happily reconnect anyway on the next batch. I understand it would not make sense to reconnect when not having anything to send. But when having items in pending-queue then it would be "smart" with simple retry mechanism with randomly-growing-retry-delay.
@snakefoot Don't think that will be a good idea since then you face the issue where you risk receiving the same data multiple times between reconnections. To deal with that you will have to implement some acknowledgement mechanism on top of TCP/IP and I don't think this is something I expect NLog would come with out of the box. Maybe with a feature flag perhaps? :)
@martinmine Not sure I understand. Data lost on the wire will not be sent again. But sending data that resides in _pendingRequests (Stuff not yet sent) after reconnect should not cause problems with sending same data twice.
There is also this comment from the NLog syslog-target-project. They are talking about both UDP + TCP, and argue that one should only react to send-failures instead of trying to perform connection-check using receive/poll. But again their arguments seems to match mostly for UDP and not for TCP.
@snakefoot If you partially receive a log message (at the server) and the connection dies, you reconnect and send the whole thing again you could end up with parts (or whole) of the same log statements depending on where the connection got closed. I also think you risk receiving parts of the log message at the server-side while the client is unable to observe that the log message is received and it could try re-send it, resulting in same data being sent multiple times.
@martinmine If you partially receive a log message (at the server) and the connection dies, you reconnect and send the whole thing again
When TCP connection is lost then there is no auto-reconnect by default. Data on the wire (or in socket buffers) are lost.
I suggest that the NLog NetworkTarget should reconnect if it sees it still has unsent data which have never reached the socket (items still exists in _pendingRequests).
Your suggestion of a handshake protocol for server and client after reconnect to recognize what was lost on the wire (or socket buffers) is a very different project, that might never complete.
@snakefoot Sorry, my bad I misread your comment, considering the _pendingRequest, I agree it would make sense to reconnect to the server and send what is left in it until empty on lost connections.