We are observing nodes running v2.6.0 having connection problems with mainnet. This is causing nodes to frequently (~1-2 hours) fall behind in their highest block and need to be restarted.
Debugging I have noted the following when stalled (0 peers)
TcpRemoteNode.cs ConnectAsync() line 42 is catching a SocketException from remote node "No connection could be made because the target machine actively refused it [IP ADDRESS HERE]"
I suspect the problems are caused by the await Task.Yield() to RemoteNode and LocalNode in this commit https://github.com/neo-project/neo/commit/b88846ea997ba3fe8d4a58bf2dbd0f970d4a2e8f
On a privatenet using only v2.6.0 everything appears to operate fine, so perhaps the problem only shows when v2.6.0 is trying to connect to nodes with earlier versions.
+#if !NET47
+ //There is a bug in .NET Core 2.0 that blocks async method which returns void.
+ await Task.Yield();
+#endif
Any info of this bug @erikzhang ?
We shouldn't be using async void methods at all really, it blocks the catching of exceptions, see https://msdn.microsoft.com/en-us/magazine/jj991977.aspx
I don't think it's caused by async void.
Don't know if it fixes the issue or not, but recommend always to return Task instead of async void ;)
@carlosrfernandez yes we will change that in a separate PR, but it is not causing the particular problems we have seen for this issue.
We have seen the following when a node fails...
System.Net.Sockets.SocketException: Too many open files in system
at System.Net.Sockets.Socket..ctor(AddressFamily addressFamily, SocketType socketType, ProtocolType protocolType)
at Neo.Network.TcpRemoteNode..ctor(LocalNode localNode, IPEndPoint remoteEndpoint)
at Neo.Network.LocalNode.<ConnectToPeerAsync>d__55.MoveNext()
On Linux, open TCP sockets create file handles. If we are not closing the sockets properly then eventually it will trigger this exception.
The sockets will be closed at https://github.com/neo-project/neo/blob/master/neo/Network/TcpRemoteNode.cs#L55
Do you have the number of clients there were at that precise moment?
Maybe should add Disconnect on this lines, any exception must disconnect the client
https://github.com/neo-project/neo/blob/master/neo/Network/TcpRemoteNode.cs#L78
https://github.com/neo-project/neo/blob/master/neo/Network/TcpRemoteNode.cs#L79
Something like that
@AshRolls Have we tried to increase /etc/security/limits.conf and /etc/sysctl.conf as workaround?
On my Ubuntu 16.04 nodes neo-cli is running under supervisord which defaults to a max of 1024 files. I have added minfds=65356 to the configuration file to see if that helps.
@belane I have also added fs.file-max = 100000 to sysctl.conf
@erikzhang Yep I have been looking at that code. We should really be doing a socket.Shutdown(SocketShutdown.Both) before we dispose of the socket, but when I add this to the disconnect method we cause exceptions in the Task.WaitAll(...) in ConnectToPeersLoop().
This should be fixed in 2.7.3. Closing
Most helpful comment
Don't know if it fixes the issue or not, but recommend always to return Task instead of
async void;)