Neo: v2.6.0 Connection problems - mainnet

Created on 19 Jan 2018  路  12Comments  路  Source: neo-project/neo

We are observing nodes running v2.6.0 having connection problems with mainnet. This is causing nodes to frequently (~1-2 hours) fall behind in their highest block and need to be restarted.

Debugging I have noted the following when stalled (0 peers)
TcpRemoteNode.cs ConnectAsync() line 42 is catching a SocketException from remote node "No connection could be made because the target machine actively refused it [IP ADDRESS HERE]"

I suspect the problems are caused by the await Task.Yield() to RemoteNode and LocalNode in this commit https://github.com/neo-project/neo/commit/b88846ea997ba3fe8d4a58bf2dbd0f970d4a2e8f

On a privatenet using only v2.6.0 everything appears to operate fine, so perhaps the problem only shows when v2.6.0 is trying to connect to nodes with earlier versions.

bug critical

Most helpful comment

Don't know if it fixes the issue or not, but recommend always to return Task instead of async void ;)

All 12 comments

+#if !NET47
 +            //There is a bug in .NET Core 2.0 that blocks async method which returns void.
 +            await Task.Yield();
 +#endif

Any info of this bug @erikzhang ?

We shouldn't be using async void methods at all really, it blocks the catching of exceptions, see https://msdn.microsoft.com/en-us/magazine/jj991977.aspx

I don't think it's caused by async void.

Don't know if it fixes the issue or not, but recommend always to return Task instead of async void ;)

@carlosrfernandez yes we will change that in a separate PR, but it is not causing the particular problems we have seen for this issue.

We have seen the following when a node fails...

System.Net.Sockets.SocketException: Too many open files in system at System.Net.Sockets.Socket..ctor(AddressFamily addressFamily, SocketType socketType, ProtocolType protocolType) at Neo.Network.TcpRemoteNode..ctor(LocalNode localNode, IPEndPoint remoteEndpoint) at Neo.Network.LocalNode.<ConnectToPeerAsync>d__55.MoveNext()

On Linux, open TCP sockets create file handles. If we are not closing the sockets properly then eventually it will trigger this exception.

Do you have the number of clients there were at that precise moment?

Maybe should add Disconnect on this lines, any exception must disconnect the client

https://github.com/neo-project/neo/blob/master/neo/Network/TcpRemoteNode.cs#L78
https://github.com/neo-project/neo/blob/master/neo/Network/TcpRemoteNode.cs#L79

Something like that

https://github.com/neo-project/neo/pull/157/files

@AshRolls Have we tried to increase /etc/security/limits.conf and /etc/sysctl.conf as workaround?

On my Ubuntu 16.04 nodes neo-cli is running under supervisord which defaults to a max of 1024 files. I have added minfds=65356 to the configuration file to see if that helps.

@belane I have also added fs.file-max = 100000 to sysctl.conf

@erikzhang Yep I have been looking at that code. We should really be doing a socket.Shutdown(SocketShutdown.Both) before we dispose of the socket, but when I add this to the disconnect method we cause exceptions in the Task.WaitAll(...) in ConnectToPeersLoop().

This should be fixed in 2.7.3. Closing

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shargon picture shargon  路  3Comments

igormcoelho picture igormcoelho  路  4Comments

vncoelho picture vncoelho  路  3Comments

shargon picture shargon  路  3Comments

roman-khimov picture roman-khimov  路  3Comments