I and @giuliavezzani have been investigating the functionality of the interrupt method called on the client side of a rpc communication.
From the tests we carried out (code available on Gist) it came out that the functionality works fine on Windows, whereas we've been experiencing deterministic deadlock on Linux.
cc @drdanz
After a quick investigation it looks like that the interrupt() call is stuck on a stateMutex.wait(); in PortCore::interrupt(). Removing this wait() fixes the deadlock. Nonetheless it still behaves in the wrong way... after the execution of interrupt() the program is supposed to be exit, but instead I believe it still waits for the server to reply.
We should at the same time address any possible issue related to the dual method resume.
After some investigation, besides the mutex issue, it appears that the interrupt method does not call interrupt for the InputStreams for each PortCoreUnit. As a consequence, the units waiting for input are stuck in a blocking read() call until they get a reply from the sender.
Interrupting all the units works, and the client is killed almost instantaneously, but unfortunately the InputStream used for RPC Connections closes the connection and resume() will not work.
Side note, there is no InputStream::resume() method.
It would be interesting to investigate what makes the rpc port special -- with respect to this bug -- and why the problem happens only on Linux.
Rpc is "special" because the protocol is all request => reply and I think that if the request is successful, there is no timeout in the read call that waits for the reply.
For the other protocols, there is an exchange of messages, but they are all handled by YARP and each message is sent immediately as soon as the previous message is received.
By contrast, for RPC protocol, the reply depends on the server implementation, so if the server does not reply, the client is stuck on a blocking read.
BTW, that's really evident if you change the delay to a very large number in @pattacini's test.
The same bug should happen for blocking read then?
I'm not sure for normal reads, we'd have to test it and to check the implementation, but I think that the issue is at a lower level...
In RPC, a single unit of the message exchange protocol includes the request and the replay; therefore, between the request and the replay, the underlying protocol is in the middle of a message exchange.
On the contrary, for a normal read the message protocol is not started yet.
I think the same issue should happen on the "writer" side, when the "reader" reads setting willReply to true.
Hi guys, I'm resuming this old post.. The problem is still the same, but it affects my win32 system too.
Additionally, in my RFModule::close(), I have a sequence of:
rpcPort.interrupt();
rpcPort.close();
The presence of rpcPort.interrupt() hangs the module forever when the subsequent close() method is called. If I comment rpcPort.interrupt() then rpc.close() is executed successfully. Note that in the meanwhile another module is contiguously communicating with my module through this rpcPort.
No matter what the interrupt() method is supposed to do (I cannot find exhaustive documentation about it), a safe system should handle it properly, eventually ignoring the call if it was unnecessary, so I consider this a severe bug.
This might be considered one of the top-list issues to address in the upcoming YARP releases.
A similar issue is recently been fixed in pr #1838 and some unit test have been added. @Nicogene can you check if that commit fixes also this issue?
I'll look at it, but I'm afraid that it is a different problem
I confirm that the problem is still there :disappointed:
As @drdanz said once interrupted, the RpcClient waits forever a response that will never come.
In PortCore.cpp there is written:
// Since interruptible is set, it is possible that the user
// may be blocked on a read. We send an empty message,
// which is reserved for giving blocking readers a chance to
// update their state.
just before the semaphore on which it gets stuck.
Who's we? :sweat_smile:
Looking deep in PortCore.cpp, that stateSema is mainly used to change these booleans:
bool listening; ///< is the port server listening on the network?
bool running; ///< is the port server thread running?
bool starting; ///< is the port in its startup phase?
bool closing; ///< is the port in its closing phase?
bool finished; ///< is the port server thread finished running?
bool finishing; ///< is the port server thread trying to finish?
bool waitBeforeSend; ///< should we wait for all current writes to complete before writing more?
bool waitAfterSend; ///< should we wait for writes to complete immediately after we start them?
bool controlRegistration; ///< should the port unregister its name when shutting down?
bool interruptible; ///< is the port in an interruptible state?
bool interrupted; ///< is the port interrupted?
Following #1754 , what about cleanup it and replace them with std::atomic_bool?
Are we brave enough @drdanz ? :grimacing:
I do not know the actual logic behind the stateSema, but in general @Nicogene note that a mutex providing mutual exclusion for a set of flags is logically different w.r.t. to a set of mutexes, in which each one is guaranteed mutual exclusion for a single flag.
Uhm you are right :thinking:
BTW I don't think that this would solve this issue, just make the code cleaner.
As @drdanz said the problem is the inputstream that get stuck waiting a response :disappointed:
Perhaps this is could be solved just by setting a timeout in the read...
Perhaps this is could be solved just by setting a timeout in the read...
Related also to this: #1542.
I completely agree :+1:
馃槶
Most helpful comment
A similar issue is recently been fixed in pr #1838 and some unit test have been added. @Nicogene can you check if that commit fixes also this issue?