Mounting a global python plugin and run unit tests or run an application using the python binding.
That it works as before.
The child process hangs
markus 9853 0.7 0.1 219624 25592 pts/14 S+ 16:41 0:01 python ./application.py
markus 9854 0.0 0.0 0 0 pts/14 Z+ 16:41 0:00 [python] <defunct>
Relevant Backtrace:
#8 elektraDumpGet (returned=0x55c2db1e13c0, parentKey=0x55c2db1e1930) at /home/jenkins/workspace/libelektra_master-Q2SIBK3KE2NBEMJ4WVGJXAXCSCB77DUBUULVLZDKHQEV3WNDXBMA@2/libelektra/src/plugins/dump/dump.cpp:225
#9 0x00007f04364dc3e8 in elektraPluginProcessSend (pp=pp@entry=0x55c2db1e0e20, command=command@entry=ELEKTRA_PLUGINPROCESS_OPEN, originalKeySet=originalKeySet@entry=0x0, key=key@entry=0x55c2dacf5750)
at /home/jenkins/workspace/libelektra_master-Q2SIBK3KE2NBEMJ4WVGJXAXCSCB77DUBUULVLZDKHQEV3WNDXBMA@2/libelektra/src/libs/pluginprocess/pluginprocess.c:278
#10 0x00007f04364dd2c3 in elektraPluginProcessOpen (pp=pp@entry=0x55c2db1e0e20, errorKey=errorKey@entry=0x55c2dacf5750)
at /home/jenkins/workspace/libelektra_master-Q2SIBK3KE2NBEMJ4WVGJXAXCSCB77DUBUULVLZDKHQEV3WNDXBMA@2/libelektra/src/libs/pluginprocess/pluginprocess.c:509
#11 0x00007f04366e320b in libelektra_Python_LTX_elektraPluginOpen (handle=0x55c2db1e0da0, errorKey=0x55c2dacf5750)
at /home/jenkins/workspace/libelektra_master-Q2SIBK3KE2NBEMJ4WVGJXAXCSCB77DUBUULVLZDKHQEV3WNDXBMA@2/libelektra/src/plugins/python/python.cpp:225
So it seems to hang in elektraPluginProcessOpen. Maybe there is no child or parent is receiving what is sent? I think it must be a bug in pluginprocess, elektraPluginProcessOpen should never hang.
Might be related to #2162
Any idea what it could be or do I need to find a reproducible test case?
Hmm. Is it possible that there are "cascading openings"? E.g. a plugin using kdbGet? I had similar issues when i tried to load a part of the kdb via kdbGet inside a haskell plugin.
The issue was that the child process executing a haskell plugin, where the haskell runtime was already loaded, executed kdbGet or something like that while a haskell plugin is already mounted (globally or locally), kdbGet caused the haskell plugin to run again, forked the process again, and then this second child process failed to open the haskell runtime as it was already opened in the parent. As far as i remember it didn't fail with a meaningful error message or something but just hanged. I resorted to use invoke instead to load what i needed. Can it be that this is the case here? Having a global plugin makes me suspicious due to the reason described above. In that case we would need to think about some way of how to treat such cases with pluginprocess in general, i currently have no quick easy idea.
No, only the binding does a kdbGet(), not the plugin.
Did you check if the Haskell plugin works as global plugin? The list plugin loads the plugin multiple times before it finally uses it. My suspicion is that this pluginprocess cannot deal with these invocations.
So one way to reproduce the problem is:
kdb gmount python script=/path/to/Elektra/src/plugins/python/python/python_filter.py
ctest -V -R test_kdb.py
But for many other tests hang, too. I added the filter in eb24b1bb682a566acc01d8cd8fe970d709ffc581
will check it out today, but if its really this "cascading child process" thing, we'd need some kind of way to handle that. i had the idea of having some kind of "daemon" running that spawns child processes so they always get forked from a "clean" parent, but that would be quite a lot of work to implement i think and having a daemon running is not ideal either. The other thing i had in mind is working with some kind of global flag that subprocesses can check so they don't attempt to initialize runtimes again.
Thank you for looking into it!
I think there is some communication problem and the child dies while the parent tries to receive something.
Maybe it is caused by API misusage in the python plugin. Maybe you find a way to improve the API a bit further. Ideally plugins would only call a single method and all the parent/child situations are handled within the API.
And looking at your haskell plugin it seems like a basic functionality is broken: In line 80 you return with SUCCESS but the haskell plugin had no chance to override the contract with its own data. So every Haskell plugin is bound to have the contract of src/plugins/haskell/README.md even though it is unlikely that this is correct.
In the Python plugin I fixed this quite recently in 143d26d2764502af15da9301f4419121a4329933 and it is possible that this triggers the defunct process problem. But it is necessary, otherwise the plugins cannot do much.
If the problem is what I think it is, a possible fix might be to call waitpid() with WNOHANG before you try to send/fetch data from the child. Then you would find out if it died during the previous operation. There might be better ways, in any case the protocol should be safe even if the child dies in the middle of the communication.
macOS reports that python quitted unexpectedly using your above command, thus the parent process never stops to read from the pipe as the child process never sends something. trying to work around that issue.
Thank you for looking into it! Yes, in general the parent process should detect that that the child exited and report an error.
i think the cleanest way to handle this is to use pipes instead of named pipes. In case child processes die while having opened pipes, EOF will be sent by the OS over the pipe so pluginprocess won't get stuck upon read. However i don't think i will finish this today as i need a way to teach the dump plugin to output the data into the pipe that has no file name. In particular i dont find a way to open a c++ stream on a pipe file descriptor? there seems to be only non-portable solutions for that, this makes this quite hard. boost would provide something like that but having boost as a dependency for dump seems overkill. libstdc++ also seems to offer something like that. I think that was the initial decision to use named pipes instead but obviously has lead to the deadlock issue.
The alternative way where we stick to named pipes is using select http://man7.org/linux/man-pages/man2/select.2.html, in theory we could set some timeout (e.g. 1 or 2s) and if no data has been received in that timespan we check if the child process has died using wait_pid. If it hasn't died we try to read again until we receive some data or the child is dead. This has the advantage that we don't have to modify the dump plugin though it seems slightly more "hacky"
what do you think is better?
It is better to use pipes, then the OS handles all corner cases for us. And this also should fix the problem we have in the homepage build. (which fails at the named pipe creation).
It should be easy to teach the dump plugin: Simply pass to it the file name /proc/self/fd/<file descriptor returned by pipe>
Btw. the src/plugins/crypto/gpg.c done by @petermax2 already uses pipe.
is that portable to do it that way? but its fine for me, i agree that it is much cleaner than periodically checking if its dead. will try to get this improvement done today, though i can't promise if it fixes the python plugin issue as python seems to segfault. its still an important improvement for pluginprocess. thanks a lot for the hint with using the proc filesystem, didn't think about that!
/proc/self/fd does not exist on macOS unfortunately. I've used /dev/fd instead of /proc/self/fd which seems a bit more portable and less linux-specific.
the gpg.c file doesn't help in my situation because it does not need a filename for a pipe so the dump plugin works.