Libelektra: pluginprocess: defunct processes

Created on 8 Aug 2018 · 14Comments · Source: ElektraInitiative/libelektra

Steps to Reproduce the Problem

Mounting a global python plugin and run unit tests or run an application using the python binding.

Expected Result

That it works as before.

Actual Result

The child process hangs

markus    9853  0.7  0.1 219624 25592 pts/14   S+   16:41   0:01 python ./application.py
markus    9854  0.0  0.0      0     0 pts/14   Z+   16:41   0:00 [python] <defunct>

Relevant Backtrace:

#8  elektraDumpGet (returned=0x55c2db1e13c0, parentKey=0x55c2db1e1930) at /home/jenkins/workspace/libelektra_master-Q2SIBK3KE2NBEMJ4WVGJXAXCSCB77DUBUULVLZDKHQEV3WNDXBMA@2/libelektra/src/plugins/dump/dump.cpp:225
#9  0x00007f04364dc3e8 in elektraPluginProcessSend (pp=pp@entry=0x55c2db1e0e20, command=command@entry=ELEKTRA_PLUGINPROCESS_OPEN, originalKeySet=originalKeySet@entry=0x0, key=key@entry=0x55c2dacf5750)
    at /home/jenkins/workspace/libelektra_master-Q2SIBK3KE2NBEMJ4WVGJXAXCSCB77DUBUULVLZDKHQEV3WNDXBMA@2/libelektra/src/libs/pluginprocess/pluginprocess.c:278
#10 0x00007f04364dd2c3 in elektraPluginProcessOpen (pp=pp@entry=0x55c2db1e0e20, errorKey=errorKey@entry=0x55c2dacf5750)
    at /home/jenkins/workspace/libelektra_master-Q2SIBK3KE2NBEMJ4WVGJXAXCSCB77DUBUULVLZDKHQEV3WNDXBMA@2/libelektra/src/libs/pluginprocess/pluginprocess.c:509
#11 0x00007f04366e320b in libelektra_Python_LTX_elektraPluginOpen (handle=0x55c2db1e0da0, errorKey=0x55c2dacf5750)
    at /home/jenkins/workspace/libelektra_master-Q2SIBK3KE2NBEMJ4WVGJXAXCSCB77DUBUULVLZDKHQEV3WNDXBMA@2/libelektra/src/plugins/python/python.cpp:225

So it seems to hang in elektraPluginProcessOpen. Maybe there is no child or parent is receiving what is sent? I think it must be a bug in pluginprocess, elektraPluginProcessOpen should never hang.

Might be related to #2162

Any idea what it could be or do I need to find a reproducible test case?

System Information

Elektra Version: master

bug

Source

markus2330

All 14 comments

Hmm. Is it possible that there are "cascading openings"? E.g. a plugin using kdbGet? I had similar issues when i tried to load a part of the kdb via kdbGet inside a haskell plugin.

The issue was that the child process executing a haskell plugin, where the haskell runtime was already loaded, executed kdbGet or something like that while a haskell plugin is already mounted (globally or locally), kdbGet caused the haskell plugin to run again, forked the process again, and then this second child process failed to open the haskell runtime as it was already opened in the parent. As far as i remember it didn't fail with a meaningful error message or something but just hanged. I resorted to use invoke instead to load what i needed. Can it be that this is the case here? Having a global plugin makes me suspicious due to the reason described above. In that case we would need to think about some way of how to treat such cases with pluginprocess in general, i currently have no quick easy idea.

e1528532 on 8 Aug 2018

❤1

No, only the binding does a kdbGet(), not the plugin.

Did you check if the Haskell plugin works as global plugin? The list plugin loads the plugin multiple times before it finally uses it. My suspicion is that this pluginprocess cannot deal with these invocations.

markus2330 on 8 Aug 2018

So one way to reproduce the problem is:

kdb gmount python script=/path/to/Elektra/src/plugins/python/python/python_filter.py
ctest -V -R test_kdb.py

But for many other tests hang, too. I added the filter in eb24b1bb682a566acc01d8cd8fe970d709ffc581

markus2330 on 9 Aug 2018

will check it out today, but if its really this "cascading child process" thing, we'd need some kind of way to handle that. i had the idea of having some kind of "daemon" running that spawns child processes so they always get forked from a "clean" parent, but that would be quite a lot of work to implement i think and having a daemon running is not ideal either. The other thing i had in mind is working with some kind of global flag that subprocesses can check so they don't attempt to initialize runtimes again.

e1528532 on 9 Aug 2018

Thank you for looking into it!

I think there is some communication problem and the child dies while the parent tries to receive something.

Maybe it is caused by API misusage in the python plugin. Maybe you find a way to improve the API a bit further. Ideally plugins would only call a single method and all the parent/child situations are handled within the API.

And looking at your haskell plugin it seems like a basic functionality is broken: In line 80 you return with SUCCESS but the haskell plugin had no chance to override the contract with its own data. So every Haskell plugin is bound to have the contract of src/plugins/haskell/README.md even though it is unlikely that this is correct.

In the Python plugin I fixed this quite recently in 143d26d2764502af15da9301f4419121a4329933 and it is possible that this triggers the defunct process problem. But it is necessary, otherwise the plugins cannot do much.

If the problem is what I think it is, a possible fix might be to call waitpid() with WNOHANG before you try to send/fetch data from the child. Then you would find out if it died during the previous operation. There might be better ways, in any case the protocol should be safe even if the child dies in the middle of the communication.

markus2330 on 9 Aug 2018

macOS reports that python quitted unexpectedly using your above command, thus the parent process never stops to read from the pipe as the child process never sends something. trying to work around that issue.

e1528532 on 9 Aug 2018

Thank you for looking into it! Yes, in general the parent process should detect that that the child exited and report an error.

markus2330 on 9 Aug 2018

i think the cleanest way to handle this is to use pipes instead of named pipes. In case child processes die while having opened pipes, EOF will be sent by the OS over the pipe so pluginprocess won't get stuck upon read. However i don't think i will finish this today as i need a way to teach the dump plugin to output the data into the pipe that has no file name. In particular i dont find a way to open a c++ stream on a pipe file descriptor? there seems to be only non-portable solutions for that, this makes this quite hard. boost would provide something like that but having boost as a dependency for dump seems overkill. libstdc++ also seems to offer something like that. I think that was the initial decision to use named pipes instead but obviously has lead to the deadlock issue.

e1528532 on 9 Aug 2018

The alternative way where we stick to named pipes is using select http://man7.org/linux/man-pages/man2/select.2.html, in theory we could set some timeout (e.g. 1 or 2s) and if no data has been received in that timespan we check if the child process has died using wait_pid. If it hasn't died we try to read again until we receive some data or the child is dead. This has the advantage that we don't have to modify the dump plugin though it seems slightly more "hacky"

e1528532 on 9 Aug 2018

what do you think is better?

e1528532 on 9 Aug 2018

It is better to use pipes, then the OS handles all corner cases for us. And this also should fix the problem we have in the homepage build. (which fails at the named pipe creation).

It should be easy to teach the dump plugin: Simply pass to it the file name /proc/self/fd/<file descriptor returned by pipe>

markus2330 on 9 Aug 2018

Btw. the src/plugins/crypto/gpg.c done by @petermax2 already uses pipe.

markus2330 on 9 Aug 2018

is that portable to do it that way? but its fine for me, i agree that it is much cleaner than periodically checking if its dead. will try to get this improvement done today, though i can't promise if it fixes the python plugin issue as python seems to segfault. its still an important improvement for pluginprocess. thanks a lot for the hint with using the proc filesystem, didn't think about that!

e1528532 on 10 Aug 2018

/proc/self/fd does not exist on macOS unfortunately. I've used /dev/fd instead of /proc/self/fd which seems a bit more portable and less linux-specific.

the gpg.c file doesn't help in my situation because it does not need a filename for a pipe so the dump plugin works.

e1528532 on 10 Aug 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

remove entry barriers

markus2330 · 3Comments

code generation for errors

markus2330 · 4Comments

mmap: changing environment

markus2330 · 4Comments

homepage: build fails

markus2330 · 4Comments

Release: debian

mpranj · 3Comments