Godot: Segfault or hang on closing in OS::print_error when resources are leaked (tries to access logger singleton after freeing it)

Created on 25 Apr 2018  路  33Comments  路  Source: godotengine/godot

Godot version: 3.0.2 Steam

OS/device including version: Solus Linux 3.999

Issue description: Every time I close Godot I cannot relaunch it through Steam without first opening Gnome System Monitor and killing the process "godot.x11.opt.tools.64". The process uses 25% CPU (probably max usage of a single thread) and 1GB+ of memory.

Steps to reproduce: Launch the Steam version of Godot. Close it. Process remains in Gnome System Monitor and prevents launching the Steam version until manually killed.

Minimal reproduction project: No project required. Happens with a fresh, new project.

bug confirmed hero wanted! core

Most helpful comment

This is a fun one.

Reduz was right, the initial error is caused by a static variable destructor (in this case a non-empty SelfList::List). These get called after main, so the OS instance has already been destroyed and it's singleton pointer set to null.

In the case of SelfList::List the destructor looks like this: ~List() { ERR_FAIL_COND(_first != NULL); } which results in a OS::get_singleton()->print_error call, dispatching on null, giving us a regular old SIGSEGV segfault.

Next up, we arrive in a crash handler (which we should have really unset by this point) which has set this little trap for us:

static void handle_crash(int sig) {
    if (OS::get_singleton() == NULL)
        return;
...

giving us the infinite loop.


Working back up the chain, things to do:

  • Crash handlers probably need to do something simple and fatal if the OS singleton is already gone.
  • Crash handlers should probably disable themselves in their destructor.
  • Decide what we want to do regarding error output in destructors and it's interaction with static lifetime objects.

I can PR the first two this evening if I get time, which would turn this issue into a "normal" segfault.
Part three requires some more thought/discussion.

All 33 comments

I ran Godot from the console and on close saw these lines:

WARNING: cleanup: ObjectDB Instances still exist!
At: core/object.cpp:1989.
ERROR: clear: Resources Still in use at Exit!
At: core/resource.cpp:418.
ERROR: cleanup: There are still MemoryPool allocs in use at exit!
At: core/dvector.cpp:70.

Running it directly, not from steam, does the same? and closing the Project manager only (without opening a project)?

Closing the project manager without opening a project does not cause this error. The non-Steam version has the same issue as the Steam version.

Can't reproduce it on Mageia 6 x86_64 with the current master branch.

Do you have Godot configured to run adb and connect to your phone for one-click deploy? Maybe it doesn't manage to kill the adb child process.

No, I haven't done anything with phones in Godot yet.

I can confirm this. Running latest master compile under Linux, if I quit the editor the CPU usage goes to 100% for that CPU core.

A strace of the process shows the following:

rt_sigreturn({mask=[]}) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x58} ---
rt_sigreturn({mask=[]}) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x58} ---
rt_sigreturn({mask=[]}) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x58} ---
rt_sigreturn({mask=[]}) = 0
(Repeated forever)

This is the symptom of an infinite loop.

A gdb of the process provides the following information when I quit:

> WARNING: cleanup: ObjectDB Instances still exist!
>    At: core/object.cpp:1986.
> [Thread 0x7fa6cdd72700 (LWP 13086) exited]
> ERROR: clear: Resources Still in use at Exit!
>    At: core/resource.cpp:417.
> ERROR: cleanup: There are still MemoryPool allocs in use at exit!
>    At: core/dvector.cpp:70.
> 
> Thread 1 "godot.x11.tools" received signal SIGSEGV, Segmentation fault.
> 0x0000000002e519ed in OS::print_error (this=0x0, p_function=0x3b2d84b <SelfList<DynamicFont>::List::~List()::__FUNCTION__> "~List", p_file=0x3b2d1bd "core/self_list.h", 
>     p_line=111, p_code=0x3b2d458 "Condition ' _first != __null ' is true.", p_rationale=0x3bdd82d "", p_type=Logger::ERR_ERROR) at core/os/os.cpp:87
> 87              _logger->log_error(p_function, p_file, p_line, p_code, p_rationale, p_type);
> 

And keeps on segfaulting there over and over.

Would you be able to get a full backtrace? (Can't reproduce it myself on Linux :( ).

Its interesting. I've opened a few demo projects and they are ok. Its just this one project that its doing it on so it must be something unique to how I've configured it. Its interesting that its having an issue in the DynamicFonts list.

I'll try a full backtrack now.

Opened the editor. Chose my project from the list, which creates a new thread. I then attach to it.

$ sudo gdb -p 17967
[sudo] password for bundabrg: 
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 17967
[New LWP 17968]
[New LWP 17969]
[New LWP 17970]
[New LWP 17971]
[New LWP 17972]
[New LWP 17976]
[New LWP 17979]
[New LWP 17982]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f6178f96c1d in nanosleep () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.

For some reason it hits a breakpoint here. I assume because of the debug build. I hit continue here.

(gdb) c
Continuing.
[New Thread 0x7f6165efa700 (LWP 18055)]
[Thread 0x7f6165efa700 (LWP 18055) exited]
[Thread 0x7f616f45d700 (LWP 17976) exited]
[Thread 0x7f6165f7c700 (LWP 17979) exited]
[Thread 0x7f6165f3b700 (LWP 17982) exited]

I then clicked close on the editor

ERROR: ~List: Condition ' _first != __null ' is true.
   At: core/self_list.h:111.
ERROR: ~List: Condition ' _first != __null ' is true.
   At: core/self_list.h:111.
[Thread 0x7f6170ab4700 (LWP 17971) exited]
[Thread 0x7f6170a73700 (LWP 17972) exited]
[Thread 0x7f6170af5700 (LWP 17970) exited]
[Thread 0x7f616fdff700 (LWP 17969) exited]
WARNING: cleanup: ObjectDB Instances still exist!
   At: core/object.cpp:1986.
[Thread 0x7f617ada4700 (LWP 17968) exited]
ERROR: clear: Resources Still in use at Exit!
   At: core/resource.cpp:417.
ERROR: cleanup: There are still MemoryPool allocs in use at exit!
   At: core/dvector.cpp:70.

Thread 1 "godot.x11.tools" received signal SIGSEGV, Segmentation fault.
0x0000000002e519ed in OS::print_error (this=0x0, p_function=0x3b2d84b <SelfList<DynamicFont>::List::~List()::__FUNCTION__> "~List", p_file=0x3b2d1bd "core/self_list.h", 
    p_line=111, p_code=0x3b2d458 "Condition ' _first != __null ' is true.", p_rationale=0x3bdd82d "", p_type=Logger::ERR_ERROR) at core/os/os.cpp:87
87              _logger->log_error(p_function, p_file, p_line, p_code, p_rationale, p_type);

Then issued a backtrack

(gdb) bt
#0  0x0000000002e519ed in OS::print_error (this=0x0, p_function=0x3b2d84b <SelfList<DynamicFont>::List::~List()::__FUNCTION__> "~List", 
    p_file=0x3b2d1bd "core/self_list.h", p_line=111, p_code=0x3b2d458 "Condition ' _first != __null ' is true.", p_rationale=0x3bdd82d "", p_type=Logger::ERR_ERROR)
    at core/os/os.cpp:87
#1  0x0000000002e33b74 in _err_print_error (p_function=0x3b2d84b <SelfList<DynamicFont>::List::~List()::__FUNCTION__> "~List", p_file=0x3b2d1bd "core/self_list.h", 
    p_line=111, p_error=0x3b2d458 "Condition ' _first != __null ' is true.", p_type=ERR_HANDLER_ERROR) at core/error_macros.cpp:84
#2  0x000000000281c716 in SelfList<DynamicFont>::List::~List (this=0x458ff30 <DynamicFont::dynamic_fonts>, __in_chrg=<optimised out>) at core/self_list.h:111
#3  0x00007f6178150ff8 in __run_exit_handlers (status=0, listp=0x7f61784db5f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#4  0x00007f6178151045 in __GI_exit (status=<optimised out>) at exit.c:104
#5  0x00007f6178137837 in __libc_start_main (main=0xfab886 <main(int, char**)>, argc=4, argv=0x7ffc92c3b608, init=<optimised out>, fini=<optimised out>, 
    rtld_fini=<optimised out>, stack_end=0x7ffc92c3b5f8) at ../csu/libc-start.c:325
#6  0x0000000000fab7b9 in _start ()
(gdb)

Same in #17024

I've removed all Custom Dynamic Fonts from scenes(one by one) and project works normaly. Only one scenes with custom font in label or richtext has caused the error.

I am almost certain that this bug was fixed, @endragor or someone else I feel did?

Still happening to me with v3.0.6 from Steam.

@UltimaN3rd can you try with the alpha or a daily (or master)?
if is fixed on dev branch it may be possible to cherry pick the fix for stable

@eon-s Still happening (master)

If remove all Custom Dynamic Fonts, not happen.

screenshot from 2018-09-20 21-41-30

Minimal project

var c = preload("res://Control.tscn")

Aparentely this line in .gd file causes bug if have Dynamic Fonts

New Game Project.zip

Confirmed and it also happens on macOS and Windows.

I'm pretty sure that #21060 is the same issue, though I don't know yet if only DynamicFont triggers it or if there's a deeper issue with destructors running after the OS destructor.

IRC log of a discussion with @reduz:

15:28 <Akien> reduz: I'm trying to debug #18429, but I don't understand when ~OS() is being triggered
15:28 <IssueBot> #18429: Segfault or hang on closing in OS::print_error when resources are leaked (tries to access logger singleton after freeing it) | https://git.io/fAFqw
<Akien> (the problem is that `unregister_core_types()` in Main::cleanup calles `StringName::cleanup()` which uses `print_line`, thus OS's logger, apparently after free
<Akien> I guess the destructor is called when `singleton = NULL;` is called?
15:32 <Akien> Well no, it seems to be the other way around
15:38 <Akien> I guess I can ask gdb :D
15:41 <Akien> That's weird, the destructor is called when `OS_X11 os;` goes out of scope in `godot_x11.cpp:main`, but that should be after `Main::cleanup()` has happened.
15:42 <reduz> yes, os destrucor is the last piece of engine code run
15:50 <Akien> Then I don't get how `OS::print_error` can crash with `_logger` being null, since it's freed in ~OS.
<Akien> reduz: Hm yeah ~OS seems to happen before some prints :/ https://paste.fedoraproject.org/paste/-r7siCdaB7PVKgdVon~82Q
15:55 <reduz> i can imagine this may be due to global objects not being destructed
15:55 <reduz> i would check backtrace for this
15:56 <reduz> may show where those objects that are not being uninitialized in globals happen
15:58 <Akien> Well currently I'm checking that with a closed source project (https://koder.itch.io/dv-rings-of-saturn-demo ) where it's obviously a user error where resources are preloaded and not properly freed
15:58 <Akien> But it should not crash anyway
15:59 <reduz> no that should not be a problem
16:00 <reduz> I mean, its bad, the debugger will tell you about it, but its not fatal because those are heap allocations and are not going to be destructed when the engine exists
16:00 <Akien> Backtraces https://paste.fedoraproject.org/paste/AjzXV3aq7xb7EJRfo-ji-A
16:00 <reduz> the only chance for code to run after OS destructor is a global variable not uninitialized
16:00 <reduz> not heap
16:00 <reduz> i mean a C++ global
16:00 <Akien> Ah ok
16:01 <reduz> heap is just discarded, no destructor is called
16:01 <Akien> I'll see if I can reproduce with a simpler project
16:01 <Akien> Might still be this DynamicFont bug
16:02 <reduz> https://www.irccloud.com/pastebin/1cnv75PU/
16:02 <reduz> there clearly are missing stack frames in there
16:02 <reduz> but i am not experienced enough to tell you how to find them
16:02 <reduz> i am afraid you will have to ask tmm[m]
16:03 <reduz> i mean, something must be calling _err_print_error
16:03 <reduz> and its not a lib stdc function
16:04 <reduz> so it sounds like an inlined destructor
16:05 <reduz> oh it is the poolvector destructor
16:05 <reduz> so yeah, there is a poolvector created as a global and not freed beforehand
16:05 <reduz> this seems like engine error to me
16:06 <reduz> as a global or static variable, either of them
16:08 <reduz> in fact, i think poolvector just cant be declared as static variable, even if you clear it, this is probably an overseeing from me
16:09 <Akien> I managed to make a minimal reproducer that triggers it
16:12 <Akien> reduz: https://framadrop.org/r/pXzGZ6pruN#SiPLOA0LKc7ZaaQt3FawOseTYu74vFF7Ti9TkANCbSk=
16:13 <Akien> What causes the segfault in this project is to preload (and leak) a scene using a RTL with a dynamic font
16:14 <Akien> It crashes in release mode. In debug mode, it seems to stall endlessly
16:14 <Akien> (but crashes in gdb)
16:14 <Akien> Thread 1 "godot-git" received signal SIGSEGV, Segmentation fault.
16:14 <Akien> 0x000000000421cf11 in OS::print_error (this=0x0, p_function=0x509ecab <SelfList<DynamicFont>::List::~List()::__FUNCTION__> "~List", p_file=0x509e48d "./core/self_list.h", p_line=111, 
16:14 <Akien>     p_code=0x509e8d0 "Condition ' _first != __null ' is true.", p_rationale=0x5151110 "", p_type=Logger::ERR_ERROR) at core/os/os.cpp:88
16:14 <Akien> 88              _logger->log_error(p_function, p_file, p_line, p_code, p_rationale, p_type);

A minimal sample project to reproduce the bug, or at least one part of it (related to DynamicFont): CleanupSegfault-DynamicFont.zip

See also #21060 for more details (it was closed as duplicate of this issue). As mentioned there it also affects Godot 3.0.6.

Some more input from IRC suggesting to debug this with valgrind:

[16:13:24] <Akien> What causes the segfault in this project is to preload (and leak) a scene using a RTL with a dynamic font
[16:14:23] <Akien> It crashes in release mode. In debug mode, it seems to stall endlessly
[16:14:57] <Akien> (but crashes in gdb)
[16:42:58] <spongie> Akien: that is because gdb initializes everything to zero
[16:43:32] <spongie> this = 0x0 which is a certain segfault, but when running outside of gdb this would lead to undefined behavior
[16:43:46] <spongie> (if it turns out that this is indeed uninitialized)
[16:44:14] <spongie> or, what i mean to say is, if you have different behavior in gdb thats usually one cause
[16:44:38] <spongie> Akien: valgrind is great for spotting memory errors like that
[16:45:01] <spongie> it will tell you if uninitialized memory was accessed with --leak-check=full
[16:49:38] <spongie> valgrind is such an excellent tool

This is a fun one.

Reduz was right, the initial error is caused by a static variable destructor (in this case a non-empty SelfList::List). These get called after main, so the OS instance has already been destroyed and it's singleton pointer set to null.

In the case of SelfList::List the destructor looks like this: ~List() { ERR_FAIL_COND(_first != NULL); } which results in a OS::get_singleton()->print_error call, dispatching on null, giving us a regular old SIGSEGV segfault.

Next up, we arrive in a crash handler (which we should have really unset by this point) which has set this little trap for us:

static void handle_crash(int sig) {
    if (OS::get_singleton() == NULL)
        return;
...

giving us the infinite loop.


Working back up the chain, things to do:

  • Crash handlers probably need to do something simple and fatal if the OS singleton is already gone.
  • Crash handlers should probably disable themselves in their destructor.
  • Decide what we want to do regarding error output in destructors and it's interaction with static lifetime objects.

I can PR the first two this evening if I get time, which would turn this issue into a "normal" segfault.
Part three requires some more thought/discussion.

22415 deals with the crash handlers disabling themselves on destruction.

As far as I can tell only the X11 and OSX crash handlers needed changing. The Windows crash handler seems to already deal well with a missing OS object and the other platforms don't seem to have anything similar going on. (However I could easily be wrong and I don't have access to test those.)

@ibrahn I don't understand too much about the problem, but gave it some tests on Windows using the sample project above.

I couldn't really discern anything aside from a few randomish error messages in the master I had built from this morning. I cherry-picked your PR and rebuilt the editor, and even less in the way of errors. Running some template builds for it, so tomorrow I can take a look at an export and see if anything happens there.

In c37442ef0037ccbb63d3cce4caa918512193a06c had this in the editor.

cmd_2018-09-25_02-17-17

Made a debug export of the project and had this error when closing the window.

cmd_2018-09-25_02-19-37

Follow-up

The debug export of the PR build, has the same error on exit in the sample project.

cmd_2018-09-25_09-46-00

@avencherus thanks for testing this :)

22415 should not have changed any of the behaviour on windows.

As long as it's outputting an error and possibly crashing on close, rather than hanging forever we're in the same place.

Options/Ideas for Static Object Destructor Output Problem

  • Ban static objects with non-trivial destructors (specifically any that ultimately call into globals like OS) and either change their destructors or heap allocate them and manage the lifetime.

    • Simple rule. Feels like the most "correct" solution. Possibly a lot of work to replace existing uses.

  • Add a hacky if (this == nullptr) { /* do something sensible and return */ } to OS methods that become a problem.

    • Not at all pretty and might be undefined behavior.

  • Make the OS singleton an "immortal" object, never running is destructor.

    • Super lazy and pretty much guaranteed to somehow break something somewhere sometime.

  • Just make sure all the static SelfList::Lists are empty before the end of main and worry about this stuff another time.

Looking for opinions on these and any other ideas.

23169 turns the segfault into an error message.

Still need to track down what's actually 'leaking' a dynamic font object.

I can confirm that this tree https://github.com/mjtorn/godot/commits/branch.3.0.6-stable.fontshadow.videofix.spine.crashfix2 looks like it works. Basically I just cherry-picked the two commits from @ibrahn and it doesn't appear to crash anymore.

Further testing is warranted when I find the time, but this feels better than commenting out the error check in the destructor ;)

I am convinced this is fixed now. Can anyone verify? It was likely the DynamicFonts not being properly freed problem that was fixed recently.

@reduz the fonts still aren't freed properly, but this shouldn't segfault anymore at the SelfList destruction step which prints an error for them.

We still need to get to the bottom of why they're hanging around, #23661 might be related.

Suppose we might want to rename the issue or start another though.

Suppose we might want to rename the issue or start another though.

Might be worth opening a new issue describing the current situation for clarity, could you do that? And referencing this one too for the sake of completeness, as there's been a lot of debugging done here and in linked duplicates.

If this fix was applied before the release of 3.1-alpha2 then I can confirm that my testing still shows this bug is still present. It appears to be an issue with preload(). Changing preload() to load fixed the issue as a workaround.

Closing as superseded by #23827 which summarizes what is still wrong in the current master branch.

@FeralBytes The fix was applied a few hours after alpha2's commit, so that's normal.

Was this page helpful?
0 / 5 - 0 ratings