Godot version:
197b65f32ac811f79bc5599fbfe8cf83914b6873
(latest master at the moment of writing this issue)
OS/device including version:
Arch Linux
Linux kernel:
5.1.5-arch1-2-ARCH #1 SMP PREEMPT Mon May 27 03:37:39 UTC 2019
(not graphics related)
Issue description:
GDScriptFunctionState can be (mysteriously) invalidated without being resumed even once.
This contradicts documentation (and breaks my game)
It should be invalidated only on resume.
Maybe it would be understandable to break over long time but this is not the case either.
Steps to reproduce:
(it is packed into minimal reproduction project for convenience)
extends Node2D
var scene_state=null
func _ready():
alt_start()
advance()
func alt_start():
scene_state=init_dialog() # if you move this to ready instead bug disappears
print ("start wrote ", scene_state, " valid? ", scene_state.is_valid())
func start2():
scene_state=null # if you don't nullify bug disappears
scene_state=antlers()
print ("start2 wrote ", scene_state, " valid? ", scene_state.is_valid())
func advance():
print("advancing from ", scene_state)
var ss = scene_state.resume()
# scene_state might have been overwritten
# directly by the function that we called
# so we check first
if ss:
print ("ss ", ss, " valid? ", ss.is_valid())
else:
print("ss null")
if scene_state:
print ("scenest ", scene_state, " valid? ", scene_state.is_valid())
else:
print("scenest null")
print ("advance ended")
func init_dialog():
print("starting")
yield()
start2()
func antlers():
print("some important print")
yield()
print("unimportant print")
starting
start wrote [GDScriptFunctionState:1177] valid? True
advancing from [GDScriptFunctionState:1177]
some important print
start2 wrote [GDScriptFunctionState:1178] valid? True
ss null
scenest [GDScriptFunctionState:1178] valid? False
advance ended
resume() exists only in the start of advance() and advance wasn't called between [GDScriptFunctionState:1178] being valid and invalid.
This can only mean that compliance with specification was broken at some point.
And that it also happened very quickly.
The dependency on the length of the call stack in internals of alt_start looks very strange.
Minimal reproduction project:
I'm a new Godot user that's been looking at the engine code for the GDScript module to understand it in order to implement a feature of yield. This looks like an interesting test case.
Even if I change start2() to:
func antlers():
antlers()
print ("start2 wrote ", scene_state, " valid? ", scene_state.is_valid())
'ss' comes out null. Even though antlers() never finished ("unimportant print" was never printed), we cannot resume it because antlers() yielded automatically to start2() which completed causing scene_state.resume() to return null. Godot's yield() only goes up one call level, like a "return". What you'd really want is for antlers() to yield back to advance() so that calling ss.resume() causes "start2 wrote" to be printed.
@busy8 in this test case i sacrificed in-game logic to drive this example to a minimum so it isn't actually what i was doing, but rather a boiled down seed that exhibits a bug.
It isn't really a problem that ss comes out null for this particular advance, because there were originally checks to not let a null ss overwrite a valid global scene_state which was written by start function (which originally was a universal start-any-scenario type of function).
The problem for in-game logic was the invalidation of scene_state that was written by a start function without ever resuming it.
Which is exactly what this test case demonstrates.
Additional note:
One might suppose that the problem is that when coming out of resume() EVERY GDScriptFunctionState obtained inside this call would be erroneusly invalidated alongside the original state that was resumed.
However that is not the case as this barebones example demonstrates:
var state1
var state2
func advance():
print("advancing: state1 = ", state1, " valid? ", state1.is_valid())
print("resuming state1")
state1.resume()
print ("resumed")
func second_tier_caller():
state2 = bar()
print("second tier: state2 = ", state2, " valid? ", state2.is_valid())
func foo():
print("i am foo!!")
yield()
second_tier_caller()
func bar():
print("i am barrr!")
yield()
print("this is never printed")
func _ready():
state1=foo()
advance()
print("end: state1 = ", state1, " valid? ", state1.is_valid())
print("end: state2 = ", state2, " valid? ", state2.is_valid())
it prints out
i am foo!!
advancing: state1 = [GDScriptFunctionState:1177] valid? True
resuming state1
i am barrr!
second tier: state2 = [GDScriptFunctionState:1178] valid? True
resumed
end: state1 = [GDScriptFunctionState:1177] valid? False
end: state2 = [GDScriptFunctionState:1178] valid? True
Just how documentation prescribes, [GDScriptFunctionState:1178] doesn't get invalidated in this example.
So a two-tiered structure itself doesn't automatically create problems.
I should also notice that while deducing my first testcase prints blanking out or plain crashes without errors reported were a common apperance which might indicate low-level side effects causing memory corruption which in turn means that it is a potential CVE when it comes to multiplayer servers.
@Houkime
From what I'm seeing, with the way yield() is currently implemented, doing anything other than resuming from the the direct caller of the function that yielded can easily get the system into a weird state. I hope to be able to help to improve this aspect of GDScript once I grok the module. Until then my personal policy is to be very conservative with yield() (perhaps even avoiding it and managing state explicitly) in my Godot projects.
In the example in your original post I suspected that reentrancy issues were responsible for some of what you were seeing. The function resume() is a method of the reference counted GDScriptFunctionState object (e.g scene_state) that is returned by yield(). When scene_state.resume() in advance() is executed, start2() is run which then sets scene_state to null. This removes the extant reference and causes scene_state to be destroyed.
However, the function start2() is executed from the resumption point in init_dialog() as part of a call tree that has the scene_state.resume() as an ancestor. Put another way, scene_state.resume() calls init_dialog() from the "frozen state" contained inside the GDScriptFuntionState (i.e scene_state) which causes it to restart and call start2() which essentially deletes scene_state. When start2() returns (followed by init_dialog() returning), the rest of code in resume() is executed on a dead object. That's obviously problematic.
To get a quick idea whether the hunch might be right I modified the GDScriptFunctionState destructor to print a message when it is called and rebuilt Godot. I also made resume() indicate when it is entered and left. This was the output:
starting
start wrote [GDScriptFunctionState:1176] valid? True
advancing from [GDScriptFunctionState:1176]
------->Entering GDScriptFunctionState::resume()
~~~~~~~GDScriptFunctionState destructor called
start2 wrote [GDScriptFunctionState:1177] valid? True
<-------Leaving GDScriptFunctionState::resume()
ss null
scenest [GDScriptFunctionState:1177] valid? False
advance ended
As you can see, the state is destroyed even before antlers() runs and outputs "some important print" [for some reason only a blank line is outputted by antlers() when I print "Entering" at the start of resume() but that's not important now]. Needless to say, all bets are off when control returns to resume() and it starts modifying members of its destroyed object. The program's state is corrupted. It's possible that the return value of antlers() happens to be created in the same address as the original scene_state. Since resume() invalidates the state it is called on, resume() would invalidate the wrong object and cause the symptom you're seeing.
After that I modified your original example in two ways. Under var scene_state = null I added 'var original_scene_state = null'. Then in alt_start() I added original_scene_state = scene_state after scene_state=init_dialog(). The following output was produced:
starting
start wrote [GDScriptFunctionState:1176] valid? True
advancing from [GDScriptFunctionState:1176]
------->Entering GDScriptFunctionState::resume()
some important print
start2 wrote [GDScriptFunctionState:1177] valid? True
<-------Leaving GDScriptFunctionState::resume()
ss null
scenest [GDScriptFunctionState:1177] valid? True
advance ended
In this case the original_scene_state variable was an additional reference that kept the state alive after scene_state = null.
TBH, I'm a bit unsettled by godot's current yield except in the simplest use cases. I would love to know which use cases were intended and which were not but are possible but may hide pitfalls.
I personally ended up entirely revamping my dialogue system and got rid of every single yield because these guys were driving me crazy already.
And rewritten system worked the first time. "Surprise".
I hope that @busy8 or someone else can give cpp implementation of yield() and resume() some love and clarity at least so that its usage can't corrupt program's state no matter what.
@Houkime I did the same thing last night. I am now using a thread and a semaphore. Seems to be working fine.
FWIW the addition below prevents the state from being deleted while resume() is running.
That would prevent the extremely inconsistent behavior you were seeing in the first example.
Would the developers be interested in a patch?
It would be my first.
Variant GDScriptFunctionState::resume(const Variant &p_arg) {
//Add a reference to *this GDScriptFunctionState while resume runs
//This will ensure that it is not destroyed before resume() returns
//(which can happen if function->call() removes a solitary reference)
//The reference will be removed when resume() returns
class StatePinner {
private:
Reference *ref;
public:
StatePinner(const StatePinner &) = delete;
StatePinner(Reference *r) :
ref{ r } {
ref->reference();
}
~StatePinner() {
ref->unreference();
}
} pin(this);
//... Rest of function snipped
return ret;
}
Make a pull request referencing this issue and find out.
A developer may ask for some work to be done before merging etc.
I am not sure though if current devs are not overflown by PRs that they already have, but it shouldn't matter since this is an established procedure anyway even if merge might be some time away in the future.
@Houkime
I was just about to say that while working on other aspects of yield (for example I want yield(value) to be possible and work as expected) I have modified my approach a bit in line with those changes. I was thinking of opening a new feature request in the issues list for discussion before creating a pull request. However, if protocol is to go straight to a pull request I guess I'll do that when I'm farther along and want to submit a patch.
I don't know how often the issue list is checked but this is just a general notification that I am still working on fixing some issues relating to yield and harmonizing how it works with the rest of the language. Hopefully I'll be able to get it merged eventually but discussion can wait for a PR.
Changes will include:
1) Preventing a resume()-ed function from deleting it's own state and causing problems when there is only one extant reference as was encountered in this issue.
2) Preventing resume() from forcing a coroutine to resume even though it's blocked waiting on a signal with yield(object, "signal") as was reported in another issue. Instead resume() will fail.
3) Completing symmetric communication by allowing yield(value) to hand a value back to the caller. Currently godot allows resume(value) to pass an argument to the coroutine that yielded. I want the other way to work as well.
4) Enhancing interaction between yield(object, "signal") and yield(). Currently it's very difficult to resume interaction with a function that used yield(object, "signal") because the state the caller held gets invalidated; you won't know when it has yielded normally and won't have a valid handle to resume() it. This will be addressed.
5) To facilitate all of the above, the current behavior of resume() will change: it will not invalidate the function state therefore it will not need to return it. This will allow it to return the value yielded instead. A persistent handle to the coroutine will make many things simpler. Of course, existing GDScript that uses yield() will not work without change. (The reassignments of resume()'s return value would have to be removed)
6) Finally, I recommend introducing yield_await(object, "signal") [or possibly just await] as an easily searchable synonym [at least] for yield(object, "signal"). A function that does yield(object, "signal") actually "goes to sleep" until it is awoken by the signal. During that time it is logical that it cannot be directly resumed using state.resume(). OTOH when a function yields normally with yield() or yield(value) it CAN be resumed right away. The behavior is different but the existing syntax will hide very easily among three versions of the command. A different keyword that is easy to search for and makes clear what's happening would be nice. Interestingly, the implementation has separate opcodes for ordinary yield and the signal version (OPCODE_YIELD and OPCODE_YIELD_SIGNAL) even though the keyword is overloaded. yield_signal would usable as a keyword too.
Point 5) is actually key to getting most of this to work and I have already done this. It's also the reason I can't submit separate PRs for 1-4 as they all need 5). Part 6, however, is orthogonal and it may not be included in the final PR as I may want feedback on the rest first.
Anyway, just an update and heads up that a PR will be on the way with more details (in a few days at the most).
I don't have crash but usage freed memory:
==17571==ERROR: AddressSanitizer: heap-use-after-free on address 0x61400001bc50 at pc 0x000001a56a17 bp 0x7ffd78d0ea00 sp 0x7ffd78d0e9f0
READ of size 8 at 0x61400001bc50 thread T0
#0 0x1a56a16 in GDScriptFunctionState::resume(Variant const&) modules/gdscript/gdscript_function.cpp:1862
#1 0x1a832ce in MethodBind1R<Variant, Variant const&>::call(Object*, Variant const**, int, Variant::CallError&) core/method_bind.gen.inc:961
#2 0xe612728 in Object::call(StringName const&, Variant const**, int, Variant::CallError&) core/object.cpp:921
#3 0xe900974 in Variant::call_ptr(StringName const&, Variant const**, int, Variant*, Variant::CallError&) core/variant_call.cpp:1096
#4 0x1a395f5 in GDScriptFunction::call(GDScriptInstance*, Variant const**, int, Variant::CallError&, GDScriptFunction::CallState*) modules/gdscript/gdscript_function.cpp:1079
#5 0x187173b in GDScriptInstance::call(StringName const&, Variant const**, int, Variant::CallError&) modules/gdscript/gdscript.cpp:1164
#6 0xe612295 in Object::call(StringName const&, Variant const**, int, Variant::CallError&) core/object.cpp:900
#7 0xe900974 in Variant::call_ptr(StringName const&, Variant const**, int, Variant*, Variant::CallError&) core/variant_call.cpp:1096
#8 0x1a39676 in GDScriptFunction::call(GDScriptInstance*, Variant const**, int, Variant::CallError&, GDScriptFunction::CallState*) modules/gdscript/gdscript_function.cpp:1082
#9 0x187248c in GDScriptInstance::_ml_call_reversed(GDScript*, StringName const&, Variant const**, int) modules/gdscript/gdscript.cpp:1195
#10 0x187273b in GDScriptInstance::call_multilevel_reversed(StringName const&, Variant const**, int) modules/gdscript/gdscript.cpp:1202
#11 0x93b1f3e in Node::_notification(int) scene/main/node.cpp:149
#12 0x172b047 in Node::_notificationv(int, bool) scene/main/node.h:46
#13 0x172d4f1 in CanvasItem::_notificationv(int, bool) scene/2d/canvas_item.h:166
#14 0xaddb97f in Node2D::_notificationv(int, bool) scene/2d/node_2d.h:38
#15 0xe612c92 in Object::notification(int, bool) core/object.cpp:931
#16 0x93b3edb in Node::_propagate_ready() scene/main/node.cpp:196
#17 0x93b3732 in Node::_propagate_ready() scene/main/node.cpp:188
#18 0x9402f0a in Node::_set_tree(SceneTree*) scene/main/node.cpp:2550
#19 0x94c9269 in SceneTree::init() scene/main/scene_tree.cpp:464
#20 0x14a18f4 in OS_X11::run() platform/x11/os_x11.cpp:3246
#21 0x141c36c in main platform/x11/godot_x11.cpp:56
#22 0x7ff17d3f01e2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x271e2)
#23 0x141bf4d in _start (/usr/bin/godots+0x141bf4d)
0x61400001bc50 is located 16 bytes inside of 408-byte region [0x61400001bc40,0x61400001bdd8)
freed by thread T0 here:
#0 0x7ff17e9146ef in __interceptor_free (/lib/x86_64-linux-gnu/libasan.so.5+0x10d6ef)
#1 0xebacf55 in Memory::free_static(void*, bool) core/os/memory.cpp:181
#2 0x14c93c2 in void memdelete<Reference>(Reference*) core/os/memory.h:119
#3 0x14ba7c3 in Ref<Reference>::unref() core/reference.h:279
#4 0xe6dc452 in RefPtr::unref() core/ref_ptr.cpp:90
#5 0xe894bad in Variant::clear() core/variant.cpp:1118
#6 0xe8903bf in Variant::reference(Variant const&) core/variant.cpp:920
#7 0xe8b17e5 in Variant::operator=(Variant const&) core/variant.cpp:2532
#8 0x1a2adba in GDScriptFunction::call(GDScriptInstance*, Variant const**, int, Variant::CallError&, GDScriptFunction::CallState*) modules/gdscript/gdscript_function.cpp:734
#9 0x187173b in GDScriptInstance::call(StringName const&, Variant const**, int, Variant::CallError&) modules/gdscript/gdscript.cpp:1164
#10 0xe612295 in Object::call(StringName const&, Variant const**, int, Variant::CallError&) core/object.cpp:900
#11 0xe900974 in Variant::call_ptr(StringName const&, Variant const**, int, Variant*, Variant::CallError&) core/variant_call.cpp:1096
#12 0x1a39676 in GDScriptFunction::call(GDScriptInstance*, Variant const**, int, Variant::CallError&, GDScriptFunction::CallState*) modules/gdscript/gdscript_function.cpp:1082
#13 0x1a56484 in GDScriptFunctionState::resume(Variant const&) modules/gdscript/gdscript_function.cpp:1848
#14 0x1a832ce in MethodBind1R<Variant, Variant const&>::call(Object*, Variant const**, int, Variant::CallError&) core/method_bind.gen.inc:961
#15 0xe612728 in Object::call(StringName const&, Variant const**, int, Variant::CallError&) core/object.cpp:921
#16 0xe900974 in Variant::call_ptr(StringName const&, Variant const**, int, Variant*, Variant::CallError&) core/variant_call.cpp:1096
#17 0x1a395f5 in GDScriptFunction::call(GDScriptInstance*, Variant const**, int, Variant::CallError&, GDScriptFunction::CallState*) modules/gdscript/gdscript_function.cpp:1079
#18 0x187173b in GDScriptInstance::call(StringName const&, Variant const**, int, Variant::CallError&) modules/gdscript/gdscript.cpp:1164
#19 0xe612295 in Object::call(StringName const&, Variant const**, int, Variant::CallError&) core/object.cpp:900
#20 0xe900974 in Variant::call_ptr(StringName const&, Variant const**, int, Variant*, Variant::CallError&) core/variant_call.cpp:1096
#21 0x1a39676 in GDScriptFunction::call(GDScriptInstance*, Variant const**, int, Variant::CallError&, GDScriptFunction::CallState*) modules/gdscript/gdscript_function.cpp:1082
#22 0x187248c in GDScriptInstance::_ml_call_reversed(GDScript*, StringName const&, Variant const**, int) modules/gdscript/gdscript.cpp:1195
#23 0x187273b in GDScriptInstance::call_multilevel_reversed(StringName const&, Variant const**, int) modules/gdscript/gdscript.cpp:1202
#24 0x93b1f3e in Node::_notification(int) scene/main/node.cpp:149
#25 0x172b047 in Node::_notificationv(int, bool) scene/main/node.h:46
#26 0x172d4f1 in CanvasItem::_notificationv(int, bool) scene/2d/canvas_item.h:166
#27 0xaddb97f in Node2D::_notificationv(int, bool) scene/2d/node_2d.h:38
#28 0xe612c92 in Object::notification(int, bool) core/object.cpp:931
#29 0x93b3edb in Node::_propagate_ready() scene/main/node.cpp:196
previously allocated by thread T0 here:
#0 0x7ff17e914ae8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dae8)
#1 0xebabf4f in Memory::alloc_static(unsigned long, bool) core/os/memory.cpp:85
#2 0xebabe4e in operator new(unsigned long, char const*) core/os/memory.cpp:42
#3 0x1a3e798 in GDScriptFunction::call(GDScriptInstance*, Variant const**, int, Variant::CallError&, GDScriptFunction::CallState*) modules/gdscript/gdscript_function.cpp:1257
#4 0x187173b in GDScriptInstance::call(StringName const&, Variant const**, int, Variant::CallError&) modules/gdscript/gdscript.cpp:1164
#5 0xe612295 in Object::call(StringName const&, Variant const**, int, Variant::CallError&) core/object.cpp:900
#6 0xe900974 in Variant::call_ptr(StringName const&, Variant const**, int, Variant*, Variant::CallError&) core/variant_call.cpp:1096
#7 0x1a395f5 in GDScriptFunction::call(GDScriptInstance*, Variant const**, int, Variant::CallError&, GDScriptFunction::CallState*) modules/gdscript/gdscript_function.cpp:1079
#8 0x187173b in GDScriptInstance::call(StringName const&, Variant const**, int, Variant::CallError&) modules/gdscript/gdscript.cpp:1164
#9 0xe612295 in Object::call(StringName const&, Variant const**, int, Variant::CallError&) core/object.cpp:900
#10 0xe900974 in Variant::call_ptr(StringName const&, Variant const**, int, Variant*, Variant::CallError&) core/variant_call.cpp:1096
#11 0x1a39676 in GDScriptFunction::call(GDScriptInstance*, Variant const**, int, Variant::CallError&, GDScriptFunction::CallState*) modules/gdscript/gdscript_function.cpp:1082
#12 0x187248c in GDScriptInstance::_ml_call_reversed(GDScript*, StringName const&, Variant const**, int) modules/gdscript/gdscript.cpp:1195
#13 0x187273b in GDScriptInstance::call_multilevel_reversed(StringName const&, Variant const**, int) modules/gdscript/gdscript.cpp:1202
#14 0x93b1f3e in Node::_notification(int) scene/main/node.cpp:149
#15 0x172b047 in Node::_notificationv(int, bool) scene/main/node.h:46
#16 0x172d4f1 in CanvasItem::_notificationv(int, bool) scene/2d/canvas_item.h:166
#17 0xaddb97f in Node2D::_notificationv(int, bool) scene/2d/node_2d.h:38
#18 0xe612c92 in Object::notification(int, bool) core/object.cpp:931
#19 0x93b3edb in Node::_propagate_ready() scene/main/node.cpp:196
#20 0x93b3732 in Node::_propagate_ready() scene/main/node.cpp:188
#21 0x9402f0a in Node::_set_tree(SceneTree*) scene/main/node.cpp:2550
#22 0x94c9269 in SceneTree::init() scene/main/scene_tree.cpp:464
#23 0x14a18f4 in OS_X11::run() platform/x11/os_x11.cpp:3246
#24 0x141c36c in main platform/x11/godot_x11.cpp:56
#25 0x7ff17d3f01e2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x271e2)
GDScriptFunctionState was removed in 4.0 and yield is replaced with await, so this can be closed as no longer valid.