Godot version:
3.1.2 stable
OS/device including version:
Windows 10
Issue description:
I've built a godot project basically running static 2D image/slideshow on request. There are multiple threads written in gdscript there with UDP server handling requests to show images or slideshow.
After running the project for ~8 days, Timer seemed to stop working all of a sudden.
Steps to reproduce:
while active:
emit_signal("update_image", ...) # it's a CONNECT_DEFERRED signal
slideshow_cur_img += 1 # what will happen if this overflows?
OS.delay_msec(1000 * slideshow_interval)
_on_timer_timeout was never run again. timer = Timer.new()
timer.set_wait_time(sync_interval_sec)
timer.connect("timeout", self, "_on_timer_timeout")
add_child(timer)
timer.start()
yield(get_tree().create_timer(wait_sec), "timeout")
While issue happened, all other parts of the godot project continued working.
Minimal reproduction project:
I don't have it now.
Does your application happen to be running at 24 FPS? I think I know the cause.
Thanks @aaronfranke . However I didn't set FPS specifically. But that fix makes sense!
To be specific, the starting time of my application was:
2020-05-01 17:35:39
And the last successful timer was triggered at:
2020-05-09 03:07:49
The timer wait_time was set to 3min and the next one never came.
I will have a custom build myself and hopefully I can report back a few weeks later :P
I'm not sure #38880 would fix it, from what I understand @starryalley was not using a Timer with a wait time higher than FLT_MAX (~3.4e+38), but instantiating new Timers each time with a wait time likely within reasonable float bounds.
I think the problem is here, since idle_process_time it's still a float:https://github.com/godotengine/godot/blob/00949f0c5fcc6a4f8382a4a97d5591fd9ec380f8/scene/main/scene_tree.h#L255
I'm not sure #38880 would fix it, from what I understand @starryalley was not using a Timer with a wait time higher than
FLT_MAX(~3.4e+38), but instantiating new Timers each time with a wait time likely within reasonable float bounds.
For more information, both an existing repeated Timer (instantiated at program start only once) and newly instantiated Timer stopped working. The wait time for the repeated timer is 3 min, and 1 second for the newly created one.
@akien-mga 99.99% of the time, issues with floats aren't issues with being out of the absolute range, they are issues with being out of the accurate range. After 16777216, single-precision floats lose the ability to store all integers (16777215 -> 16777216 -> 16777218, since 16777217 is not a valid float), so if a program depends on incrementing by 1, it will fail at 16777216.
Similarly, if a program depends on incrementing by a smaller number, say, 1/24 per frame for 24 FPS, then it will fail earlier, approximately 24x as early, but in reality depends on how the number rounds to powers of two. If you increment by 1 millisecond, 1000 times per second, it fails after a few hours (this is the same problem as losing millimeter precision after a few km with vectors).
I don't know for sure if #38880 fixes the problem, but I know that using float for time is borked, so it should be replaced with double (or some other implementation). I'll update the PR to also do this for SceneTree get_idle_process_time().
There's a lot of things going on in your description. This really needs a min production project (even though I accept one is hard to make, maybe just isolate one of the things that breaks in a single project). Then we can modify the code to pretend it has been running a long time.
This _could_ simply be a QueryPerformanceCounter wraparound:
https://stackoverflow.com/questions/5296990/queryperformancecounter-and-overflows
uint64_t OS_Windows::get_ticks_usec() const {
uint64_t ticks;
uint64_t time;
// This is the number of clock ticks since start
if (!QueryPerformanceCounter((LARGE_INTEGER *)&ticks))
ticks = (UINT64)timeGetTime();
// Divide by frequency to get the time in seconds
time = ticks * 1000000L / ticks_per_second;
// Subtract the time at game start to get
// the time since the game started
time -= ticks_start;
return time;
}
LARGE_INTEGER is signed 64 bit value. If clock frequency is high you _could_ get wraparound. Depends on the frequency and how long running I guess (I don't know what is typical on windows).
Also the expression
time = ticks * 1000000L / ticks_per_second;
could be problematic. I wasn't sure on the operator precedence so tried it in godbolt and sure enough it does the multiply before the divide, so that adds to the chance of an overflow. (It must do thinking about it, otherwise you'd end up with fractions getting rounded as integers! :grin: )
You could gain another 1000000 on the range by doing this step at 128 bit.
2^63 = 9.22x10e+18
resolution dropped by multiply
2^63 / 1000000 = 9.22x10e12
typical clock freq, 1.4 million
9.22x10e12 / 1.4 million = 6588000 seconds
minutes = 109802
hours = 1830
days = 76
So depending on clock frequency, and how long the PC has been on (if QueryPerformanceCounter resets at boot?) this could be an issue. Whether it is responsible for the bug seen I don't know.
I think the problem is here, since
idle_process_timeit's still a float:
idle_process_time being a float isn't necessarily a problem, provided it is promoted to a double in calculations where it needs to be a double. And of course provided that idle_process_time itself isn't being wrongly calculated....
Can anyone check if it is fixed?
Related PR was already merged and cherrypicked.
Most helpful comment
Does your application happen to be running at 24 FPS? I think I know the cause.
38880 should fix this, but I haven't tested it since that would take a long time.