RetroArch menu freezes with gl/glcore and xmb

Created on 6 Jul 2019 · 34Comments · Source: libretro/RetroArch

Description

When moving right of the favorites tab with xmb RetroArch will freeze forcing the process to be killed. Sometimes it will get farther, but it will always freeze before looping through all the tabs. This seems to affect the gl and glcore video drivers, but does not affect vulkan.

Expected behavior

RetroArch should not freeze.

Actual behavior

RetroArch freezes.

Steps to reproduce the bug

Start Retroarch.
Select the gl or glcore video drivers.
Select the xmb menu driver.
Hold the right key to make RetroArch scroll right through the tabs.
It usually freezes right of the favorites tab, sometimes it gets farther and it never freezes sooner.
I loop through all the tabs 3 times before deciding if commits are good or not.

Bisect Results

ff297e72e78a99cad6218e0f4192353798011903 is the first bad commit
commit ff297e72e78a99cad6218e0f4192353798011903
Author: jdgleaver <[email protected]>
Date:   Tue May 28 12:55:31 2019 +0100

    (task_image) Make image loading/processing non-blocking on non-threaded systems

 libretro-common/formats/image_transfer.c |  3 ---
 libretro-common/formats/png/rpng.c       | 16 +++++-------
 tasks/task_image.c                       | 45 ++++++++++++++++++--------------
 3 files changed, 33 insertions(+), 31 deletions(-

ff297e72e78a99cad6218e0f4192353798011903

Version/Commit

You can find this information under Information/System Information

RetroArch: https://github.com/libretro/RetroArch/commit/0b1ee7d00a27f8f2f8f1b36f1b1f54245c6bfb99

Environment information

OS: Slackware64-current
Compiler: clang-8.0.0
mesa: https://github.com/mesa3d/mesa/commit/9b116173b6a5e96c54ef3962546aabd505e00cfb
libxcb: 1.13.1

bisected glcore major opengl xmb

Source

orbea

All 34 comments

@jdgleaver Maybe can you look at this when you have time?

orbea on 6 Jul 2019

It additionally does not occur if threaded video is enabled, but disabling threaded video will segfault. Not sure if this issue is related or not.

Edit: The crash with disabling threaded video is an older or external issue.

orbea on 6 Jul 2019

@orbea This almost certainly has nothing to do with my RPNG PR...

I cannot reproduce this on my dev machine (nvidia), but I just tried it on my laptop with an intel broadwell GPU, and yes, the hang is there (OpenSUSE 15.1 in both cases). This is the backtrace:

Thread 1 (Thread 0x7ffff7fa3dc0 (LWP 11272)):
#0  0x00007fffef97719b in poll () from /lib64/libc.so.6
#1  0x00007ffff3e85307 in ?? () from /usr/lib64/libxcb.so.1
#2  0x00007ffff3e8702a in xcb_wait_for_special_event () from /usr/lib64/libxcb.so.1
#3  0x00007fffe281b2af in ?? () from /usr/lib64/libGLX_mesa.so.0
#4  0x00007fffe2814301 in ?? () from /usr/lib64/libGLX_mesa.so.0
#5  0x0000000000700ffa in gfx_ctx_x_swap_buffers ()
#6  0x000000000070e865 in gl2_frame ()
#7  0x000000000046b9f3 in video_driver_frame ()
#8  0x000000000046c085 in video_driver_cached_frame ()
#9  0x00000000006704d7 in menu_display_libretro ()
#10 0x0000000000672c3e in menu_driver_render ()
#11 0x000000000046de2d in runloop_check_state.constprop ()
#12 0x000000000046f7f5 in runloop_iterate ()
#13 0x000000000056ecb1 in ui_application_qt_run(void*) ()
#14 0x000000000045634e in rarch_main ()
--Type <RET> for more, q to quit, c to continue without paging--
#15 0x00007fffef8a9f8a in __libc_start_main () from /lib64/libc.so.6
#16 0x000000000045338a in _start () at ../sysdeps/x86_64/start.S:120

Note this: xcb_wait_for_special_event ()

If you google it, you'll find that many programs are subject to the same error - and it would appear to be a driver issue.

In my case, if I disable my desktop compositor (compton), the issue goes away.

This is a very serious problem, and I don't know how to fix it... (I think we may need the help of a driver expert...)

jdgleaver on 7 Jul 2019

Would you be willing to ask about this in #dri-devel @ freenode? A lot of driver devs hang out there and they may be able to offer good advice (However I would try on a weekday). I also use compton and found that disabling it will postpone the freeze, but it still occurs.

I double checked and I am certain that your commit is related, if it didn't cause it then it exposed the issue.

orbea on 7 Jul 2019

I double checked and I am certain that your commit is related, if it didn't cause it then it exposed the issue.

Hmm... Yeah, I can see how this might have exposed the issue. Before, all PNG thumbnails were loaded 'instantly' (i.e. within a single task iteration) - now they load in 'chunks' over several iterations. This means we have a task thread running in the background while switching playlist tabs - and I guess the more threads we have, the more likely it's going to hang while switching between them.

It would be interesting to check whether other tasks cause this - i.e. disable thumbnail display, start a playlist thumbnail download task, then switch rapidly between playlists tabs while it's running (I can't test this myself today - don't have access to a PC at present...)

Would you be willing to ask about this in #dri-devel @ freenode? A lot of driver devs hang out there and they may be able to offer good advice (However I would try on a weekday).

I don't use IRC, so no. But also, GFX driver code is outside my comfort zone...

Anyway, I'll see if I can find anything out tomorrow...

jdgleaver on 7 Jul 2019

It would be interesting to check whether other tasks cause this - i.e. disable thumbnail display, start a playlist thumbnail download task, then switch rapidly between playlists tabs while it's running

I have experienced no freezes attempting this.

orbea on 7 Jul 2019

@orbea Would you mind testing this diff? It won't solve the underlying issue, but TSAN showed up a data race condition, and I have a hunch it could be related to the lock-ups. Sadly, I won't have access to an intel machine for a few days... (can't reproduce any kind of hanging with an nvidia card)

diff --git a/tasks/task_image.c b/tasks/task_image.c
index cbe7fe41e9..c48026a540 100644
--- a/tasks/task_image.c
+++ b/tasks/task_image.c
@@ -192,7 +192,7 @@ static int cb_nbio_image_thumbnail(void *data, size_t len)
    nbio_handle_t *nbio             = (nbio_handle_t*)data;
    struct nbio_image_handle *image = nbio  ? (struct nbio_image_handle*)nbio->data : NULL;
    void *handle                    = image ? image_transfer_new(image->type)       : NULL;
-   float refresh_rate;
+   float refresh_rate              = 0.0f;

    if (!handle)
       return -1;
@@ -209,8 +209,8 @@ static int cb_nbio_image_thumbnail(void *data, size_t len)
    image->size                     = len;

    /* Set task iteration duration */
-   rarch_environment_cb(RETRO_ENVIRONMENT_GET_TARGET_REFRESH_RATE, &refresh_rate);
-   if (refresh_rate == 0.0f)
+   /*rarch_environment_cb(RETRO_ENVIRONMENT_GET_TARGET_REFRESH_RATE, &refresh_rate);
+   if (refresh_rate == 0.0f)*/
       refresh_rate = 60.0f;
    image->frame_duration = (unsigned)((1.0 / refresh_rate) * 1000000.0f);

It's pretty rough that we have deal with driver bugs like this...

I could only find one example of someone adding a workaround to their own project for this intel xcb_wait_for_special_event() issue (in all other cases it was fixed at the mesa level): https://github.com/tildearrow/kwin-lowlatency/commit/73f09f6e11ce806064d0360bef1383e0779d9fa4. Don't think that's very useful for us here.

I can also confirm that the issue only started happening on my laptop with intel graphics after I updated to OpenSUSE Leap 15.1 last week (so perhaps it's a regression in recent intel drivers)

jdgleaver on 8 Jul 2019

Your workaround helps and it does not freeze anymore.

orbea on 8 Jul 2019

Thanks for testing.

So now I have to find a way to get the refresh rate without calling any video driver code. Hmm...

I think I can probably work around this for now, but it's almost certainly going to come back and bite us at some point. Oh well. Anyway, I'll try to get a PR put together before the end of today.

(BTW - apologies if I was abrupt with you earlier. I worked almost 40 hours over the weekend, and getting called out on this issue was the last thing I needed! But it's all good now :slightly_smiling_face: )

jdgleaver on 8 Jul 2019

i probably have the same issue as you @orbea. for weeks. i have since renamed thumbnails folder as i found this to be causing it more frequently.

i tried applying @jdgleaver patch, im still getting freezes when thumbnails are available. (linux, arch, xfce with xwmf4 enabled on intel gpu).

ghost on 8 Jul 2019

@retro-wertz Oh damn... So you still get infrequent hangs with thumbnails disabled? If so, then this is pretty much an unfixable driver bug... (or at least, intel mesa needs fixing upstream)

jdgleaver on 8 Jul 2019

with thumbnails disabled, it does not freeze as fast when thumbnails are enabled (or at least i can press left or right longer without it freezing yet...)

ghost on 8 Jul 2019

OK, that's what I needed to know - if it freezes without thumbnails then the RPNG/task_image code is not at fault. It all comes back to the xcb_wait_for_special_event() issue at the driver level.

I guess I should still make the task_image PR, though, due to the TSAN data race condition (which again doesn't seem to be our fault, but another driver issue...)

jdgleaver on 8 Jul 2019

Its possible that the intel and amd issues are slightly different? Can someone try confirming my bisect results?

BTW - apologies if I was abrupt with you earlier.

I didn't notice and no worries. :)

If so, then this is pretty much an unfixable driver bug...

It would really help if someone could discuss this with mesa devs, I suggested irc, but...

orbea on 8 Jul 2019

@orbea Oh, are you using AMD gfx? I did see similar libxcb xcb_wait_for_special_event() hang issues reported for AMD as well - e.g. https://github.com/ValveSoftware/steam-for-linux/issues/4368

Probably AMD is slightly more robust than intel, which is why the hang frequency differs.

It would really help if someone could discuss this with mesa devs, I suggested irc, but...

Thing is, the person who talks to the mesa devs needs to know what they're talking about :)

As I said, GFX driver stuff is not my field...

jdgleaver on 8 Jul 2019

For what it's worth, here's the task_image PR... #9079

jdgleaver on 8 Jul 2019

From reading the steam-for-linux issue I tried LIBGL_DRI3_DISABLE=true and found it does not occur with DRI2. I suspect this is an issue in the xorg-server or the xorg ddx rather than in mesa. I'll look into reporting this to the xorg developers later.

Thing is, the person who talks to the mesa devs needs to know what they're talking about :)

I've talked to mesa and xorg developers in the past and recieved a lot of help. I certainly don't know very much about the GFX drivers. :)

Its usually more productive if someone more knowledgeable about the programs code talks to them directly...

orbea on 8 Jul 2019

From reading the steam-for-linux issue I tried LIBGL_DRI3_DISABLE=true and found it does not occur with DRI2. I suspect this is an issue in the xorg-server or the xorg ddx rather than in mesa. I'll look into reporting this to the xorg developers later.

Ah, that's good to know - hopefully narrows things down a little. Keep us posted on what the xorg devs say :)

jdgleaver on 8 Jul 2019

i dont know if this matters.. but later versions of intel driver (at least arch) has been using DRI 3. ubuntu which still uses DRI2 is problematic with RA and i had to change that when for some reasons i need to run ubuntu or any debian derivatives.

ghost on 8 Jul 2019

I made a backtrace with mesa and libxcb symbols.

Thread 1 "retroarch" received signal SIGINT, Interrupt.
0x00007ffff3a32a19 in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff3a32a19 in poll () from /lib64/libc.so.6
#1  0x00007ffff5f131bc in _xcb_conn_wait (c=0x19cd290, cond=0x1dec908, 
    vector=0x0, count=0x0) at xcb_conn.c:479
#2  0x00007ffff5f15e2d in xcb_wait_for_special_event (c=0x19cd290, se=0x1dec8e0)
    at xcb_in.c:795
#3  0x00007ffff58c01a1 in loader_dri3_wait_for_msc (draw=0x1a3ac58, 
    target_msc=50465947, divisor=0, remainder=0, ust=0x19cbb50, msc=0x19cbb58, 
    sbc=0x19cbb60) at ../src/loader/loader_dri3_helper.c:582
#4  0x00007ffff58b4999 in dri3_wait_for_msc (pdraw=0x1a3ac20, 
    target_msc=50465947, divisor=0, remainder=0, ust=0x19cbb50, msc=0x19cbb58, 
    sbc=0x19cbb60) at ../src/glx/dri3_glx.c:417
#5  0x00007ffff5895696 in __glXWaitForMscOML (dpy=0x19d6040, drawable=50331651, 
    target_msc=50465947, divisor=0, remainder=0, ust=0x19cbb50, msc=0x19cbb58, 
    sbc=0x19cbb60) at ../src/glx/glxcmds.c:2245
#6  0x000000000081135c in gfx_ctx_x_swap_buffers (data=0x19cbb50, 
    data2=0x7fffffffd640) at gfx/drivers_context/x_ctx.c:360
#7  0x000000000087a0fe in gl_core_frame (data=0x1a3df50, 
    frame=0x11a1aa0 <video_driver_init_internal.dummy_pixels>, frame_width=4, 
    frame_height=4, frame_count=150, pitch=8, 
    msg=0x1191ae0 <video_driver_frame.video_driver_msg> "", 
    video_info=0x7fffffffd640) at gfx/drivers/gl_core.c:1734
#8  0x0000000000465248 in video_driver_frame (
    data=0x11a1aa0 <video_driver_init_internal.dummy_pixels>, width=4, height=4, 
    pitch=8) at retroarch.c:10192
#9  0x00000000004648fb in video_driver_cached_frame () at retroarch.c:9178
#10 0x0000000000745022 in menu_display_libretro (is_idle=false, 
    rarch_is_inited=true, rarch_is_dummy_core=true) at menu/menu_driver.c:644
#11 0x000000000074997f in menu_driver_render (is_idle=false, 
    rarch_is_inited=true, rarch_is_dummy_core=true) at menu/menu_driver.c:2053
#12 0x0000000000470a0a in runloop_check_state (settings=0x7fffef1c9010, 
    input_nonblock_state=false, runloop_is_paused=false, fastforward_ratio=5, 
    sleep_ms=0x7fffffffe060) at retroarch.c:16109
#13 0x000000000046f37d in runloop_iterate (sleep_ms=0x7fffffffe060)
    at retroarch.c:16689
#14 0x00000000006077a1 in ui_application_qt_run (args=0x0)
    at ui/drivers/qt/ui_qt_application.cpp:164
#15 0x0000000000455573 in rarch_main (argc=1, argv=0x7fffffffe1e8, data=0x0)
    at frontend/frontend.c:171
#16 0x00000000006072d6 in main (argc=1, argv=0x7fffffffe1e8)
    at ui/drivers/qt/ui_qt_application.cpp:188

GDB log - retroarch.log

orbea on 8 Jul 2019

I asked in #dri-devel @ freenode.

12:45 <orbea> I'm experiencing a freeze in retroarch somewhere in mesa/libxcb code using DRI3 (DRI2 seems unaffected). I got a backtrace - https://termbin.com/zv0m and it was exposed by this RetroArch commit - https://github.com/libretro/RetroArch/commit/ff297e72e78a99cad6218e0f4192353798011903 and then hidden again in this commit - https://github.com/libretro/RetroArch/commit/9093c9feb84479ff3d92a9f5706c746b95ae16e3 Is
12:45 <orbea> this potentially an issue in mesa or somewhere other than retroarch?
12:48 <ZeZu> o.O
12:49 <ZeZu> loader->xcb event->wait() dies in poll
12:49 <daniels> no? in that commit, refresh_rate could be used uninitialised unless explicitly set
12:49 <daniels> so the fix was the right fix
12:50 <imirkin_> orbea: iirc there's some xcb shenanigans with hangs in dri3 ... something with event queues
12:50 <imirkin_> or maybe it was in some other library and xcb was the fix?
12:50 <imirkin_> as you can tell, i'm a bit weak on the details, but perhaps this jogs someone else's mind
12:50 <orbea> daniels: the problem is that fix works for me with amd, but reportedly not for others with intel
12:52 <orbea> i guess I might have to pressure someone who can still reproduce it to get debuggin symbols...

I suppose @jdgleaver's fix was correct and since I can't reproduce the issue with amd anymore I think someone that has intel and can still reproduce this issue will need to debug further on their own.

orbea on 8 Jul 2019

@orbea Great, thanks for following this up. So it looks like the issue you originally reported is now fixed, and the intel lock-ups are something else (dri3-related?).

I'll have access to an intel machine later in the week. I'll try to see if I can reproduce the latter issue.

jdgleaver on 9 Jul 2019

just to add, on an intel gpu resizing window or maximizing when QT UI is enabled also freezes retroarch.

ghost on 13 Jul 2019

I can't reproduce that with amd.

orbea on 13 Jul 2019

ugh... ill try running in DRI 2 for awhile and see if its a bit stable. i know than on debian/buntu DRI 2 (which is its default) causes problems as well. It might be different here in Arch....

ghost on 13 Jul 2019

What's the status on this?

twinaphex on 16 Jul 2019

@twinaphex This issue was correctly fixed for me and amdgpu with PR https://github.com/libretro/RetroArch/pull/9079 as confirmed when I asked about this issue on #dri-devel. See the irc log in a previous comment.

However as reported by @retro-wertz this or a similar issue still occurs with intel and maybe other hardware. As I can't reproduce it I can't do anything more here...

orbea on 16 Jul 2019

@orbea, are you still running without compositor? if you are not, can you run RetroArch (gl or glcore) in window mode, and resize the window using mouse. This should instantly freeze RA

for me i cant run any compositor enabled (mutter, marco or xfwm) without the above test freezing RA

running on Vulkan, despite my lowly Ivy bridge having incomplete support, i can run with compositors fine.

ghost on 21 Jul 2019

@retro-wertz Yes I am still running compton. I tried resizing a floating RetroArch window with the master using xmb, gl and glcore. I was unable to reproduce any freezes.

I think the possibilities could be a few things.

There is some setting which is hiding the issue which you enable and I have disabled (Or vica versa).
There is something in the environment which is either triggering the issue for you or hiding it for me.
Its fixed in recent mesa or xorg commits (I run the mesa and xorg masters).
This is intel only and I can't help unfortunately.

orbea on 21 Jul 2019

@i think updating to the dev version mesa fixed my issue (at least the resizing window). im now 19.2.0 from aur (official is 19.1.2-1).

ghost on 21 Jul 2019

👍1

@jdgleaver Have you had a chance to try this further? Given the above comment I am inclined to think this is fixed.

orbea on 23 Jul 2019

Ah, sorry - haven't had the chance yet.

I can get access to an intel machine tomorrow, so can test in the morning. But it does look like this was a mesa bug fixed in the latest version...

jdgleaver on 23 Jul 2019

👍1

@orbea Okay, on OpenSUSE Leap 15.1 with Intel(R) HD Graphics 5500 (Broadwell GT2) graphics and Mesa 18.3.2 I cannot reproduce any crashes when switching tabs, scrolling lists, resizing windows or toggling full screen (Qt UI enabled). This is with the gl driver, with desktop compositing enabled (compton).

So I think this issue can be closed now :)

jdgleaver on 24 Jul 2019

🎉1

Thanks for testing!

orbea on 24 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings