Retroarch: [Vulkan Regression] Black Screen on Win10, NVidia

Created on 8 Jun 2018 · 26Comments · Source: libretro/RetroArch

With the latest nightly (9271a7505faeeb5abe53e93fd0c039416fa5da43) I'm getting a black screen with sound after launching content in full screen mode when using the Vulkan video driver. I'm on the 397.64 Nvidia driver, so it's pretty recent, but not the latest. I see some bad issues with the latest driver on Reddit, so I'm a bit skittish about updating to it right now, but if someone on the 398.xx drivers could test that would help.

I'm assuming 16c797f05773cfc643f9740c7d03c58c4e6caae2 is what caused it. I tested with swap interval 1 and 2 and exclusive and windowed fullscreen and all had the black screen. If I switch to windowed mode it works fine. Fast forward in Vulkan windowed mode is fixed in this nightly, which is great. Though I still get a graphical artifact that shows up for a frame if swap interval is set to 2 after I let off the fast forward hold hotkey:
clipboard 10

In full screen if I fast forward I hear the sound stutter and slow down.

Source

Awakened0

Most helpful comment

Many thanks for those instructions. I was able to reproduce the black screen. I also reproduced another issue where I Alt-Tabbed out and in again and it seemed Cave Story lost its vsync throttling and starting playing at fast-forward speed.

I'll dig into these problems. Thanks again.

pdaniell-nv on 29 Jun 2018

👍3 ❤1

All 26 comments

I can confirm that black screen fullscreen issue on Windows 10 as well, with driver 398.11 on my GTX 970. As Awakened0 suspected, https://github.com/libretro/RetroArch/commit/16c797f05773cfc643f9740c7d03c58c4e6caae2 is the culprit, more specifically, the if (old_swapchain != VK_NULL_HANDLE) and resulting action that it takes afterward. If I comment out the if and its corresponding action, fullscreen works fine, with no need to alt tab or play in windowed mode.

That being said, I don't know much about Vulkan, so I don't have any suggestions for a proper fix that will keep the validation layer happy, while also not breaking things.

thedax on 26 Jun 2018

@Themaister Looks like there are issues with that last commit you sent, specifically with Nvidia drivers.

I get the feeling Nvidia doesn't care too much about Vulkan for whatever reason to patch these issues properly.

twinaphex on 26 Jun 2018

I played around with it for a while, and found that if you do it like this, it doesn't exhibit the fullscreen issue (the single frame artifact during fast forward should probably be its own issue, since it doesn't seem to be affected by this), at least for me:

   old_swapchain               = vk->swapchain;

// snipped out info struct stuff for brevity
   info.oldSwapchain           = VK_NULL_HANDLE;

   if (old_swapchain != VK_NULL_HANDLE)
      vkDestroySwapchainKHR(vk->context.device, old_swapchain, NULL);

   if (vkCreateSwapchainKHR(vk->context.device,
            &info, NULL, &vk->swapchain) != VK_SUCCESS)
   {
      RARCH_ERR("[Vulkan]: Failed to create swapchain.\n");
      return false;
   }

However, I have no way of checking the validation layer (mostly because I have no idea how to set that stuff up), so this might not be a solution at all.

thedax on 26 Jun 2018

Thanks for testing and looking into a potential fix!

Awakened0 on 28 Jun 2018

Sounds like yet another broken nVidia driver then. :( Not sure why this isn't working.

Setting info.oldSwapchain to NULL should be fine. It was vkDestroySwapchainKHR which was missing before, and that needs to remain to avoid leaking memory.

Themaister on 28 Jun 2018

Btw, in my checkout, vkCreateSwapchainKHR is called after vkDestroySwapchainKHR. Did you change that somehow?

Themaister on 29 Jun 2018

(Disregard the last comment that I deleted)

From what I can see in the commit history to vulkan_common.c, it seems to have always been Create -> Destroy in, did something get mixed up somewhere along the way?

As for the snippet I posted, I did swap from Create -> Destroy to Destroy -> Create, otherwise the issue persisted, even if info.oldSwapchain was set to a null handle.

thedax on 29 Jun 2018

I would like to investigate this bug, but as a complete n00b to RetroArch it isn't obvious to me how to recreate the failure. I installed the latest RetroArch nightly binary and fired it up. I see the blue menu screen and modified the graphics driver to be Vulkan. But then I got stuck because I wasn't sure how to run this specific content. Looks like something called "Mednafen PCE Fast". Maybe it doesn't matter what specific content I should use, but any tips on what core and content I could use would be appreciated. Thanks.

pdaniell-nv on 29 Jun 2018

The content doesn't appear to matter, and neither does the core (I've done all my testing with NES and N64 emulators, as well as NXEngine, and not PC Engine), so the easiest way to reproduce is probably:

install retroarch as normal
start it and go fullscreen with F on the keyboard
on the "Main Menu", pick Online Updater and download Cave Story (NXEngine) from the Core Updater submenu, and then push backspace to go back, pick "NXEngine" from the Content Downloader submenu, and download "Cave Story (En).zip"
Push backspace until you're back at the main menu again, then go to Load Content -> Downloads -> Cave Story (En) and pick Doukutsu.exe, and in its submenu, choose "Cave Story (NXEngine)"

The screen should go black until you alt-tab out and then back in.

Edit: of course, your renderer needs to be set to Vulkan (which I believe you said you switched it to, but just making sure. :P)

thedax on 29 Jun 2018

I'll dig into these problems. Thanks again.

pdaniell-nv on 29 Jun 2018

👍3 ❤1

No problem, and thank you for looking into it.

thedax on 29 Jun 2018

Yes, this issue renders the RetroArch's vulkan video driver unusable. Any core booted into fullscreen appears black. Alt-tabbing allows for a temporary fix, though that seems to reveal another issue where vsync becomes disabled. Just confirming the behavior reported previously :)

Windows 10 1803 (Build 17134.112)
Nvidia driver 398.36

theoldsport on 1 Jul 2018

This PR should workaround it and generally improve performance for toggling fullscreen: https://github.com/libretro/RetroArch/pull/6933
I avoid using oldSwapchain now on Windows ... @twinaphex said it worked on nVidia now.

@pdaniell-nv any idea why toggling vsync on Windows triggers a lot of black frames or display mode changes (press space while in a game to toggle this)? We want seamless change between vsync and non-vsync. Linux can do this, but not Windows for some reason.

We're also seeing crashes on nVidia when windows are minimized for some reason.

Themaister on 1 Jul 2018

@pdaniell-nv any idea why toggling vsync on Windows triggers a lot of black frames or display mode changes (press space while in a game to toggle this)? We want seamless change between vsync and non-vsync. Linux can do this, but not Windows for some reason.

Yeah, my 120hz monitor gets forced to 60hz mode when I fast forward (which toggles vsync) with the new nightly build. Then if I fast forward again I get a black screen. The GL and D3D drivers have never had this type of issue.

Awakened0 on 2 Jul 2018

This is still being investigated as an NVIDIA driver bug, but when I looked at the RetroArch WSI code, I noticed a couple of things:

It doesn't do anything to ensure the old swapchain images are idle before destroying them. The application is still responsible for ensuring there is no outstanding rendering to any acquired swapchain image before destroying the swapchian, even if the swapchain is out of date. Not doing so could result in DEVICE_LOST errors, corrupted rendering, etc. RetroArch be avoiding this by luck or because the rendering is simple enough that it always completes by the time vkDestroySwapchain is called at the moment, but even if we fix our driver bugs, this might still sporadically result in the same symptoms seen here.
While there's code to handle re-creating the swapchain if AcquireNextImage() fails, it loops at most one time. There's no guarantee the new swapchain won't immediately be out of date as well. Correct code loops indefinitely when it gets OUT_OF_DATE, and exits immediately on other errors. Generally, repeated OUT_OF_DATE errors only happen when a user is spastically resizing a window, but there's always the possibility a popup window quickly appears then goes away and triggers the same race condition.

cubanismo on 2 Jul 2018

@Themaister The black frames or display mode changes when toggling vsync could be a result of transitions into and out of "fullscreen exclusive" mode in Windows. I'm having an active discussion on this issue internally so several folks are aware and we're digging into it.

Also, the crash you're seeing on minimization might be similar to an issue that has also affected other apps. Several months ago we modified the NVIDIA driver to follow the spec more correctly with our VK_ERROR_OUT_OF_DATE_KHR behavior. The Vulkan spec itself has also changed around this area to try make this more clear. This caused a problem for some apps that weren't handling VK_ERROR_OUT_OF_DATE_KHR correctly. With RetroArch I observed that minimizing it did indeed cause our implementation to return VK_ERROR_OUT_OF_DATE_KHR when a vkQueuePresent was called. I then noticed a follow-on call to vkAcquireNextImageKHR with the swapchain parameter set to NULL. This is probably the crash you observed.

pdaniell-nv on 2 Jul 2018

@cubanismo, @pdaniell-nv Thanks. So this is what the spec says:

The application must not destroy a swapchain until after completion of all outstanding operations
on images that were acquired from the swapchain. swapchain and all associated VkImage handles are
destroyed, and must not be acquired or used any more by the application.

The way I read it, you can destroy a swapchain after vkDeviceWaitIdle, basically the VkImages can go "poof" ala vkDestroyImage, and we do make sure we don't render to stale images.

One thing we do however, is to make sure an image is always acquired and ready to go. This might be a weird quirk, but it is intended to make the swapping logic as similar to GL as possible (present + acquire in the same place).

Something which can occur is:

vkAcquireNextImageKHR(fence = sync_fence);
vkWaitFences(sync_fence);
vkDestroySwapchainKHR

as long as I don't actually render into the acquired image, I should be good though?

As for the minimization issue, that seems like the problem. I don't think we loop on OUT_OF_DATE, creating new swapchains until we can successfully acquire. I'll probably need to sort that out on my end. There are some checks for min/max extent being 0 as per the spec, but I never trigger that on my machines, so I never hit this issue first hand.

Do I need to pump the Windows messaging loop when creating new swapchains, or can I just spin until it succeeds?

As for fullscreen exclusive, we try to toggle between FIFO and MAILBOX if present, but it sounds like both modes should be able to use fullscreen exclusive?

Themaister on 4 Jul 2018

Another PR: https://github.com/libretro/RetroArch/pull/6945. Seems to have resolved the minimization issues.

Themaister on 4 Jul 2018

@Themaister The specific case I was worried about, which your PR does seem to address at the expense of any oldSwapchain usage, was something like:

-AcquireNextImage() == SUCCESS
-Render to acquired image
-QueuePresent() == OUT_OF_DATE
-DestroySwapchain() <-- Might destroy images with rendering in flight.

Because I didn't see any CPU-side synchronization/waits happening in this chain of events.

Pumping any message queues whenever you're looping indefinitely would probably be wise, but I don't know enough about windows internals to say for sure what exactly is necessary. If you're trying to create a swapchain of a particular size, it should be sufficient to re-query the Vulkan surface and adjust the swapchain size within the loop unless some action is needed in the native message loop to apply the window resize natively or something.

I think we might only report IMMEDIATE in fullscreen on windows. Not sure. On Linux we always report only IMMEDIATE and FIFO. For portable code, re-query the available present modes at startup and whenever you get OUT_OF_DATE. All VkSurface state is volatile and can potentially change when you get OUT_OF_DATE, including available present modes.

cubanismo on 5 Jul 2018

Right, I pick either IMMEDIATE or MAILBOX depending on what's supported, with MAILBOX being preferred. This is queried every time, so should be fine. (If neither MAILBOX nor IMMEDIATE is supported, FIFO fallback).

I also changed the Acquire/QueuePresent failing to just stall and delete the swapchain right away. There is little value in using oldSwapchain when I hit errors or resize anything, since there is no image to actually reuse, and it should greatly improve robustness. oldSwapchain should theoretically help when I want to switch present-mode without any error having occured.

The actual swapchain size being created will depend on which sizes I get from min/maxExtent and currentExtent. This is typically driven by the window itself (currentExtent != -1), so sometimes I end up creating a swapchain with a different size than what I anticipate. This is fine and expected. I just wondered if you have to pump the event loop to break out of the OUT_OF_DATE loop somehow.

I'm generally a bit concerned with how this min/maxExtent == 0 query is supposed to work. I mean, after I query min/maxExtent, I'm not guaranteed that whenever I call vkCreateSwapchainKHR, those parameters are even valid anymore, since apparently it can just magically change behind the scenes. If the user managed somehow to minimize the window after I queried those values (i.e. race condition), I'll likely end up breaking something.

Themaister on 5 Jul 2018

OK, that all sounds good. For your remaining concerns:

Yes, min/maxExtent == 0 is a problematic case (IIRC, creating a swapchain of size 0 will always fail on our drivers as required by the spec, though some drivers from other vendors will let it succeed anyway). Currently applications should handle this corner case specially. We're looking for better solutions at the spec level here, but it's difficult due to all the interactions with other parts of the spec.
The query + create operation is a race by design. A new swapchain may fail creation if the surface changed between when the application queried its properties and swapchain creation time, or it might succeed and immediately return out of date on first use. With OpenGL+EGL/GLX/WGL, drivers effectively handle this race internally with visible and not-entirely-predictable side effects for applications, so the handling was pushed up to the application for Vulkan WSI to make the behavior well-defined in all cases, at the expense of application complexity. Unfortunately, I believe we neglected to document exactly what failure mode to expect in this case in the spec, but I believe that oversight is being tracked via an internal issue in Khronos already.

cubanismo on 5 Jul 2018

Right, if vkCreateSwapchainKHR can spuriously fail due to races (e.g. min/maxExtent suddenly becoming 0 behind the scenes), there needs to be some kind of error code which can signal this so I can spin on that, otherwise I have a fatal error and probably need to take the application down, ala EBUSY/EAGAIN for nonblocking file descriptors.

Although, RetroArch should be quite robust against swapchain failing to be created now as well. Frames will simply be dropped if there is no swapchain. It will try again next frame to create a swapchain, which is something I need to deal with the min/maxExtent == 0 case.

This goes back again into recent Vulkan Ecosystem discussions around WSI platform specific details not being well documented.

Themaister on 6 Jul 2018

I tested RetroArch with a GSync display today. With GSync enabled, there is about a 2 second freeze when VSync disengages when I use fast forward. I don't seem to get the flickers or other issues after fast forward is disengaged anymore though. With GSync the freeze happens with the GL driver too. Enabling Threaded Video in RetroArch will fix it in both drivers, and that option seems to run smoother with GSync than it does with VSync. I think Theaded Video adds more latency though...

Awakened0 on 2 Aug 2018

Fast forward is much better in Vulkan for me now with @Themaister 's recent workaround (https://github.com/libretro/RetroArch/pull/7191). However, randomly (maybe 1/4th of the time on average) when disengaging FF I get a "stale frame" for a second. Sometimes that frame will stay up for two seconds or so before the game starts playing normally again. I haven't noticed any issue when engaging it.

I'm testing with Gsync enabled and "Sync To Exact Content Framerate" enabled in RetroArch's Frame Throttle settings though. It probably works fine with normal Vsync.

Awakened0 on 9 Sep 2018

Further commits fixed the fast forwarding bugs completely with Vsync or Gsync. Thanks again for the workaround, @Themaister 👍

Awakened0 on 13 Sep 2018

Feels good to hear positive feedback about this! I'm sure @Themaister is glad about this as well!

twinaphex on 13 Sep 2018

🎉1

Was this page helpful?

0 / 5 - 0 ratings