Rpcs3: Ninja gaiden sigma windows and linux performance gap

Created on 7 May 2019  路  4Comments  路  Source: RPCS3/rpcs3

In https://github.com/RPCS3/rpcs3/issues/5913 I estimated that the reason for the greatly reduced performance in windows was related to the sys_timer_usleep call, but before I opened an issue on it, I wanted to first see if that was what was really causing the slowdown.

I tried a whole lot of ultra hacky workarounds but none of them had any impact on performance. I tried reverting the change to make usleeps accurate, and that reduced cpu time spent in rpcs3, but performance was unchanged. I tried making threads return 10% earlier than they request, which had no effect on performance. I even accidentally made all threads return instantly from the call, which apparently ninja gaiden sigma doesn't crash from.

Finally, I tried setting the min quantum time to 1000, which led to two threads yielding at the same time, but there was no performance impact, only higher cpu usage.

Booting back into linux, I tried setting the min quantum time to 500, matching windows, and cpu usage was now the same as windows, but performance was still locked to 60fps.

It's clear then that whatever is slowing the game down on windows isn't easily profiled. The only 2 things left that I could still test would be using a windows build of rpcs3 compiled with clang or gcc, or using the proprietary amd vulkan driver on linux. But I doubt that either of those would explain the gap.

At this point I can only guess where the gap is coming from. I speculated usleep being responsible because it was visible in the profiler, but now theres nothing else that stands out for me in the profiler.

As a reminder this is what rpcs3 usage looks like in ninja gaiden sigma on windows:
ninja_gaiden_windows

PPU usage is fairly high, but its not responsible for slowing the game down.
SPU usage is low.
RSX usage is fairly high, but its pretty far from being maxed out.

Discussion AMD Windows

Most helpful comment

@Whatcookie Thanks! I reverted the sampler pool approach after profiling on NVIDIA+windows; I did not realize things were that bad on AMD drivers. Keep the issue open; I have some ideas that could help with this situation and have acquired a polaris card for optimization targeting.

All 4 comments

The thing that made me open this issue was the fact that performance was low, even though PPU, SPU, and rsx usage were all low. It didn't make sense to me that there was no clear bottleneck appearing anywhere.

When I started installing more games on windows to benchmark them, they weren't working. I then deleted my config files and with the new config files they worked. Ninja gaiden sigma also was no longer exhibiting the strange behavior where it wouldn't max anything. It still doesn't reach 60fps in my benchmarking areas, but its now clearly limited by the RSX thread, and I think that 98% of that can be attributed to compiler and driver differences.

I don't know what cursed shit was in my config files, but that install wasn't touched for over 2 years so its not surprising I guess.

At least I learned a whole lot.

I did some additional research and decided to reopen the issue.

I tried using vtune to profile the game on windows and I noticed that something in the amd vulkan driver was taking a whopping 6 seconds (over 30 seconds) to complete.
amdvlk64

It seemed impossible to figure out what was taking so long in the driver, so I built the amdvlk driver on linux with symbols enabled. (https://github.com/GPUOpen-Drivers/AMDVLK)
The top function for amdvlk.so was this:
sampler
I commented out the code that emitted the samplers around line 1590 in VKGSRender.cpp and noted a performance increase. I tried the same on windows and the performance was noticeably better.
ninjagaidensampler
We are almost at 60fps now!

However, this breaks the visuals in obvious ways. I tried reverting https://github.com/RPCS3/rpcs3/commit/af1b13550b8b775f5ccbbcad7b7d303097a4ff75 assuming that the patch was made without knowing how slow creating samplers is on the amd driver, but performance degraded.

Creating samplers in amdvlk is just unreasonably fucking slow. Radv is legitimately 50 times faster than amdvlk at this. I don't think theres going to be any sane way to work around this so I'll file an issue for amdvlk later. For most things radv seems on average 2-4 times faster. (It's generally worse in terms of gpu utilization though)

Finally, I'd also like to mention that even though amdvlk is so much slower than radv, I'm still getting 60fps with amdvlk at all times on linux. I'm almost entirely sure that this just comes down to compiler differences. I spent a lot of times chasing things that had higher cpu usage on windows, but I've concluded that they're just symptoms of the rsx thread running too slow.

The next thing I'll try is a windows build with clang or gcc, and then I'd like to start contributing to rpcs3 in other ways.


notes

I think its pretty funny that I earlier said

The only 2 things left that I could still test would be using a windows build of rpcs3 compiled with clang or gcc, or using the proprietary amd vulkan driver on linux. But I doubt that either of those would explain the gap.

but that ended up being exactly the issue.

I first noticed that get_system_time was taking longer on windows, so I wrote a version that simply reads rdtsc and then divides by a constant to get microseconds. It was 10 times faster, but there was no performance increase so I dropped that. (It did save 1% cpu usage on linux though)

Then I wasted my time looking at usleep again. I've concluded that it uses usleep only when the ppu and rsx are both finished their frame and have nothing else to do. So its natural that messing with usleep would never confer a performance increase since it only gets called when its performing perfectly.

I also got distracted by "NtWaitForKeyedEvent". I tracked it down to the ppu thread calling sys_cond_wait and concluded that it only happens when rsx is overbudget and ppu is waiting on rsx so that it can start the next frame.

What a waste of time.

It really makes a big difference. :D
[Windows 10 / RX 570 ]
master / custom build
1

@Whatcookie Thanks! I reverted the sampler pool approach after profiling on NVIDIA+windows; I did not realize things were that bad on AMD drivers. Keep the issue open; I have some ideas that could help with this situation and have acquired a polaris card for optimization targeting.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

iBlackS0ul picture iBlackS0ul  路  3Comments

kurosh10000 picture kurosh10000  路  3Comments

legend800 picture legend800  路  3Comments

Emulator-Team-2 picture Emulator-Team-2  路  3Comments

Asinin3 picture Asinin3  路  3Comments