Rpcs3: [Optimization] Use 2 hw threads for PPUs

Created on 24 Mar 2019  路  3Comments  路  Source: RPCS3/rpcs3

Currently, the scheduler works by woking another thread and sleeping everytime a thread task is ended. Max number of running threads is 2, the rest asleep.
This is very inefficient on operating systems such as Windows where theres a noticable delay between the time a thread is being notified and the time its actually running, resulting in performance issues and in unutilized cpu time in some cases.
To solve this thread tasks need to run on 2 hw threads alone, everytime a task ends instead of notifying a different thread, the current thread can switch context immediatly and execute the task.

CPU Discussion Enhancement

Most helpful comment

I remember experimenting with something similar a while back, using 'virtual cores' to run tasks, but I was doing this for SPU, not PPU. In the end, it was not worth the overhead (it ran much worse) and I ended up with a much simpler solution (the SPU concurrency choker). It is an interesting idea on paper, unfortunately its hard to determine its efficiency until an implementation is already done. In the end, since other OSes are less affected by this, could we maybe gather some comparative figures for this problem (windows vs linux performance for example, with the same-ish compiler)? If its only going to give 1% more performance for that much code complexity, it may not be worth the effort.

All 3 comments

Do you have any real measurements which show inefficiency?

First, you can't measure cpu time not being taken in the profiler..
But from what is measurable, sys_timer_usleep alone takes about 10 percent of cpu time in the profile in TLoU, most of which comes from yield. CPU time which could have been avoided completely if the thread would have switched to another task instead. (size of g_ppu is 6)
This is of course only a part of the issue, but that's what it is measurable.

I remember experimenting with something similar a while back, using 'virtual cores' to run tasks, but I was doing this for SPU, not PPU. In the end, it was not worth the overhead (it ran much worse) and I ended up with a much simpler solution (the SPU concurrency choker). It is an interesting idea on paper, unfortunately its hard to determine its efficiency until an implementation is already done. In the end, since other OSes are less affected by this, could we maybe gather some comparative figures for this problem (windows vs linux performance for example, with the same-ish compiler)? If its only going to give 1% more performance for that much code complexity, it may not be worth the effort.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Nezarn picture Nezarn  路  3Comments

AniLeo picture AniLeo  路  3Comments

uaqlover picture uaqlover  路  3Comments

densandwitch picture densandwitch  路  3Comments

xiangzhai picture xiangzhai  路  3Comments