Pcsx2: [Feature request] GSdx: detect supported instruction sets on runtime.

Created on 27 Aug 2015  路  75Comments  路  Source: PCSX2/pcsx2

Current implementation creates five different forks of the same code using sse2, ssse3, sse4.1, avx, avx2 which are selected at build time. The user then can choose one of these plugins during runtime. Unfortunately most user don't know anything about instruction sets and are puzzled which of these five plugins is the best.

Would it be possible to merge all forks into one and outsource only the instruction set dependent part interfacing necessary functions which then gets dynamically loaded at runtime?

I don't know how the structure of gsdx is and where the instruction codes make the difference but I would believe oop should enable such a possibility. I would even believe due to maintainability reasons you should already have every implemented...

The advantages are:
Simplification of plugin selection.
Possibly easier introduction of new/other promising instruction sets.

Disadvantages are most probably:
Increased code complexicity
Stealing time of developers for more important stuff.

Enhancement / Feature Request GS

Most helpful comment

Won't read through every comment, just want to mention that in the past I tried really hard to find any penalized code with vtunes regarding this SSE/AVX mixing and I could not find any. Old non-VEX encoded code have to be mixed with AVX to trigger it, but the compiler encodes intrinsics with VEX, the generated code is also done that way, and the GS runs on its own thread (hopefully), so pcsx2 cannot interfere much. There is only one case, if one of the SSE builds was to run on AVX enabled CPUs, but why would anyone do that.

All 75 comments

Yes and no i think.

As it's currently set up, the profiles take advantage of allowing the compiler to use the enhanced instruction sets on the standard code as well as having the segmented code for the specific software renderers, if it was to detect on the fly, the best gsdx could be compiled with from the compiler is sse2, essentially losing some of the benefits of the higher instruction sets.

As explained in:
https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties
http://stackoverflow.com/questions/7839925/using-avx-cpu-instructions-poor-performance-without-archavx

It's not fun. It would be a bit easier if there was no AVX/AVX2 code. AVX/AVX2 uses 256 bit registers but modern cpu's only read/write 128bit at a time so the difference between SSE4 and AVX should not be much. I don't have a AVX cpu so I can't test it for PCSX2 but for some applications SSE2 code runs faster than the AVX equivalent,

I could be mistake but dont some instructions set work better then others in different setups?

sse4.1 works better then AVX and vis versa in certain setups? maybe instead of have different plugins for each instruction set, just have 1plugin and let the user pick which set?, which guess wouldn't help with the confusion. but atelst it all be in one plugin?

@tsunami2311
This was more or less the idea. Better to have a hidden advanced option where you select the instruction set than a forced selection choice before you even see one ingame scene.

@micove
The second source was quite promising. After avx usage just zero all upper bits to avoid the change penalty. You obviously can do this in the end of avx using functions. So the question is again how the GSplugin is written and how much the avx code is encapsuled.

Actually arch:avx explaines as well why avx seems to be slower than sse4.1 in hardware mode. Obviously avx is not only used as extension but rather replaces all occurrences of sse. So possibly a better management can as well lead to better results. (If the switching penalty is avoided or sufficiently lowered)

Edit: revert stupid spell corrections of android

@refractionpcsx2
I would suggest to put the instruction set dependent code into seperate dll's compiled using their proper environment. The gs plugin itself contains the sse2 code as default implementation. On runtime the plugin looks for extension dll's that override the sse2 code. So in the end you have the same amount of dll's but only one of them is recognized as pcsx2 plugin (interfacing the plugin structure) while all other dll's are only recognized by gsdx as instruction set extensions (interfacing the extension structure). i would just create plugins within the gsdx plugin.

@willkuer the only problem there is PCSX2 will attempt to recognise them as plugins and cause errors every time, which may cause unnecessary threads on the support forums. I'm also not sure on the overhead that would be associated with that.

The warning I would actually only display in verbose logging. I think it doesn't help anyone at all and confuses new users even in the current situation. But this is a different issue.

I can not comment on overhead. If there is overhead we don't need to further discuss this.
I am programming all the time in C# where a scenario as I've described is kind of easy. And 'overhead' would only happen on dll load which would only happen once on gsinit. But I am not putting any inline asm in my code and C# is by itself kind of slow so I really can not judge.

I actually expected something like a postponed label. And if there will be somebody at some point interested in modular programming he/she could tackle it.

I would suggest to put the instruction set dependent code into seperate dll's compiled using their proper environment.

You don't need to put them in different DLL it's more of putting everything that needs AVX in their own units/cpp files and compiling only those files with /arch:avx -mavx. Then using cpudetect functions to call the SSE functions from files compiled with /arch:SSE2 or AVX functions from files compiled as /arch:AVX. Also use zeroupper to guard the AVX code. Splitting all that and properly zeroing out the upper bits is the not fun part.

Actually arch:avx explaines as well why avx seems to be slower than sse4.1 in hardware mode.

I think the AVX plugin may be gimped since the core/deps are not compiled with /arch:AVX. There is stuff already like this in the code that supports that hypothesis:

    #if 0//_M_SSE >= 0x501

    // TODO: something isn't right here, this makes other functions slower (split load/store? old sse code in 3rd party lib?)

The Intel Compiler Suite tests for this but it's expensive. Well there is a 30 days trial for windows while I think it's free on Linux.

This was more or less the idea. Better to have a hidden advanced option where you select the instruction set than a forced selection choice before you even see one ingame scene.

If nothing else comes of this issue, I'd like this to be explored. I think it would make PCSX2 much less intimidating for new users. Choosing the right GSdx plugin is basically the only plugin decision PCSX2 users need to make. All the other default plugins are fine. So if we could remove that decision, they could just move on from that window without having to do anything. Heck we might not even need to show them that window on first-time setup, maybe go straight into the BIOS window.

Honestly there are code everywhere in gsdx. It would be painful to split the code properly.
1/ drop avx1 and sse3. The former is not really faster and the later is used by few cpu (can still dl an old versions). It will reduce the choice for the users.
2/ maybe add a mechanism in plugin selection to select by priority avx2/sse4/sse2.

I think you mean ssse3 ;) but no there's a reason that's there, unless you have a super new amd chip you have to use that or sse2 as they didn't support sse4, they had something like sse4a.

Yes ssse3. Are you sure that bulldozer arch doesn't support avx? I think Phenom 2 is limited to sse2

By the way is there any benchmark of the different sse/avx. I think most of the speed impact, if any, is in GSState.cpp handler. This part could be dynamic because handlers are function pointers.

I know a lot of people use the FX6300 chip which only supports ssse3, beyond that I'm not sure where the change is.

I quickly look at the code. Big changes are in gsvector h file. Sse code is inlined everywhere. It would be very hard to split the code.

I know a lot of people use the FX6300 chip which only supports ssse3

FX-6300 also has support for SSE4 / SSE4.1 / SSE4.2 /AVX.

My bad :P I'm sure there was a bunch of chips that don't tho.

Core2Duo's and Core2Quad's support ssse3 but not sse4.1 iirc.

Maybe not that dominant anymore but most of older system have a Core2 installed. I could possibly benchmark ssse3 vs sse2 on such a system using games that are not that cpu hungry. Possibly ffx, ffxii, kh, kh2 and GoW. I think I don't need to check snowblind games and SotC.

But I would only do it if you consider Core2 hw to be relevant.

Core2Duo's and Core2Quad's support ssse3 but not sse4.1 iirc.

only the conroe (65 nm) architecture core2duo's supported till SSSE3 , the Wolfdale (45nm) core2duo's supported SSE4.1.

https://en.wikipedia.org/wiki/List_of_Intel_Core_2_microprocessors#.22Conroe.22_.2865_nm.2C_1333_MT.2Fs.29

Not mentioned yet, but avx also has the vex encoding with arbitrary destination register, it is much less likely to spill data into temp registers on the stack. That alone could help a lot in pcsx2's recompiler, too.

Well the pcsx2.exe is currently compiled with

<EnableEnhancedInstructionSet>NoExtensions</EnableEnhancedInstructionSet>

In Debug/Devel/Release. Not sure if it's related to the AVX transition penalties or if there is another reason for it.

Most likely no penalties, because it's per thread, and pcsx2 runs gsdx on its own thread, memcpy may use sse instruction only.

In Debug/Devel/Release. Not sure if it's related to the AVX transition penalties or if there is another reason for it.

To keep build easily portable between user/dev ? Or to avoid strange crash due to no stack alignment in the recompiler.

Well, I meant that all the versions of the core executable are compiled with something like -mno-sse -mno-sse2, etc under windows which is sort of odd and explains why some of the assembly code claims 10x speedups which made no sense unless the MS libraries were totally broken at some point. If I remember correctly in linux SSE/SSE2 is always enabled in the core so it uses non-gimped loops, memcpy and friends.

Well project is very old, it comes from a long journey. Various compiler options were not safe. Potentially sse can be enabled safely. This way we could remove the remaining memcmp (and memset which is mostly removed anyway).

An idea, the new instructions are mostly useful for the SW renderer. The SW renderer is a JIT compiler so maybe the JIT could be detected at runtime.

Other ideas perhaps.

@mirh it isn't related. It is bad to mix SSE/AVX (AVX is a superset of SSE). However it means that you need to port all SSE code to AVX, if you want to use the latter properly. So there is the zero* trick to keep SSE code around.

Here a more interesting idea:
https://gcc.gnu.org/wiki/FunctionMultiVersioning

Anyway, I think the best start will be on the SW code generator. Overhead will be small. You just need plain C++ intheritance nothing fancy.

Won't read through every comment, just want to mention that in the past I tried really hard to find any penalized code with vtunes regarding this SSE/AVX mixing and I could not find any. Old non-VEX encoded code have to be mixed with AVX to trigger it, but the compiler encodes intrinsics with VEX, the generated code is also done that way, and the GS runs on its own thread (hopefully), so pcsx2 cannot interfere much. There is only one case, if one of the SSE builds was to run on AVX enabled CPUs, but why would anyone do that.

I still use SSE4.1 on my 6700k when i was testing between them I saw graphical glitchs with AVX/2 versions sse4.1 in some games, and no speed difference.

Though I think there should only be one plugin instead the 5 there is now and it should use what is appropriate

Though I think there should only be one plugin instead the 5 there is now and it should use what is appropriate

The problem is the code is chosen on compile time to what instrinsics are compiled. Having it selected on the fly will probably bring a speed penalty with it, unless it call all be done in classes and instantiate them under a single name, kinda like we do with the interpreter and recompilers in the main emu.

dont know i take your word for as i dont know software/development, I just know from user standpoint having 5 different plugin can be confusing to people

Core uses SSE2 everywhere. But recompiled code is generated based on the auto-detection (but always SSEx). However it misses lots of opportunity to use AVX operation in standard code (such as memcpy)

GSdx uses a JIT compiler for the SW renderer. So it would be possible to do the same here. However it would be annoying for other paths that aren't recompiled. Hum, it will be a nightmare with the various intrinsics. I'm not sure you could use AVX intrinsics if you don't enable AVX optimization.

Conclusion, the most efficient and easy way is to build several versions of the plugin. However maybe we could limit yourself to 3 version only. SSE2 (compatible with all), AVX2 (all optimizations), AVX or SSE4 ?
Holy shit, steam updated their survey to include AVX (at least 1) info too, I'm very happy :)

| ISA | total | increase |
| --- | --- | --- |
| SSE2 | 99.99% | 0 |
| SSSE3 | 91.37% | +0.27% |
| SSE4.1 | 85.51% | +0.23% |
| SSE4.2 | 82.19% | +0.22% |
| AVX | 69.28% | +1.26% |

If AMD doesn't screw up with zen, potentially AVX will be supported by ~75-80% start of next year.

Well, either we have a single dll, or we can live up just fine with 5, once anyway there would still be to choose.

would be nice if amd pull there head out off the asses and made something that lived up to there hype. might force intel to rethink the prices.

Is there speed increase between them? cause like said when I did test AVX2 vs the SSE4.2 that i been using forever I didnt see much if any increase, then again all i care is my game run fullspeed and most do if not all do now. only difference I saw was graphically corruption, this was in DQ8 when I was testing the SW mode speeds I also testedAVX2/SSE4.2 (HW mode) AVX 2 had graphical flashing/corruption SSE4.2 did not was changing the plug with game up still though I didnt do fast reboot I used pause/resume.

And like said choose is nice my majority people aren't gonna know what they are people are probably just letting it use what ever it defaults too.

If anybody's still interested in a solution; I've implemented CPU detection into the installer that selects the corresponding GSdx dll based on the highest supported instruction set of the user's CPU. Any thoughts?

IMHO, first we can drop SSSE3 and AVX1 build. The SW renderer will automatically use the best ISA (except AVX2 that is reserved to AVX2 build). This way it is much easier to select the good plugin (note it would be nice to compare the speed of the 3 remaining build SSE2/SSE4/AVX2). Low power, AVX2 emulated with SSE operation could impact the choice.

Can't we just put a check on the first time wizard then? If [Filenames]GS is not set in the ini.

Also, if dropping avx and ssse3 is just because their additional instructions are ~pointless I guess it can be only fair. Otherwise if it's just for low cpu counts.. I dunno, I have mixed feelings.

It only remains a couple of SSSE3 define in the code. I doubt that it really increase the Perf (again the SW will use them automatically). Potentially it impacts the intrinsic/compiler optimization, again benchmark will be nice. AVX1 is even worse, as compiler adds everywhere the extra vzeroupper instruction which could explain why AVX1 was slower on AMD hardware.

So if you look at the code.

  • You have dedicated SSE4/AVX2 code to handle texture conversion (and 1 SSSE3 path, search _M_SSE)
  • You have either AVX2 or SSE2->AVX1 SW renderer
  • You have extra potential optimization that the compiler can add (dedicated AVX memcpy for example)

Compiler optimizations are really bad on vector. So my guess is that SSSE3/AVX1 speed boost is close of 0.

TIL about benchmarks, makes sense so.

I stand about the idea of the check inside UI then.

The bench was done before the automatic selection of the ISA for the SW renderer. That being said, you can see that SSSE3/AVX1 is useless on the HW renderer.

The bench was done before the automatic selection of the ISA for the SW renderer.

Because we do have now?

That being said, you can see that SSSE3/AVX1 is useless on the HW renderer.

Yes indeed.

Because we do have now?

Yes we do. https://github.com/PCSX2/pcsx2/blob/master/plugins/GSdx/GSDrawScanlineCodeGenerator.cpp#L25
Search m_cpu.has in the project, for example in this file : https://github.com/PCSX2/pcsx2/blob/master/plugins/GSdx/GSDrawScanlineCodeGenerator.x86.cpp

We don't have a full runtime detection.

  • AVX2 build will only use AVX2 at run time.
  • SSEn/AVX1 build will select the SSEn/AVX1 SW renderer at run time.

The only annoying stuff, if the potential penalty on the AVX/SSE transition. I don't know know if we have it as we use the new vex encoding that will put zero on the upper 128 bits already. And code is already running in a separate thread (the scanline part)

Oh, right you had mentioned that.
Anyway, again (regardless of whatever you'll end up with avx) can't we just have a check in the UI when gsdx plugin is for the first time queried/assigned?

@refractionpcsx2 @turtleli ok to remove SSSE3/AVX1 ? I can't do it.

@mirh dunno, it is a Windows issue ;)

Sure, I can remove them if there aren't any major objections.

I have no real objections, SSE4 is available for those chips which support SSSE3, not sure about AVX1, it depends how much the earthmover range of AMD processors support

@refractionpcsx2 do you mean a CPU that supports AVX but not SSE4 ?

Based on https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html it might not exist.

@refractionpcsx2 do you mean a CPU that supports AVX but not SSE4 ?

that's what I'm thinking yea, they support SSE4a but not SSE4 and they have AVX I believe.

Oh crap I need to double check I completely forgot it

So bulldozer and later are fine. Previous CPU was already limited to sse2. So in short the impact for the he renderer

  • Core2 is limited to sse2
  • Sandy/ivy/bulldozer+ are limited to sse4
    Speed impact is between none and faster (based on blyss benchmarks)

Many Core2 CPUs support SSE4.1 too.

Ah yes you're right, only the 65 nm version (2006-2007) is impacted. The "limited" core2 are around 5% of the market (4.75% based on steam stat).

Actually just checking the code again. It seems SSSE3 is used on the GSBlock stuff. But the gains seem to be small. It could still be bigger than 0 on some games.

Could the ssse3 code be assumed to work for cpu's with sse4? As it was an Intel thing those could be easily under the same umbrella

???

CPU that support SSE4 will have both SSSE3 and SSE4 optimizations. The "minor" issue is that some C code (7 ifdef) are used for a couple of functions. So SSS3 only CPU would need a dedicated build to use them. They will miss a couple of optimization on some texture formats.

Right, I was just making sure :p

On Wednesday, January 25, 2017 8:35:27 AM EST refractionpcsx2 wrote:

I have no real objections, SSE4 is available for those chips which support
SSSE3, not sure about AVX1, it depends how much the earthmover range of AMD
processors support

According to cat /proc/cpuinfo, CPU flags: "fpu vme de pse tsc msr pae mce
cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss
ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt
tsc_deadline_timer aes rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi
flexpriority ept vpid tsc_adjust smep erms dtherm ida arat".

SSSE3, SSE4_1, and SSE4_2 are clearly supported - is the difference between
SSE4_1/SSE4_2 and SSE4 great enough that the CPU doesn't support SSE4?

It doesn't quite work like that. Each version doesn't improve on the previous by enhancing his well it works, but just provides a few extra instructions. For our SSE4 plugin to work you need SSE4 support. The instructions added in 4.1 and 4.2 were pretty useless to us

Enhancing how it works*

sorry, on my mobile, no edit button (come on git hub!)

Actually I wrote 4 as a short name of 4.1
4a and 4.2 are useless. Anyway AMD is all or nothing. Intel is new instructions by gen

Cool #1790
Can we get back to the OP issue now? :s

Unfortunately most user don't know anything about instruction sets and are puzzled which of these five plugins is the best.

Just a random idea, now that we have only 3 plugins (and 2 for ~50% of the market). Maybe we could just rename them with a fast adjective.

gsdx.so
gsdx-fast-sse4.so
gsdx-faster-avx2.so

-.-
The point is: no user tinkering is required at all.
Not "try to make stuff (that only advanced users would already touch/know in the first place) easier".

it was just an idea (that could have be implemented in 5 minutes)

But the idea will be deprecated once the plugins are automatically detected for the corresponding CPU's, which is the goal of this issue.

So is having the installer detect the CPU still a good option, or would we rather install everything/let PCSX2 detect the CPU? An installer solution would at least solve the issue on stable, but I don't know how much we care about this issue on dev builds.

Not like dev builds aren't the stable builds by now 馃槢
..
Jokes aside, even assuming dev weren't important, I can't understand why you should over complicate the installer, when (I guess?) CPUID flag checking and selection would just be a bunch of lines of code.

EDIT: for as much, now that I think.. I'm not sure (after the check) how UI should know the dll "named this way" is sse4, while this is sse2

Well as far as over-complicating the installer goes, I'm _way_ ahead of you 馃槢 (see #1699) The CPU check is relatively simple.

${If} ${CPUSupports} "AVX2"
        File /nonfatal ..\bin\Plugins\gsdx32-avx2.dll
        Goto FinishedCpuCheck
    ${ElseIf} ${CPUSupports} "SSE4"
        File /nonfatal ..\bin\Plugins\gsdx32-sse4.dll
        Goto FinishedCpuCheck
    ${Else}
        File /nonfatal ..\bin\Plugins\gsdx32-sse2.dll
    ${EndIf}

Lol.
Still feels really odd for something like this at installer time.

It's almost like we shipped xbox default controls with a pre-defined ini.

Otherwise, (I didn't like the first time but I'm not stable), when assigning a plugin the first time (i.e. no gs plugin was configured yet). Seach AVX2, SSE4 in string name, and assign a good default based on CPU detection.

I'd like to request partially reverting pull #1790 to re-add AVX.

gregory38 stated in #1790 :

Nobody complain (yet) let's go. Thanks for the update.

But I believe not many people complained because extracting orphis builds over your existing plugins keeps the old AVX one from when it existed.
I almost didn't notice it was not being updated until I looked at the timestamps and all the other plugins had newer dates.

I am on build 2067 and I get in Dakenguard:
48 fps with SSE4.1/AVX GSdx 20170503164203 and
59-60 fps with AVX/AVX GSdx 20170127110023

I don't know if it matters, but it is running OpenGL Hardware - Nvidia 555m

There are also a lot of Sandybridge motherboards with i7 processors that only have AVX and not AVX2.

Sandybrige i7 my computer

You need to compare the same build version. There isn't any specific AVX code for the HW renderer. The SW renderer is still running in AVX anyway. So compiler may better optimize some code when AVX is enabled but so far you're the only one to report a real performance difference. It is way too big to be related to AVX vs SSE4.

https://gcc.gnu.org/wiki/FunctionMultiVersioning

https://www.phoronix.com/scan.php?page=news_item&px=GCC-Clear-make-fmv-patch
But unless #2683 happens, the best that could happen in windows-land would still be the "first time wizard" thing

Honestly, I won't bother. So far 3 instructions set remain. Eventually sse2 will die. It is 5% (3% Phenom2 and 2% of old Intel CPU) of the market nowadays (end of 2018). Meanwhile we could just report an error message for SSE2 build that run on sse4 capable CPU

https://github.com/PCSX2/pcsx2/pull/3013 improves this so I guess we can close this now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

backgamon picture backgamon  路  5Comments

BroKill picture BroKill  路  4Comments

Nezarn picture Nezarn  路  6Comments

Clarke2131 picture Clarke2131  路  3Comments

vgturtle127 picture vgturtle127  路  4Comments