Dolphin's emitter is more advanced and far better optimized than the one PCSX2 has. I think it should be possible to copy Dolphin's emitter to PCSX2 because PPSSPP is using Dolphin's emitter despite emulating a different architecture.
https://github.com/MoochMcGee/Emotionless/commit/b54812ae320beb37ec41303a84619a1ae678aefa
Not to mention it's 64bit.
Wow amazing work i think it should be a good one for amd cpu's
@ADormant im curious to why you think the PCSX2 emitted isn't very optimised or advanced? It's very optimised and flexable. Also the emitter is only really required for the compilation stage, once the code is compiled the emitter is not used, as long as it is telling the recompiler to use the right instructions at compilation time (which is generally less than 200ms) then how optimised it is doesn't matter.
having a 64bit emitter will be needed in time, but it would be more advantageous (and easier to maintain) if we expand the functionality of the current one rather than trying to splice in an entirely new one.
They're probably being enthusiastic. Although the dolphin emitters are actually used by multiple projects, and there is even a ARM64 one if you ever want to port.
There is that, although there are no plans to port to android or anything at the moment. We're not even completely convinced upgrading to 64bit is actually worth all the effort that will be required to update the current codebase to be compatible with it.
im curious to why you think the PCSX2 emitted isn't very optimised or advanced?
Dolphin's emitter has support for the newest instructions like AVX2, FMA3/4 , BM1/2 and ABM. @refractionpcsx2
@ADormant This is true, however we only implement instructions which are useful to the emulation. There may be some AVX instructions which might be useful for the VU's (for the MADD instructions for example) but im pretty sure it's not just as easy to implement the instructions then use them due to the different nature of AVX instructions. What also makes this more difficult is i have no idea how the emitter works :P
I think Dolphin is using these instructions mainly to increase performance although AVX and FMA instructions can be probably used for more accurate FPU, ALU and VUs emulation as well, from what I heard especially floating point emulation is inaccurate in PCSX2. Perhaps Dolphin devs or someone else skilled at JIT can advice something? @refractionpcsx2
I don't know about the gamecube or psp, but the ps2 doesn't follow IEEE standards so any x86 rounding and float limits are different from that of the ps2.
Yeah I think somebody who knows about this on the dolphin side would be helpful
Looks like GC and Wii aren't fully IEEE compliant either.
https://code.google.com/p/dolphin-emu/issues/detail?id=6936
Yeah I think somebody who knows about this on the dolphin side would be helpful
@ mention whoever made the files then, most of the Dolphin devs are pretty active on GH.
@karasuhebi Okay I'll try.
Could someone offer some assistance regarding porting Dolphin's Emitter/JIT to PCSX2?
@hrydgard
@unknownbrackets
@FioraAeterna
@unknownbrackets
@Sonicadvance1
@degasus
@Tilka
@phire
@MoochMcGee
It's definitely an emitter. That is true.
lol @Sonicadvance1 thanks for clearing that up :p
It's an emitter, you don't need amazing JIT recompiler skills to understand it.
Whoa, why was I mentioned? The most I did was port the emitter to my own PS2 emulator! I haven't even started on a JIT yet. Most of the work was just porting it to my own logging system tho.
An emitter is an emitter, it just facilitates the writing of jitted code to memory.
All actual performance/accuracy issues are the problem of the JIT itself.
@phire is right.
Also, @refractionpcsx2, AVX-512 may also be useful, because it extends the SIMD register set to 32 registers. There's also operand masking on most if not all instructions.
@refractionpcsx2 Please, think of all those users on 64-bit Linux. Those poor, poor souls, with no user-friendly way to play their beloved PS2 games.
@phire That's what I've been trying to tell them.
@MoochMcGee Does God kill a kitten every time we don't let them play? :P
But yes, extra registers would be useful to keep live registers for longer, the AVX would be useful in some cases but as I said I don't know how it's used and if you have to mess about like going between MMX and SSE registers.
@refractionpcsx2 No, not really. He just angrily shakes his fist.
Well, we can't be having angry fist shaking, that may need to be worked on at some point, if people can be arsed!
AFAIK the registers are aliased although I don't know if there's some perf impact to changing pipelines or whatnot. See MOVUPD that shows different encodings and how they affect YMM regs.
Aren't the ps2's regs 64-bit (technically 128 but really it's just a pair)? Does pcsx2 just use SSE regs for even GPR or something to workaround this? I have not looked much at pcsx2's jit but last time I did it seemed like this was the case. I do feel like 64-bit would inarguably generate less code with more mappable regs, even if the code didn't have better timings (which it might), and less code still matters (we've seen this in ppsspp even on x64 when we improved imm encoding in some instructions.)
Nevertheless, yeah, emitter probably doesn't matter. And the ARM64/ARM32/x64 emitters are not _all_ that similar, so it's not like not using the x64 one excludes the idea of using the ARM64 sometime in the future.
The floating point thing seems painful. From my ps2autotests, I seem to recall that it's basically using the full range of the exponent. If that's the bottleneck, then yeah, no emitter is really gonna fix that situation...
-[Unknown]
Yeah, it's just an emitter. PCSX2's one is different but not really less powerful. Extending an x86 emitter to x64 is not a big job, the only advantage of using Dolphin's instead would be that it would be easier for contributors used to Dolphin's emitter to work on PCSX2, but only marginally.
However I am 99% convinced that porting PCSX2 to 64-bit will be worth the effort, if anyone undertakes it. If nothing else, having 16 registers instead of 8 makes it easier to emit efficient code, and wider registers match the PS2 better. But as unknown says, it won't make a difference for the non-IEEE behaviour.
@ADormant Please don't spam so many random people. I guess you have mixed "emitter" with "jit". The emitter has no performance impact, and is /much/ less work compared with the jit. So on switching the emitter, you almost have to rewrite the jit.
In my opinion, it's nice to share as much code as possible, but it's not worth to rewrite the jit to just share the emitter. However sharing the jit is likely not wise as the PS2 has a MIPS cpu and the WII a PPC. Maybe we're able to share some IL->host assembly backends (eg the dolphin jitil), but this would require to start from scratch on both sides :/
Thank you very much for your input guys, I appreciate you all taking the time to reply to this. It is pretty much the result I expected.
@unknownbrackets the PS2 registers are very similar to SSE but can be seen as 64bit pairs, although generally the instructions are 32bit and the upper 32bits is just an extension of the sign, apart from MMI instructions the upper 64bits generally goes untouched, but then MMI is pretty much SSE in 95% of all cases. They wider registers could be nice but they may also cause problems (with how the sign extending works etc).
@hrydgard I agree the 16 registers instead of 8 would be very much an advantage, it would cut down a lot of swapping in and out of the live cpu registers making instructions execute much quicker, especially in the case of the VU's where it's handling lots of floats at the same time, however i suspect AVX would be very nice for the VU's as well, although they don't require 64bit as such.
But beyond the extra registers, I'm not sure if there will be much in the way of advantages in a 64bit JIT
You also get extra address space, which can be very useful.
Dolphin uses 4gb of sparsely allocated address space to implement fastmem, which provides a large speedup.
@phire I have yet to have an explanation of what fastmem actually is beyond what comes across as a buzzword, if you could explain how it functions (you don't have to go in to massive detail, just the basic functionality of it) that would be interesting to read.
@refractionpcsx2 It's a fast path to access common memory. So we alloc 4 GB of virtual memory and just try to access it on every load/store. If it's not within common memory, this will raise a segfault which we're able to detect and to emit the slow path. The slow path however have to check for MMIO and different ranges.
Just read page 9 on https://www.exploit-db.com/docs/pocorgtfo06.pdf
@degasus awesome thanks for the info, i'll have a read :)
But beyond the extra registers, I'm not sure if there will be much in the way of advantages in a 64bit JIT
As a Linux person, having to install a plethora of 32 bit libraries is a major turnoff.
@Tilka This is true but the linux market is a very small share of our users (about 1-3%) so focus on this has been very limited, a large majority of our users are either using 32bit or 64bit windows which is happy to run a 32bit app, with the remaining linux users generally might grumble about having to install the extra libraries, but generally do it anyway. Until very recently the performance of PCSX2 on linux has been extremely poor, something which introducing a 64bit version wouldn't have fixed or even helped in the slightest, but now OpenGL is running pretty well on both sides, so who knows what the future may bring.
@degasus I had a look at that doc, we do something kind of similar, however probably not so fast or elegant. We do guess where most things are based on the address given, we can pretty much assume where everything is in memory without having to do an expensive TLB lookup, but in cases where we can't do this, it does check the TLB. We don't have any sort of "just go for it" type system in place however as far as I'm aware so that side might speed things up :)
Mhh.. speaking of 64 bit advantages they were discussed a week ago on pcsx2 forums
And bugs aside, you shouldn't ever need more than a gigabyte of RAM
@mirhl it depends if you implement a virtual address space like Dolphin does, then you need more than 1Gb, we just store the base memory sizes of each of the components and the recompiler compiled code space and of course cached textures which are the biggest issue with us ending up using more than 1Gb. PCSX2 did used to do a virtual address space thing but it was massively unreliable as we had to provide a base address at compile time, but if that was in use on anybodies machine, PCSX2 refused to work, strangely the biggest problem we had was people running a Brazillian copy of Windows XP!
I'm sure there's other ways of doing it tho
Another option to consider would be https://github.com/herumi/xbyak which is used by https://github.com/benvanik/xenia
GSdx uses xbyak to obscure the software renderer. It's completely unreadable now but it does run faster :p
@refractionpcsx2 Why do you care about the linux user market share? For dolphin, up to half of the developers use linux. I think this is a very important share ;)
@mirhl Vram :P
@degasus I don't particularly :P In our case only 2 developers use Linux, I realise at the moment that's like 80% of our developers lol, but when everybody else is actually doing coding, it's less than 1/4 of us :)
Userwise there isn't many linux users in comparison to windows
@refractionpcsx2 Now it's time to guess why it's needed to be more attractive to linux devs :P
@degasus You're welcome to forward the effort by starting 64bit recs for the EE, IOP and VU if you like :P
@unknownbrackets the PS2 registers are very similar to SSE but can be seen as 64bit pairs, although generally the instructions are 32bit and the upper 32bits is just an extension of the sign, apart from MMI instructions the upper 64bits generally goes untouched, but then MMI is pretty much SSE in 95% of all cases. They wider registers could be nice but they may also cause problems (with how the sign extending works etc).
PPSSPP is mips, although mips 32. I feel confident that if we used SSE/MMX regs for addiu and etc., it would be signficantly slower. Things like this just look like crazy amounts of bloat to me.
I think the overhead of sign extending would not be huge, and theoretically the regcache could cache this. For example, it could have a state LOC_HOSTREG and a separate state LOC_HOSTREG32 (or just a bool flag.) Then whenever you need to map it as a 64-reg, you'd call MapReg(rt) and it would sign-extend there, and if you were clobbering it could ignore it. Plus if you only needed the lower 32-bits it could ignore the flag for now - MapReg(rt, MAP_LOWER32). This could allow you to "propagate" the unextended value through potentially a few instructions (I'm not sure how often games use 32 vs 64 bit arith.) It might never need to apply the fixup, if all it does with the 32-bit value is use it as a load/store addr or something (maybe?)
-[Unknown]
I have wanted to replace the x86 with SSE for a while, but I can never get myself in a headspace to get my head around all the handling ;p
Condition operation (if/test stuff) are pure 64 bits on EE. Going 64 bits will really reduce branching code. The real issue with 64 bits is the removal of 32 bits. On linux, issue with 32 bits library is very rare on the forum nowadays. So it isn't an issue (and don't tell me that 800Mb of data is too much).
However I'm sure memory emulation could be faster. Current code uses 6/7 instructions for each mem access (kinda of a tlb lookup). There were some plan in the past to support the "fast mem" feature like Dolphin.
The mem access cost:
xMOV( eax, ecx );
xSHR( eax, VTLB_PAGE_BITS );
xMOV( eax, ptr[(eax*4) + vtlbdata.vmap] );
xMOV( ebx, 0xcdcdcdcd );
uptr* writeback = ((uptr*)xGetPtr()) - 1;
xADD( ecx, eax );
The long term future :p
// --------------------------------------------------------------------------------------
// Future-Planned VTLB pagefault scheme!
// --------------------------------------------------------------------------------------
// When enabled, the VTLB will use a large-area reserved memory range of 512megs for EE
// physical ram/rom access. The base ram will be committed at 0x00000000, and ROMs will be
// at 0x1fc00000, etc. All memory ranges in between will be uncommitted memory -- which
// means that the memory will *not* count against the operating system's physical memory
// pool.
//
// When the VTLB generates memory operations (loads/stores), it will assume that the op
// is addressing either RAM or ROM, and by assuming that it can generate a completely efficient
// direct memory access (one AND and one MOV instruction). If the access is to another area of
// memory, such as hardware registers or scratchpad, the access will generate a page fault, the
// compiled block will be cleared and re-compiled using "full" VTLB translation logic.
//
// Note that support for this feature may not be doable under x86/32 platforms, due to the
// 2gb/3gb limit of Windows XP (the 3gb feature will make it slightly more feasible at least).
//
However I miss a part how to support address mirroring properly. The 'AND' is not enough. The issue to separate 0x3 range (Ram) from 0x1 range (Reg). From the top of my head the mapping is
Msb => Mem
0x0 => Ram
0x1 => Reg
0x2 => Ram
0x3 => Ram
0x7 => Scratch pad
0x8 => Ram
0xA => Ram
0xB => Rom
I think one of the main problems are those clamping and rounding modes.
http://forums.pcsx2.net/Thread-blog-Whats-clamping-And-why-do-we-need-it
http://forums.pcsx2.net/Thread-blog-Nightmare-on-Floating-Point-Street
http://pcsx2.net/developer-blog/232-nightmare-on-floating-point-street.html
http://pcsx2.net/developer-blog/209-whats-clamping-why-do-we-need-it.html
http://forums.pcsx2.net/Thread-About-rounding-mode-clamping
Perhaps Play! author can offer some opinion too since Play is already 64bit. @jpd002
Ya, know, @ADormant, Dolphin has issues just like those, and they usually have a fast path and a slow path. If certain checks about the inputs pass, they're able to use the less accurate fast path with no accuracy qualms. If they don't, Dolphin just uses the more accurate slow path.
@gregory38 you can just do real memory mirrors for a 4GB virtual address space (make the host operating system do all the work for you, since it's doing it anyway after all.) On 64 bit, such an address space isn't really an issue. That's what we do in PPSSPP (although we don't have to support hardware registers so we don't bother with the segfaulting.)
We only mask on 32 bit, but it's safe in our case. If 32-bit is not supported by fastmem (e.g. if this code is not used in the 32-bit build), then you don't need to worry about ANDing at all, right?
Edit: And just to say, we're able to optimize many loads and stores to a single instruction because of this. To handle a trampoline, I think it'll require some NOP padding, but definitely can be very fast for the common case. Fast mem is significantly faster than safe mem (when fast mem is disabled and checks are emitted.)
-[Unknown]
But how do you implement the mirroring? How do you store in 0x100 and read back in 0x3000_0100 for example? Why do you need 4GB to mirror 32MB 5 times?
You ask the operating system to map the same 32mb of physical memory in 5 different locations.
Sure but how ;)
Hi, I just got the mail about closing this "issue". It is nice to see those old discussions, and I wonder why it took that long to close this one here... However, afaik you do support 64bit now. Was there a blog post with a discussion about the outcome?
@degasus see #3451 and #3608
We've literally just added 64bit and it's not in a state we are completely happy with yet to announce to the public, there will be a report highlighting it when the time is rirght.
As for this issue, I'd completely forgotten about it, one of the other developers brought it up for closing
Most helpful comment
We've literally just added 64bit and it's not in a state we are completely happy with yet to announce to the public, there will be a report highlighting it when the time is rirght.
As for this issue, I'd completely forgotten about it, one of the other developers brought it up for closing