Cxbx-reloaded: Solve RDTSC differences

Created on 1 Jan 2017  路  9Comments  路  Source: Cxbx-Reloaded/Cxbx-Reloaded

Executing the RDTSC instruction (opcode 0F 31) on the host CPU without emulation probably gives different results than on the Xbox1.

We should measure the characteristics of the host RDTSC instruction;

  • if it behaves close enough to the Xbox, it requires no special handling
  • if it deviates too much from the Xbox, RDTSC must be emulated.

If the host RDTSC can be used, it's probably wise to base our KeQueryPerformanceCounter implementation on the host RDTSC instruction too (as that results in better interoperability between code using both this kernel API and the opcode).

If the RDTSC must be emulated, there are at least three possible methods:
1: Scan for the opcode and patch if (much like the patching of FS-accesses)
2: If possible, set the TSC flag, which causes an exception when executing the RDTSC instruction.
3: Run all code via a JIT engine, handling the RDTSC instruction specially

When using option 1, the patch must result in a value that increments with the same frequency as on the Xbox.

When using option 2, the RDTSC instruction causes an exception. Emulation of the Xbox characteristics can be done in our exception handler.

When using option 3, most code the JIT engine handles, can be executed as-is. The RDTSC instruction however would need special handling.

Be aware though, that any kind of emulation of RDTSC will incur some overhead, which will reduce the granularity of the resulting value. This might become a problem in tight loops, but otherwise it's not a big deal.

For more defails, see http://x86.renejeschke.de/html/file_module_x86_id_278.html en http://stackoverflow.com/questions/8322782/rdtsc-too-many-cycles

LLE cpu-emulation enhancement

All 9 comments

The RDTSC opcode is very short, only two bytes (0F 31)
This makes a search/replace patch option infeasible due to the high chance of false positives.

This leaves option 2 and 3.
Option 2 could be good as a short term goal
Option 3 would be the desired end goal, but requires a complete JIT implementation.

Based on this, I would suggest looking into Option 2 for now, and options 3 in the far future

TSC scaling proposal for Option 2:

Scale the TSC, only if the host TSC is faster than the Xbox TSC, by a factor greater than 1 tick per 733 MHz tick.

Utilize prctl(), or bit 2 of register CR4, to catch rdtsc accesses in the Xbox code via the sigaction(SIGSEGV) handler.

The __int128 internal type is not supported for 32-bit builds. So you might have to emulate the type to get the fixed-point precision you need.

My example code for Linux:

/*
 * $ gcc -m64 -g scaler.c -o scaler
 * $ ./scaler
 * host TSC frequency    : 2793889043
 * TSC scaling factor    : 3.809848696 per 733 MHz tick
 *
 * host tick count       : 496106460379494
 * emulated tick count   : 130216840628375
 * emulated tick diff    : 733357523
 */

#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/prctl.h>

uint64_t tsc_scaler = 0;

int tsc_on() {
    return prctl(PR_SET_TSC, PR_TSC_ENABLE);
}

int tsc_off() {
    return prctl(PR_SET_TSC, PR_TSC_SIGSEGV);
}

uint64_t tsc(int xbox) {
    unsigned __int128 t;
    register uint32_t a;
    register uint32_t d;

    tsc_on();
    __asm__ __volatile__("rdtsc" : "=a" (a), "=d" (d));
    tsc_off();

    if (xbox && tsc_scaler) {
        t  = 1000000000;
        t *= ((uint64_t)d << 32) | a;
        t /= tsc_scaler;
        return (uint64_t)t;
    }

    return ((uint64_t)d << 32) | a;
}

uint64_t tsc_calibrate() {
    register uint64_t x;
    register uint64_t y;

    x  = tsc(0);
    sleep(1);
    y  = tsc(0);
    y -= x;

    fprintf(stderr, "host TSC frequency    : %lu\n", y);

    return y;
}

int main() {
    unsigned __int128 t;
    register unsigned i;
    register uint64_t x;

    t  = 1000000000;
    t *= tsc_calibrate();
    t /= 733333333;
    tsc_scaler = t;

    fprintf(stderr, "TSC scaling factor    : %lu.%.09lu per 733 MHz tick\n\n",
        tsc_scaler / 1000000000,
        tsc_scaler % 1000000000);

    for (t = tsc(1), i = 0; i < 10; ++i) {
        sleep(1);
        fprintf(stderr, "host tick count       : %lu\n", tsc(0));
        x = tsc(1);
        fprintf(stderr, "emulated tick count   : %lu\n", x);
        fprintf(stderr, "emulated tick diff    : %lu\n\n", x - (uint64_t)t);
        t = x;
    }

    return 0;
}

Wow! Nice! Did you mean t instead of tsc() in the last fprintf?

It'd be cool to have this working under Windows too. Any idea on how to get that? (We should start writing portable code from now on)

A great idea, however this is not possible on Windows as the CR4 register is not exposed to user mode processes. In order to implement this, we would have to develop our own kernel mode driver, and I'm against that for a few reasons.

1.there are strict security concerns for kernel mode development, I'm not sure we are qualified for that.

  1. As we don't have a Microsoft signing certificate, users would need to disable driver signature enforcement on their computers: this is a major security risk as it allows any unsigned kernel mode code to run.

I don't believe we should expose users to these risks.

Simplified above code. tsc(1) should return the emulated tick count @ 733 MHz.

I've found out you cannot modify the CR4 register in user space, confirming @LukeUsher 's doubt. The Linux kernel fortunately has this interface exposed to control the TSC. Apparently, kernel space only.

One of the many reasons Linux is better to develop for ;)

I don't dislike Windows, I just think Linux treats developers a little better ;).

Even if we 100% go cross platform and remove the Windows dependency, we still need to support Windows due to it's huge market share, so things like this aren't really viable for this project, which is a real shame, because @haxar's solution is great and does exactly what we need.

As for option 1:Scan for the opcode and patch if (much like the patching of FS-accesses)
I have some quick results with my collections of xbes. I tested more than a dozen of games.
most games has rdtsc instruction at these patterns within specific sections.

text:
0F 31 89 01 (1 occurrence)

D3D:
0F 31 C3 (1 occurrence)

BINK:
0F 31 89 01 (3 occurrences)
0F 31 8B 4c (2 occurrences)
0F 31 B9 XX XX XX XX 8B (1 occurrency)
0F 31 C7 05 (1 occurrence)

WMADEC:
0F 31 89 07 (2 occurrences)

Godzilla save the earth has one additional occurrence at text: 0F 31 89 54
DOA3 has one additional occurrence at text: 0F 31 8B 54

There are data match 0F 31, idapro treat them as data. and these data repeat in lots of games. I try to treat them as code but the disasm results seems not reasonable so I prefer to treat them as data as well.
DSOUND: 0F 31 59 51
XMV: 0F 31 01 23

Tow games ExaSkeleton and MechAssault use lots of rdtsc instuctions, which can only use reduced pattern with 0F 31 89, and 0F 31 8B. most of the cases are at text section.

I couldn't find and data collision (false detect) with patterns listed below, so here is a quick suggestion to find all rdtsc instructions without false detect:

for all code sections:
search
0F 31 89
0F 31 8B

for D3D section:
search
0F 31 C3

for BINK section:
search (in addition to the patterns listed for all code section)
0F 31 B9
0F 31 C7

games/xbe tested:
DOA3, ExaSkeleton, Godzilla save the earth, Gunvalkyrie, Halo, MechAssault, Otogi, Otogi 2, Panzer Dragoon Orta, RalliSportChallenge, Smashing Drives, Sonic Riders, Steel Batallion

So if my patterns, assumptions were correct. What's the easiest way to patch the rdtsc instruction?
because the instruction itself is only 2 bytes long, I am thinking to patch it to a specific interrupt, will that be feasible?

update 2018/04/24: I will keep this post updated with every new xbe I found.

as a generalized rdtsc detection isn't feasible in simply way.
I would suggest to use the original pattern search method.
patterns as listed below:
we could either separate the patterns per section, or use these patterns for all white listed sections.

//rdtsc patterns, General is for patterns found in more than 3 titles. other wise, the title contains the patter is noted.

known segments to be searched: .text,D3D, BINK, WMADEC

0F 31 89 (TEXT, BINK, WMADEC) General
0F 31 C3 (D3D) General
0F 31 8B (TEXT, BINK,D3D) General
0F 31 B9 (TEXT,BINK) General
0F 31 C7 (TEXT,BINK) General
0F 31 8D (TEXT) General
0F 31 68 (TEXT) ExaSkeleton, MechAssault
0F 31 5A (TEXT) TestDrive
0F 31 29 (TEXT) ExaSkeleton
0F 31 F3 (TEXT) ExaSkeleton
0F 31 E9 (TEXT) ExaSkeleton
0F 31 2B (TEXT) ExaSkeleton
0F 31 50 (TEXT) ExaSkeleton
0F 31 0F (TEXT) ExaSkeleton
0F 31 3B (TEXT) ExaSkeleton
0F 31 D9 (TEXT) ExaSkeleton
0F 31 57 (TEXT) ExaSkeleton
0F 31 B9 (TEXT) ExaSkeleton
0F 31 85 (TEXT) ExaSkeleton
0F 31 83 (TEXT) ExaSkeleton
0F 31 33 (TEXT) ExaSkeleton
0F 31 F7 (TEXT) ExaSkeleton

//false detect, excluded pattern(double check for the positive pattern above)
0F 31 00 (TEXT) PDO, Burnout2
0F 31 F0 (TEXT) Burnout 3
0F 31 5E (TEXT) Burnout Rev
0F 31 14 (TEXT) Testdrive
0F 31 09 (TEXT) VCOP

known segments to be excluded:
rdata, data, DSOUND, XMV, XGRAPH,XONLINE,MDLPL

tested titles:

Blinx
Burnout
Burnout2
Burnout3
Burnout Revenge
DOA3
ExaSkeleton
Futurama
Godzilla Sae the Earth
GunValykrie
Halo
JSRF
MechAssault
Otogi
Otogi 2
Panzer Dragoon ORTA
Project Gotham Racing
RalliSport Challenge
Smashing Drives
Sonic Riders
Steel Batallion
Tak 2
Test Drive
Virtual Cop 3

Closing this as it is solved as good as we can get in the current form of Cxbx-R. Fixing this fully on Windows will require CPU emulation, since unlike Linux, it does not provide a way to hook rdtsc.

CPU emulation is on the roadmap anyway, so we'll deal with it then

Was this page helpful?
0 / 5 - 0 ratings