Vamiga: Custom CPU implementation

Created on 1 Dec 2019  Β·  73Comments  Β·  Source: dirkwhoffmann/vAmiga

Many of the recently reported bugs seem to be related to bus and interrupt timing. To improve the situation, I favour the idea of integrating a custom CPU implementing into vAmiga. To get this project done in a decent time frame, I will take a reference implementation approach based on two already existing cores: Musashi and portable68000. These cores are going to serve as my functional reference and temporal reference, respectively.

This is my roadmap:

  • Task 1: Write a CPU which is functionally equivalent to portable68000.
  • Task 2: Add a disassembler (portable68000 has none).
  • Task 3: Add cycle counting.
  • Task 4: Integrate the new core in vAmiga.

Task 4 will require some smart recording logic, because I cannot simply run both cores in a row (the first CPU will alter memory and cause side effects). To cope with that, the second core must run in a fake environment that intercepts all memory calls and compares them to what the first CPU did.

These are my corresponding milestones:

  • Milestone 1: Pass all unit tests of portable68000 functionally.
  • Milestone 2: Match Musashi’s disassembler output.
  • Milestone 3: Pass all unit tests of portable68000 temporally.
  • Milestone 4: Run the new core side by side with Musashi with matching output for each and every executed command.

Once all four milestone have been reached, the new core can take over and will hopefully bring vAmiga to the next level.

Milestones reached so far: None 🀭

Enhancement

Most helpful comment

Milestone 1 reached πŸ₯³.

I'm sure that my "CPU" is still full of bugs though. Although the portable68000 unit test suite is pretty good, it can only check a tiny subset of all possible instruction / mode / argument combinations.

Next milestone is fixing the disassembler output. For this purpose, I am using a faked vAmiga app (on the dasm branch) that disassembles each executed instruction internally. It then compares the output of Musashi with my own disassembler and crashes the app once a mismatch has been found. Right now, it crashes almost immediately:

Disassembled instruction 262168 differs:
Musashi: dbra    D1, $fc0142
 vAmiga: dbf       D1, $0

Assertion failed: (false), function executeInstruction

Let's see how long it takes until I can see the hand & disk logo in this faked app 😬.

All 73 comments

I am just reading about the m68k emulator which is written in rust https://github.com/marhel/r68k . And about his testing strategy.

And then I read this πŸ€” ...

In effect, each instruction is compared thoroughly (with random values) to Musashi, using all combinations possible of the allowed source and destination addressing modes and registers. The number of clock cycles consumed is also reported by Musashi after execution, and is also compared to r68k.

Did you read it too ? The last sentence said that Musashi reports the consumed clock cycles.😳 But I thought it is not ?! 😳 Until now I thought you wanted to have an own CPU implementation because Musashi is not counting cycles ... And I understood that we will get that part from portable68000...

This might also be useful for our book shelve

http://cache.freescale.com/files/32bit/doc/ref_manual/MC68000UM.pdf

Section 7 and 8 lists all instruction execution times in clock cycles.

Whereas https://www.nxp.com/docs/en/reference-manual/M68000PRM.pdf describes all possible opcodes...

Until now I thought you wanted to have an own CPU implementation because Musashi is not counting cycles ...

Musashi reports the number of elapsed cycles after each executed instruction, but doesn't report the intermediate cycle counts when memory is accessed. Let's say we execute a command that consumes 12 cycles and performs 4 memory accesses. In this case, we need something like this:

| Event | Cycle
| ------------- |:-------------:|
| Mem access 1 | 2 |
| Mem access 2 | 4 |
| Mem access 3 | 8 |
| Mem access 4 | 10 |
| End | 12 |

Portable68000 and it's successor Denise will provide us with that information.

Ah yes I understand that means for example
when
(
the PC in chipram is at a

  mulu <ea>, Dn command  ->70 Cycles (1 read / 0 write)

and the CPU is blocked by Bitplane/Copper or Blitter DMA Access
)
then
we plan to stop the CPU only that one read cycle when it is acessing the bus, is that right ?

luckily the complex CPU is the last component with the lowest prio in the chain of bus consumers...

we plan to stop the CPU only that one read cycle when it is acessing the bus, is that right ?

Yes, exactly. When the CPU tries to acquire the bus, I need to check if the bus is available. To do this, I need to to know the exact cycle when the read happens (e.g., cycle 12 after instruction start). If the bus is in use, the CPU is halted until it is free again.

To deeper understand the problem I try to learn what vAmigas Agnus controller does in the current implementation. I spotted the code partly in agnus.cpp and memory.cpp but I can not see the behaviour easily. Therefore I made the following little quiz 😎
In chipram when BLTPRI is set
Is it
a) executing the CPU with full speed like fastram without caring about the blitter nasty flag.
or
b) executing the CPU with some delay
or
c) stopping the CPU entirely
?

In chipram When BLTPRI is cleared
Is it
d) executing the CPU with full speed like fastram
or
e) executing the CPU for 1 bus cycle if the CPU already requested 3 consecutive memory cycles which were already denied
or
f) periodically stopping the CPU for some bus cycles (random staccato πŸ₯³)
?

Should be c) and e) when I understand the docs correctly.

But then again when e) is not possible because Musashi does not care about bus cycles, what does vAmiga currently do when BLTPRI is not set?

The crucial function w.r.t. bus timing is Agnus::executeUntilBusIsFree() which is executed whenever the CPU accesses Chip or Slow Ram. I tried a lot of variants of which none really worked (every attempt is a hack, because Musashi doesn’t provide the exact cycle information). The current implementation looks like this (it is more primitive than the previous ones, but working best at the moment):

Agnus::executeUntilBusIsFree()
{
    int16_t oldpos;

    // Quick-exit if CPU runs at full speed during blit operations
    if (blitter.getAccuracy() == 0) return;

    // Tell the Blitter that the CPU wants the bus
    cpuRequestsBus = true;

    oldpos = pos.h > 0 ? pos.h - 1 : HPOS_MAX;

    // Wait until the bus is free
    while (busOwner[oldpos] != BUS_NONE) {

        // Add a wait state
         cpu.addWaitStates(DMA_CYCLES(1));

         // Emulate another Agnus cycle
         oldpos = pos.h;
         execute();
    }

    cpuRequestsBus = false;
    cpuDenials = 0;
}

The code checks if the bus is free by reading array busOwner[] at the preceding hpos. This array contains, e.g., BUS_COPPER if the Copper used it (it is the same array that is read by the DMA debugger for displaying bus usage). The BLTPRI flag is checked inside this function (the Blitter and the Copper call it to acquire to bus):

template <BusOwner owner> bool
Agnus::allocateBus()
{
    // Deny if the bus has been allocated already
    if (busOwner[pos.h] != BUS_NONE) return false;

    switch (owner) {

        case BUS_COPPER:

            // Assign bus to the Copper
            busOwner[pos.h] = BUS_COPPER;
            return true;

        case BUS_BLITTER:

            // Check if the CPU has precedence
            if (!bltpri() && cpuRequestsBus) {

                if (cpuDenials >= 3) {

                    // debug("Blitter leaves bus to the CPU\n");
                    return false;

                } else {

                    // debug("Blitter ignores the cpu request\n");

                    // The Blitter gets the bus
                    cpuDenials++;
                }
            }

            // Assign the bus to the Blitter
            busOwner[pos.h] = BUS_BLITTER;
            return true;
    }

    assert(false);
    return false;
}

If BLTPRI is true, the Blitter takes the bus whenever it can. If BLTPRI is false and the CPU wants the bus (indicated by cpuRequestsBus being true), the Blitter skips allocating the bus every third request (determined by counter cpuDenials).

Please feel free to ask more about the code. I'm really happy if somebody looks at it (albeit this part of the code is probably the most ugly one).

Ok I try to answer my own question in order to prove my understanding of the code above πŸ™‹πŸ»β€β™€οΈ.... when I understand the current code correctly then
in BLTPRI = 0 it
adds a CPU waitstate (probably 1 CPU cycle long??? ) every 4th DMA Cycle ... when there is a pending CPU bus request

In case of lots of move.l (ax), (an) instructions there will be more waitstates then for example
as it would be the case for lots of mulus. So yes, I can imagine the current implementation should approximate the real world πŸ€— "in theory".

Question1: why do we add a waitstate? Is it not better to block the Musashi CPU when it requests a memory word from bus ?

Question2: when the Musashi CPU requests memory via bus, why can't we count these requests and treat them as bus cycles ? Or in other words what is the advantage of portable68000's intermediate cycle count?

if BLTPRI = 0 it adds a CPU waitstate (probably 1 CPU cycle long??? ) every 4th DMA Cycle ... when there is a pending CPU bus request

If the bus is in use, the CPU gets delayed by 1 DMA cycle which is 2 CPU cycles:

cpu.addWaitStates(DMA_CYCLES(1));

DMA_CYCLESis a macro converting DMA cycles to master clock cycles (the master clock runs at 28 Mhz). There are macros for the CPU clock and the CIA clock as well:

#define CPU_CYCLES(cycles) ((cycles) << 2)
#define CIA_CYCLES(cycles) ((cycles) * 40)
#define DMA_CYCLES(cycles) ((cycles) << 3)

Question1: why do we add a waitstate? Is it not better to block the Musashi CPU when it requests a memory word from bus ?

The code does exactly this. If the CPU wants to access memory and the bus is blocked, Agnus is emulated until the bus is free (which is the same as blocking the CPU). Of course, blocking the CPU has the effect that the currently executed instructions needs longer than usual. This is taken care of by adding the wait states.

Function addWaitStates is very simple (and might be inlined in future):

void
CPU::addWaitStates(Cycle number)
{
    waitStates += number;
}

The wait states are added in function CPU::executeInstruction():

Cycle
CPU::executeInstruction()
{
    ...
    advance(m68k_execute(1));

    if (waitStates) debug(CPU_DEBUG, "Adding %d wait states\n", waitStates);
    clock += waitStates;
    waitStates = 0;

    return clock; 
}

what is the advantage of portable68000's intermediate cycle count?

Portable68000 gives us the complete memory access pattern. E.g., if a 10 cycle instruction with 4 memory accesses is executed, this pattern could looks like this:

C-C-C---C-

Now, assume that bitplane DMA is going on, with the following memory access pattern:

B-B-B-B-B-B-B-B-B-B-B-B-B

This would result in the following bus usage:

BCBCBCB-BCB-B-B-B-B_B

Because CPU instructions usually use every other bus cycle for memory access, the CPU runs at full speed in this example. However, if the number of bitplanes is increased, the bitplane DMA pattern could look like this:

B-BBB-B-BBB-B-BBB-B-BBB

Now, the CPU would be slowed down:

BCBBBCBCBBBCB-BBB-B-BBB

Bottom line: Simply counting the number of memory accesses doesn’t help. Whether the CPU is slowed down depends on the actual memory access pattern which is not provided by Musashi (unfortunately).

I am still did not getting it.😌 May I ask more ? πŸ™‹

Bottom line: Simply counting the number of memory accesses doesn’t help. Whether the CPU is slowed down depends on the actual memory access pattern which is not provided by Musashi (unfortunately).

in vAmiga when the CPU reads or writes to the bus it is dispatched
as an example via
activeAmiga->mem.peek16(addr);
and
activeAmiga->mem.poke16(addr, value);

assuming the CPU is being stepped forward one by one in terms of CPU cycles via
m68k_execute(1)

Then I don't get it why we can not record the pattern in these bus dispatchers methods ? The CPU certainly calls these in a pattern e.g. read first, some cycles computation, then write, no ? Why is this not the pattern we need then ?

May I ask more ? πŸ™‹

Definitely πŸ‘¨πŸ»β€πŸ«.

The CPU certainly calls these in a pattern e.g. read first, some cycles computation, then write, no ? Why is this not the pattern we need then ?

If I understand correctly, you would build up the pattern step by step. Most likely like this:

Mem access 1: C
Mem access 2: C-C
Mem access 3: C-C-C

But if, e.g., MULU is executed, the pattern would look very different. It would be similar to this:

C- ..... -C- (with many cycles in-between where the multiplication happens).

The memory pattern can differ considerably between instructions and counting memory accesses would only approximate the real behaviour.

If I understand correctly, you would build up the pattern step by step.

yes.

But if, e.g., MULU is executed, the pattern would look very different. It would be similar to this:

C- ..... -C- (with many cycles in-between where the multiplication happens).

if that is the real pattern of MULU, yes that would be the expected behaviour.

The memory pattern can differ considerably between instructions and counting memory accesses would only approximate the real behaviour.

Now I get the problem you think of. You think although we might step the CPU forward cycle by cycle, the memory requests from the emulated CPU would not happen at the same cycle step compared to a real physical CPU. Which is mainly because you assume that the developer of the Musashi did not measure the bus access for example with a logic analyser as the developer of the portable68000 did. And therefore it will only be a guess or an approximation. But what proves this assumption ?

You think although we might step the CPU forward cycle by cycle, the memory requests from the emulated CPU would not happen at the same cycle step compared to a real physical CPU.

No, the problem is that we cannot step the Musashi CPU cycle by cycle (we wouldn't have any problem if we could). We can only step the CPU instruction by instruction which is the problem. When we call Musashi::m68k_execute(1), Musashi executes a single instruction (as a chunk) and returns the number of elapsed cycles. While executing m68k_execute, Musashi calls vAmiga::peek() and vAmiga::poke()a couple of times, but doesn't advance an internal clock between those calls. Hence, viewed from the vAmiga side, Musashi executes all memory accesses at the same time which is the problem. The only thing we could do (besides improving Musashi or implementing our own CPU) is to pretend that a certain number of cycles (usually 2) did elapse between two memory accesses. This is what I have in mind when I call it an approximation. It would be correct for many instructions, but totally wrong for instructions such as MUL or DIV.

Ok, now I completely understand the problem and why it is not possible to do some hack in the peek and poke methods in order to fine tune the CPU bus access. Thank you!!

The most simple approach from the vAmiga side would be to use a CPU implementation which vAmiga can step forward cyclewise... And which would then call vAmigas memory interface (e.g. peek & poke ) at the correct cycle of the CPU-instruction which is currently being processed.

The most simple approach from the vAmiga side would be to use a CPU implementation which vAmiga can step forward cyclewise...

This is how it is done in VirtualC64. The approach would be very slow though. Fortunately, we can do better by letting the CPU drive the whole thing. This means that the run loop will be something like this:

while (1) {
   cpu.executeInstruction();
   agnus.executeToCpuClock();
}

In other words: The CPU is giving pace and Agnus follows.

Function cpu.executeSingleInstruction will be structured similar to this:

CPU::executeSingleInstruction(...)

   clock += 4;
   value = mem->peek();

   clock += 2;  
   value = mem->peek();
   …
}

And the peek handler will look like this:

Mem::peek() {
  agnus.executeToCpuClock();

  If (source == CHIP_RAM || source == SLOW_RAM) {
      int blockedCycles = agnus.executeUntilBusIsFree();
      cpu.clock += blockedCycles;
    }
  ...
}

This means that whenever a memory access occurs, Agnus is executed up to the cycle where the CPU already is. If Agnus used the bus in the last cycle, the CPU cannot have it immediately. In this case, Agnus continues to execute until the bus is free. The number of blocked cycles is added to the CPU clock and the memory access performed.

True 🀀

seems "theirs" is all about cycles 😬...

for example cpu.js from SAE

function runNormal() { //m68k_run_2()
        var exit = false;

        while (!exit) {
            try {
                while (!exit) {
                    regs.instruction_pc = getPC();
                    //regs.opcode = getInst16_default(0);
                    regs.opcode = nextInst16_default();
                    SAER.events.do_cycles(cpu_cycles);
                    var orw_cycles = iTab[regs.opcode].f(iTab[regs.opcode].p);
                    cpu_cycles = orw_cycles[0] * cpucycleunit;
                    //cpu_cycles = adjust_cycles(orw_cycles[0] * cpucycleunit);
                    SAEV_CPU_cycles = cpu_cycles;

                    if (SAEV_spcflags) {
                        if (SAER.m68k.do_specialties(cpu_cycles))
                            exit = true;
                    }
                }
var orw_cycles = iTab[regs.opcode].f(iTab[regs.opcode].p);
cpu_cycles = orw_cycles[0] * cpucycleunit;
SAEV_CPU_cycles = cpu_cycles;

Sorry, I just don't get what they do πŸ™ˆ. I can't get all those different cycles into my head πŸ€“.

Here is something simpler:

Bildschirmfoto 2019-12-11 um 15 02 26

... which leads to E = m c^2 with a few more derivations 😍.

I love simple relationships (which is the reason why UAE code has to stay out of vAmiga πŸ˜‰).

This is so smart to let the cpu drive the thing. Honestly first it was sounding a little bit paradox because the cpu should be the one with lowest priority in chipram. The way you intercept the cpu bus access in the peek&poke memory interface reminds me a little on the fairytale of the rabbit and the hedgehog. The speedy rabbit πŸ‡ with the big advantage always lost the run. The hedgehog πŸ¦” always replied to the rabbit I am already here.
Again thank you for sharing the concepts it is very interesting how such a complicated computer like the Amiga is being emulated. I also never really fully understood the UAE. Only small parts of it. 😌The code and concepts in vAmiga are much clearer and pretty cool to study and learn from.

Milestone 1 reached πŸ₯³.

I'm sure that my "CPU" is still full of bugs though. Although the portable68000 unit test suite is pretty good, it can only check a tiny subset of all possible instruction / mode / argument combinations.

Next milestone is fixing the disassembler output. For this purpose, I am using a faked vAmiga app (on the dasm branch) that disassembles each executed instruction internally. It then compares the output of Musashi with my own disassembler and crashes the app once a mismatch has been found. Right now, it crashes almost immediately:

Disassembled instruction 262168 differs:
Musashi: dbra    D1, $fc0142
 vAmiga: dbf       D1, $0

Assertion failed: (false), function executeInstruction

Let's see how long it takes until I can see the hand & disk logo in this faked app 😬.

Wow, that is a big christmas present for human mankind, a new child is born πŸ§šπŸ»β€β™€οΈ ...

How to pronounce Moira ?

"Moi" like french Mine ... ?
"ra" like the "ra" in supra-moleculear ?

emphasis on the first sylable or last sylable ?

is it male πŸ‘ΆπŸ½or femaleπŸ‘§πŸ» or it πŸ‘Ά?

anyway happy birthday πŸŽŠπŸŽ‚πŸŽˆ.

EDIT:
https://en.wikipedia.org/wiki/Moira_(given_name)
https://en.wikipedia.org/wiki/Moirai
https://www.babycenter.com/baby-names-moira-3266.htm

In ancient Greek religion and mythology, the Moirai, often known in English as the Fates (Latin: Fata), Moirae or Mœræ (obsolete), were the white-robed incarnations of destiny

I'm not really an expert in ancient Greek mythology and Wikipedia is kind of technical about it. But what I understood is that the Moirae were essentially three cool girls with superpowers which I found very cool 😎.

What makes me a little suspicious is that they are not on the official super-power list 🀨:

https://marvel.fandom.com/wiki/Category:Powers

And their relationship to Zeus is also unclear:

Both gods and men had to submit to them, although Zeus's relationship with them is a matter of debate: some sources say he can command them (as Zeus Moiragetes "leader of the Fates"), while others suggest he was also bound to the Moirai's dictates.

How to pronounce Moira ?

I have no clue 🀭.

BTW. I have switched over to a formal approach to test the disassembler. I simply iterate over all opcodes and call the disassembler for them:

The following mismatch is very strange. What is Musashi trying to tell me with the *4? I've never seen this syntax... πŸ€”

Mismatch found: 30 0 7456

       Musashi: ori.b   #$0, ($56,A0,D7.w*4)
         Moira: ori.b   #$0, ($56,A0,D7.w)

Now I see it .... the "*4" is the scale . I cannot remember I ever used that scale thing.

ori.b #$0, ($56,A0,D7.w*4)

bd=$56
An=A0
Xn=D7
scale=4

effectively this
grafik

where scale is
grafik

Oops, never heard about a scaling factor in this context πŸ™„.

Here is my current implementation of this addressing mode:

       case 6: // (d,An,Xi)
        {
            i8 d = (i8)irc;
            i32 xi = readR((irc >> 12) & 0b1111);
            ea = readA(n) + d + ((irc & 0x800) ? xi : (i16)xi);
            result = read<S>(ea);
            readExtensionWord();
            break;
        }

The bit format of this addressing mode is:

iiii sxxx dddd dddd

i = Index register
s = Size indicator
d = displacement

I bet the (unused) bits marked xxx contain the scaling factor πŸ€“.

Let's cheat and peek into the Musashi sources:

static char* get_ea_mode_str(uint instruction, uint size)
...
if(EXT_INDEX_SCALE(extension))
    sprintf(mode+strlen(mode), "*%d", 1 << EXT_INDEX_SCALE(extension));
...
}

Here we go:

#define EXT_INDEX_SCALE(A)                (((A)>>9)&3)

This means we have two scaling bits right here:

xxxx xSSx xxxx xxxx

There is a bit remaining at position (1 << 8). Because I discovered it first, I have the right to name it. I call it the "mystery bit" M.

xxxx xxxM xxxx xxxx

What could be the purpose of M? πŸ€”

Oh no πŸ™ˆ, I searched for the newly discovered M Bit πŸ•΅ and now I found this

grafik

look at the bit 10 and 9 of picture (a) ... there is no scale for 68000 !? maybe the instruction you have stumbled upon is only there to test the CPU, maybe in kickstart ? When the CPU does scaling then the kickstart code knews that it is a newer CPU?

If that should turn out to be true, then you should ignore the scaling to let Moira identify itself as a 68000 CPU ....

Also no description about the M Bit. It is always zero look! πŸ‘€

I see, it's a 68020+ feature πŸ˜„. Actually this explains everything, including why there isn't a single portable68000 unit test that applies a scaling factor.

I'll integrate the scaling thing into my disassembler to achieve compatibility with Musashi (this is key for rapid testing).

maybe the instruction you have stumbled upon is only there to test the CPU, maybe in kickstart ?

No, it's such simpler. My new code iterates over all possible bit pattern and calls both disassemblers on them. It's an artificially generated instruction that does't appear anywhere in Kickstart.

Also no description about the M Bit. It is always zero look!

So disappointing. It felt like I was close to a big discovery 😟.

In (d,An) addressing mode, Musashi switches between signed and unsigned format.

E.g., Musashi translates $28 $0 $8000 to:

ori.b   #$0, (-$8000,A0)

Contrarily, Musashi translates $108 $8000 $0 to:

movep.w ($8000,A0), D0

So mean πŸ˜–

Go ahead and just mock that Musashi behaviour in Moira for now ...

When complete equality of disasm output to Musashi is reached. We can let Moira consistently produce signed format or unsigned. Maybe we could invent😏 an "SignedFormatOutputEnabled"switch for that in Moira...

The Mystery bit enables "Full Extension Word Format" 🀯 (68020+).

Bildschirmfoto 2019-12-15 um 08 59 21

The 68000er interpretes this as [M]eaningless ? Only the 68020+ knows about full mystery extension words? πŸ‘€I am reading the specs...

Only the 68020+ knows about full mystery extension words?

Yes. The 68000/680010 ignores the [M]eaningless/[M]ystery bit as well as the scale bits. Those CPUs only support the brief extension word format.

Mismatch found: 13a 0 0

       Musashi: btst    D0, ($0,PC); ($10002)
         Moira: btst    D0, ($0,PC)

My compatibility counter has reached 0x13A out of 0xFFFF. This means that 0.48% of all disassembled strings already match. Just 99.52% to go. Piece of cake.

At opcode 0x13C, it's getting messy πŸ˜•.

According to the specs, there is no immediate addressing mode for BTST Dn,\

Bildschirmfoto 2019-12-15 um 12 49 13

Accordingly, Moira treats the corresponding bit pattern as illegal. Musashi, however, disassembles it:

Mismatch found: 13c 0 1

       Musashi: btst    D0, #$0
         Moira: btst    D0, #$1

Maybe immediate addressing for BTST is a 68020+ feature? πŸ€”

According to

http://www.easy68k.com/paulrsm/doc/trick68k.htm

the addressing mode is supported:

Checking for membership in a small set. If you want to see if a number is in a set of several numbers, you can create a bit mask corresponding to the set. For instance, if the set is {0,1,3,5}, the mask has those bits set and the bit map is 00101011 (2B hexadecimal). You can test for membership in this set with

BTST D0,#$2B ;Is D0 in {0,1,3,5}?
If your set is composed of more than eight elements you have to move the mask into a data register first.

1B4CA37B-CD4B-4B18-9716-B322FC7303FF

Should be valid...

No wait πŸ™Š... this one is the correct one

7DBF9A31-7DD4-454E-9CDE-E6F10144A49A

The #<data> 111 / 100 mode is the mode in question. It's definitely supported then.

New high score reached 😎:

Mismatch found: 4180 0 0

       Musashi: chk.w   D0, D0
         Moira: chk     D0, D0

Bildschirmfoto 2019-12-15 um 14 15 34

Hmm, is this a bug is Musashi? πŸ€”

Mismatch found: 41bc 0 0

       Musashi: chk.w   #$0, D0
         Moira: dc.w $41bc; ILLEGAL

Immediate addressing should not be allowed:

Bildschirmfoto 2019-12-15 um 14 21 23

DDC52AA4-0B44-4BBC-90ED-22D14AD638FE

My documents are different 🧐

What means the value 100 in the register column?

Seems like you have to better docs πŸ€“. What document did you use?

Look at the third entry of this issue. 😎

This might also be useful for our book shelve

http://cache.freescale.com/files/32bit/doc/ref_manual/MC68000UM.pdf

Section 7 and 8 lists all instruction execution times in clock cycles.

Whereas https://www.nxp.com/docs/en/reference-manual/M68000PRM.pdf describes all possible opcodes...

Look at the third entry of this issue.

Oh, I see. Yes, it's all there πŸ€“.

I thought for this expedition into the stone age we need some proper and excellent equipment. Well prepared for the mysteries and obstacles that are awaiting us there .... πŸ‘¨πŸ»β€πŸš€

What means the value 100 in the register column of the 111 adressing mode of the chk operation? Why it is called the register column? Has it a meaning or is it just the combination code for chk to immeadiate adressing...

Has it a meaning

Yes. There is a general coding scheme:

The first seven addressing modes among

Bildschirmfoto 2019-12-15 um 15 10 42

need a register as a parameter. They are coded in the form MMM RRR where MMM is the binary representation of the mode number and RRR is the register number. The last five modes don't require a register as parameter. Because of that, the register field is used to store additional mode bits. I.e.,

111 000 for mode 7
111 001 for mode 8 etc.

Milestone 2: Match Musashi’s disassembler output.

Reached 😎. Moira's disassembler output matches Musashi for all opcodes now.

I did test

  • all 65536 possible opcodes with 48 distinct extensions words,
  • instruction BTST with all 65536 possible extension words.

I cannot test all possible combinations, because of combinatorial explosion, but I am pretty confident that the disassembler is fine now. This also means that Moira's jump table is correct which is a big step forward.

For Christmas, Moira had wished for a clock. I told her she might be too young for such a device, but she wouldn't listen 🀨. Anyway, enough for today...

Before giving Moira a clock, I decided to give her a sandbox. It works as follows: When a portable68 unit test is executed, the sandbox intercepts all memory accesses and records them. When Moira runs the same test afterwards, her memory accesses are also intercepted and compared to the results on record. This enables automatic verification of all memory access patterns.

Using this brand new cutting edge VMAS(TM) technology (Virtual Memory Access Sandboxing, patent pending), the first mismatch can be found in no time 😎:

Instruction: add.l   D2, (A2)+

ACCESS 8 DOESN'T MATCH:
i:  8  Type: Poke16  Addr: 2000  Cycle: 0  

ACCESS RECORD:
i:  0  Type: Peek16  Addr:    0  Cycle: 0  
i:  1  Type: Peek16  Addr:    2  Cycle: 0  
i:  2  Type: Peek16  Addr:    4  Cycle: 0  
i:  3  Type: Peek16  Addr:    6  Cycle: 0  
i:  4  Type: Peek16  Addr:    8  Cycle: 0  
i:  5  Type: Peek16  Addr:    a  Cycle: 0  
i:  6  Type: Peek16  Addr: 2000  Cycle: 0  
i:  7  Type: Peek16  Addr: 2002  Cycle: 0  
i:  8  Type: Peek16  Addr:    c  Cycle: 0  
i:  9  Type: Poke16  Addr: 2000  Cycle: 0  
i: 10  Type: Poke16  Addr: 2002  Cycle: 0  

The output shows that Moira misses to read a word from memory before writing the result. How dare she 🀨.

Just profiled the disassemblers of Musashi and Moira (65536 x 48 instructions):

Musashi: 44.8 sec
  Moira:  6.5 sec

Actually, it was easy to outperform Musashi, because it calls sprintf to assemble the strings whereas Moira utilises a template-based string writer. The picture will be different once Moira is ready enough to compare emulation speed (which is the important metric). I expect it to be rather impossible to outperform Musashi, so the question is how much slower Moira will be 😬.

Moira is born to not only emulate the states before and after an CPU instruction but additional also to emulate the intermediate temporal states of the m68k e.g. caring of bus access times. With all those extra states and probably extra syncing, Moira is a much bigger beast from a states machine perspective. Due to its higher grade of complexity, we should expect it to be slower but more accurate πŸ˜ŽπŸ‘πŸ».

Here is something puzzling:

When executing ori.b #$0, $8010.w, Musashi writes the result back to $ff8010 whereas Moira writes back to $8010. The discrepancy got trapped by my sandbox:

Instruction: ori.b   #$0, $8010.w

ACCESS 0 DOESN'T MATCH:
i:  0  Type: Poke8   Addr:   8010  Cycle: 18  Value:    0  

ACCESS RECORD:
i:  0  Type: Poke8   Addr: ff8010  Cycle:  0  Value:    0  

Here is the corresponding Musashi code:

static void m68k_op_ori_8_aw(void)
{
    uint src = OPER_I_8();
    uint ea = EA_AW_8();

    uint res = MASK_OUT_ABOVE_8(src | m68ki_read_8(ea));

    m68ki_write_8(ea, res);

    FLAG_N = NFLAG_8(res);
    FLAG_Z = res;
    FLAG_C = CFLAG_CLEAR;
    FLAG_V = VFLAG_CLEAR;
}

EA_AW_8 does sign-extension which means that $8010 becomes $FFFF8010. Because the 68000 has a 24-bit address bus, this is cropped to $FF8010.

What is going on here? πŸ€” When I use a word operand (####.w), is it really treated as a signed number? I can't really believe this...

Indeed, the 68000 compiler issues an error message. Apparently, word addresses are treated as signed numbers 🀭:

error 2033 in line 43 of "ori1.s": absolute short address out of range
>   ori.b #$C5,$8010.w

also
lea $8010.w,a1 gives a absolute short address out of range
whereas
lea $7fff.w,a1 is correct....

it is apparently not specific to ori...

Somehow ori reminds me on tolkiens dwarf names. Don't know why. Lets see what were the dwarf names ... Dwalin, Balin, Kili, Fili, Dori, Nori, Ori, Oin, Gloin, Bifur, Bofur, Bombur and Thorin.

Wow there is even a fan dome page for ori !!! Did not know that !

https://lotr.fandom.com/wiki/Ori

Ori was born in the late third age ... and has died πŸ™ˆ in TA2994 (does that mean TolkinAge?)

Completely fictional fact:
Much later in modern days they invented a microprocessor instruction and for his honors gave it his name....

Dwalin, Balin, Kili, Fili, Dori, Nori, Ori, Oin, Gloin, Bifur, Bofur, Bombur and Thorin.

Oh, I see, it's this guy Moira has trouble with. Anyway, I think she is too young to have a friend (nope, no co-processors yet, Moira!).

Bildschirmfoto 2019-12-23 um 13 21 11

Surprisingly, I can't find ORIs brother ANDI πŸ˜…. His opcode pattern is 0000 0010 xxxx xxxx.

and has died πŸ™ˆ in TA2994

Hmmm, sounds like a number of a CPU trap to me, but maybe I am just coding too much these days πŸ€“.

What's new? Musashi is playing happily with Moira in her new sandbox. As expected, she still can't count properly πŸ™„, but I'm pretty confident she's going to improve over time...

Instruction: btst    D0, $80008010.l
Instruction: btst    D0, (-$8000,PC); ($ffff9002)
Instruction: btst    D0, (PC,A0.w)
Instruction: btst    D0, #$0
Instruction: bchg    D0, D0

MISMATCH FOUND (opcode $140 out of $FFFF):

Instruction: bchg    D0, D0

    Musashi: PC: 1002 Elapsed cycles:  8
      Moira: PC: 1002 Elapsed cycles:  6

Problems ... and more problems ... πŸ™ˆ

According to the M68000 User’s Manual, Ninth Edition, BCHG D0,D0takes 12 cycles? No?

Bildschirmfoto 2019-12-23 um 14 17 29

In Musashi, however, the cycle count is hard coded to 8:

{m68k_op_bchg_32_r_d         , 0xf1f8, 0x0140, {  8,   8,   4,   4}},

In portable68000 and Denise, the cycle count varies depending on the bit number:

template<uint8_t Mode> auto M68000::cyclesBit(uint8_t bit) -> void {
    uint8_t cycles = 0;

    switch(Mode) {
        case Btst: cycles = 2; break;
        case Bclr: cycles = 2;
        case Bset:
        case Bchg:
            cycles += bit > 15 ? 4 : 2;
            break;
    }
    ctx->sync( cycles );
}

Looks like total anarchy here πŸ₯Ί.

We could make a program which runs millions of bchg d0,d0 instructions. Run that on a500mmse and measures the time to know the exact number, no?

We could make a program

Yes, I think we need to write a test-case and run it on the MMSE 😎.

There is no need to run a million BCHGs though. We can let the VSYNC interrupt handler start the execution, run BCHGs until the raster beam reaches the middle of the screen and change the background color.

I already tried to do that, but I screwed it up ... obviously 🀭

Bildschirmfoto 2019-12-23 um 15 00 30

It really can't be that difficult to write such a test-case πŸ˜–

Bildschirmfoto 2019-12-23 um 15 18 01

There is no need to run a million BCHGs though. We can let the VSYNC interrupt handler start the execution, run BCHGs until the raster beam reaches the middle of the screen and change the background color.

Yes that is even better... we see the result immediatly πŸ˜ƒ.

Assumption:
226 DMA cycles available in a horizontal scan line.
1 DMA cycle is 2 CPU cycles.
bchg d0,d0 is 12 CPU cycles = 6 DMA cycles

Then the CPU should be able to process 37 bchg d0,d0 instructions in a line

The plan is if I understand you correctly to start at the vertical blank and let the CPU execute 370 times the instruction. At the end of all the instructions draw a color. Then when we see the color at scan line 100 the cycle length of 12 for bchg d0,d0 was correct...

grafik

this is a test of my program in fsuae
from scanline zero to 64 red color
100*37 bchg d0,d0 green color
the rest of the scan lines blue color

impossible that green spans 100 lines πŸ™ˆ...

Oh I see I tested it on A1200 configuration

here is the program again on A500

grafik

better looks like a lot more lines ... but are these really 100 as it should be with bchg d0,d0 and 12 cycles? Looks like it has less than 12 cycles. I have to test on A1000 ...

grafik

the height of 100 yellow lines prove that green is not 100 lines height... ahem on fsuae

You've managed to write a working test case. So cool 😎.

Mine is still buggy πŸ˜•:

Bildschirmfoto 2019-12-23 um 15 20 08

BTW, you don't need to count scanlines. Simply substitute a command for which we know how many cycles they need and compare the images. I guess one of the dwarf instructions will do: ORI or ANDI πŸ˜…

Bildschirmfoto 2019-12-23 um 13 21 11

Simply substitute a command for which we know how many cycles they need and compare the images.

ok lets execute ori instead of bchg, and see how the cpu of fsuae times them...

grafikgrafik

left picture:
390 times ori #0,d0 execution time in green

right picture:
390 times bchg d0,d0 execution time in green

BTW: I made a mistake I did not execute 37x100 as I mentioned before but 39x100 instructions.

So we know that FSUAE (probably WinUAE as well) emulates the execution time of bchg d0,d0 exactly with twice the cycles of ori #0,d0

I still have to create a adf from it and throw that onto the A1000 ...

the program is here
timing_test.s.zip

new combined test is here
as ADF
bchg_ori_test.adf.zip
as source code
bchg_ori.s.zip

it produces 100 yellow lines
370 bchg d0,d0 executions in darker green
370 ori #0,d0 executions in lighter green

grafik

(picture FS-UAE setting high compatible CPU 68000)

Could you throw the adf onto the A500 MMSE and see what the correct timings are?

I found the sister of Ori !! Apparently Lea played in a completely different film genre though...
grafik

I found the sister of Ori !!

😳 All of a sudden, I am loosing interest in the princess being held prisoner in Defender of the Crown.

But wait, wasn't she the sister of Luke? πŸ€” There is no LUKE instruction though. Just a LINK instruction. Maybe LINK Skywalker sounded so stupid that they changed his name for the movie. This could also be the reason why they did another Star Wars movie. They finally reveal his real name? No?

Could you throw the adf onto the A500 MMSE and see what the correct timings are?

I'll do in a minute...

In the meantime, I also managed to fix my test case, so we have two now. My test utilises the Copper to trigger interrupts and I am performing the tests in the interrupt handlers. I have set up 6 interrupt handlers (priority 1 to 6), so I can run multiple timing tests in parallel. Here is the result in UAE:

Bildschirmfoto 2019-12-24 um 08 27 27

Colors:

  • Blue: Copper wakes up to trigger the IRQ
  • Red: CPU enters the IRQ routine and sets up test case data
  • Yellow: The actual timing test

Test lines:
1: Running 12 NOPs, accounting for 48 cycles in total
2: Running 16 NOPs, accounting for 64 cycles in total
3: Running 8 BCHGs with shift value $00
4: Running 8 BCHGs with shift value $10
5, 6: Same as 3,4 with another destination register

Conclusion (for UAE):

  • The shift value does affect timing
  • For shift value $00, BCHG consumes 6 cycles
  • For shift value $10, BCHG consumes 8 cycles

I'm curious what the real machine will do.
The bookmakers are now open. Please place your bets...

Here is a tricky one:

$4784: chk.w   D3, D3

ACCESS 2 DOESN'T MATCH:
i:  2  Type: Poke16  Addr: 7ffa  Cycle: 22  Value: 2700  

ACCESS RECORD:
i:  0  Type: Poke16  Addr: 7ffc  Cycle:   0  Value:    0  
i:  1  Type: Poke16  Addr: 7ffe  Cycle:   0  Value: 1002  
i:  2  Type: Poke16  Addr: 7ffa  Cycle:   0  Value: 2708  

The mismatch is caused by the N bit in the status register. Musashi sets it to 1 before pushing the status register to the stack and Moira leaves it at 0.

OK, let's RTFM:

Bildschirmfoto 2019-12-28 um 13 21 14

🀨 In our case, [Dn] < 0and [Dn] > [<ea>] are both true, so the manual doesn't help. Note: In hardware design, "undefined" is usually another word for "we don't care" or "we don't know".

How can we figure out which one is correct? The command initiates exception processing which means that the next command in my program is not executed. We need to write an exception handler that verifies the N flag for us 😬. Has anybody written such a thing before? No? πŸ™„

We need to write an exception handler that verifies the N flag for us

OK, trap handlers are as easy as interrupts... stay tuned 😎

Hier is my exception handler:

chkHandler:
    bmi     chkHandler2
    move.w  #$0F0,$DFF180
    rte
chkHandler2:
    move.w  #$F00,$DFF180
    rte

UAE:

Bildschirmfoto 2019-12-28 um 14 03 52

vAmiga (Musashi):

Bildschirmfoto 2019-12-28 um 14 04 18

And the winner is ... 😴

And the winner is ... Musashi πŸ‘

IMG_2239

I was curious to see if vAmiga and Moira happen to like each other. Unfortunately, no so much yet πŸ˜•.

Bildschirmfoto 2020-01-01 um 18 50 47

So, what is going here? We are right at the beginning of the Kickstart Boot Rom (the same place where we've been exactly a year ago 😲):

```
; Set up the Exception Vector Table. Vectors 2 through 47
; (Bus Error through TRAP #15) are all all set to the initial
; exception handler. If any exception occurs now, the screen
; will turn yellow, the power light will flash, and the computer
; will be reset.

FC0136 move.w #8,A0 Start at address 8 (vector #2).
FC013A move.w #$2D,D1 Do 46 vectors.
FC013E lea FC05B4(PC),A1 Address of initial exception handler.
FC0142 move.l A1,(A0)+ Set one vector
FC0144 dbra D1,FC0142(PC) Loop back.
```

Seeing the screen turn yellow means that some exception had happed that should not happen. At first, I was disappointed, but if I think about it, this is quite good. It means that Moira can already process exceptions πŸ₯³ and she is not color blinded (she wrote into the correct memory cell to change the background color). So the question is .... what kind of exception is going here? πŸ€” No, it's not interrupts, I've already checked that... 🀨

I have started to convert the test programs created by cputester into ADFs.
First instruction (in alphabetical order) is ABCD:

vAmiga with Musashi πŸ™ˆ:

Bildschirmfoto 2020-01-05 um 08 36 23

vAmiga with Moira 😎:

Bildschirmfoto 2020-01-05 um 08 30 40

Time had come to use the big wrecking ball. With the latest checkin, Musashi is gone from the dev branch. I do feel a little sorry about it 😒, because I really liked that core and without its existence, I wouldn't have started the vAmiga project at all.

There is still a lot to do, because big portions of the old wrapper code need to be integrated into Moira (breakpoint support, instruction logging, etc.). To keep things simple, I plan to remove conditional breakpoints, because I never use them myself and a whole lot of code is needed to implement them. (A conditional breakpoint halts the CPU only when a certain condition holds, such as D0 == 42.).

To deeper understand the problem I try to learn what vAmigas Agnus controller does in the current implementation. I spotted the code partly in agnus.cpp and memory.cpp but I can not see the behaviour easily.

I have reimplemented bus sharing with Moira in hand. The code is much much cleaner now:

Here is the run loop (the outermost loop of the emulator thread):

```
do {

    // Emulate the next CPU instruction
    cpu.execute();

    // Check if special action needs to be taken
    if (runLoopCtrl) {
        ...
    }
} while (1);
Here is function Moira::sync()

void
CPU::sync(int cycles)
{
// Advance the CPU clock
clock += cycles;

// Emulate Agnus up to the same cycle
agnus.executeUntil(CPU_CYCLES(clock));

}

Here is `Agnus::executeUntilBusIsFree()`

void
Agnus::executeUntilBusIsFree()
{
DMACycle delay = 0;

// Return immediately if the bus is free
if (busOwner[pos.h] == BUS_NONE) return;

// Execute Agnus until the bus is free
do {
    execute();
    delay++;
} while (busOwner[pos.h] != BUS_NONE);

// Add wait states to the CPU
cpu.addWaitStates(AS_CPU_CYCLES(DMA_CYCLES(delay)));

}
```

I have to admit that the code is completely untested yet 🀭. For now, I'm really happy that the code architecture has become so simple by replacing Musashi with Moira.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dirkwhoffmann picture dirkwhoffmann  Β·  5Comments

dirkwhoffmann picture dirkwhoffmann  Β·  3Comments

dirkwhoffmann picture dirkwhoffmann  Β·  3Comments

emoon picture emoon  Β·  4Comments

KenDFish picture KenDFish  Β·  3Comments