Tvm: [RFC] Register Relay VM design

Created on 27 Mar 2019 · 9Comments · Source: apache/tvm

Register VM Design

Current Relay VM RFC (#2810) proposes stack-based VM design which uses push and pop to maintain a unified stack. Though the design itself is simple to implement, it is cumbersome in tasks such as dataflow analysis and enforces certain orders in the execution.

I propose a register-based VM design as an alternative design. The main difference is that register-based VM uses registers to designate the operands and results instead of using the stack. Registers in the VM are virtual registers and each is a reference to a VMObject. We assume there are infinite registers so runtime doesn't need to worry about register spilling. Registers are in SSA form where its index is unique in each VMFunction.

Calling convention in register VM is also simple. Each function has its own local register file. The first k registers in the register file of callee function are arguments passed in by caller function, and the VMObject in return register is allocated by callee function instead of caller function to avoid unnecessary memory copy. Caller function then assigns the reference to VMObject pointed by return register into its own register (see detail in Invoke instruction).

I summarize some pros and cons for register VM compared to stack VM.

Pros:

Data dependency analysis is self-explaining as register reveals the data dependency. In comparison, stack VM requires you to simulate the program and recover the stack index in order to get data dependency. As a result, it'll be easier to implement a dataflow executor or async executor.
If/else scope is easier to handle. Previously stack VM needs to move the objects to the right place in the stack given it enters if or else branch. Now register VM only needs a new Phi instruction to select the result from either branch.
Tail recursion optimization should be easier since we only need to move the register to the correct slot in the register file.

Cons:

There is some memory overhead to keep the empty slots for registers that are never used.
Need to kill the register after its life cycle. This can be done by either annotating the life cycle of each register or inserting the kill instructions in the program during the life cycle analysis. Note that life cycle analysis pass is needed by both stack VM and register VM.

Instructions

We modify the current stack VM instructions to use registers as operands and results.
Registers are designated by $k where k is the register index. *reg indicates a list of registers.

AllocTensor

AllocTenosr $1, ndim, *shape, dtype  ; %1 = AllocTensor(ndim, *shape, dtype)

Allocate a tensor object and stores to register $1.

AllocDatatype

AllocDatatype $1, tag, num_fields, *regs  ; %1 = AllocDatatype(tag, num_fields, *regs)

Allocate a datatype object and stores to register $1. *regs is a list of registers containing fields in the datatype.

AllocClosure

AllocClosure $1, func_index, num_freevars, *regs  ; %1 = AllocClosure(func_index, num_freevars, *regs)

Allocate a closure object, where there are num_freevars in *regs.

LoadConst

LoadConst $1, const_index  ; %1 = LoadConst(const_index)

Load the constant at const_index from the constant pool.

Mov

Mov $2, $1  ; %2 = Mov(%1)

Create a reference to VMObject in register $1 and stores it in $2.

Note: No data copy happens in this instruction. The real data copy should be a PackedFunction.

Phi

Phi $3, $1, $2  ; %3 = Phi(%1, %2)

Takes the VMObject either in $1 or $2 and stores in $3.

Note: This instruction requires VMObject in register $1 and $2 having the same type, and only one of them should be valid during the runtime.

Ret

Ret $1

Returns the register $1.

GetField

GetField $2, $1, field_index  ; %2 = GetField(%1, field_index)

Get the field at field_index in $1.

If

If $1, true_offset, false_offset

Check if $1 is true or false. If true relative jump by true_branch, else relative jump by false_branch.

Goto

Goto pc_offset

Relative unconditional jump by pc_offset.

InvokePacked

InvokePacked packed_index, arity, output_size, *regs

Invoke the packed function denoted by packed_index

Note: Number of registers in *regs should be arity + output_size where first arity registers are arguments and rest are output registers.

Invoke

Invoke $1, func_index, arity, *regs  ; %1 = Invoke(func_index, arity, *regs)

Invoke VM function at func_index

Note: Number of registers in *regs should be arity. Register $1 will be a reference to the VMObject in the return register.

InvokeClosure

InvokeClosure $2, $1, arity, *regs  ; %2 = InvokeClosure(%1, arity, *regs)

Invoke closure function in register $2 and save the result into register $2.

Note: Register $2 must be a VMClosureObject.

Stack and State

In order to convert to register-based VM, we also need to adjust the data structure in the VM. We assume there are infinite registers available, and each function has its own register file. Each time an Invoke instruction is executed, the runtime creates a new VMFrame. We pre-allocate max number of registers used in the callee function (this number can be derived during VMCompile) and assigns the arguments to the first args slots in the registers.

struct VMFrame {
  // Current function
  size_t func_index;
  // Pre-allocate max number of registers used in the functions
  std::vector<VMObject> registers;
  // Number of arguments
  size_t args;
  // Pointer into the current function's instructions.
  const Instruction* code;
  // Current program counter
  size_t pc;
};

RFC

Source

icemelon9

👍3

Most helpful comment

Summary

@tqchen @icemelon9 @jroesch @zhiics @yongwww we discuss in person. Reached the following consensus:

Remove Phi instruction. Instead extend If to write the result to a new register.
Reuse the existing value stack as the register file. Have an anchor in the function frame to point to the start of each function's register region.
Try to do liveness on Relay AST to reuse the pass manager infrastructure, and to not introduce extra interfaces.

Let me know if I miss anything or said something wrong. I'll take out liveness analysis on opcodes from my branch and polish remaining stuff(register VM + linear scan + interfaces). Since we don't have liveness analysis on Relay AST now, I'll simply generate live interval for each register with the full opcodes range, so register allocator can assign unique slot for each register.

wweic on 2 Apr 2019

👍5

All 9 comments

I proposed an alternative VM design using registers instead of stack. We can discuss and compare which one works better.

cc @jroesch @tqchen @wweic @zhiics @MarisaKirisame @junrushao1994 @abergeron @ajtulloch @antinucleon @Ravenwater

icemelon9 on 27 Mar 2019

In the long tradition of competitive engineering, let's develop both VMs concurrently. That way we will generate actual engineering metrics of the difference between a stack-based and a register-based VM. The engineering dependencies are too complex to productively discuss without actually going through the implementation. The arguments that @jroesch presented, i.e. the stack VM simplifies the instruction management, and the fact that you would need to unroll to discover how operators interact was considered acceptable given the fact that most concurrency is expected to be in the operators themselves.

What the stack vs register organization does to the complexity of resulting compiler analysis code in the context of tensor compositions I think is an open question that can only be answered by doing both implementations. The beauty of that approach is that it will generate a level set of these questions and will offer many an opportunity to publish the results and go on to lucrative commercial engagements as one of the few people in the world that actually has the engineering know-how about the technical details.

Ravenwater on 27 Mar 2019

Thanks for the proposal. I think we can have more discussions on this topic. To make progress on discussions, let us try to divide our argument by points, so we can agree on some of them and see the major disagreements.

For example, instead of saying: register vm allows data flow discovery, but dataflow discovery don't have to be done at the VM level. We break it into two points:

(1) Register vm allows easier dataflow discovery
(2) Dataflow parallel discovery can be done at a higher or lower level and do not have things to do with VM

In our case, most people can agree on (1) but not necessarily (2). This way, we can find what we actually agree on more and move on from that point. I do think we should aim to have only one version of VM after we agree on things. Mainly because it is a better way to have the community pushing for a consistent technical direction.

tqchen on 27 Mar 2019

Some brain dumps from my side.

Design points to consider

Allocating a new frame for each function call can be not quite friendly for deep recursive calls.
- Possible alternative: simply use the stack as registers, e.g. the register number points to an offset to the current stack pointer
SSA vs register reuse
- SSA is easy to generate but can need a lot of registers, which may not bad for deep recursion
- Register alloca(with infinite registers) can provide reuse amount registers, would need
Register ISA is indeed more intuitive during the dump.
Control flow: the implementation of Phi seems to need some thoughts
- Need to be able to explicit reset the not required slot to nullptr, need to think about how that might interplay with register alloca.
- Note: if only relay primitive conditional is needed, do we need a generic Phi?

As mentioned above, one possible tradeoff is to simply use the stack as registers, so the register number indicates an offset to the begin ptr of the current function. This could unify quite a lot of things.

Cost of Register VM

The main argument against a register vm is the implementation complexity given that stack vm is a "fine and concise impl", the question really boils down to do we need these additional features and cost that comes with them, so far the costs are

Cost of new frame alloca for new call
- Can be resolved by using stack frame as register
Cost of larger register slot size
- Possibly need reg alloca, need to think about control flow handling
Control flow handling

tqchen on 30 Mar 2019

I explored @icemelon9 's register vm design in my branch: https://github.com/wweic/tvm/commits/relay-rts. Would like to share some data points.

We need to add special registers in VM for function arguments and return value. Return register is necessary because when callee function returns, its register file has gone. So return value must be persisted outside of register file.

Opcode generation in register VM is easier, compiler doesn't need to maintain stack offsets all along, just memorize variable to register mapping. Then we need register allocation, I tried liveness analysis + linear scan. It's pretty straightforward since opcode is relatively simple. I'm not sure I understand @tqchen 's concern about handing control flow in register allocation. Note that Haichen's register VM uses local register file per function. So we are doing register allocation per function, which shouldn't need to worry about interprocedural register allocations. As for branches inside a function, live interval of register can be the range from the earliest opcode to the last opcode that the register is live. I think linear scan would work fine under this setting.

In register VM we need to allocate a register file(vector<Object>) per call, equal to growing the stack in stack VM. I'm not sure if there will be much performance difference.

Overall register VM is slightly easier to work with, debugging is painless. But I'm not sure if this is convincing enough to favor register VM. Also register VM's opcode size is larger, it might be a concern for embedded devices.

The major selling point of register VM is easy to discover dataflow. But it begs the questions that is register VM better at leveraging the dataflow information? I think it requires more thought. Fundamentally both VMs have call stack and will have similar problems when integrating with async execution engines.

wweic on 1 Apr 2019

👍4

Summary

@tqchen @icemelon9 @jroesch @zhiics @yongwww we discuss in person. Reached the following consensus:

Remove Phi instruction. Instead extend If to write the result to a new register.
Reuse the existing value stack as the register file. Have an anchor in the function frame to point to the start of each function's register region.
Try to do liveness on Relay AST to reuse the pass manager infrastructure, and to not introduce extra interfaces.

wweic on 2 Apr 2019

👍5

I want to highlight that due to public archive principle. The summary of in person discussion only serve as summary information and suggestions instead of the final design decision.

The design decision should be make in this thread, allowing everyone to participate. So at this moment, the discussion is still open as RFC

tqchen on 2 Apr 2019

I propose two changes to instructions.

AllocTensor uses shape stored in the register instead of hardcode shape in the instruction. This can help support dynamic shape in VM in the future. We can store constant shapes in the constant lists, and use LoadConst to load them.
Change Phi to Select (defined below) that takes a condition register to select two values. This allows us to map SSA register number to fewer slots to reduce memory footage.

AllocTensor

AllocTenosr $2, $1, dtype  ; %2 = AllocTensor(%1, dtype)

Allocate a tensor object with shape stored in register $1 and save to register $2.

Select

Select $4, $1, $2, $3  ; %4 = (%1) ? %2 : %3

Select VMObject either in $2 or $3 based on condition register $1, and store in $4.

cc @tqchen @jroesch @wweic @zhiics @yongwww

icemelon9 on 10 Apr 2019

👍3

Seems we have reached consensus about VM design, close for now

tqchen on 16 May 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[Torch] Support aten::tensor, aten::empty and aten::numel

zhiqwang · 4Comments

Add tvm.contrib.sort.argsort to iOS RPC

KayneWest · 6Comments

Cross-Compilation error: error adding symbols: File in wrong format

ysh329 · 6Comments

[WINDOWS][AutoTVM] OSError: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted and OSError: [WinError 10049] The requested address is not valid in its context

Coderx7 · 5Comments

[RPC] Make RPC fork safe on macOS?

eqy · 3Comments