Current Relay VM RFC (#2810) proposes stack-based VM design which uses push and pop to maintain a unified stack. Though the design itself is simple to implement, it is cumbersome in tasks such as dataflow analysis and enforces certain orders in the execution.
I propose a register-based VM design as an alternative design. The main difference is that register-based VM uses registers to designate the operands and results instead of using the stack. Registers in the VM are virtual registers and each is a reference to a VMObject. We assume there are infinite registers so runtime doesn't need to worry about register spilling. Registers are in SSA form where its index is unique in each VMFunction.
Calling convention in register VM is also simple. Each function has its own local register file. The first k registers in the register file of callee function are arguments passed in by caller function, and the VMObject in return register is allocated by callee function instead of caller function to avoid unnecessary memory copy. Caller function then assigns the reference to VMObject pointed by return register into its own register (see detail in Invoke instruction).
I summarize some pros and cons for register VM compared to stack VM.
Pros:
Phi instruction to select the result from either branch.Cons:
We modify the current stack VM instructions to use registers as operands and results.
Registers are designated by $k where k is the register index. *reg indicates a list of registers.
AllocTenosr $1, ndim, *shape, dtype ; %1 = AllocTensor(ndim, *shape, dtype)
Allocate a tensor object and stores to register $1.
AllocDatatype $1, tag, num_fields, *regs ; %1 = AllocDatatype(tag, num_fields, *regs)
Allocate a datatype object and stores to register $1. *regs is a list of registers containing fields in the datatype.
AllocClosure $1, func_index, num_freevars, *regs ; %1 = AllocClosure(func_index, num_freevars, *regs)
Allocate a closure object, where there are num_freevars in *regs.
LoadConst $1, const_index ; %1 = LoadConst(const_index)
Load the constant at const_index from the constant pool.
Mov $2, $1 ; %2 = Mov(%1)
Create a reference to VMObject in register $1 and stores it in $2.
Note: No data copy happens in this instruction. The real data copy should be a PackedFunction.
Phi $3, $1, $2 ; %3 = Phi(%1, %2)
Takes the VMObject either in $1 or $2 and stores in $3.
Note: This instruction requires VMObject in register $1 and $2 having the same type, and only one of them should be valid during the runtime.
Ret $1
Returns the register $1.
GetField $2, $1, field_index ; %2 = GetField(%1, field_index)
Get the field at field_index in $1.
If $1, true_offset, false_offset
Check if $1 is true or false. If true relative jump by true_branch, else relative jump by false_branch.
Goto pc_offset
Relative unconditional jump by pc_offset.
InvokePacked packed_index, arity, output_size, *regs
Invoke the packed function denoted by packed_index
Note: Number of registers in *regs should be arity + output_size where first arity registers are arguments and rest are output registers.
Invoke $1, func_index, arity, *regs ; %1 = Invoke(func_index, arity, *regs)
Invoke VM function at func_index
Note: Number of registers in *regs should be arity. Register $1 will be a reference to the VMObject in the return register.
InvokeClosure $2, $1, arity, *regs ; %2 = InvokeClosure(%1, arity, *regs)
Invoke closure function in register $2 and save the result into register $2.
Note: Register $2 must be a VMClosureObject.
In order to convert to register-based VM, we also need to adjust the data structure in the VM. We assume there are infinite registers available, and each function has its own register file. Each time an Invoke instruction is executed, the runtime creates a new VMFrame. We pre-allocate max number of registers used in the callee function (this number can be derived during VMCompile) and assigns the arguments to the first args slots in the registers.
struct VMFrame {
// Current function
size_t func_index;
// Pre-allocate max number of registers used in the functions
std::vector<VMObject> registers;
// Number of arguments
size_t args;
// Pointer into the current function's instructions.
const Instruction* code;
// Current program counter
size_t pc;
};
I proposed an alternative VM design using registers instead of stack. We can discuss and compare which one works better.
cc @jroesch @tqchen @wweic @zhiics @MarisaKirisame @junrushao1994 @abergeron @ajtulloch @antinucleon @Ravenwater
In the long tradition of competitive engineering, let's develop both VMs concurrently. That way we will generate actual engineering metrics of the difference between a stack-based and a register-based VM. The engineering dependencies are too complex to productively discuss without actually going through the implementation. The arguments that @jroesch presented, i.e. the stack VM simplifies the instruction management, and the fact that you would need to unroll to discover how operators interact was considered acceptable given the fact that most concurrency is expected to be in the operators themselves.
What the stack vs register organization does to the complexity of resulting compiler analysis code in the context of tensor compositions I think is an open question that can only be answered by doing both implementations. The beauty of that approach is that it will generate a level set of these questions and will offer many an opportunity to publish the results and go on to lucrative commercial engagements as one of the few people in the world that actually has the engineering know-how about the technical details.
Thanks for the proposal. I think we can have more discussions on this topic. To make progress on discussions, let us try to divide our argument by points, so we can agree on some of them and see the major disagreements.
For example, instead of saying: register vm allows data flow discovery, but dataflow discovery don't have to be done at the VM level. We break it into two points:
In our case, most people can agree on (1) but not necessarily (2). This way, we can find what we actually agree on more and move on from that point. I do think we should aim to have only one version of VM after we agree on things. Mainly because it is a better way to have the community pushing for a consistent technical direction.
Some brain dumps from my side.
As mentioned above, one possible tradeoff is to simply use the stack as registers, so the register number indicates an offset to the begin ptr of the current function. This could unify quite a lot of things.
The main argument against a register vm is the implementation complexity given that stack vm is a "fine and concise impl", the question really boils down to do we need these additional features and cost that comes with them, so far the costs are
I explored @icemelon9 's register vm design in my branch: https://github.com/wweic/tvm/commits/relay-rts. Would like to share some data points.
We need to add special registers in VM for function arguments and return value. Return register is necessary because when callee function returns, its register file has gone. So return value must be persisted outside of register file.
Opcode generation in register VM is easier, compiler doesn't need to maintain stack offsets all along, just memorize variable to register mapping. Then we need register allocation, I tried liveness analysis + linear scan. It's pretty straightforward since opcode is relatively simple. I'm not sure I understand @tqchen 's concern about handing control flow in register allocation. Note that Haichen's register VM uses local register file per function. So we are doing register allocation per function, which shouldn't need to worry about interprocedural register allocations. As for branches inside a function, live interval of register can be the range from the earliest opcode to the last opcode that the register is live. I think linear scan would work fine under this setting.
In register VM we need to allocate a register file(vector<Object>) per call, equal to growing the stack in stack VM. I'm not sure if there will be much performance difference.
Overall register VM is slightly easier to work with, debugging is painless. But I'm not sure if this is convincing enough to favor register VM. Also register VM's opcode size is larger, it might be a concern for embedded devices.
The major selling point of register VM is easy to discover dataflow. But it begs the questions that is register VM better at leveraging the dataflow information? I think it requires more thought. Fundamentally both VMs have call stack and will have similar problems when integrating with async execution engines.
@tqchen @icemelon9 @jroesch @zhiics @yongwww we discuss in person. Reached the following consensus:
Phi instruction. Instead extend If to write the result to a new register. Let me know if I miss anything or said something wrong. I'll take out liveness analysis on opcodes from my branch and polish remaining stuff(register VM + linear scan + interfaces). Since we don't have liveness analysis on Relay AST now, I'll simply generate live interval for each register with the full opcodes range, so register allocator can assign unique slot for each register.
I want to highlight that due to public archive principle. The summary of in person discussion only serve as summary information and suggestions instead of the final design decision.
The design decision should be make in this thread, allowing everyone to participate. So at this moment, the discussion is still open as RFC
I propose two changes to instructions.
AllocTensor uses shape stored in the register instead of hardcode shape in the instruction. This can help support dynamic shape in VM in the future. We can store constant shapes in the constant lists, and use LoadConst to load them.Phi to Select (defined below) that takes a condition register to select two values. This allows us to map SSA register number to fewer slots to reduce memory footage.AllocTenosr $2, $1, dtype ; %2 = AllocTensor(%1, dtype)
Allocate a tensor object with shape stored in register $1 and save to register $2.
Select $4, $1, $2, $3 ; %4 = (%1) ? %2 : %3
Select VMObject either in $2 or $3 based on condition register $1, and store in $4.
cc @tqchen @jroesch @wweic @zhiics @yongwww
Seems we have reached consensus about VM design, close for now
Most helpful comment
Summary
@tqchen @icemelon9 @jroesch @zhiics @yongwww we discuss in person. Reached the following consensus:
Phiinstruction. Instead extendIfto write the result to a new register.Let me know if I miss anything or said something wrong. I'll take out liveness analysis on opcodes from my branch and polish remaining stuff(register VM + linear scan + interfaces). Since we don't have liveness analysis on Relay AST now, I'll simply generate live interval for each register with the full opcodes range, so register allocator can assign unique slot for each register.