There is a drop instruction to drop a topmost value of the stack, but there's no dup instruction to duplicate the topmost value.
It would be very useful in cases like raising to power. Imagine we're going to square a value from memory. Currently it can be done like this:
(f32.load (get_local $ptr))
(f32.load (get_local $ptr))
(f32.mul)
or
(f32.load (get_local $ptr))
(tee_local $tmp)
(get_local $tmp)
(f32.mul)
It would be great to be able (in the _future_) to do something like this:
(f32.load (get_local $ptr))
(dup)
(f32.mul)
No duplicate reads, no temporary variables.
Other use case could be the validation of the topmost value:
;; ... lots of computation leaving a very important i32 on top ...
(dup) ;; duplicate the value so we can use it after the validity check
(i32.le_s (i32.const 0)) ;; is the value valid?
(if
(then (; ... do some useful stuff with that value ;) )
)
This is discussed some here: https://github.com/WebAssembly/multi-value/blob/master/proposals/multi-value/Overview.md#open-questions.
That said, when it came up in the most recent CG meeting, the consensus was to not provide a pick operator in the current proposal: https://github.com/WebAssembly/meetings/blob/master/2017/CG-07.md#proposal-multiple-return-values-and-generalised-block-types
I checked those links. And also a search turned up:
https://github.com/WebAssembly/spec/blob/5b3a5bb0a92a7935e621eb7b4452670d4f5f6b1e/document/core/appendix/algorithm.rst#validation-of-opcode-sequences
"It is an invariant under the current WebAssembly instruction set that an operand of Unknown type is never duplicated on the stack. This would change if the language were extended with stack operators like dup. Under such an extension, the above algorithm would need to be refined by replacing the Unknown type with proper type variables to ensure that all uses are consistent."
So it seems that we don't have dup/pick because we won't always know what type of value is at the top-of-stack. Is there any problem with having i32.dup i64.dup etc? Or is it that we don't want to pin down this in case it needs to change for some other reason? Is that what this is saying? https://github.com/WebAssembly/meetings/blob/master/2017/CG-07.md#proposal-multiple-return-values-and-generalised-block-types
Missing these stack manipulation instructions: dup, pick, swap, rot; is a big surprise.
No, simplifying typing wasn't the reason, that came up much later. The reason why there are no stack juggling instructions in Wasm simply is that it already provides local variables, which are more general and often easier to handle for producers. That makes dup and friends mostly redundant.
PS, for historic background: Wasm started out as an expression language, not a stack machine. So naturally it didn't have such instructions. Later we reinterpreted the existing binary format as a stack machine by simply dropping some restrictions. At that point we could have introduced new instructions, but nobody felt a particular need to do so.
Thanks @rossberg
No, simplifying typing wasn't the reason, that came up much later. The reason why there are no stack juggling instructions in Wasm simply is that it already provides local variables, which are more general and often easier to handle for producers. That makes dup and friends mostly redundant.
I find that for some things stack juggling is easier. Let's say you're generating code for let x = ... in ...
You can run the first expression, and then while executing the 2nd expression rememeber the depth of x. And just "pick" that depth to get a copy of x. Under-the-hood the engine would remember where it left x (eg which machine register) and retrive it. The lifetime of x is also clear (but would have to be analysed for a local & register allocation). I suppose it'd be good for a baseline compiler but less good for an optimising compiler.
Another use I have is when generating case x of c1 -> e1, c2 -> e2,... For each of the case tests you may need a fresh x on the top of the stack (so you can do some testing). If X is already a local then using get_local each time is great. If x is an expression then it's easiest just to do "dup" before each case test. This last option seems more direct and easier for a code generator to optimise.
There are probably other users.
PS, for historic background: Wasm started out as an expression language, not a stack machine. So naturally it didn't have such instructions. Later we reinterpreted the existing binary format as a stack machine by simply dropping some restrictions. At that point we could have introduced new instructions, but nobody felt a particular need to do so.
That is interesting. I can see how it came out that way and I had been wondering why have the expressions there. Thanks.
You seem to assume that there are significant differences in how a VM handles locals vs stack operands. But in practice, a VM like V8 converts both into the equivalent of SSA variables right away, so it doesn't actually care. They will go through the same register allocation and code generation path. I assume other optimising VMs are similar (though the situation may be somewhat different for baseline compilers).
That also means that there is no significant cost to using more locals -- you can just map every let-variable to a fresh Wasm local, which should be even easier for your codegen than tracking relative stack depths.
As for case, it seems natural to simply put the scrutinee into a fresh local. I don't believe that stack shuffling will be easier once you're dealing with more interesting patterns and decision graphs -- it's a good example of a technique that doesn't scale well. But if you must, you can always encode (dup) with (tee_local $x) (get_local $x) for some scratch local $x.
That also means that there is no significant cost to using more locals
For performance, perhaps not, though it will probably increase module size over clever use of the stack.
You seem to assume that there are significant differences in how a VM handles locals vs stack operands. But in practice, a VM like V8 converts both into the equivalent of SSA variables right away, so it doesn't actually care).
I figured something like that for optimising VMs, but for fast baseline VMs I assume they're used more directly without def/use analysis and such.
Regarding your other points. That locals become SSA variables and having many non-shared ones isn't a big deal & case analysis. I you're right, all these things end up being equivalent once you optimise them. I was just surprised because I attempted to generate a "dup" instruction assuming it was there and couldn't find it anywhere in the docs. I had come to expect it from stack machines.
If it and pick/swap/rot were added, it'd be a convenience to anyone generating wasm, and might be okay if it didn't add any cost/maintenance burden. But I'm not familiar enough with the wasm tools (like wabt) or the JITs in the various engines to judge this myself.
I think it'd be a mistake to optimize wasm for baseline compilation. It's good to ensure that one-pass baseline compilation (with reasonable backpatching) remains possible, but beyond that we should really worry about what a good compiler can do with the code.
Encoding efficiency is important, obviously. However most wasm functions have a small number of params+locals already, so we could introduce specialized one-byte opcodes a la Java if we really, really worried about that (get/set/teelocal0, get/set/teelocal1, ..., for the first eight or so) - we still have some one-byte opcodes left.
I'm not convinced by the convenience argument either. For a simple compiler from a language that matches wasm's semantics well it's probably true that stack operations are a small benefit. For all other compilers and languages the stack is probably not really all that interesting, it's an encoding optimization if you can manage to generate code for operands in stack order, otherwise it's just in the way.
Hi @lars-t-hansen
I don't think it's optimising wasm for baseline. If you did make compilation easier for baseline compilers it'd have no effect for an optimising compiler, that's going to turn each stack push into a new SSA variable anyway (probably, I guess) and do its thing.
I'd like to clarify what I was getting at WRT convenience. I don't mean making compiler writers' jobs easier. But if the wasm code is closer to the intentions of that compiler then hopefully that means simpler code that a baseline compiler can take advantage of. For example on a single pass you don't know which locals can share a register or stack slot. But variables on the wasm stack have a fixed lifetime and it's clear which overlap (some will have longer than necessary lifetimes and that's where converting it all to SSA would be optimal).
In case I was unclear, I don't want to remove locals. I was just assuming we'd have both so that we can use whichever fits best.
Just out of curiosity here, @eholk, would a dup instruction (or other traditional stack machine operations) have been useful for https://github.com/google/schism?
Most helpful comment
I checked those links. And also a search turned up:
https://github.com/WebAssembly/spec/blob/5b3a5bb0a92a7935e621eb7b4452670d4f5f6b1e/document/core/appendix/algorithm.rst#validation-of-opcode-sequences
So it seems that we don't have dup/pick because we won't always know what type of value is at the top-of-stack. Is there any problem with having i32.dup i64.dup etc? Or is it that we don't want to pin down this in case it needs to change for some other reason? Is that what this is saying? https://github.com/WebAssembly/meetings/blob/master/2017/CG-07.md#proposal-multiple-return-values-and-generalised-block-types
Missing these stack manipulation instructions: dup, pick, swap, rot; is a big surprise.