This came up occasionally in discussions but AFAIK there's no issue associated with it and IMO there should be due to the significant (negative) impact GT_ASG nodes have on RyuJIT's IR.
IMO it's one of the worst IR "features", if not plain and simple the worst. So far I only encountered pros and no cons:
GT_LCL_VAR on the RHS is a use but on the LHS is a def. The JIT uses various means to deal with this issue - GTF_VAR_DEF, GTF_IND_ASG_LHS, gtGetParent() etc.GT_LCL_VAR node on the LHS, from there you need to use gtGetParent() to get the GT_ASG node and then get the assignment's second operand. If you're lucky the LHS follow the assignment, otherwise gtGetParent() will need to traverse multiple nodes.mov eax, ebx is assignment but that's only true if eax and ebx happen to be enregistered variables. Otherwise you'd be looking at mov [eax], ebx which is more similar to a store.Anyone knows any advantages?
AFAIR Phoenix did have Assign but its IR was very different, looking more like assembly. I know next to nothing about LLVM but I don't think it has assignment.
category:implementation
theme:ir
skill-level:expert
cost:extra-large
Unfortunately getting rid of GT_ASG requires a ton of work so it's not clear when and if that may happen.
I'm going to take a closer look to see what needs to be done, if only for the sake of curiosity.
I cleaned up some GT_ASG_op leftovers in dotnet/coreclr#18205 and in the process I got rid of OperIsAssignment, this should make searching for GT_ASG uses slightly easier. There are some 120 uses of GT_ASG in code, not including comments and asserts. And then there are some 40 gtNewAssignNode (see https://gist.github.com/mikedn/cc0756edf8ae8eab66356d4275d62c16 for a list).
Doing it all at once may be difficult but it's not clear if that's feasible. For a while I thought that it may be good to do this first for SSA phases but it's not so clear if it's actually simpler because these still involve morph and the assertion propagation has shared code for local and global propagation.
Another way to split work may be scalar/struct or lclvar/indir. Dealing with scalars first may be easier because it avoids potential first-class structs complications. Dealing with indirs first may be easier because the JIT doesn't do a lot of work with indirs, compared to lclvars.
There may be complications with lclvars written to via indirs. For some reason the JIT keeps checking for such cases, instead of simply converting IND(ADDR(LCL_VAR)) to LCL_VAR.
And finally, I've made on bone headed attempt to generate a GT_STOREIND in the importer to see what happens. Rather surprisingly it didn't quite crash and burn when compiling a trivial method but the store was lost along the way. Adding GT_STOREIND to OperRequiresAsgFlag solved the problem. So I suppose it looks somewhat encouraging, I was expecting an interminable stream of asserts. Of course, attempting to compile anything more complex does just that.
@dotnet/jit-contrib
There's been a general desire to import "rationalized" IR and therefore remove the "rationalizer" phase (after which there is no GT_ASG). Removing GT_ASG might be the biggest part of that work. This phase was introduced to allow the new RyuJIT downstream phases to not deal with GT_ASG but leave the legacy codegen alone. Now that legacy backend is gone, it's "just work" to move it -- likely, teaching morph, VN, SSA, etc., about rationalized IR.
@mikedn thanks for opening this issue!
The RyuJIT overview mentions the desire to move the Rationalized IR earlier in the JIT: https://github.com/dotnet/coreclr/blob/master/Documentation/botr/ryujit-overview.md#rationalization
And actually, I think that eliminating COMMA nodes will be potentially more challenging, but I could be wrong.
The RyuJIT overview mentions the desire to move the Rationalized IR earlier in the JIT:
Thanks, I knew I saw some discussion somewhere but couldn't remember where, I thought it was in a PR/issue. That document also mentions GT_ADDR which has similar issues. Removing that is probably less work but it's not clear to me what is better to do first - remove GT_ASG or remove GT_ADDR, or maybe both at the same time. Or maybe remove IND(ADD(LCL)).
And actually, I think that eliminating COMMA nodes will be potentially more challenging, but I could be wrong.
Hmm, I'd put that in a different bucket - side effects and ordering. I used to think that moving the whole JIT to LIR could be a good idea. Now I think that, while it may better than the current implementation, it's not the best solution - LIR is good for some things and bad for others (try doing expression reassociation, you'll either battle with maintaining the LIR links or "sequence" the whole thing and take a throughput hit). So I have this vague idea of a model where only side effects are maintained in linear order while other nodes roam freely and they're referenced only via side effect nodes. But that's another issue and my current impressions is that if we do any of this removing GT_ASG should be done first, as it may simplify things.
I was looking at lclvar nodes and noticed something that may explain how GT_ASG came to be: in the assignment model GenTreeLclVar probably was always a leaf. Now it is an unary op, to support GT_STORE_LCL_VAR. Perhaps it would be better to have a GenTreeStoreLclVar that's a special op and has a GenTree* op in it.
In general, it may be useful to revisit the lclvar node hierarchy as it may be possible to improve some SSA related stuff - either replace the SSA number with a pointer to the SSA def (problematic now because on 64 bit targets it increases the node size) or find room to store an additional SSA number for GTF_USE_ASG to avoid the need for the side table.
If something like GenTreeStoreLclVar is to be added then it's probably best done before removing GT_ASG. After that the use of GT_STORE_LCL_VAR will increase significantly and changes around it might require more work.
Removing assignment (and GT_ADDR) would require dealing with an interesting SSA issue - assigning to a struct field that is itself a struct produces IR like:
[000013] ------------ * STMT void (IL 0x002...0x00A)
[000010] -----+------ | /--* CNS_INT int 0
[000012] IA---+------ \--* ASG struct (init)
[000011] x----+-N---- \--* BLK(12) struct
[000009] -----+------ \--* ADDR byref
[000006] U----+-N---- \--* LCL_FLD struct V01 loc0 [+0] Fseq[s3]
Despite the fact that the assignment is "indirect" the LCL_FLD node is GTF_VAR_DEF and acts as a SSA definition. If we remove GT_ASG and GT_ADDR we should end up with something like STORE_BLK(LCL_FLD_ADDR, CNS_INT). So which node is the SSA definition? There's no way it's LCL_FLD_ADDR, it's STORE_BLK. But making an indirect store act as a SSA definition sounds a bit weird.
Ideally there should be no indirect store needed. But how to avoid it, add STORE_LCL_FLD_BLK? I guess it would work but it has to be a large node.
Assignment to array elements produces somewhat strange trees:
( 18, 21) [000022] ------------ * STMT void (IL 0x006...0x00E)
N017 ( 1, 1) [000019] ------------ | /--* LCL_VAR int V04 loc1 u:3 $180
N018 ( 18, 21) [000021] -A-XG------- \--* ASG int $VN.Void
N014 ( 4, 4) [000020] a--XG--N---- | /--* IND int $180
N012 ( 1, 1) [000044] ------------ | | | /--* CNS_INT long 32 Fseq[#ConstantIndex, #FirstElem] $281
N013 ( 2, 2) [000045] -------N---- | | \--* ADD byref $340
N011 ( 1, 1) [000037] ------------ | | \--* LCL_VAR ref V06 tmp1 u:2 (last use) <l:$301, c:$201>
N015 ( 12, 15) [000046] ---XG------- | /--* COMMA int $VN.Void
N010 ( 8, 11) [000040] ---X-------- | | \--* ARR_BOUNDS_CHECK_Rng void <l:$30e, c:$30d>
N007 ( 1, 1) [000018] ------------ | | +--* CNS_INT int 4 $43
N009 ( 3, 3) [000039] ---X-------- | | \--* ARR_LENGTH int <l:$143, c:$142>
N008 ( 1, 1) [000036] ------------ | | \--* LCL_VAR ref V06 tmp1 u:2 <l:$301, c:$201>
N016 ( 16, 19) [000047] -A-XG--N---- \--* COMMA int $VN.Void
N004 ( 4, 4) [000017] ---XG------- | /--* IND ref <l:$304, c:$303>
N002 ( 1, 1) [000048] ------------ | | | /--* CNS_INT long 8 field offset Fseq[a] $280
N003 ( 2, 2) [000049] -------N---- | | \--* ADD byref $2c0
N001 ( 1, 1) [000016] ------------ | | \--* LCL_VAR ref V02 arg2 u:1 $c0
N006 ( 4, 4) [000035] -A-XG---R--- \--* ASG ref <l:$304, c:$303>
N005 ( 1, 1) [000034] D------N---- \--* LCL_VAR ref V06 tmp1 d:2 <l:$301, c:$201>
The LHS of the assignment isn't an IND node, it's a COMMA. The actual IND node is hidden behind the comma. To get rid of ASG we'll need to make it so that the COMMA tree produces the address (as a byref) and then ASG + IND can be easily changed to a STOREIND.
We should probably ensure that the ASG LHS is always a "location" node (LCL_VAR, LCL_FLD, IND etc.) before attempting to remove ASG nodes.
There are some COMMA patterns that aren't well handled by CSE, and in my (somewhat dated) experience of change the block stores to assignments, I ran into some of those. When we transform these array element assignments we'll have to ensure that we preserve CSE's ability to reason about them.
Speaking of block stores/assignments, assigning a struct to an array element generates a different tree. The LHS of the ASG is a BLK (good), the COMMA produces an address (good), the address is produced by taking the address of a struct IND (kind of weird), the IND doesn't have GTF_DONT_CSE set (scary):
-A-XG------- * ASG struct (copy)
n--XG------- +--* BLK(24) struct
---XG------- | \--* COMMA byref $80
---X-------- | +--* ARR_BOUNDS_CHECK_Rng void
------------ | | +--* LCL_VAR int V01 arg1 u:1
---X-------- | | \--* ARR_LENGTH int
------------ | | \--* LCL_VAR ref V04 loc1 u:2
---XG------- | \--* ADDR byref $80
a--XG------- | \--* IND struct
-------N---- | \--* ADD byref
------------ | +--* LCL_VAR ref V04 loc1 u:2
-------N---- | \--* ADD long
-------N---- | +--* LSH long
------------ | | +--* MUL long
------------ | | | +--* CAST long <- int
i----------- | | | | \--* LCL_VAR int V01 arg1 u:1
------------ | | | \--* CNS_INT long 3
-------N---- | | \--* CNS_INT long 3
------------ | \--* CNS_INT long 16 Fseq[#FirstElem]
------------ \--* LCL_VAR struct V06 loc3 u:3 (last use)
Guess that the only thing that prevents CSE from messing up with the indir is the fact that its a struct. Well, at least the assignment part is clear, it simply maps to a STORE_BLK.
@CarolEidt What do you think about adding a GenTreeLclFldBlk node? Unfortunately I don't see any other way to deal with https://github.com/dotnet/coreclr/issues/19412#issuecomment-420736412
Ideally we'd just add the block size to GenTreeLclFld but there's no room for it on 32 bit hosts (on 64 bit there's padding between gtLclOffs and gtFieldSeq).
Potential alternatives:
gtFieldSeq and recover it from offset. I find the need for FieldSeq on LCL_FLD nodes to be rather unfortunate but I'm not sure if it's feasible to remove it.I'm out of the office today and it's hard to look at the sources on my phone ;-) but I'd like to see us split off the variants of LclFld. One that represents an actual field, and from which one can recover the handle info (and therefore the size) and a variant that represents a field access for an opaque or possibly overlapping field access. I'm not crazy about gtFieldSeq and would love to see that reworked, but it's been a while since I've looked at that.
If necessary I think I'd prefer to constrain size and offset over using a large node for the common case.
I'm out of the office today and it's hard to look at the sources on my phone ;-) but I'd like to see us split off the variants of LclFld. One that represents an actual field, and from which one can recover the handle info (and therefore the size) and a variant that represents a field access for an opaque or possibly overlapping field access.
Hmm, I know you mentioned this in the past but I don't quite understand what would be the advantage of having a variant of LclFld. To me it seems simpler to do something like:
TYP_STRUCT then it's simple, the size/layout is known because it's a primitive type (or SIMD type)TYP_STRUCT then there should be another piece of information in the node that describes the layout of the struct. Not a handle, something more generic, something that can "point to" a handle, size, layout, GC info etc.If necessary I think I'd prefer to constrain size and offset over using a large node for the common case.
Turns out that the offset is already constrained: https://github.com/dotnet/coreclr/blob/631407852a6332a659a6a57ede670d3e94c81efb/src/jit/morph.cpp#L13142-L13145
So an unsigned short offset and an unsigned short index into a "layout table" should do.
@mikedn - I like the idea of using an offset and an index into a "layout table" (I'm sure there are other places where we separately allocate the layout table, and could instead index into a common table).
I think there are places that expect most struct values to return a handle (i.e. by calling gtGetStructHandle()), where all they really need to do is to get the gc layout and/or size - so there may be some places that will need to change to accommodate this, but also some opportunity to reduce the need to carry around and/or query the handles.
Oh, and to clarify - the reason that I thought it would be useful to have two variants of lclFld would be to distinguish the variant that needs to be handled conservatively (accessing a field through a different type and/or accessing a field of a struct with overlapping fields) from the one that is non-aliasing. While it may be moot for the front-end, where we rely on value numbers, it may be useful in reducing the places (e.g. in morph) where we treat all lclFlds conservatively, as well as downstream where we no longer have valid SSA/value numbers.
accessing a field through a different type
Hmm, some of these cases could be handled in a different manner. For example, accessing a float field as int could be morphed to BITCAST<float>(LCL_VAR<int>) instead of LCL_FLD<float>. Combined with "reg optional" we could then generate movd xmm4, edx or movss xmm4, dword ptr [rbp-n].
I'm sure there are other places where we separately allocate the layout table, and could instead index into a common table
It looks to me that the best place to start is actually GT_BLK/OBJ. GT_OBJ in particular because it has so much stuff in it that it end up being a large node. Very early work, just enough to show the approach and survive some diffing (no diffs): https://github.com/mikedn/coreclr/commit/086c7fa1579295f46142457c4aea07c41d1ede35
Block size, class handle and gc pointer table are moved to a separate "layout" object and GenTreeBlk/Obj only maintains a pointer to that. The same "layout" object is shared by multiple nodes (by means of a hashtable). The "layout" object is immutable, in order to avoid nasty accidents (right now it's only partially immutable because SetGCInfo is called multiple times, it really should only be called once).
IMO it has a bunch of advantages:
GenTreeObj becomes a small node. In fact we can probably get rid of it and GenTreeBlk and just put all the necessary data in GenTreeIndir.LclVarDsc as well, to replace lvStructGcCount and lvGcLayout. While it doesn't make LclVarDsc smaller (it replaces on pointer with another pointer), we get a consistent representation of struct layout throughout the JIT.The need for hashtable might be a disadvantage but then the number of struct types you encounter in a method is usually very small so it shouldn't be a problem. And the hashtable lookup cost is likely to be smaller than the redundant VM calls we can avoid.
Once this is done it should be pretty easy to extend it to GT_LCL_FLD, we just need to somehow get a short table index instead of pointer and modify some code here and there to support struct types GT_LCL_FLD. Well, it's probably not that easy but it's a start.
some of these cases could be handled in a different manner.
Yes, I think that, in general, using BITCAST (or CAST in the case of a partial reference where we need to do a zero- or sign-extend) would be a better approach. AFAIK though the front-end doesn't really know how to handle BITCAST yet, and may not expect a CAST in some places where we're currently using LCL_FLD. But I would really like to see us eliminate the use of LCL_FLD for ABI handling purposes.
AFAIK though the front-end doesn't really know how to handle BITCAST yet, and may not expect a CAST in some places where we're currently using LCL_FLD
I'm more concerned about the backend. I'm not sure if and how we'll be able to handle something like BITCAST between long and double on x86. On x64 we have movd (and there's something similar on ARM64) but on x86 we need to pass the BITCAST node through decomposition somehow and the codegen will like need a temporary memory location.
Now, if we allow LCL_FLD with type struct the question is where should they be generated. The importer naturally generates BLK(FIELD(ADDR(LCL_VAR))) trees and then morphs converts them to BLK(ADDR(LCL_FLD)). And we want a single LCL_FLD node having type struct.
Looks like I'll end up trying to do this kind of stuff in LocalAddressVisitor, where this transform IMO belongs. But that means that we'll have to deal with struct promotion and effectively do "morph copy block" at the same time, in LocalAddressVisitor. Or, as @sandreenko proposed in another issue, move struct promotion after LocalAddressVisitor so all the promotion details, including block copy, are handled during morph. Kind of strange how things fit together sometimes.
@CarolEidt Since GT_BITCAST keeps coming up, here's a quick experiment that transforms *(int*)&float_var into GT_BITCAST: https://github.com/mikedn/coreclr/commit/730badc41e6995504471302afe66dcae5c7970a4
It's quite trivial. Saves ~1KB in corelib. The only problem will be adding support for long-double conversion on x86 (and ARM32 I guess).
It's quite trivial. Saves ~1KB in corelib
Would a similar optimization around Unsafe.As (when TFrom and TTo are the same size) be worth looking into?
Would a similar optimization around Unsafe.As (when TFrom and TTo are the same size) be worth looking into?
For what types?
For what types?
I wasn't thinking about hardcoding it to particular types, but rather seeing if a more generalized option would be possible (any two structs that are both blittable and the same size is likely a good basis).
S.P.Corelib (and CoreFX) use Unsafe.As in a number of places to effectively reinterpret-cast a given ref to another type. Some examples are code in System.Enum, System.BitConverter, System.Buffer, etc.
Unsafe.As and Unsafe.AsRef not always doing the most optimal thingany two structs that are both blittable and the same size is likely a good basis
Well, with some exceptions structs aren't enregistrable. Since they're already in memory, reinterpretation shouldn't have much impact on the generated code. That said, I don't remember looking at such cases so perhaps there are some improvements that could be made.
I'm just recalling a previous conversation around Unsafe.As and Unsafe.AsRef not always doing the most optimal thing
Perhaps that was about primitives and not structs? The problem with primitives is that the current handling of reinterpretation (by using GT_LCL_FLD) prevents the variable from being enregistered.
There is an interesting case that BITCAST (well, at least in its current form) can't handle and that came up in "real world" (Decimal) code:
```C#
struct Pair { public int lo, hi; }
…
long l = (long)&pair.lo;
Today this forces the struct in memory which is not ideal. The JIT uses a `GT_LCL_FLD` of type `long`, that avoids marking the struct var as address taken it still forces it into memory. We need to make `GT_LCL_FLD` (or a new oper) capable of handling the case of dependently promoted struct fields by assembling the resulting value from individual variables using shifts and masking.
Trouble is, it's not that simple to decide when it's better to spill the struct to memory so you can simply load the `long` value and when it's better to assemble it from individual values. In this example assembling form enregistered values would look something like:
```asm
mov ecx, eax ; eax contains lo
shl rdx, 32 ; edx contains hi
or rcx, rdx ; rcx contains the result
If the values aren't enregistered then the code gets a bit more messy as you may end up with 2 loads instead of the original one load:
mov ecx, dword ptr [lo]
mov edx, dword ptr [hi]
shl rdx, 32
or rcx, rdx
instead of just
mov rcx, qword ptr [lo]
There will be trade-offs to be made.
Now, all this is a bit ortoghonal to this issue - GT_ASG. But you can also have:
```C#
(long)&pair.lo = l;
And end up with similar drawbacks:
```asm
mov dword ptr [lo], rax
shr rax, 32
mov dword ptr [hi], rax
instead of single mov instruction.
And it gets even more complicated if you have more than 2 fields (e.g. you could read an int from 4 byte fields).
Going back to struct typed GT_LCL_FLD nodes - I've made an experiment and it's definitely doable. It does require a fair amount of work but it's not prohibitive. I got it to the point that it generates IR (and the associated assembly code) like STORE_LCL_FLD<struct>(LCL_VAR<struct>) or STORE_LCL_FLD<struct>(IND<struct>). It only does unroll copy so I don't know what impact, if any, the actual implementation will have on the code size.
The main problem is where exactly to produce such struct typed LCL_FLD nodes. I did it in LocalAddressVisitor, IMO that's where it belongs but it would likely work better once promotion is moved after LocalAddressVisitor. Attempting to handle copying from a promoted struct LCL_VAR to a struct LCL_FLD isn't fun. Generating LCL_FLD nodes and promotion should be done separately.
The example from Decimal is probably optimal with direct memory accesses because the structs are always in memory anyway (often passed in ref or in). Enregistering a Decimal looks unlikely as it's 4 ints, unless it's enregistered as two longs (might be too complicated).
The example from Decimal is probably optimal with direct memory accesses because the structs are always in memory anyway (often passed in ref or in).
Yes but the general case can be more complicated. You could use the lo and hi separately in 50 places and then have just one place where someone uses (long*)&lo and then performance goes down the drain. Though I suppose that then you can blame the developer, it's not the JIT's job to recover from poor optimization choices.
And then there's also the pesky problem of the SYS V calling convention that stuffs 2 int struct fields in a register. At least because of that it would be good to find a way to treat such cases better.
Enregistering a Decimal looks unlikely as it's 4 ints, unless it's enregistered as two longs (might be too complicated).
Actually a 4 int struct can be "promoted" - treated as 4 individual int variables. But if you use address tricks then those variables are no longer enregistered, even if they're still individual variables.
Hmm, that's actually something I should pay attention to. If you have 4 int fields and read only 2 as long then the other 2 shouldn't be affected. But I'm pretty sure that today they're all affected.
And then there's also the pesky problem of the SYS V calling convention that stuffs 2 int struct fields in a register.
And of course, this isn't actually SYS V specific. It happens on Windows as well except that only for structs having size <= 8 (e.g. on Linux Decimal is passed in 2 64 bit registers while on Windows it is passed on stack).
The not so fortunate effects of JIT's limitation in this area can be seen in the recently added Range's Equals method:
mov qword ptr [rsp+10H], rdx
mov eax, dword ptr [rsp+10H] ; GT_LCL_FLD
mov edx, dword ptr [rcx]
cmp eax, edx
jne SHORT G_M32203_IG04
mov eax, dword ptr [rsp+14H] ; GT_LCL_FLD
mov edx, dword ptr [rcx+4]
cmp eax, edx
The ideal codegen would be something like:
cmp dword ptr [rcx], edx
jne SHORT G_M32203_IG04
shr edx, 32
cmp dword ptr [rcx+4], edx
@CarolEidt Do you know if there's any opened issue tracking such problems? Should there be?
This isn't directly related to GT_ASG removal, it's just another case where IR limitations cause problems so it's something to keep in mind.
Most helpful comment
@CarolEidt Since
GT_BITCASTkeeps coming up, here's a quick experiment that transforms*(int*)&float_varintoGT_BITCAST: https://github.com/mikedn/coreclr/commit/730badc41e6995504471302afe66dcae5c7970a4It's quite trivial. Saves ~1KB in corelib. The only problem will be adding support for long-double conversion on x86 (and ARM32 I guess).