Hi,
Basically, I want to have a method executing some code based on a calculation on the generic types of a method. The idea behind is to avoid calculation of hashes and a dictionary get access for each call and use the jit stage to cache this.
To have an exemple of this mechanism, consider the following class:
class ExempleClass
{
public int GetHashFor<T1, T2>()
{
if (typeof(T1) == typeof(T2))
return 0;
return 1;
}
}
I want that the jitted code to be equivalent to:
public int GetHashFor<int, int>()
{
return 0;
}
public int GetHashFor<bool, int>()
{
return 1;
}
So, this raise some issues:
So, my questions are:
Also, on another thread about injecting IL code at build time (see https://github.com/dotnet/corefx/issues/30323), Fody was mentionned to update the IL code at build time.
Maybe such a technology can be used during AOT compilation for this optimization?
Bests regards,
Sure, RyuJIT is already doing this optimization. You can check your example at sharplab.io
With:
```C#
using System;
using System.Runtime.CompilerServices;
public static class ExampleClass
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int GetHashFor
{
if (typeof(T1) == typeof(T2))
return 0;
return 1;
}
}
public static class Program {
public static int Main() {
return ExampleClass.GetHashFor
}
}
Basically resolving to this:
Program.Main()
L0000: mov eax, 0x1
L0005: ret
```
For some reason in Sharplab I still had to add the aggressive inlining option (it is the .NET CLR JIT, not .NET core), haven't checked why, likely type handle + equality needs a bit some help here to let the inline kick-off
@xoofx not only in Sharplab, If you put the same code into benchmarkdotnet with .net coreapp 2.1 you need to add that attribute, otherwise it won't inline. The IL size of that method is too big, maybe @mikedn can provide more info about why it does not inline by default.
Maybe such a technology can be used during AOT compilation for this optimization?
Conceptually here, the full method specialization could be done at compiler time (C# to IL). The problem doing this at this stage is that it complicates the compilation story: Imagine that your ExampleClass is being used by different libraries, with similar generics, each library would require to have a private pre-compiled copy of this generic specialization. A first problem is that it would result in code duplication, though the JIT compiler could detect them with some hint.. A second problem more annoying is that your compiled library would have a copy of the implementation... updating ExampleClass without recompiling your library could cause severe compatibility problems...
Though, letting the JIT/NGEN/AOT compiler do this only at compile time is also problematic. That's a long standing issue as well. A security, update on the .NET core libs for example would require to massively recompile all AOT applications (and recompile the .NET Core many times for each app which is not optimal). So there is also in this area some hard work to improve the code sharing story and minimize full rebuild to native code. This is a bit what is behind UWP native cloud compilation. In .NET CoreCLR OSS, it is called Ready To Run, though haven't followed what is its status...
The IL size of that method is too big,
Indeed, maybe that's just IL size actually 馃槈
Indeed, maybe that's just IL size actually
I double checked, the il size should be 0x1F and this is below 0x20.
The IL size of that method is too big, maybe @mikedn can provide more info about why it does not inline by default.
Inline candidate callsite is boring. Multiplier increased to 1.3.
calleeNativeSizeEstimate=409
callsiteNativeSizeEstimate=55
benefit multiplier=1.3
threshold=71
Native estimate for function size exceeds threshold for inlining 40.9 > 7.1 (multiplier = 1.3)
Inline expansion aborted, inline not profitable
Basically the JIT thinks that the method will end up generating a ton of native code when this isn't actually true. @AndyAyersMS could probably provide a more advanced explanation but the summary is that it's not trivial to estimate how much native code will be generated from some IL code BEFORE doing any optimizations.
The optimisation should still kick in however regardless of whether its inlined I believe? (for valuetypes, not for object types); there would just be a method call present.
... that it's not trivial to estimate how much native code will be generated from some IL code BEFORE doing any optimizations.
So this is a candidate for future tiered compilation improvements? initial jit without inline, realize that it small and then patch the new into the old big one?
So this is a candidate for future tiered compilation improvements? initial jit without inline, realize that it small and then patch the new into the old big one?
That's an interesting metric @AndyAyersMS ? Use the results of first compilation to inform inlining for second compilation?
So this is a candidate for future tiered compilation improvements? initial jit without inline, realize that it small and then patch the new into the old big one?
Maybe but it's a bit more complicated. Sometimes dead code elimination and simplification opportunities only arise as a result of inlining, the non-inlined version may still be a giant method. That said, tiered compilation could be used to recompile the inliner method (if it turns to be hot) with more aggressive inlining policies. The low tier could even disable inlining completely (well, almost - there should be some cases where inlining is a no brainer) and leave it to the high tier.
Inlinling is not required to get the optimization in this example:
;; snippet from COMPlus_JitDump=GetHashFor
;; with no aggressive inline attribute
...
Importing Type.op_*Equality intrinsic
Folding call to Type:op_Equality to a simple compare via EQ
Optimizing compare of types-from-handles to instead compare handles
Asking runtime to compare 00007FF9B23CB5C0 (System.Int32)
and 00007FF9B23C8FD8 (System.Boolean) for equality
Runtime reports comparison is known at jit time: 0
[ 1] 25 (0x019) brfalse.s
Folding operator with constant nodes into a constant:
[000009] ------------ /--* CNS_INT int 0
[000010] ------------ * EQ int
[000008] ------------ \--* CNS_INT int 0
Bashed to int constant:
[000010] ------------ * CNS_INT int 1
The conditional jump becomes an unconditional jump to BB03
Inlining may be required in more general cases where the type parameters are reference types or you are fetching types from objects, as inlining can (sometimes) "unshare" the shared code and make those shared types concrete, or see how objects have been constructed.
As far as the inliner anticipating things like this and being more aggressive without being told -- it is on my (admittedly lengthy) todo list. The current inliner does not really allow in-depth examination of the callee before making a profitability assessment. In particular it does not resolve tokens so it has no idea there is a type equality check in this method. I had hoped to use some machine learning derived heuristics to try and better recognize potentially good inlines without doing the in-depth analysis, but hit various roadblocks and I have shelved (but not abandoned) the effort.
Tier0 today doesn't inline at all. We may or may not change this to inline, but it we do it will inline only the most trivial of methods (~ IL size 6 or less). So tier0 inlining can't tell us much about the opportunities offered by potential inlines we might try in tier1.
We potentially could do something @Tornhoof hinted at: recognize those opportunities in the callee at tier0 (even if we don't act on them), and somehow record them so that when later the caller is rejitted at tier1 we know more about the callee than we do now. A bit of engineering here to figure out what to store, how to represent it, where to put it, and how to guarantee it's always valid.
This lets tier0 act as a simple interprocedural analysis phase, though possibly not a very-effective one on its own.
One of the cruel realities of jitting is that we tend to jit callers before callees, which really limits opportunities for propagating facts about callees to callers. However, in many programs, callees will be invoked more often than callers. If so, tier1 will likely be asked to rejit in mostly reverse callgraph order.
So perhaps there's hope for this kind of "tiering as an interprocedural analysis" scheme after all -- provided tier1 also publishes this kind of info. We need to be careful about how we build this because both tier1 and tier0 code for a method can remain active indefinitely and tier1 methods complete jitting somewhat unpredictably. And profilers and such can get in and mess with method IL. But it seems like it has promise.
For instance suppose we have a chain of calls A -> B -> C where each called is too big to inline by default. When normally jitting or tier0 jitting the jit will likely jit A, then jit B, then jit C. But when we get to to tier 1, it's more likely that the rejit requests will come back as C, then B, then A. So the jitting of B at tier1 can learn from the jitting of C at tier 0 and at tier 1 (if ready).
This may not be the best way to build things though. I have also suggested it might make sense to bypass rejit requests low in the call graph in favor of only rejitting higher up, because time spent rejitting C at tier1 is wasted if the jit then immediately decides to inline C into B -- the tier1 C may never get called. So we need to be thoughtful about all this.
Suffice to say there are many options to explore...
It seems like that performing the optimization during runtime jit is a real challenge (and probably will impact performance of the JIT).
I'll add more detail to my actual case, so it may add additional information.
Basically, I have containers of entities (represented by an int) and each entity can have a set of type linked to it and a container is linked to a set of type.
A container can accept an entity only if the set of types for this entity is a superset of the set of the container.
In code:
class EntityContainer<TC1, TC2> // the set of types accepted is (TC1, TC2)
// Use only value type to force JIT
where TC1 : struct
where TC2 : struct
{
List<int> m_IDs = new List<int>();
public void AddEntity<T1>(uint id)
where T1 : struct
{
// JITted code should cache the branch
if (TypeSetChecker<TC1, TC2>.IsSuperSet<T1>())
m_IDs.Add(id);
}
public void AddEntity<T1, T2>(uint id)
where T1 : struct
where T2 : struct
{
// JITted code should cache the branch
if (TypeSetChecker<TC1, TC2>.IsSuperSet<T1, T2>())
m_IDs.Add(id);
}
public void AddEntity<T1, T2, T3>(uint id)
where T1 : struct
where T2 : struct
where T3 : struct
{
// JITted code should cache the branch
if (TypeSetChecker<TC1, TC2>.IsSuperSet<T1, T2, T3>())
m_IDs.Add(id);
}
}
class TypeSetChecker<TC1, TC2>
where TC1 : struct
where TC2 : struct
{
public static bool IsSuperSet<T1>()
where T1 : struct
{
return false; // we need at least 2 types
}
// JITted code should cache the result only
public static bool IsSuperSet<T1, T2>()
where T1 : struct
where T2 : struct
{
var referenceHash = Hash<TC1>() | Hash<TC2>();
var thisHash = Hash<T1>() | Hash<T2>();
return (referenceHash & thisHash) == referenceHash;
}
// JITted code should cache the result only
public static bool IsSuperSet<T1, T2, T3>()
where T1 : struct
where T2 : struct
where T3 : struct
{
var referenceHash = Hash<TC1>() | Hash<TC2>();
var thisHash = Hash<T1>() | Hash<T2>() | Hash<T3>();
return (referenceHash & thisHash) == referenceHash;
}
// Perform a bloom filter on the type
// JITted code should cache the result only
static ulong Hash<T>()
where T : struct
{
var result = 0Lu;
result |= (1u << (types[i].FullName.GetHashCode() & 0x3F));
result |= (1u << ((types[i].FullName.GetHashCode() >> 8) & 0x3F));
result |= (1u << ((types[i].FullName.GetHashCode() >> 16) & 0x3F));
result |= (1u << ((types[i].FullName.GetHashCode() >> 24) & 0x3F));
return result;
}
}
The overall idea behind this is to have some kind of metaprogramming during compile / jit time, which is pretty valuable for performance.
Another point that can be important is the build pipeline I intend to use:
Those methods will be in a library A and then I use them in libraries B1, B2, and B3 to finally build a dotnet core app C.
The build pipeline will be:
So this way, the first step won't have to precompile multiple version of the generic (If the dll is used directly, it can be jitted and with available jit runtime optimisation that may not include this one).
But the second step takes all of the IL code, so the AOT compiler should be able to optimize all of the calls.
Also, as a side note, I am fine if I have to write specific code to "control" or give hint to compiler so the code get jitted/compiled like this and also if this optimization is only available when linking a set of dll into a native app.
And this also implies that updating the code will probably result in running the whole build pipeline again, but that is fine in that case.
(So maybe this can be enabled by the end user, so he can choose between easy update of the application of better runtime optimization)
You really don't need any kind of meta programming or codegen story for your case. You can move the hash function (That could not be optimized correctly as-is by the JIT) to a generic struct outside of TypeSetChecker<TC1, TC2> (to avoid generating a double specialization that doesn't seem to be required) with a readonly computed field like this (assuming that types[i] in your code was actually typeof(T)):
```C#
struct HashHelper
{
public static readonly ulong Value = ComputeHash
private static ulong ComputeHash<T>()
where T : struct
{
var result = 0Lu;
var hashOftypeOfT = typeOfT.FullName.GetHashCode();
result |= (1u << (hashOftypeOfT & 0x3F));
result |= (1u << ((hashOftypeOfT >> 8) & 0x3F));
result |= (1u << ((hashOftypeOfT >> 16) & 0x3F));
result |= (1u << ((hashOftypeOfT >> 24) & 0x3F));
return result;
}
}
```
This is a proven technique used with generics and JIT/AOT optimizations. Proceeding further, you could even put the entire IsSuperSet through a single internal generic type (e.g TypeSetChecker<TC1, TC2, T1, T2>)
and cache the result in a readonly field so your AddEntity() would be entirely optimized right away.
(side note: if you really want to compute a 64bit hash from a name, you should probably not use 32bit GetHashCode() to construct an unreliable pseudo 64bit hash, but use a stronger 64 bit hash directly)...
Thanks for the answer @xoofx, I implemented it and it works perfectly.
So considering performance, it seems, that most of the optimization are performed during the JIT/AOT compilation. But it is not really clear (or I haven't found where to look) what are the design patterns that are efficient for the JIT/AOT.
the only way to know whether is performed in JIT is to produce JIT dumps with a debug build of coreCLR.
I also looked at the JIT chapters in the Book Of The Runtime, it is not straightforward to see the efficient C# patterns.
However, considering code generation, I think I'll do a bit of C# code generation to generate all of the IsSuperSet<...> variants. But that's another story.
(Note: thanks for the hash info, I am looking into that)
Most helpful comment
Inlinling is not required to get the optimization in this example:
Inlining may be required in more general cases where the type parameters are reference types or you are fetching types from objects, as inlining can (sometimes) "unshare" the shared code and make those shared types concrete, or see how objects have been constructed.
Inliner Behavior
As far as the inliner anticipating things like this and being more aggressive without being told -- it is on my (admittedly lengthy) todo list. The current inliner does not really allow in-depth examination of the callee before making a profitability assessment. In particular it does not resolve tokens so it has no idea there is a type equality check in this method. I had hoped to use some machine learning derived heuristics to try and better recognize potentially good inlines without doing the in-depth analysis, but hit various roadblocks and I have shelved (but not abandoned) the effort.
Tier0 today doesn't inline at all. We may or may not change this to inline, but it we do it will inline only the most trivial of methods (~ IL size 6 or less). So tier0 inlining can't tell us much about the opportunities offered by potential inlines we might try in tier1.
Using tier0 / tier1 to do interprocedural analysis
We potentially could do something @Tornhoof hinted at: recognize those opportunities in the callee at tier0 (even if we don't act on them), and somehow record them so that when later the caller is rejitted at tier1 we know more about the callee than we do now. A bit of engineering here to figure out what to store, how to represent it, where to put it, and how to guarantee it's always valid.
This lets tier0 act as a simple interprocedural analysis phase, though possibly not a very-effective one on its own.
One of the cruel realities of jitting is that we tend to jit callers before callees, which really limits opportunities for propagating facts about callees to callers. However, in many programs, callees will be invoked more often than callers. If so, tier1 will likely be asked to rejit in mostly reverse callgraph order.
So perhaps there's hope for this kind of "tiering as an interprocedural analysis" scheme after all -- provided tier1 also publishes this kind of info. We need to be careful about how we build this because both tier1 and tier0 code for a method can remain active indefinitely and tier1 methods complete jitting somewhat unpredictably. And profilers and such can get in and mess with method IL. But it seems like it has promise.
For instance suppose we have a chain of calls A -> B -> C where each called is too big to inline by default. When normally jitting or tier0 jitting the jit will likely jit A, then jit B, then jit C. But when we get to to tier 1, it's more likely that the rejit requests will come back as C, then B, then A. So the jitting of B at tier1 can learn from the jitting of C at tier 0 and at tier 1 (if ready).
This may not be the best way to build things though. I have also suggested it might make sense to bypass rejit requests low in the call graph in favor of only rejitting higher up, because time spent rejitting C at tier1 is wasted if the jit then immediately decides to inline C into B -- the tier1 C may never get called. So we need to be thoughtful about all this.
Suffice to say there are many options to explore...