Runtime: Performance drawback of type derived from generic

Created on 4 Feb 2015 · 7Comments · Source: dotnet/runtime

I've encountered with one performance problem.
Let's talk code. I simplified the code as much as I could to reproduce the issue.
Suppose we have a generic class. It has an empty list inside and does something with T in constructor. It has Run method that calls an IEnumerable<T> method on the list, e.g. Any().

public class BaseClass<T>
{
    private List<T> _list = new List<T>();

    public BaseClass()
    {
        Enumerable.Empty<T>();
        // or Enumerable.Repeat(new T(), 10);
        // or even new T();
        // or foreach (var item in _list) {}
    }

    public void Run()
    {
        for (var i = 0; i < 8000000; i++)
        {
            if (_list.Any())
            // or if (_list.Count() > 0)
            // or if (_list.FirstOrDefault() != null)
            // or if (_list.SingleOrDefault() != null)
            // or other IEnumerable<T> method
            {
                return;
            }
        }
    }
}

Then we have a derived class which is empty:

public class DerivedClass : BaseClass<object>
{
}

Let's measure the performance of running ClassBase<T>.Run method from both classes. Accessing from derived type is 4 times slower that from base class. And I can't understand why that happens. Compiled in Release mode, result is the same with warm up. It happens on .NET 4.5 only.

public class Program
{
    public static void Main()
    {
        Measure(new DerivedClass());
        Measure(new BaseClass<object>());
    }

    private static void Measure(BaseClass<object> baseClass)
    {
        var sw = Stopwatch.StartNew();
        baseClass.Run();
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);
    }
}

Full listing on gist

I've got an answer on Microsoft Connect

It is related to dictionary lookups in shared generics code. The heuristic in runtime and JIT do not work well for this particular test. We will take a look what can be done about it.

In the meantime, you can workaround it by adding two dummy methods to the BaseClass (do not even need to be called). It will cause the heuristic to work as one would expect.

area-CodeGen-coreclr optimization

Source

alexandrnikitin

Most helpful comment

Alexandr, thanks for the post regarding the performance observation about generics. I think there are two issues here: the first is to explain the current behavior you're observing in this program and the second is whether we can do anything to fix it.

Let me start with trying to explain the behavior. The issue arises because of sharing of the same JITTED code for generic methods on behalf of different types at runtime. When shared methods are executed any applications of generic type variables in their body have to be looked up to get the concrete runtime type. For your example it is List given it's executing the shared method BaseClass.Run. We refer to this process as runtime handle lookup. This process is the most important aspect of making shared generic code as nearly as efficient as fully specialized methods.

Because of the critical performance needs of this feature, both the JIT and runtime cooperate through a series of sophisticated techniques to reduce the overhead. The different timing behaviors you observe in the various versions of your PerformanceProblem.Program example are a reflection of the current implementation artifacts. So these artifacts might change over time as the code base evolves.

So after saying that, lets talk about how the runtime optimizes these lookups. There are essentially a series of caches to avoid the ultimately expensive lookup of types at runtime via the class loader. Without going into too much detail you can abstractly look at the lookup costs like this:

Inline lookup from dictionary slot - An inline sequence of code that can lookup the exact type within a few levels of indirection (think 10 clocks for a hit). These slots are accessible off the methodtable for the object.
Helper global dictionary cache lookup - A call to a helper method that looks up in a global hash table to find handle corresponding to the token. (think about 30 clocks for a hit)
Type hierarchy walk with global dictionary cache lookup - Walk inheritance tree and lookup in the global hash table using the declaring type (think about 50-60 clocks for a hit)
Class-loader - (think 300 clocks)

So ideally all of the handle lookup scenarios get an inline lookup off the methodtable and this costs very little and hence performance near that of specialized generic methods. But like most cache performance it comes down to determining how many slots are necessary and predicting where (in what methods) these lookups will be required. When a class is built at runtime the VM makes a guess about how many slots are needed without having actually compiled all the methods (remember this is lazy execution of code). So it currently makes an estimate of how many slots to reserve based on the number of methods in the class. Then, as the methods are actually JITTED the slots are allocated for specific lookup points in the method. So whether you hit the ideal performance characteristics of dotnet/coreclr#1 above has to do with the number of methods in the class and the order in which the methods are JITTED. So in your example if you add more methods to the base class the class builder in the VM assumes those methods will have points where more runtime handle lookups are required and allocate more slots.

Lets come back to the above in a bit and talk about the JIT. Depending upon which JIT is being used it may be able to optimize away some of these artifacts and speedup the code as well. As mentioned the JIT allocates the slots as it compiles the methods. In some cases, the JIT might be able to determine that a method invocation is dead after inlining it and optimize away it's runtime handle lookup and thereby not need to allocate a dictionary slot for it. This then frees up a slot for use when compiling some later method. So in your example at least one of our JITs will inline Enumerable.Empty() in BaseClass() and remove the runtime handle lookup, thereby leaving the dictionary slot available for use when compiling Run() method with the slot used for the runtime lookup of _list.Any() call. So our 4.5 64-bit JIT will do this whereas the x86 JIT will not. For x86, if you remove the call to Enumerable.Empty from the source code this will allow the slot to be used for the _list.Any() call and should demonstrate a speedup.

There is another aspect of the loop in BaseClass.Run that impacts the performance of this example program. The runtime handle lookup for _list.Any() is invariant and can be hoisted out of the loop. The x86 JIT and the 64-bit JIT in 4.6/CoreCLR (not 64-bit 4.5) will do this. Because this transformation is so effective at mitigating the runtime handle lookup cost you will not measure a significant difference between the cost of the BaseClass and DerivedClass invocations because of this in 4.6. I had to rewrite your test example of put the loop structure up into Measure in order to block the invariant code hoisting in order to reproduce the difference on 4.6.

So the above explains why you see different performance characteristics from various kinds of minor code changes as well as on different versions of the CLR JIT for this example. I don't see any changes in our version control history to suggest that the VM's behavior has changed in the last 5 years. But it doesn't explain why the base and derived classes have different performance characteristics so here's the punch line:

For this example, because the base class has 2 methods it ends up getting 2 dictionary slots allocated, using the 2 reserved slots (preferred dotnet/coreclr#1 above) to cache the lookups for "new List" and "Empty when JITTING the constructor. Obviously the constructor is JITTED before the Run method. So when going to JIT the Run method the dictionary slots have overflowed which means that _list.Any() will fall-back to a runtime helper for the lookup (#2-#4).

It is this slower lookup path that the test case is measuring and shows the difference in lookup cost between base and derived. Basically the base class case lookup is hitting dotnet/coreclr#2 above and the derived class is falling back to dotnet/coreclr#3. I measure a roughly 3x slowdown for the derived class. I believe there is a bug in the global dictionary cache update for the derived class that is responsible for the difference. I have made the change locally and ran some benchmarks. With this change both class lookups give the same performance. So thanks for reporting this I believe this is a real improvement for the runtime that I'll checkin after more desktop testing and code-review.

There is however still the issue that even with the above fix you get a hit in dotnet/coreclr#2 (global dictionary cache) and not dotnet/coreclr#1 (inline dictionary slot) for this case. While the heuristic for pre-allocating the dictionary slots could be bumped up a little bit to allocate extra slots for this case, there is a trade-off between over-allocating unused dictionary slots (bad working set) and under-allocating used dictionary slots (slow lookup). I don't think we have enough justification to change the heuristic given how much performance benchmarking we'd have to do on large server workloads to know whether changing the heuristic would lead to the right trade-offs. I expect the current heuristic works well for a large cross-section of programs using generics. Of course any single program may benefit more or less from this heuristic. We may still want to look into this in the future if it becomes a recurring source of performance problems for customers.

Hope this helps explain things. Thanks again Alexandr.

cmckinsey on 2 Apr 2015

👍18

All 7 comments

@cmckinsey - we also have TFS bug from the original report on Microsoft Connect

jkotas on 4 Feb 2015

@jkotas What's the status of the bug? Could you share some insights?

alexandrnikitin on 4 Feb 2015

We did not have a chance to look into the bug.

The heuristic mentioned is implemented in MethodTableBuilder::AllocAndInitMethodDescs: https://github.com/dotnet/coreclr/blob/master/src/vm/methodtablebuilder.cpp#L7260

The problem is that the number of generic dictionary slots guessed by MethodTableBuilder::AllocAndInitMethodDescs is less than what the JIT eventually asks for, and forces the JIT to use slower path.

jkotas on 4 Feb 2015

@jkotas Is there an action item here? This issue is aging -- what are the next steps?

kangaroo on 14 Mar 2015

The issue is assigned to @cmckinsey I expect that @cmckinsey or somebody on his team will look into it once it gets to the top of their priority list, if nobody beats them to it.

jkotas on 14 Mar 2015

Inline lookup from dictionary slot - An inline sequence of code that can lookup the exact type within a few levels of indirection (think 10 clocks for a hit). These slots are accessible off the methodtable for the object.
Helper global dictionary cache lookup - A call to a helper method that looks up in a global hash table to find handle corresponding to the token. (think about 30 clocks for a hit)
Type hierarchy walk with global dictionary cache lookup - Walk inheritance tree and lookup in the global hash table using the declaring type (think about 50-60 clocks for a hit)
Class-loader - (think 300 clocks)