Runtime: Performance decrease when using interfaces

Created on 25 Jan 2017  ·  60Comments  ·  Source: dotnet/runtime

The general guidelines that I see for writing in C# usually say something like, "Accept as general of an object as you can for your parameters." For example, instead of requiring a List as a parameter when you don't actually use List specific methods, accept an IList, or even an IEnumerable if you can. This way you're not tied to a specific implementation. The basic concept of interfaces.

This works well in theory, but in practice, I've found that it's usually a bad idea when it comes to interfaces like IList, ICollection, or IEnumerable. In many of my projects, I've found that changing methods' parameter types from arrays or Lists to interfaces such as IList result in a 2-4x increase in execution time for the program.

Because virtual calls are used when using interfaces, making it so the JIT doesn't inline them, indexer access time is greatly increased. I was curious just how big the performance decrease was, so I put together a little demo program to try and benchmark this, and the results were worse than I expected:

Times are in milliseconds

18       Array
218      Array as IList
26       List
223      List as IList
19       ArrayWrapper
188      ArrayWrapper as IList

Using an IList resulted in about an order of magnitude longer execution time than not using an IList. I wondered if some of this was due to the extra stuff the CLR does when accessing an Array or List as an IList, so I made a simple wrapper class around an array. This resulted in some improvement, but not much.

This virtually negates the advantage of something like accepting an IList as a parameter instead of a T[] to decouple the code from any specific implementation. A new developer might expect that passing an object to a method accepting an IList would result in the same performance as if the method accepted that object's concrete type.

Is there any way to lessen this kind of performance loss? Or is it unavoidable without changing the JIT to be able to inline virtual calls?

category:cq
theme:devirtualization
skill-level:expert
cost:large

area-CodeGen-coreclr question

Most helpful comment

I agree with @Thealexbarney 10x increase is unreasonable. I very often use IReadOnlyList instead of List for safety and code correctness.

All 60 comments

The general guidelines that I see for writing in C# usually say something like, "Accept as general of an object as you can for your parameters."

Such guidelines need to be taken with a grain of salt. They usually make sense in public APIs but beyond that collection interfaces should be used only when needed. It makes no sense whatsoever to have a private method taking IList<T> when all its callers use List<T>.

Is there any way to lessen this kind of performance loss? Or is it unavoidable without changing the JIT to be able to inline virtual calls?

Not really. JIT improvements could help here and there but in the general case there's not much that can be done.

@Thealexbarney you can optimise specific cases with tight loops by casting; for example changing your SumIList method to the following would allow the the inlines when ilist is a list

private static int SumIList(IList<int> ilist)
{
    int result = 0;

    var list = ilist as List<int>;
    if (list != null)
    {
        var length = list.Count;
        for (int i = 0; i < length; i++)
            result += list[i];
    }
    else
    {
        var length = ilist.Count;
        for (int i = 0; i < length; i++)
            result += ilist[i];
    }

    return result;
}

In your example you are also doing a double interface dispatch per loop (for both count and item) and comparing that against a fully inlined bounds-check eliminated loop. The time should halve if you cache the count when going via the interface.

Abstractions have performance costs. Pretty much all runtime abstractions - interfaces, generics, async, linq, ... - have extra performance costs compared to their less abstract equivalents. This overhead does not matter for most of the code and it is often worth paying to get the productivity benefits that these abstractions provide. However, one has to be careful about this overhead for the performance critical parts.

@vancem wrote a great articles on this topic years ago. Links to the archived copies can be found here.

As Jan points out, abstractions often have costs. This is part of what makes design hard, because you have to make tradeoffs. We have problems with people going 'overboard' both ways. We often have people designing things without any concern for perf, often with really bad consequences. More rarely, we have people who try to optimize 'everything' and end up making hard-to-maintain code.

What we really want is to take advantage of a 90%-10% dynamic that typically happens with software. It is typically the case that WELL UNDER 10% of your code in is on high volume (HOT) code paths. Moreover, it is NOT THAT hard to anticipate this. In this code, there is a good chance you need to NOT be as abstract (e.g. using Arrays instead of IList), and think carefully about the APIs (make them chunky, so that they are not called often to do little work). The other 90% of your code you can skew toward all the good software abstractions etc that make the code understandable, reusable and maintainable.

In the article Jan mentions above I go into more detail, but it is not rocket science, it is al about exploiting the 90-10% characteristic. People who design general purpose libraries have the hardest since they really don't know what scenarios are the most important to optimize.

For what it is worth...

Yes, abstractions have costs, but a 10x increase in execution time for this example doesn't seem very reasonable to me. I'm not very knowledgeable about the inner workings of the CLR's code generation, so somebody correct me if I'm wrong, but isn't one of the advantages of compiling at run-time the fact that the JIT compiler knows the type behind the interface, and can optimize accordingly?

I've written code across various projects that only operated on an object's indexer and length. We wanted to be able to run Arrays Lists, and other types through this code. Using an IList seemed like a good solution, but wasn't feasible for our project because of the performance issues.

I was curious how the JVM handled this, so I did a similar benchmark in Java testing a method that took an interface as a parameter. Going through an interface there didn't have much, if any, noticeable overhead compared to using a plain array. Looking at the compiled assembly confirmed that the interface calls were being inlined and optimized. It looks like when a new type goes through the interface, the method is JIT compiled for the new base type. This doesn't happen with the CLR.

I guess what I'm trying to ask is why is this still a problem after years of the CLR being developed? I can't speak for others, but I've personally run into it a decent number of times. Is there something about the way the CLR is designed that makes it difficult to improve on? Is it not worth the effort to improve? Or maybe there's some other reason that I'm overlooking.

Yes, it would be nice for CLR to have the optimizations you are asking about. They are not easy to build, but it is certainly doable with enough effort. They have not been built yet because of people working on CLR did not find them to be the best bang for the buck when choosing areas to improve so far.

I agree with @Thealexbarney 10x increase is unreasonable. I very often use IReadOnlyList instead of List for safety and code correctness.

If anyone is interested in understanding 'why interface calls are more expensive', this BOTR page is an interesting read. Despite the title it is about interface dispatch, from the page:

Although it is possible for Virtual stub dispatching (VSD) to dispatch both virtual instance and interface method calls, it is currently used only for interface dispatch.

Hi all, @fatalerrorx, @benaadams

Correct me if I'm wrong, but don't generic constraints alleviate this pressure to overload/specialize in a lot of cases?

private static int SumIList<TList>(TList list) where TList : IList<int>
{
    int result = 0;
    if (list != null)
    {
        int length = list.Count;
        for (int i = 0; i < length; i++)
            result += list[i];
    }
    return result;
}

The JIT already knows to compile a different function for each type used, right?

Correct me if I'm wrong, but don't generic constraints alleviate this pressure to overload in a lot of cases?

Your example code still uses IList<int>.

The JIT already knows to compile a different function for each type used, right?

This only happens when TList is a struct type. If it's a reference type then the same code is used for arrays, List<int> etc.

I don't want to get into sound too defensive about optimization the .NET runtime does, but I do want to clarify two things that are important in deciding how important various optimization are.

Yes, abstractions have costs, but a 10x increase in execution time for this example doesn't seem very reasonable to me

I think you tested the worst case since the operation being done was literally one instruction. Thus I do think it is not fair to use the 10x as the representative increase. Of those cases where you are doing 'flyweight' things in hot loops, that actually have impact in overall performance, how problematic would it have been to simply use arrays?

I was curious how the JVM handled this

First, I would like to confirm that the Java experiment you did was 'representative' of your real code. What is going on is code specialization, and every system has its limits. There is a good chance that your microbenchmark lives within the limits (the whole function was inlined), but in more typical code, it is problematic (the methods are bigger and you have to decide which are worth it). Certainly 'hotspot' style dynamic re-compilation can do this (and that may be what is happening in your case), but that is a non-trivial investment.

My main point, however is that you should be careful about over-extrapolation of microbechmarks. If the benchmark REALLY represents your hot paths, then I can't argue with that, but real-world code has a surprising tendency to be both not as bad in some case (as in the first case above) or as good as (if the java JIT 'falls off' its optimization for your real code in the second case).

This is kind of tradeoff we have to make when the .NET Runtime decides how to invest in optimization. What we have found is that in 'real-world' code the performance problems are only very rarely related to JIT code quality issues of the type you suggest (but they do turn up in microbenchmarks all the time, which is what is happening here).

Anyway, what would be more useful to us is not this microbenchmark, but examples of 'hot kernels' in your code where you find this to be an actual problem. That would certainly add weight for us to do some optimization of those scenarios.

@mikedn, wow I assumed it did that work for us. I must have misinterpreted something I read somewhere. I just took @Thealexbarney 's demo program and tried my change; same as not using generic. I'm kind of disappointed.

Now it makes sense why the LINQ operators don't use this (non)-"trick". I wish they would have added that to the JIT compiler to support the LINQ implementation using that optimization.

Regarding the generic constraint comment (@NickStrupat) there are areas where using that could avoid (a) the interface invocation tax and (b) the struct-boxing tax incurred when an instance is used via its interface, particularly for things like IEnumerable and IEnumerator, which work really well as structs.

@markrendle absolutely. What I'm a bit disappointed about is that JIT doesn't emit a specialized method for classes as well. It could emit a method where all of the calls to the interface methods are resolved to call the methods on the class (at least if that class is sealed so it knows there won't be polymorphic behaviour).

I just tested @Thealexbarney 's demo program with an added SealedArrayWrapper and having all tests call a generic version of the Sum... method. The results are the same sealed or not.

Developing a library which reads the method body as IL and emits a DynamicMethod with the modified IL for sealed classes could be a workable solution.

Adding @cmckinsey

So in summary, RyuJIT does not take advantage of the optimization opportunity to output specialized native executable code for generic methods taking an interface/class-constrained parameter where T is a sealed class, despite value types receiving this optimization. The specialized code would in theory be calling the method directly instead of during a virtual table look-up call, assuming this is how it works for value types(?).

Is that correct? Does anyone have expertise on what changes that would require?

Scenarios like this where high performance abstractions can be achieved through using a sealed derived class combined with generic methods in hot loops or elsewhere could be considered very valuable for people with their eyes on the "performance as a feature" aspect of .NET development.

So in summary, RyuJIT does not take advantage of the optimization opportunity to output specialized native executable code for generic methods taking an interface/class-constrained parameter where T is a sealed class, despite value types receiving this optimization. The specialized code would in theory be calling the method directly instead of during a virtual table look-up call, assuming this is how it works for value types

It works that way for value types because there aren't any reasonable alternatives (e.g the code for List<double> and List<int> has to be different because the 2 value types have different sizes and require different instructions, the only way to avoid this is to do box those values like Java generics do).

Doing the same thing indiscriminately in the case of reference types would result in a code explosion and in many cases it would be useless or worse, have adverse effects. If such specialization is done then it should be done only when needed and that makes things more difficult. The JIT needs to somehow figure which code paths are worth optimizing either at runtime (like Java supposedly does) or by having profiling information collected in advance (e.g. via tools like mpgo). And each approach has its pros and cons (e.g. the Java approach wouldn't probably work in AOT runtimes such as .NET Native and CoreRT).

@NickStrupat the caller can choose to force the implementation for the generic by creating a non-generic struct wrapper

struct ListWrapper : IList<int>
{
    private List<int> _list;
    public ListWrapper(List<int> list)
    {
        _list = list;
    } 
    public int Count => _list.Count;
    //...
}

var result = SumIList(new ListWrapper(list));

It should also give you the benefits of inlining back

@vancem The microbenchmark is not completely representative of the hot spots in our code, but it's not _too_ far off. The most severely I run into this performance decrease is when working with code that uses interfaces such as IList or IReadOnlyList.

Here's one example of a possible situation. Let's say someone's writing some code that does some sort of data processing, and the functions to call this code accept an Array. This implementation is working fine, and performance is good.

Now this person needs to pass in portion of a large array to this function. They have two main options: Create a new, smaller array, allocating memory in the process, or create an object (Similar to ArraySegment or Span) that abstracts the access to the original array, requiring duplicated code to handle the object, incurring a slight increase in execution time, but without a large memory allocation.

I won't go into any more detail, but there are a few more memory abstractions that this person uses in different situations, and a common theme between all of them is that they can all implement IList. Currently there are multiple versions of the code that have to be maintained, but if a single version of the code accepted an IList, ease of maintenance and the size of the code base would be better. However, making the original code accept an IList instead of an Array results in about a 2-4x increase in execution time, depending on the situation.

I'm not that great at creating scenarios, but I hope this shows one situation where a developer could run into this issue.

@mikedn I'm not suggesting indiscriminately doing the same thing for reference types. Only for sealed classes.

@benaadams The wrapper struct can actually be generic and receive the same inlining treatment.

I'm not suggesting indiscriminately doing the same thing for reference types. Only for sealed classes.

How does doing it only for sealed classes help? List<T> and most collection classes aren't sealed anyway. The only commonly used "collection class" that is sealed is T[]. But even then wholesale generic method specialization doesn't necessarily solve this issue. The JIT needs to somehow figure out which specialization to call and at compile time this is only possible if the actual type of the collection is known at the callsite. As in:
C# void foo() { IList<int> nums = new int[42]; ... // the compiler knows that nums is int[] so it could call // a specialized version of SumIList SumIList(nums); }
Otherwise the best a compiler can do is something along the lines of @benaadams first example. And that kind of approach needs to be driven by profiling information otherwise you risk ending up with a larger and slower program.

@mikedn I think we've each been talking about slightly different scenarios, so I'll try to specify again.

If the calling code knows the IList<T> is T[] or some derived and sealed class of any IList<T>, the following method could and should perform the same "inlining" as value-type IList<T>s already receive.

private static int SumIList<TList>(TList list) where TList : IList<int>
{
    int result = 0;
    if (list != null)
    {
        int length = list.Count;
        for (int i = 0; i < length; i++)
            result += list[i];
    }
    return result;
}

If TList is a value-type, we get the same performance as if SumIList were overloaded specifically for TList, but not if TList is a sealed class.

Currently, as @benaadams points out, we need to make a value-type wrapper to get the "inlining". If someone has a List<T> or some other standard IList<T> collection, they need to make a struct wrapper for that particular class.

@NickStrupat as @mikedn points out if that was done wholesale it would cause a binary size explosion especially as then it would start inlining code into these functions. It would be worthwhile in hot paths but disadvantageous to do in all paths.

There is a proposal in rosyln to indicate that you want specific handling for various generic types https://github.com/dotnet/roslyn/issues/15822 <-- Shameless self promotion 😉

@benaadams, @mikedn, I don't know how many other ways I can point this out... I'm not talking about doing it wholesale.

I'm talking about doing it only for sealed classes where TList is the sealed class. In this scenario, the JIT compiler knows just as much about that type as if it were a struct. Only if it's sealed is this a good idea. Since the vast majority of classes are unsealed, there would be no code explosion in most cases.

I fully recognize and understand that generating different native code methods for every and any T would result in an undesirable code explosive à la C++ class templates. Again, I'm not proposing that.

@benaadams, generic specialization proposals get a +1 from me! I loved designing C++ libs which took advantage of that.

@NickStrupat sealed is only relevant for virtual methods; otherwise it already has all the type information. It would also cause code bloat unnecessarily for everywhere a sealed type was used rather than just in hot paths. e.g. as string is a sealed type creating a Dictionary<string, object> would generate full unique assembly for that dictionary, as would Dictionary<string, string> and List<string> etc but there is likely no reason to do so.

Regarding non-generic devirtualization for sealed types that is being looked at as part of https://github.com/dotnet/coreclr/pull/9230; though that is different than this issue.

I think we've each been talking about slightly different scenarios, so I'll try to specify again.

I don't think so. Let's try again:

If the calling code knows the IList is T[] or some derived and sealed class of any IList, the following method could and should perform the same "inlining" as value-type ILists already receive.

Considering that doing this will invariably increase code size one can wonder why should the JIT be doing this in the absence of any indication that the code is performance sensitive. Does the fact that TList happens to be sealed implies that SumIList is performance sensitive? Clearly not.

If TList is a value-type, we get the same performance as if SumIList were overloaded specifically for TList, but not if TList is a sealed class.

And as I stated before value types works the way they work because there aren't any reasonable alternatives, not because the JIT is doing an optimization. If anything, the JIT could do better in the area of value types by avoiding generating separate code for similar structs (e.g. 2 structs that both contain a single int field)

After rechecking my assumptions, I take it back, alas a concrete struct wrapper doesn't inline via a generic. https://gist.github.com/benaadams/ba3c008a328f02c75663213cd9affa65

// incorrect numbers

The interface calls are slower, but closer to a non-inlined call (as above)

Ahh... had it wrong; they do inline, the wrappers were classes not structs, updated code, results are

 31      Array
237      Array as IList
 30      ArrayWrapper
149      ArrayWrapper Non-inlined        ** Non-inlined direct call
 45      List
207      List as IList
220      ArrayWrapper as IList
235      Array as Generic                ** via constraint
206      List as Generic                 ** via constraint
 34      ArrayWrapper as Generic         ** via constraint

Considering that doing this will invariably increase code size one can wonder why should the JIT be doing this in the absence of any indication that the code is performance sensitive.

@mikedn Generics in general already causes this "code explosion" relative to non-generic types/collections). What are the chances that the current implementation of generics and their surrounding code base (LINQ, etc.) represents exactly the right balance of native code size vs performance? There is no real reason other than performance to generate the specializations at all. The type arguments could just be used for type-safety to wrap a non-generic with casting. Clearly the performance gains were considered more important than the potentially large generated code size.

Does the fact that TList happens to be sealed implies that SumIList is performance sensitive? Clearly not.

Currently the mechanism for achieving the inlining is "wrap with a struct", which is no less cryptic than "seal your class". That said, I would definitely be interested in some fine-grained control of which inlined specializations are generated.

And as I stated before value types works the way they work because there aren't any reasonable alternatives, not because the JIT is doing an optimization.

There is definitely at least one reasonable alternative. You could just box the value type wrapper like many parts of the .NET BCL already/still do.

If anything, the JIT could do better in the area of value types by avoiding generating separate code for similar structs (e.g. 2 structs that both contain a single int field)

That would be very cool. So the easy case would be if the fields match exactly?

Generics in general already causes this "code explosion" relative to non-generic types/collections). What are the chances that the current implementation of generics and their surrounding code base (LINQ, etc.) represents exactly the right balance of native code size vs performance?

There is definitely at least one reasonable alternative. You could just box the value type wrapper like many parts of the .NET BCL already/still do.

There may be places where the cost of boxing is small enough that code sharing would be desirable but in general boxing is costly enough that code sharing isn't a reasonable option. I don't think anybody wants List<int> to be a list of objects rather than a list of ints.

Currently the mechanism for achieving the inlining is "wrap with a struct", which is no less cryptic than "seal your class".

IMO the problem is not that it is cryptic, the problem is that sealed can be and is used for other purposes. I see no reason why making a class sealed because I don't want 3rd parties to inherit from it would result in more code being generated, code that most of the time is not needed.

That would be very cool. So the easy case would be if the fields match exactly?

They need to have identical layout but they don't need to match exactly. For example, most List<int> and List<SomeInt32BasedEnum> methods should generate identical code.

There is definitely at least one reasonable alternative. You could just box the value type wrapper

There's still full backwards compat with .NET Framework 1.0 and 1.1 so you can use ArrayList for your ints if you want this behaviour.

I think that it makes a lot of sense to solve this problem for readonly fields of which types are interfaces. Many long-lived objects receive dependencies as interfaces in the constructor and store them in private readonly fields. The object is expected to invoke methods on this interface a very large number of times during its lifetime.

Because the fields are readonly, they are not supposed to change. It makes sense in this case to "cache" the addresses of the methods of the real type behind the interface. I know this needs to be per instance, so this might mean the we need someplace inside the object to store these addresses. Maybe the developer can opt-in into this by applying some attribute over the field. E.g.

public class MyObject
{
    [CacheMethodAddresses]
    private readonly IDependency dependency;

    public MyObject(IDependency dependency)
    {
        this.dependency = dependency;
    }

    //... rest of members here
}

One problem is that readonly fields can be changed via reflection. One potential fix is to make fields with the CacheMethodAddresses attribute applied truly readonly. Is there an existing concept of truly readonly field at the IL/CLR level?

I see no reason why invoking a call on an interface could not be at least as fast as invoking a virtual method. It must have something to do with how .NET represents interfaces (or interface mappings).

Based purely on the computational model of classes and interfaces (e.g. as described in compiler-design books such as the following: https://www.amazon.com/Modern-Compiler-Implementation-Andrew-Appel/dp/052182060X ), both should involve the same amount of reference lookups:

A virtual method call consists of looking up the vtable for an object (which each class-based object contains a reference to), and then looking up the correct method from the vtable, and then calling the method. That's 2 lookups. (By "lookup", I mean dereferencing from a compile-time determined offset of a static block of memory).

Each "instance" of an interface consists of a reference to the implementing object, and a reference to a vtable that specifically describes how an instance of a particular class implements that interface. The call consists of looking up the vtable from the interface instance, looking up the method in the vtable, and then calling the referenced method. That's also 2 lookups.

In theory, an interface method could be even FASTER, if interfaces are stored as fat-pointers (conceptually, a STRUCT that contains both the object-reference and the vtable reference). That is, for a virtual method, you must dereference the object, then the vtable, and then the method; but for an interface, you'd just dereference the vtable (from a compile-time determined offset) and then the method.

What is .NET doing that makes interface calls SLOWER?

It might not be "worth" the bang for the buck, but this is fundamental language-design stuff, and the result of this failure is that Microsoft (and others) encourage you to do the wrong thing, namely to prefer abstract classes over interfaces (unless you REALLY need to): https://msdn.microsoft.com/en-us/library/ms229013(v=vs.100).aspx . That stance is based purely on implementation choices of the CLR, and not on what makes for a better programming model.

... Even so, I'd expect a 2-4x hit AT MOST, but I see some benchmarks that indicate that the descrepency is 10x in some cases? What the heck?

C'mon Microsoft! Software design should cater to doing the right thing, and not the other way around.

Software design should cater to doing the right thing, and not the other way around.

Sometimes there's no such thing as the right thing, there are only trade-offs. Your interface implementation(s) trade memory for speed. Sometimes that can't even be called a trade-off but simply a deoptimization.

@d-cook

I see no reason why invoking a call on an interface could not be at least as fast as invoking a virtual method.

As far as I can see, nothing in this issue is about the difference between virtual calls and interface calls. It's all about the difference between interface calls and non-virtual calls.

Each "instance" of an interface consists of a reference to the implementing object

That's not how interfaces are implemented in the CLR. I think that changing that would require incredibly good reasons.

It might not be "worth" the bang for the buck, but this is fundamental language-design stuff, and the result of this failure is that Microsoft (and others) encourage you to do the wrong thing, namely to prefer abstract classes over interfaces (unless you REALLY need to): https://msdn.microsoft.com/en-us/library/ms229013(v=vs.100).aspx . That stance is based purely on implementation choices of the CLR, and not on what makes for a better programming model.

Keep in mind that you're linking to an outdated document. The current version of the same guidelines does not include that document at all.

I see some benchmarks that indicate that the descrepency is 10x in some cases? What the heck?

Can you give specific examples? Without seeing the code for a benchmark, it's really hard to talk about what it means. Especially since, as has been mentioned before, 10x difference in a microbenchmark might not mean much, if that microbenchmark is not representative of real code.

How easy would it have been to miss that information, if the goal is to justify past decisions, rather than to honestly seek the better path?

respond with a half-baked reason to be dismissive, rather than providing some reasonable explanation as to why this

I was considering writing a more detailed response but I see no point in wasting my time after such a reply. I'll also note that your initial reply didn't include this so it's clear that you considered this very important to add to deserve an edit.

I do not work for Microsoft and I have no obligation to do anything - respond to issues, do code reviews and send pull requests to improve existing code. Yet I have spent quite a bit of my spare time doing just that. You have provided no serious analysis of your implementations and it's also pretty obvious that you're not familiar with the existing implementation. I see no reason to spend a lot of time explaining the various additional issues (and there are quite a few issues) your implementations have when simply pointing out that such implementations result in increased memory usage should suffice. Every serious .NET developer knows that allocating more memory is not a good thing due to GC. Some more advanced developers will also know that's not a good thing due to the fact that accessing memory isn't exactly fast when the data doesn't fit in CPU caches.

Sitting in that armchair reading books and accusing companies and people about lack of honesty and half-backed responses must be really comfortable. I suggest changing the armchair. Or the attitude...

I agree and I apologize for my embarrassing overreaction.

Thanks for providing some insight into how that extra bit of memory (specifically when referencing an object) can have so many negative impacts within the CLR.

I'm still confused as to why interface calls are slower than virtual method calls (since they can be implemented with the same amount of indirection); but that's off topic here, and your response provides enough insight that I can see how that might be.

There is fairly good documentation on the current approach for interface dispatch along with some of the rationale behind it here: Virtual Stub Dispatch.

Also it is not hard to find writeups online about the older IVMap approach that VSD supplanted.

I'm still confused as to why interface calls are slower than virtual method calls

Note that while they are indeed slower the usual numbers are well below 10x, even below 2x.

The 10x number that appears in some posts above is about the cost of interface calls versus direct calls. Since direct calls can be inlined and interface/virtual calls usually cannot, the slowdown caused by interface/virtual calls can be very high.

And the difference between interface and virtual calls could probably be erased in many cases by JIT optimizations (e.g. dotnet/runtime#6558) that are likely simpler than the optimizations required to get rid of the interface/virtual calls completely.

For those who want to actually get data on things like this

First there are some performance articles. The 'Part 2' article is particularly relevant.

The article talks about MeasureIt, which you can get here

Note that measureit /usersGuide gives you USEFUL information on the care needed in collecting microbenchmark performance information. It is VERY easy to mislead yourself.

In the specific case of interface dispatch, how often the SAME code location goes to DISTICT targets matters. .NET Interface dispatch is TUNED for the MONOMORPHIC case (that is the case where at any particular code site, you tend to go to the SAME target (that is you make calls with object of the same type). This case is DEFINATELY not 10X slower than virtual calls (it is about the same, but the exact number depends on LOTS of details). However if your code side is very POLYMORPHIC, then it will not perform as well (I would be surprised by 10X, but again, the details of the benchmark matter)

Now if your code always does POLYMORPHIC interface dispatch, then .NET is not tuned for your case. But keep in mind we did not do that capriciously. We did that because the data we have suggests that most (like 99%) of the time interface dispatch in real world programs IS MONOMORPHIC. That, plus the idea that typically call dispatch is NOT the bottleneck (your methods actually do something, and THAT is what takes much of the CPU time), is what drove the current design.

We are however constantly testing our assumptions, as they may change over time. That is why we ARE interested in any cases where REAL code (or even microbenchmarks inspired by REAL code), perform poorly.

The most likely response to such data, would not be a wholesale re-architecting of basic things like Interface dispatch (although we have done it (twice) in the past), but rather seeing if we can improve the polymorphic case in the current design (we actually know we CAN, the question is how much of a difference it will make).

In summary: We DO care about Interface dispatch, but we have to take a very deliberate approach to any changes we make here. Thus the most valuable thing the community can do is GIVE US DATA, that shows where the 'pain points' are in a way that allows us to attack them.

While I can't speak for others, I originally opened the issue/question because we wanted to pass lists of values to a series of functions via an interface like IEnumerable

At first we simply used arrays, which worked perfectly fine. Later someone had a need to accept these values as something other than an array, such as a List

Maintaining a copy of the code for each type we wanted to be able to use would have been time-consuming and frustrating, but simple interfaces like IEnumerable

When using interfaces, I remember seeing overall, real-world processing times anywhere from 1.5-3x longer compared to using an object directly. Unfortunately I don't have any exact data anymore. In the isolated microbenchmark I posted above, this increased to ~10x.

Ideally the CLR might have done something like compile the functions for every type that passes through them, similar to generics. Feasibility aside, it certainly seems possible to improve these types of situations.

What I don't know is how much demand there is for improving this, how performance would change across the board, or the difficulties and drawbacks that would arise from any changes, especially without tiered compilation.

It is somewhat unfortunate that we don't have things that look more like the real world example. I looked at your microbenchmark, and there is a very good chance that what you are measuring is NOT what you think you are measuring. In particular it is not representative of interface dispatch.

The reason is that the objects you are operating on are Arrays (arrays of primitive types in fact). If C# had been Java, you would not be able to pass an array as anything, because an arrays are BUILT IN TYPES (there is no class definition for an array), and thus no place to declare that an array implements IList.

Yet it does. That is because of runtime 'magic' where arrays inherit from System.Array (which in turn declares that it implements IList). However doing this is RARE (most people use arrays as arrays, not as an IList) Because of this we did not spend a lot of time optimizing the 'magic' that the runtime does to make the implementations of the ILIst methods.

Now since your real world example really does want code like this, so we could take it as a request to make this magic fast. However there are alternatives. Your ArrayWrapper is a possibility, but an even simpler one is the following

  1. Pass an IList to your routines, but just before the 'expensive' loop do a cast to see if it really is an array, and if so do the loop over the array, otherwise do the normal foreach.

To be sure this is not a pretty (you are cloning code), but it DOES make it pay for play.

We could probably do something to improve this array magic, but we are not likely to get closer than 3X (down from 10X) in your microbenchmark. It is not clear how many users would benefit (even you may not since if you do (1) above you have already fixed it and get no benefit.

These are the kind of tradeoffs we struggle with all the time. It teaches us to be cautious and methodical, which of course DOES slow things down.

I will mention that the benchmark you showed tells us virtually NOTHING about 'normal' interface dispatch (that is not the magic kind the runtime injects on arrays).

Finally my mantra on things like this is to MEASURE. The PerfView tool would quickly validate or refute where the performance was lost (it would show the helpers being used to implement the implementions of the IList methods. In this case there was a bit more you needed to know (that these are magical and not representative of normal interface dispatch). but it would tell you quite a lot.

I don't really know where this leaves us. You can start a new issue about NORMAL interface dispatch, or we can try to make the case that this scenario (the IList helpers on arrays), are important, but to do the latter, we need several (or preferably more) examples of code that would benefit.

I guess using arrays wasn't the best way to test this out. The issue wasn't meant to be specifically about arrays, but I used them as an example because they were the simplest way to demonstrate the issue (Or so I thought).

The programs I worked on were related to data processing. One not-too-uncommon operation might be doing this operation on two vectors to produce a third: c = sqrt(a^2 + b^2)

I put together a quick benchmark with BenchmarkDotNet that does this using Lists, and the results were closer to what I remember seeing.

Method | Mean | Error | StdDev |
--------------- |---------:|----------:|----------:|
RunInterface | 60.91 ms | 0.5285 ms | 0.4685 ms |
RunNoInterface | 31.87 ms | 0.2153 ms | 0.1909 ms |

Since Lists are backed with arrays and I'm not too familiar with the runtime internals, is this doing normal interface dispatch (I'd assume that it is) and not the special handling used for arrays?

At first we simply used arrays, ... interfaces like IEnumerable<T>

I could see that scenario easily leading to a x10 perf drop; however you aren't directly comparing the cost of the calls.

The array version will behave like its completely inlined so there will be no calls and things like bounds checking will be hoisted to only be done once per loop; so it will be a negative call cost.

On the other hand IEnumerable is quite an expensive abstraction; there's one IEnumerator interface call for the loop; then two IEnumerator interface calls per iteration; with the MoveNext having a bounds check, finishing with an IDisposable interface call at the end.

Interface calls are about 2x the cost of a non-inlined non-vritual call (call stub->call method)

If the call actually does something (coarse grained api) the dispatch cost is dwarfed by the action; however IEnumerable is particularly expensive because it is a very fined grained api with the double dispatch cost on simple properties and two of them per loop.

There is a potential for monomorphic interface calls to be reduced to a single dispatch call as each call site is specialised https://github.com/dotnet/coreclr/issues/8819; though its probably not straight forward.

A shared generic interface (class rather than struct) call, is again worse because its an additional interface dispatch to invoke the interface method on the item in the call; so combining this with IEnumerable you'd be looking at x6 the call cost. So while the interface dispatch is only x2, it can add up.

td;lr the biggest cost for interface dispatch and virtual dispatch is loosing the inlines on simple properties/methods; and interface dispatch counts as x2 non-inlines; and IEnumerable counts as x4 non inlines.

Shared interface calls via generic constraints can inline (unlike raw interface dispatch); however when they don't they count as interface dispatch; you can use a concrete struct wrapper to devirtualize the calls.

We also ran into this problem when writing an economic simulation program that does a lot of math operations over arrays. I wrote extension methods for double arrays that would make it easier to express the intent of the calculation going on. So for example, we have 2 arrays of doubles, and I want to do the following:

```c#
for (int i=0; i arr[i] *= other[i];


This turned into the following extension method:
```c#
public static double[] MultiplyInPlace(this double[] arr, double[] other)
{
    int end = Math.Min(arr.Length, other.Length);
    for (int i = 0; i < end; i++)
        arr[i] *= other[i];

    return arr;
}

This would allow us to do multiple operations all at once and make it easy to see what was going on -- such as:

```c#
// Every variable is a double[], which represents a month of data. We want to apply this formula for all months
// AD_Val_Tax = (Gross_Rev - Sev_Tax) * AD_Val_Rate
adValTax
.Copy(gross_Rev)
.SubtractInPlace(sevTax)
.MultiplyInPlace(adValRate);


So for the most part we want the same operation to happen on the entire array -- or there are versions of the extension method that have a "startIndex" and "count" or "lastInclIndex" parameter.  But there is a huge part of the calculation where we might want to operate only on a year's worth of data at a time.  So we created a class called "ArraySlice" which is pretty much functionally equivalent to ArraySegment (but this was before ArraySegment existed -- or we have something custom in place to help with memory allocations).  It naturally made sense to change our extension method to look like this:

```c#
public static IList<double> MultiplyInPlace(this IList<double> arr, IList<double> other)
{
    int end = Math.Min(arr.Count, other.Count);
    for (int i = 0; i < end; i++)
        arr[i] *= other[i];

    return arr;
}

This does reduce performance by about half. We now have an extension method that operates on double[], ArraySlice, and every now and then we have a method that takes double[] as the first argument and ArraySlice as the second argument. So we may have 3 methods that all do exactly the same thing that could be replaced by a single method, except for the performance concerns. And that's just for "MultiplyInPlace." We also have AddInPlace, SubtractInPlace, DivideInPlace, Or_InPlace, And_InPlace, MinimumShouldBe, MaximumShouldBe, BetweenTheseValues, RoundInPlace, AccumulateInPlace.

So we have a lot of duplicated code overall. But, this is a calculation program -- so when these method calls take twice as long, it genuinely translates into the overall program taking twice as long (which can take it from running for 45 minutes to running for 90 minutes).

It would be nice if there wasn't the penalty for this. But we did find a way around it. I just thought about it again today and searched to see if anyone was addressing this in the future.

I just thought about it again today and searched to see if anyone was addressing this in the future.

Can change it to Span<double> now with the System.Memory package

    public static Span<double> MultiplyInPlace(this Span<double> arr, Span<double> other)
    {
        int end = Math.Min(arr.Length, other.Length);
        for (int i = 0; i < end; i++)
            arr[i] *= other[i];

        return arr;
    }

But there is a huge part of the calculation where we might want to operate only on a year's worth of data at a time. So we created a class called "ArraySlice" which is pretty much functionally equivalent to ArraySegment (but this was before ArraySegment existed -- or we have something custom in place to help with memory allocations).

Span<T> allows you to .Slice(offset, length) windows over the array so only they are visible to the method you pass them to; also a ReadOnlySpan<T> version to prevent modification to the elements.

Its far better than ArraySegment in this regard; and should be faster than IList<T> and on CoreClr just as fast as the array.

I had a bit of a crazy idea, and was wondering if it has any merit

Calling a method via a fat inteface pointer has similar performance overheads to virtual method dispatch. They are used in Go for example.

Unfortunately it requires interface pointers being double the size of normal pointers, which causes memory overheads, and means interface pointer reads/writes are not atomic.

On Arm64 (and X86_64) only 48 bits are used for the pointer. The other 16 bits go unused.

Since there's unlikely to be many codebases with more than 2^16 ≈ 64000 implementations of any given interface, how about those 16 unused bits are used to specify which implemention of an interface is used. Then accessing an interface ITable is a matter of bit masks and pointer arithmetic, so long as a static list of all implementations of an interface is available.

Essentially we compress a fat pointer down to the size of a normal pointer by making use of those unused 16 bytes on 64 bit systems.

@YairHalberstadt, I love this idea and have wondered about it a couple times in the past few years. They are known as tagged pointers if you want to read more about them.

Additionally, if memory is aligned you can use lower bits as well (3 bits on 64-bit aligned addresses).

One thing to note is this: https://lwn.net/Articles/717293/ (basically, a move to 57-bit addresses is coming eventually).

RyuJIT could query the system at run-time to discover what bits are available at the top and bottom of the pointer and emit tagged pointer instructions when available.

Thinking about it a bit more:

Tagged interface pointers cannot be used everywhere. Firstly, it won't work where a generic type is constrained to multiple interfaces. Secondly an interface may have more implementations than the number of free bits available can reference. This is most likely with interfaces like IEquatable or Incomparable.

As such virtual stub dispatch must be continue to be used where necessary.

However in almost all cases on a 64 bit architecture there will be some bits free on a pointer. Where a generic type is constrained to multiple interfaces the JIT would have to decide which interface to use the spare bits for.

Let us assume there are m free bits in a pointer. Then for the first 2^m-1 implementations of an interface to be instantiated, an itable pointer would be added to a list of such pointers for the interface. The free bits would reference the index in the list of the itable for that interface. Any further implementations to be instantiated would have 1s in all the free bits. This would indicate that virtual stub dispatch should be used. However I imagine that in practice this scenario will be rare.

Performance wise we have to add a mask to every interface pointer dereference on x86_64, and a mask and three levels of pointer indirection to an interface method call.

@YairHalberstadt

Performance wise we have to add a mask to every interface pointer dereference on x86_64, and a mask and three levels of pointer indirection to an interface method call.

That's a lot of indirection. What makes you think it would be better than the current implementation?

I don't. However virtual method dispatch has two levels of pointer indirection, and is more than double as fast as best case virtual stub dispatch in my measurements, so I think it may well be worth investigating.

Sorry I made a mistake.
It should have the same amount of pointer indirection as virtual method dispatch.

One level of pointer indirection to get a pointer to the ITable/VTable

Another level to get the correct function pointer.

Now that the tiered JIT will be enabled by default soon, are there any plans to inline indirect calls as a tiered optimization?

@AndyAyersMS the last question here was about whether its possible to inline indirect calls. Should this be moved to codegen area for future consideration?

Sure ... we are slowly building up the needed tech to improve here.

Didn't reread all the upthread but one of the things that can get overlooked is that optimizing the cost of the dispatch is not the real prize -- it's being able to inline the method being called, and all the knock-on optimizations that enables.

@Thealexbarney if you still have those original tests around I have some changes to try out (see #43607). On your simple test above I get numbers like:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
  Job-BMVYOC : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

EnvironmentVariables=COMPlus_TieredPgo=1,COMPlus_JitEnableGuardedDevirtualization=1,COMPlus_JitClassProfiling=1,COMPlus_TC_QuickJitForLoops=1  Toolchain=CoreRun  IterationCount=20

| Method | Mean | Error | StdDev | Code Size |
|--------------- |---------:|---------:|---------:|----------:|
| RunInterface | 54.36 ms | 2.801 ms | 3.113 ms | 572 B |
| RunNoInterface | 49.08 ms | 4.439 ms | 5.112 ms | 357 B |

The only tests I still have around now are the gists I posted on this issue. Gist 1 Gist 2

I've read your PGO docs over the past week or two and I've been interested in seeing the improvements it would bring. The numbers from your prototype are already looking pretty good.

Speed of List<T> indexer is not good in any case so taking it for performance critical test where virtual calls are taken into consideration is little awkward. (Array vs. List Performance)

@Thealexbarney You pointed out JVM capability that make these thing fine but you did can just make you way using abstract class.

public abstract class AC<T>
{
    public abstract T this[int index] { get; set; }
}

public class PerfAbstractedArray<T> : AC<T>
{
    private readonly T[] valuesMadeToArray;

    public PerfAbstractedArray(T[] valuesMadeToArray)
    {
        this.valuesMadeToArray = valuesMadeToArray;
    }

    public override T this[int index] 
    {
        get => valuesMadeToArray[index];
        set => valuesMadeToArray[index] = value;
    }
}

public class PerfAbstractedList<T> : AC<T>
{
    private readonly List<T> valuesMadeToList;

    public PerfAbstractedList(List<T> valuesMadeToList)
    {
        this.valuesMadeToList = valuesMadeToList;
    }

    public override T this[int index]
    {
        get => valuesMadeToList[index];
        set => valuesMadeToList[index] = value;
    }
}

FMPOV framework guidelines are fit for particular purpose so if you are interested in performance you are going to be tuner all the time. Even in the future. Other told well there are trade-offs, costs and priorities. On the other hand autogained performance for no research is not hurting anyone at all.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

iCodeWebApps picture iCodeWebApps  ·  3Comments

yahorsi picture yahorsi  ·  3Comments

omajid picture omajid  ·  3Comments

GitAntoinee picture GitAntoinee  ·  3Comments

omariom picture omariom  ·  3Comments