Runtime: Question: Why for loop is 1.3 slower over byte[] than foreach

Created on 28 Oct 2019  路  22Comments  路  Source: dotnet/runtime

Hi Guys!

Usually, people used to thing foreach version of loop is always slower than for and so I, but, when I benchmarked it I was surprised how much (1.3 times) for version is actually slower when iterating over bytes array (similar for int/uint). Is that something expected and ok? Here is the code.

using System;

public class C {
    private byte[] _byteData;

    public int byte_sum_ByIndex()
    {
        int sum = 0;
        for (int i = 0; i < _byteData.Length; ++i)
        {
            sum += _byteData[i];
        }

        return sum;
    }

    public int byte_sum_Foreach()
    {
        int sum = 0;
        foreach (var n in _byteData)
        {
            sum += n;
        }

        return sum;
    }
}
.class private auto ansi '<Module>'
{
} // end of class <Module>

.class public auto ansi beforefieldinit C
    extends [System.Private.CoreLib]System.Object
{
    // Fields
    .field private uint8[] _byteData

    // Methods
    .method public hidebysig 
        instance int32 byte_sum_ByIndex () cil managed 
    {
        // Method begins at RVA 0x2050
        // Code size 34 (0x22)
        .maxstack 3
        .locals init (
            [0] int32,
            [1] int32
        )

        IL_0000: ldc.i4.0
        IL_0001: stloc.0
        IL_0002: ldc.i4.0
        IL_0003: stloc.1
        // sequence point: hidden
        IL_0004: br.s IL_0015
        // loop start (head: IL_0015)
            IL_0006: ldloc.0
            IL_0007: ldarg.0
            IL_0008: ldfld uint8[] C::_byteData
            IL_000d: ldloc.1
            IL_000e: ldelem.u1
            IL_000f: add
            IL_0010: stloc.0
            IL_0011: ldloc.1
            IL_0012: ldc.i4.1
            IL_0013: add
            IL_0014: stloc.1

            IL_0015: ldloc.1
            IL_0016: ldarg.0
            IL_0017: ldfld uint8[] C::_byteData
            IL_001c: ldlen
            IL_001d: conv.i4
            IL_001e: blt.s IL_0006
        // end loop

        IL_0020: ldloc.0
        IL_0021: ret
    } // end of method C::byte_sum_ByIndex

    .method public hidebysig 
        instance int32 byte_sum_Foreach () cil managed 
    {
        // Method begins at RVA 0x2080
        // Code size 33 (0x21)
        .maxstack 2
        .locals init (
            [0] int32,
            [1] uint8[],
            [2] int32,
            [3] uint8
        )

        IL_0000: ldc.i4.0
        IL_0001: stloc.0
        IL_0002: ldarg.0
        IL_0003: ldfld uint8[] C::_byteData
        IL_0008: stloc.1
        IL_0009: ldc.i4.0
        IL_000a: stloc.2
        // sequence point: hidden
        IL_000b: br.s IL_0019
        // loop start (head: IL_0019)
            IL_000d: ldloc.1
            IL_000e: ldloc.2
            IL_000f: ldelem.u1
            IL_0010: stloc.3
            IL_0011: ldloc.0
            IL_0012: ldloc.3
            IL_0013: add
            IL_0014: stloc.0
            // sequence point: hidden
            IL_0015: ldloc.2
            IL_0016: ldc.i4.1
            IL_0017: add
            IL_0018: stloc.2

            IL_0019: ldloc.2
            IL_001a: ldloc.1
            IL_001b: ldlen
            IL_001c: conv.i4
            IL_001d: blt.s IL_000d
        // end loop

        IL_001f: ldloc.0
        IL_0020: ret
    } // end of method C::byte_sum_Foreach

    .method public hidebysig specialname rtspecialname 
        instance void .ctor () cil managed 
    {
        // Method begins at RVA 0x20ad
        // Code size 7 (0x7)
        .maxstack 8

        IL_0000: ldarg.0
        IL_0001: call instance void [System.Private.CoreLib]System.Object::.ctor()
        IL_0006: ret
    } // end of method C::.ctor

} // end of class C
; Core CLR v4.700.19.46205 (coreclr.dll) on x86.

C..ctor(Byte[])
    L0000: mov eax, edx
    L0002: lea edx, [ecx+0x4]
    L0005: call 0x6ed991d0
    L000a: ret

C.byte_sum_ByIndex()
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: push esi
    L0004: push ebx
    L0005: xor eax, eax
    L0007: xor edx, edx
    L0009: mov ecx, [ecx+0x4]
    L000c: cmp dword [ecx+0x4], 0x0
    L0010: jle L0026
    L0012: mov esi, ecx
    L0014: cmp edx, [esi+0x4]
    L0017: jae L002a
    L0019: movzx ebx, byte [esi+edx+0x8]
    L001e: add eax, ebx
    L0020: inc edx
    L0021: cmp [ecx+0x4], edx
    L0024: jg L0012
    L0026: pop ebx
    L0027: pop esi
    L0028: pop ebp
    L0029: ret
    L002a: call 0x6eec2680
    L002f: int3

C.byte_sum_Foreach()
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: push esi
    L0004: push ebx
    L0005: xor eax, eax
    L0007: mov edx, [ecx+0x4]
    L000a: xor ecx, ecx
    L000c: mov esi, [edx+0x4]
    L000f: test esi, esi
    L0011: jle L001f
    L0013: movzx ebx, byte [edx+ecx+0x8]
    L0018: add eax, ebx
    L001a: inc ecx
    L001b: cmp esi, ecx
    L001d: jg L0013
    L001f: pop ebx
    L0020: pop esi
    L0021: pop ebp
    L0022: ret

https://sharplab.io/#v2:C4LghgzgtgPgAgJgIwFgBQcDMACR2DC2A3utmdgA4BOAlgG5jACm2ARgJ7MDaAutgPodmAEUZgA3OlLks2GgDtgbTk34QArlH4AhdgEl5AEyYAPABQBKaWRJpy9uYuwao2ALzYADJLsOyAMwB7KmwzBSUady9xOWwAHgEhJlFgMAA6ABkmeQBzYAALGIBqIporXz9bPz8XbCKPQRUUsC4aHh9q7ABfawde+zgAdmdNDvIetF7ZcOVmNU1+ADFgpjAAY3zLXqrqmdqPb37yIKpVjdCGEPlHRKaxcs6dzrJa+ux5Mb8JzqOyIZGoJ9uugukA==

category:cq
theme:range-check
skill-level:intermediate
cost:medium

JitUntriaged area-CodeGen-coreclr question

Most helpful comment

What is maybe a bit unfortunate is that making the field readonly doesn't help while you would expect.

All 22 comments

If you copy _byteData to a local variable in both methods - the codegen will be the same.

If you copy _byteData to a local variable in both methods - the codegen will be the same.

Wow. Almost identical.

But, anyway, for is still slower without that trick. Could you pls hint why?

@yahorsi the answer is there in your sharplab link I guess 馃檪 (unsuga

image

It's better to work with a local variable in case of performance critical code in such cases, e.g. JIT doesn't eliminate bound checks for your for loop (anyone can modify the field while you iterate it)

JIT doesn't eliminate bound checks for your for loop (anyone can modify the field while you iterate it)

Got it, thanks

anyone can modify the field while you iterate it

Wait :) We just copy the reference, so, anyone still can change it :)

anyone can modify the field while you iterate it

Wait :) We just copy the reference, so, anyone still can change it :)

And that reference points to some old object, and the field now points to a different one.
Btw, here is an example how CoreCLR developers sometimes optimize range checks: https://github.com/dotnet/coreclr/pull/27340/files

And that reference points to some old object

So, you meant that not the "content" on the array might change but the reference. Ok, now clear. Almost. Except some weird benchmark results. (will experiment more and share)

The length of the array for the local variable cannot change (unless it was reassigned in the method; but the Jit could "see" that when compiling); even if the field is updated to something else.

However, the length of array in the field could change because it could be assigned to a different array in a different method of the class.

What is maybe a bit unfortunate is that making the field readonly doesn't help while you would expect.

I think JIT should unchain itself and make a local copy.
Unless the array reference is marked volatile.

I think JIT should unchain itself and make a local copy.
Unless the array reference is marked volatile.

Yeah, but that would be quite a breaking change, e.g if you are calling another instance method in the loop, which is changing the reference to the array (so not even in a multithreading case)... that could be completely on purpose, and you don't want the compiler to start to rewrite something that is intended

@xoofx
Never thought about that..
Then I will like your comment about readonly fields.

This is likely a CQ regression introduced by the fix for dotnet/runtime#9486, I'm pretty sure range check elimination was working in this case before that fix.

The problem is that JIT's value numbering correctly concluded that the 2 _byteData accesses will produce the same value, assuming that there is no thread interference. This assumption is valid for many purposes, since thread interference in this case implies that there's a race condition in the user code and race conditions are basically undefined behavior.

However, this assumption is not valid for range check elimination because said undefined behavior does not include out of range array access, that would compromise type safety.

The fix that was done is probably more conservative than it needs to be. At least in this case, the redundant field load is eliminated by CSE so the race condition effectively disappears. Unfortunately CSE is not something that can be guaranteed so performing the optimization correctly is a bit more complicated - it has to be done only after CSE did eliminate the load or it has to somehow force CSE to happen.

It may be possible to improve this. But at least for the presented example this is completely pointless, there's no reason not to use foreach in this case.

What is maybe a bit unfortunate is that making the field readonly doesn't help while you would expect.

Isn't completely straightforward; the method maybe called by constructor which means the value of the readonly can change https://github.com/dotnet/coreclr/issues/21951#issuecomment-464296078

Isn't completely straightforward; the method maybe called by constructor which means the value of the readonly can change dotnet/runtime#11797 (comment)

Assuming in a constructor you can't call a method that will change the readonly field, so only the constructor can change it. If you call a method, within the constructor, I would expect this to be safe otherwise for anything happening in the constructor the JIT could track correctly assignment to this field during the constructor. If the constructor is too complicated, you could also decide to disable this optimization when JITing the constructor. No?

What is maybe a bit unfortunate is that making the field readonly doesn't help while you would expect.

readonly instance fields can be changed via reflection, at least for now. And anyway it's not clear how much relying on readonly would actually help. While it's nice to make fields readonly it is not always possible and I have a feeling that it's even less possible/practical in the case of arrays. For example, think of all collection classes that use arrays under the cover.

Assuming in a constructor you can't call a method that will change the readonly field, so only the constructor can change it.

Taking the reference of a readonly field from a constructor produces a normal ref, so by-refs can point to readonly fields if a method was called from a constructor.

readonly instance fields can be changed via reflection

But not after the type is initialized, e.g.:

System.FieldAccessException: 'Cannot set initonly static field 'DirectorySeparatorChar' after type 'System.IO.Path' is initialized.'

But not after the type is initialized

@EgorBo, your error example was for a static field, not instance ;)

@stephentoub oops, indeed :) Well, at least it's safe to assume that static readonly fields never change once the host type is initialized

Taking the reference of a readonly field from a constructor produces a normal ref, so by-refs can point to readonly fields if a method was called from a constructor.

Good point, that's also unfortunate... never actually tried that... 馃槄
So yeah, that complicates the detection then... even if we could start to say if a ref is taken of an array, you can't do any optimization in the constructor, but then in the methods you call by passing this ref, you would not be able to access safely the readonly field as it is now aliased by the ref parameter...

Was this page helpful?
0 / 5 - 0 ratings