Fsharp: Striding loop performance

Created on 7 Feb 2016  路  13Comments  路  Source: dotnet/fsharp

System.numerics.vectors exposes a SIMD enhanced Vector classes. Using VS2015 Update 1, latest versions of .NET framework and F# and System.numerics.vectors the performance of System.Numerics is worse than not using it at all, for instance:

``` F#
let sumVectorLoop =
let mutable total = Vector.Zero
for i in 0 .. COUNT/8-1 do
total <- total + vecArray.[i]
total

Is slower than the same operation on an array of integers:

``` F#
     let sumsLoop =
            let mutable total = 0;
            for i in 0 .. COUNT - 1 do
                total <- total + numsArray.[i]
            total

I have confirmed that Vector.isHardwareAccelerated reports as true. I have confirmed that equivalent code in C# runs ~2x faster for the Vector approach. Interestingly, using Array.reduce on the vector array is faster than the imperative loop, which is the opposite of working with an array of ints, suggesting something may be amiss:

F# let sumVectorReduce = Array.reduce (fun a e -> a + e) vecArray

Area-Compiler Feature Improvement Tenet-Performance

Most helpful comment

I started to take a look at this, and it's not easy.

One problem is that the F# "FastIntegerLoop" TAST construct can't represent striding loops. It could be extended, but this has to be done with care since the construct can (and does) occur in optimization information and the representations of inlined functions. Ideally care should be taken that DLLs that generate this new construct be consumable by down-level F# compilers, but that's hard to arrange.

Another problem is that "F#-style loops" for x in n .. step .. m are currently generated using an "bne" branch-not-equals instruction at the end condition. This is done because m might be MaxInt. But this won't work for striding loops - a less-than operation is needed. But a less-than operation doesn't work when m is above MaxInt - step since a wrap-around occurs.

Perhaps we could just sacrifice semantics for striding loops near the maxint condition - though whatever we do parity with C# is really needed. Perhaps I need to look more closely at C# code generation for these cases

All 13 comments

Could you please give the exact code in C#? There must be a difference in F# and C# code generation which is tripping up the optimizations applied by the RyuJIT.

BTW does anyone know the right people to contact on the RyuJIT team? They may well want to fix this there rather than fiddling with F# codegen, as whatever needs fixing may lead to more general and more robust loop optimizations.

Here you go

``` C#
const int COUNT = 10000000;
int[] ints = new int[COUNT];
for (int i = 0; i < COUNT-1;i++)
{
ints[i] = i;
}

        Vector<int>[] vectors = new Vector<int>[COUNT / 8];
        for (int i = 0; i < COUNT / 8 - 1; i = i + 8)
        {
            vectors[i] = new Vector<int>(ints, i);
        }

        Stopwatch sw = new Stopwatch();
        sw.Start();
        Vector<int> sum = Vector<int>.Zero;

        for (int i = 0; i < COUNT/8-1;i++)
        {
            sum = sum + vectors[i];
        }

        sw.Stop();
        Console.WriteLine("Sum vector:" + sum + " time:" + sw.ElapsedMilliseconds);


        sw.Restart();
        int sumi = 0;
        for (int i = 0; i < COUNT-1; i++)
        {
            sumi = sumi + ints[i];
        }
        Console.WriteLine("Sum integers:" + sum + " time:" + sw.ElapsedMilliseconds);
        Console.ReadLine();

```

Here is the IL of each vector summing loop as reported by ILSpy:

C#:

IL_004e: call valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0> valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>::get_Zero()
        IL_0053: stloc.2
        IL_0054: ldc.i4.0
        IL_0055: stloc.s i
        IL_0057: br.s IL_006e
        // loop start (head: IL_006e)
            IL_0059: ldloc.2
            IL_005a: ldloc.1
            IL_005b: ldloc.s i
            IL_005d: ldelem.any valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>
            IL_0062: call valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0> valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>::op_Addition(valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0>, valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0>)
            IL_0067: stloc.2
            IL_0068: ldloc.s i
            IL_006a: ldc.i4.1
            IL_006b: add
            IL_006c: stloc.s i

            IL_006e: ldloc.s i
            IL_0070: ldc.i4 1250000
            IL_0075: blt.s IL_0059
        // end loop

F#

IL_006a: ldloc.3
        IL_006b: stloc.1
        IL_006c: nop
        IL_006d: call valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0> valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>::get_Zero()
        IL_0072: stloc.s total
        IL_0074: ldc.i4.0
        IL_0075: stloc.s i
        IL_0077: ldc.i4 10000000
        IL_007c: ldc.i4.8
        IL_007d: div
        IL_007e: stloc.2
        IL_007f: ldloc.2
        IL_0080: ldloc.s i
        IL_0082: blt.s IL_00a2
        // loop start (head: IL_0084)
            IL_0084: ldloc.s total
            IL_0086: ldloc.1
            IL_0087: ldloc.s i
            IL_0089: ldelem.any valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>
            IL_008e: call valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0> valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>::op_Addition(valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0>, valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0>)
            IL_0093: stloc.s total
            IL_0095: ldloc.s i
            IL_0097: ldc.i4.1
            IL_0098: add
            IL_0099: stloc.s i
            IL_009b: ldloc.s i
            IL_009d: ldloc.2
            IL_009e: ldc.i4.1
            IL_009f: add
            IL_00a0: bne.un.s IL_0084
        // end loop

@CarolEidt works on the SIMD feature on RyuJIT. Carol, do you recognize any differences in the C#/F# IL that would lead to significantly worse performance?

On the usage side, I did want to point out that we don't usually recommend folks actually store their data in an array of Vector<int>'s. The idiomatic way to use the library would be to keep everything stored in your original int[] array, and construct Vector items directly from that. So your vectorized loop would be something like:

for (int i = 0; i < COUNT / Vector<int>.Count; i += Vector<int>.Count)
{
    sum = sum + new Vector<int>(ints, i);
}

Another thing to note is that your scalar and vector passes aren't necessarily computing the same thing. To get the sum of a single array's values efficiently, we'd need to something like a HorizontalAdd method (which would compute the sum of an individual Vector<T>'s elements. Right now you are left with a Vector<int> at the end of your vector computation, which you'd need to manually add together to get the same result as the scalar version.

The idiomatic approach seems to make use of an enumerator for the loop which kills performance:

``` F#
let sumVectorLoop =
let mutable total = Vector.Zero
for i in 0 .. 8 .. COUNT-1 do
total <- total + new Vector(numsArray,i)
total

if I get rid of the `.. 8 ..` in the for loop and do that by hand, or use a for loop that counts by one and then multiply the index, a do while loop is generated instead and performance is in line with C#:

``` F#
 let sumVectorLoop =
        let mutable total = Vector<int>.Zero
        for i in 0  .. COUNT/8-1 do
            total <- total + new Vector<int>(numsArray,i*8)
        total    

If there are good reasons why that style of for loop becomes an enumerator and this is just reflecting my lack of practice with the language feel free to close. Thanks for the help everyone.

@jackmott There is no fundamentally good reason why loops like that use an enumerator, except that the necessary optimizations just aren't done in the F# compiler for striding loops. It would be awesome to have this fixed.

@dsyme,

Ha, beat me to it by a few minutes :-)

In DetectAndOptimizeForExpression there is a check that is only optimizing for step of 1 or -1.

And then an appropriate change to mkFastForLoop would be required.

@manofstick And then real care near MinInt and MaxInt. (If necessary a dynamic check on entry to the loop could be used in those cases)

I started to take a look at this, and it's not easy.

One problem is that the F# "FastIntegerLoop" TAST construct can't represent striding loops. It could be extended, but this has to be done with care since the construct can (and does) occur in optimization information and the representations of inlined functions. Ideally care should be taken that DLLs that generate this new construct be consumable by down-level F# compilers, but that's hard to arrange.

Another problem is that "F#-style loops" for x in n .. step .. m are currently generated using an "bne" branch-not-equals instruction at the end condition. This is done because m might be MaxInt. But this won't work for striding loops - a less-than operation is needed. But a less-than operation doesn't work when m is above MaxInt - step since a wrap-around occurs.

Perhaps we could just sacrifice semantics for striding loops near the maxint condition - though whatever we do parity with C# is really needed. Perhaps I need to look more closely at C# code generation for these cases

I hit this issue today. Below is a complete (.NET Core 3.1 + BenchmarkDotNet) console program illustrating the problem.

open BenchmarkDotNet.Attributes
open BenchmarkDotNet.Running

type ForLoopComparison ()=
    let n = 1000000

    [<Benchmark>]
    member this.ForTo() =
        let mutable acc = 0

        for i = 1 to (n / 2) do
            acc <- acc + (2 * i)
        acc

    [<Benchmark>]
    member this.ForIn() =
        let mutable acc = 0

        for i in 1 .. (n / 2) do
            acc <- acc + (2 * i)
        acc

    [<Benchmark>]
    member this.ForInStep() =
        let mutable acc = 0

        for i in 2 .. 2 .. n do
            acc <- acc + i
        acc

[<EntryPoint>]
let main argv =
    BenchmarkRunner.Run<ForLoopComparison>() |> ignore
    0

The three methods give identical results. Benchmarked times on my PC are 235碌s for the first two methods but 2728碌s for the last one.

@dsyme given the renewed focus on slicing (and its syntax) for .NET 5, what do you think of revisiting this? How common do you feel this scenario is for numeric programming in general?

@dsyme given the renewed focus on slicing (and its syntax) for .NET 5, what do you think of revisiting this? How common do you feel this scenario is for numeric programming in general?

Yes, we should fix this, definitely.

Was this page helpful?
0 / 5 - 0 ratings