I listened to Microsoft Language & Runtime Community standup on twitch
were they said they like "aggressive" well-informed feedback from the community
so I figured I bring up something that has been rubbing me the wrong way a long
time in .NET; tail call performance.
I am sure that has been discussed before but I am going to add more fuel to the debate.
In .NET it's possible that annotate calls with a .tail
attribute allowing the
jitter to eliminate stack frames and thus avoiding to run out of stack space.
If you are a C# developer you probably never encountered tail calls as the C#
compiler as of now (AFAIK?) doesn't emit .tail
attributes (might change with tail return
?).
It's different for F# developers though where the compiler does emit .tail
attributes.
Avoiding running out stack space is a good thing so what's the problem?
The problem is the abysmal performance of tail calls in the general case.
Let me demonstrate by implementing a simple data pipeline based on push rather than pull (IEnumerable<_>
is pull based).
// Minimalistic PushStream
// A PushStream accepts a receiver function that will be called
// with each value in the PushStream
type 'T PushStream = ('T -> unit) -> unit
module PushStream =
let inline zero () = LanguagePrimitives.GenericZero
let inline push r v = r v
// Creates a PushStream with all integers from b to e (inclusive)
let inline fromRange b e r = for i = b to e do push r i
// Maps all values in ps using mapping function f
let inline map f ps r = ps (fun v -> push r (f v))
// Filters all values in ps using filter function f
let inline filter f ps r = ps (fun v -> if f v then push r v)
// Sums all values in ps
let inline sum ps = let mutable s = zero () in ps (fun v -> s <- s + v); s
For a simple data pipeline designed to detect the overhead of the pipeline the push stream shows some promising performance when comparing it to LINQ.
// Uses BenchmarkDotNet
type Benchmarks () =
[<Params (10000, 1000, 100)>]
member val public Count = 100 with get, set
[<Benchmark>]
member x.SimpleImperativeTest () =
let mutable i = x.Count
let mutable sum = 0L
while i >= 0 do
let v = int64 i
i <- i - 1
if (v &&& 1L) = 0L then
sum <- sum + (v + 1L)
sum
[<Benchmark>]
member x.SimpleLinqTest () =
Enumerable.Range(0, x.Count)
.Select(int64)
.Where(fun v -> (v &&& 1L) = 0L)
.Select((+) 1L)
.Sum()
[<Benchmark>]
member x.SimplePushStreamTest () =
PushStream.fromRange 0 x.Count
|> PushStream.map int64
|> PushStream.filter (fun v -> (v &&& 1L) = 0L)
|> PushStream.map ((+) 1L)
|> PushStream.sum
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i5-3570K CPU 3.40GHz (Ivy Bridge), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.100
[Host] : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT DEBUG
DefaultJob : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
| Method | Count | Mean | Error | StdDev |
|--------------------- |------ |--------------:|-----------:|-----------:|
| SimpleImperativeTest | 100 | 73.36 ns | 0.259 ns | 0.243 ns |
| SimpleLinqTest | 100 | 1,550.18 ns | 7.837 ns | 7.331 ns |
| SimplePushStreamTest | 100 | 419.71 ns | 1.768 ns | 1.654 ns |
| SimpleImperativeTest | 1000 | 611.51 ns | 2.081 ns | 1.946 ns |
| SimpleLinqTest | 1000 | 13,677.24 ns | 47.074 ns | 44.033 ns |
| SimplePushStreamTest | 1000 | 3,576.24 ns | 11.202 ns | 10.478 ns |
| SimpleImperativeTest | 10000 | 5,996.41 ns | 27.344 ns | 25.578 ns |
| SimpleLinqTest | 10000 | 134,129.03 ns | 443.042 ns | 414.422 ns |
| SimplePushStreamTest | 10000 | 34,988.54 ns | 126.441 ns | 118.273 ns |
The imperative "pipeline" does the best as expected as there's no pipeline
overhead but the push stream is doing a lot better than LINQ. Great right?
Not so fast! If we add a second test which does the same thing in a slightly
different way then what happens?
[<Benchmark>]
member x.StructLinqStreamTest () =
Enumerable.Range(0, x.Count)
.Select(int64)
.Select(fun v -> struct ((v &&& 1L) = 0L, v))
.Select(fun struct (b, v) -> struct (b, v + 1L))
.Select(fun struct (b, v) -> if b then v else 0L)
.Sum()
[<Benchmark>]
member x.StructPushStreamTest () =
PushStream.fromRange 0 x.Count
|> PushStream.map int64
|> PushStream.map (fun v -> struct ((v &&& 1L) = 0L, v))
|> PushStream.map (fun struct (b, v) -> struct (b, v + 1L))
|> PushStream.map (fun struct (b, v) -> if b then v else 0L)
|> PushStream.sum
Suddenly the push stream performance is abysmal.
| Method | Count | Mean | Error | StdDev |
|--------------------- |------ |----------------:|-------------:|-------------:|
| StructLinqStreamTest | 100 | 3,060.52 ns | 15.855 ns | 14.831 ns |
| StructPushStreamTest | 100 | 60,241.31 ns | 256.031 ns | 239.492 ns |
| StructLinqStreamTest | 1000 | 27,813.86 ns | 99.601 ns | 93.167 ns |
| StructPushStreamTest | 1000 | 591,909.97 ns | 1,710.549 ns | 1,600.048 ns |
| StructLinqStreamTest | 10000 | 274,592.50 ns | 591.566 ns | 553.351 ns |
| StructPushStreamTest | 10000 | 5,908,867.08 ns | 9,791.398 ns | 9,158.880 ns |
We see the LINQ now performs 20x times better? What's going on?
Let's fix the problem using magic! Rewrite the push
function from this:
// Original push function, just invokes the function r with v
let inline push r v = r v
Replace it with this nonsense:
// New invoke. Seems just to add redundant code that does nothing
let inline push r v = match r v with () -> ()
Now push stream compares favorably in all cases
| Method | Count | Mean | Error | StdDev |
|--------------------- |------ |--------------:|-------------:|-------------:|
| SimpleImperativeTest | 100 | 73.52 ns | 0.290 ns | 0.271 ns |
| SimpleLinqTest | 100 | 1,552.31 ns | 4.984 ns | 4.418 ns |
| SimplePushStreamTest | 100 | 503.42 ns | 3.451 ns | 3.228 ns |
| StructLinqStreamTest | 100 | 3,079.51 ns | 18.761 ns | 17.549 ns |
| StructPushStreamTest | 100 | 2,531.61 ns | 9.913 ns | 9.273 ns |
| SimpleImperativeTest | 1000 | 612.66 ns | 1.832 ns | 1.713 ns |
| SimpleLinqTest | 1000 | 13,667.12 ns | 41.531 ns | 38.848 ns |
| SimplePushStreamTest | 1000 | 4,346.26 ns | 16.795 ns | 15.710 ns |
| StructLinqStreamTest | 1000 | 28,029.09 ns | 70.510 ns | 58.879 ns |
| StructPushStreamTest | 1000 | 22,131.33 ns | 63.323 ns | 59.232 ns |
| SimpleImperativeTest | 10000 | 6,007.44 ns | 19.996 ns | 18.704 ns |
| SimpleLinqTest | 10000 | 133,865.57 ns | 355.534 ns | 332.567 ns |
| SimplePushStreamTest | 10000 | 42,625.87 ns | 75.884 ns | 67.270 ns |
| StructLinqStreamTest | 10000 | 321,961.25 ns | 1,074.170 ns | 1,004.780 ns |
| StructPushStreamTest | 10000 | 242,181.19 ns | 419.269 ns | 392.185 ns |
What's going on?
With the original push
function the IL code to invoke the receiver function looks like this:
// let inline push r v = r v
// Tell the jitter that the call is a tail call
IL_000D: tail.
// Call invoke virtually
IL_000F: callvirt instance !1 class <>::Invoke
// For a tail call the call function has to be followed by ret
IL_0014: ret
The modified push
function which doesn't change the meaning of the program at all the IL code looks like this:
// let inline push r v = match r v with () -> ()
// Call Invoke (no tail call)
callvirt instance !1 class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2<int64, class [FSharp.Core]Microsoft.FSharp.Core.Unit>::Invoke(!0)
// Throw away the result (which is the unit value anyway)
IL_0012: pop
// Push the result (the unit value is null)
IL_0013: ldnull
// Done
IL_0014: ret
As the modified push
"looks" at the value (match) and then loads a new result this doesn't fit the pattern of a tail call. Thus the compiler doens't inject the .tail
attribute.
.tail
calls sometimes go fast and sometimes go really really slow?Let's look at the assembly code when .tail
call executes quickly.
.tail
calls (Yay!)// This is the implementation of: PushStream.map (fun v -> struct ((v &&& 1L) = 0L, v))
// What's this? Tiered compilation left behinds?
00007fff`90362010 0f1f440000 nop dword ptr [rax+rax]
00007fff`90362015 8bc2 mov eax,edx
// Check is number odd
00007fff`90362017 a801 test al,1
00007fff`90362019 7512 jne 00007fff`9036202d
// Number is even, pass value down the pipeline
// This is a virtual call so we find the jump address by looking
// up the value in the vtables
00007fff`9036201b 488b4908 mov rcx,qword ptr [rcx+8]
00007fff`9036201f 488b01 mov rax,qword ptr [rcx]
00007fff`90362022 488b4040 mov rax,qword ptr [rax+40h]
00007fff`90362026 488b4020 mov rax,qword ptr [rax+20h]
// rax now contains the address of the next step in the pipeline
// thanks to .tail call we do a jmp here not call
// this means that when the pipeline finally returns it will return
// directly to the top loop
00007fff`9036202a 48ffe0 jmp rax
// Number was odd, clear the result and return
00007fff`9036202d 33c0 xor eax,eax
00007fff`9036202f c3 ret
So apart from the odd nop in the beginning and the usual vtable dance over
virtual functions (what are the conditions to make the jitter succeed with
devirtualizations?) it's quite ok. Actually suprisingly just 5x slower than the
fast imperative solution considering how much more junk happens in each step.
Let's look at the slow .tail
call
.tail
calls (Boo!)// This is the implementation of: PushStream.map (fun v -> struct ((v &&& 1L) = 0L, v))
// Function prelude
00007ff7`d77725e0 56 push rsi
00007ff7`d77725e1 4883ec40 sub rsp,40h
00007ff7`d77725e5 c5f877 vzeroupper
00007ff7`d77725e8 4c8d442430 lea r8,[rsp+30h]
00007ff7`d77725ed c5f857c0 vxorps xmm0,xmm0,xmm0
00007ff7`d77725f1 c4c17a7f00 vmovdqu xmmword ptr [r8],xmm0
00007ff7`d77725f6 488b7108 mov rsi,qword ptr [rcx+8]
00007ff7`d77725fa 448bc2 mov r8d,edx
// Is number odd?
00007ff7`d77725fd 41f6c001 test r8b,1
00007ff7`d7772601 410f94c0 sete r8b
00007ff7`d7772605 450fb6c0 movzx r8d,r8b
// Save results
00007ff7`d7772609 4488442438 mov byte ptr [rsp+38h],r8b
00007ff7`d777260e 4889542430 mov qword ptr [rsp+30h],rdx
00007ff7`d7772613 49b8784b6637f87f0000 mov r8,offset
// Checks: Volatile<LONG> g_TrapReturningThreads;
coreclr!g_TrapReturningThreads (00007ff8`37664b78)
// If true we need to suspend thread (by calling coreclr!JIT_PollGC ())
00007ff7`d777261d 41833800 cmp dword ptr [r8],0
00007ff7`d7772621 752e jne 00007ff7`d7772651
00007ff7`d7772623 4c8bc6 mov r8,rsi
// Juggling with struct tuple values
00007ff7`d7772626 c5fa6f442430 vmovdqu xmm0,xmmword ptr [rsp+30h]
00007ff7`d777262c c5fa7f442420 vmovdqu xmmword ptr [rsp+20h],xmm0
00007ff7`d7772632 4c8d4c2420 lea r9,[rsp+20h]
00007ff7`d7772637 48b990d162d7f77f0000 mov rcx,7FF7D762D190h
// Loading vtable to find the address to call to
00007ff7`d7772641 498b10 mov rdx,qword ptr [r8]
00007ff7`d7772644 488b5240 mov rdx,qword ptr [rdx+40h]
00007ff7`d7772648 488b5220 mov rdx,qword ptr [rdx+20h]
// Do the call to next step through: coreclr!JIT_TailCall
00007ff7`d777264c e89ff6c35f call coreclr!JIT_TailCall (00007ff8`373b1cf0)
00007ff7`d7772651 e86ad6c35f call coreclr!JIT_PollGC (00007ff8`373afcc0)
00007ff7`d7772656 ebcb jmp 00007ff7`d7772623
So, there's lot more setup here but this is because this function actually needs
a stackframe to store intermediate results. Also I suspect because the stackframe
is needed it has to call coreclr!JIT_TailCall
at the end which is the CPU hog.
I don't know exactly what coreclr!JIT_TailCall
does but it does a lot when
stepping through the assembly code. However, my suspicion is that its purpose is
eliminate the stackframe and call the next function. While it eliminates the stackframe it adds about 60x overhead to the PushStream
pipeline.
Finally let's look at the code for the modified push
function to get back
predictable performance
// Function prelude
00007ff7`d77725e1 4883ec40 sub rsp,58h
00007ff7`d78625f4 c5f877 vzeroupper
00007ff7`d78625f7 33c0 xor eax,eax
00007ff7`d78625f9 4889442448 mov qword ptr [rsp+48h],rax
00007ff7`d78625fe 4889442450 mov qword ptr [rsp+50h],rax
00007ff7`d7862603 488d442438 lea rax,[rsp+38h]
00007ff7`d7862608 c5f857c0 vxorps xmm0,xmm0,xmm0
00007ff7`d786260c c5fa7f00 vmovdqu xmmword ptr [rax],xmm0
00007ff7`d7862610 8bc2 mov eax,edx
// Is number odd?
00007ff7`d7862612 a801 test al,1
00007ff7`d7862614 0f94c0 sete al
00007ff7`d7862617 0fb6c0 movzx eax,al
// Juggling with struct tuple values
00007ff7`d786261a 88442440 mov byte ptr [rsp+40h],al
00007ff7`d786261e 4889542438 mov qword ptr [rsp+38h],rdx
00007ff7`d7862623 c5fa6f442438 vmovdqu xmm0,xmmword ptr [rsp+38h]
00007ff7`d7862629 c5fa7f442448 vmovdqu xmmword ptr [rsp+48h],xmm0
00007ff7`d786262f 488b4908 mov rcx,qword ptr [rcx+8]
00007ff7`d7862633 c5fa6f442448 vmovdqu xmm0,xmmword ptr [rsp+48h]
00007ff7`d7862639 c5fa7f442428 vmovdqu xmmword ptr [rsp+28h],xmm0
00007ff7`d786263f 488d542428 lea rdx,[rsp+28h]
// Loading vtable to find the address to call to
00007ff7`d7862644 488b01 mov rax,qword ptr [rcx]
00007ff7`d7862647 488b4040 mov rax,qword ptr [rax+40h]
// Call the next step in the pipeline
00007ff7`d786264b ff5020 call qword ptr [rax+20h]
// The function returns, clear eax
00007ff7`d786264e 33c0 xor eax,eax
// Deallocate stackframe
00007ff7`d7862650 4883c458 add rsp,58h
// Return to previous chain
00007ff7`d7862654 c3 ret
Because juggling with struct tuple is more complex than the first example it is
more complex code but at least the next step in the pipeline is invoked without
the need of coreclr!JIT_TailCall
which means the performance is reasonable and
predictable.
It turns out that the only overhead of match r v with () -> ()
ends up being a
xor eax,eax
which is essentially free compared to everything else.
I believe for most of us we only have to make that we just have to avoid writing
code that has TRUELY TERRIBLE PERFORMANCE. We can get away with bad performance
in 99% of all code we write.
However, if I need to write performant code in F# (and maybe future versions of
C#) because tail calls are really really slow I have to be very careful to ensure
with the code I write so that the F# compiler don't emit the .tail
attribute.
Sure, I have a pattern that allows me to do it in this case but what if F# compiler improves in future releases and eliminate the nonsense code I wrote to avoid tail calls? The F# compiler has compiler options that allows me to suppress tail calls but F# also supports true inlining which means even if my library is compiled without tail calls when the functions are inlined into the calling assembly it might well inject tail calls.
Further, sometimes the tail calls does go faster when no stack frame is needed.
This puts me in a frustrating spot; I want tail calls but I don't want to pay the
price of the worst case tail call performance.
Obviously the best solution would be that tail calls are always faster than normal
calls but I am sure that is tricky to implement (otherwise it would have been done already).
The second best solution for my particular scenario would be, if the tail call
can be faster than normal calls let's do it otherwise fallback to normal calls.
That is probably confusing as then it doesn't always have the correct semantics
of a tail call but for this particular scenario that is what I want.
I realize F# is a small language so I am hoping that the C# compiler will start
emitting .tail
attributes so that the big C# community will notice the awkward
performance of tail calls.
Be sure to complain to Microsoft Language & Runtime Community standup on twitch
if you were bored by this rant and want to see less of it.
Regards.
// Turn off tiered compilation
// $env:COMPlus_TieredCompilation="0"
// dotnet core : 3.1.100
// FSharp.Core : 4.7.0
// BenchMarkDotNet: 0.12.0
module TailCall =
open System
open System.Linq
open System.Diagnostics
// let inline push r v = match r v with () -> ()
// let inline push r v = r v
// Minimalistic PushStream
// A PushStream accepts a receiver function that will be called
// with each value in the PushStream
type 'T PushStream = ('T -> unit) -> unit
module PushStream =
let inline zero () = LanguagePrimitives.GenericZero
let inline push r v = r v
// Creates a PushStream with all integers from b to e (inclusive)
let inline fromRange b e r = for i = b to e do push r i
// Maps all values in ps using mapping function f
let inline map f ps r = ps (fun v -> push r (f v))
// Filters all values in ps using filter function f
let inline filter f ps r = ps (fun v -> if f v then push r v)
// Sums all values in ps
let inline sum ps = let mutable s = zero () in ps (fun v -> s <- s + v); s
module Tests =
open BenchmarkDotNet.Attributes
open BenchmarkDotNet.Running
type Benchmarks () =
[<Params (10000, 1000, 100)>]
member val public Count = 100 with get, set
[<Benchmark>]
member x.SimpleImperativeTest () =
let mutable i = x.Count
let mutable sum = 0L
while i >= 0 do
let v = int64 i
i <- i - 1
if (v &&& 1L) = 0L then
sum <- sum + (v + 1L)
sum
[<Benchmark>]
member x.SimpleLinqTest () =
Enumerable.Range(0, x.Count)
.Select(int64)
.Where(fun v -> (v &&& 1L) = 0L)
.Select((+) 1L)
.Sum()
[<Benchmark>]
member x.SimplePushStreamTest () =
PushStream.fromRange 0 x.Count
|> PushStream.map int64
|> PushStream.filter (fun v -> (v &&& 1L) = 0L)
|> PushStream.map ((+) 1L)
|> PushStream.sum
[<Benchmark>]
member x.StructLinqStreamTest () =
Enumerable.Range(0, x.Count)
.Select(int64)
.Select(fun v -> struct ((v &&& 1L) = 0L, v))
.Select(fun struct (b, v) -> struct (b, v + 1L))
.Select(fun struct (b, v) -> if b then v else 0L)
.Sum()
[<Benchmark>]
member x.StructPushStreamTest () =
PushStream.fromRange 0 x.Count
|> PushStream.map int64
|> PushStream.map (fun v -> struct ((v &&& 1L) = 0L, v))
|> PushStream.map (fun struct (b, v) -> struct (b, v + 1L))
|> PushStream.map (fun struct (b, v) -> if b then v else 0L)
|> PushStream.sum
let now =
let sw = Stopwatch ()
sw.Start ()
fun () -> sw.ElapsedMilliseconds
let time o a =
let inline cc n = GC.CollectionCount n
let v = a ()
GC.Collect (2, GCCollectionMode.Forced)
GC.WaitForFullGCComplete () |> ignore
let bcc0, bcc1, bcc2 = cc 0, cc 1, cc 2
let before = now ()
for _ = 1 to o do
a () |> ignore
let after = now ()
let acc0, acc1, acc2 = cc 0, cc 1, cc 2
v, (after - before), (acc0 - bcc0, acc1 - bcc1, acc2 - bcc2)
let run argv =
let b = BenchmarkSwitcher [|typeof<Benchmarks>|]
let summary = b.Run argv
printfn "%A" summary
// BenchMarkDotNet is good but the runs takes too long for me for experimentation
// Then I rely on quickRun
let quickRun () =
let inline testCase n a = n, fun c -> string (a c)
let benchmarks = Benchmarks ()
let testCases =
[|
testCase "simple, imperative" benchmarks.SimpleImperativeTest
testCase "simple, linq" benchmarks.SimpleLinqTest
testCase "simple, pushstream" benchmarks.SimplePushStreamTest
// testCase "struct, linq" benchmarks.StructLinqStreamTest
// testCase "struct, pushstream" benchmarks.StructPushStreamTest
|]
let total = 100000000
let inners = [|1000000; 10000; 100|]
for inner in inners do
let outer = total / inner
benchmarks.Count <- inner
printfn "Performance test, total: %d, outer: %d, inner: %d" total outer inner
for n, a in testCases do
printfn " Running '%s'..." n
let v, r, cc = time outer a
printfn " result is %A, it took %d ms to produce with (%A) CC" v r cc
[<EntryPoint>]
let main argv =
// TailCall.Tests.quickRun ()
TailCall.Tests.run argv
0
@mrange thanks for the feedback.
We have been looking at tail calls recently and have a big change in the works that has not yet landed. It would be good to try these examples with #341 to see if it addresses your concerns.
cc @jakobbotsch @erozenfeld
Thanks sound great you are working on tail calls. The PR mentions that there is no need to change x86 tail call because that is already fast. From my my experience of x86 tail calls might not be as bad as x64 but it's IMHO not fast. I find that x86 tail call makes the simple pushstream pipeline take 3.6 sec in my quick benchmark vs x86 non-tail that take 0.6 sec. With that overhead I would still try to avoid the overhead of tail calls in the push stream example.
It has been a while since I looked at x86 but my impression when I did was that there was not much that could be done to make the general case faster. This is because the calling convention in x86 means that we have to adjust stack pointers and move arguments and the return pointer. There is no easy workaround to make this faster; it is slower simply because it has more stuff to do.
The state of tailcalls is that it is not always an optimization, so compilers will simply not do the transformation when this is the case. But F# is very aggressive at emitting tail. because true tailcalls are sometimes _required_ for correct functional programs. Omitting the prefix simply makes the code incorrect, so that is not an option for the compiler. In your particular case, the program with the match might be faster, but from the perspective of the F# compiler it is an incorrect transformation. And from the perspective of the runtime, the F# compiler has told it that it _needs_ to do a tailcall, so the runtime has to obey this and unfortunately it needs to jump through hoops to do this in some cases鈥搕here simply is more work to do.
The work in #341 will make x64 quite a bit faster and add support for tailcalls for other platforms, but it will still be slower than not having tailcalls in the general case. However it will make programs relying on tail. correct.
So to sum up I do not think it is possible to make tail. calls always faster than normal calls. The best you can do is expose some way in F# to let the programmer tell the compiler that a tailcall is not required for a call site. But otherwise you are choosing between fast programs (without tail.) and correct programs (with tail.), and overhead in some cases is to be expected.
I understand from the respect of jitter it has to implement tail correctly but as for correctness isn't the problem to prevent long tail call chains from running out stackspace? Or are there other differences I am not aware of?
I guess I should register an issue in the F# compiler to allow us suppress .tail attributes reliably.
It is a subte area, with much misunderstanding and confusion.
I guess, I think, there should be two tail call operators:
1 a hint to JIT, hey, please try a little, you probably can figure it out, and I don't mind the debugging impact
2 a demand to JIT, hey, please try super duper hard, even if the result is slow.
Presently there is only #2 which I believe F# outputs aggressively, and C# outputs never.
The JIT does usually figure things out, and #1 cannot be trusted, like for passing addresses of locals, so #1 maybe is moot. Maybe F# should be less aggressive with #2 and maybe C# should offer #2 as an option, but not clear.
Maybe the effort is better spent on the compiler frontends noticing near tailcall opportunities and advising developer to change his code a little. I realize these compilers are not that sophisticated.
Maybe the JIT could issue such warnings, but it feels like the wrong time.
Even C++ compilers don't do this, but maybe they should.
Also tailcall serves two purposes, from the two ideas above.
Maybe the JIT could issue such warnings, but it feels like the wrong time.
Right, it is not easy for the jit to give feedback to the developer. There are tools like the Inlining Analyzer that show some promise here. I am not sure how hard it would be to pull something like this off in F#.
The jit relays tail call outcomes back to the runtime, and these end up as ETL records if you enable the jit tracing keyword (say, by checking the normally unchecked "Jit Inlining" checkbox in perfview). There isn't a nice display for this that I know of; perhaps in the short run we can get something added in perfview or a tail call diagnoser added to BenchmarkDotNet.
The issue I see for a library LINQ-like library that relies on tail calls is that the user of the library will sometimes get really good performance and sometimes really bad performance. They might not be the user that checks ETL records for what is wrong and even if they do they can't really do much to eliminate the tail calls as that is part of the library.
This means that as a LINQ-library developer I need to protect my user from the bad outliers and I have to remove the tail calls somehow.
However, as library developed the ETL records are interesting, thanks for sharing.
Perhaps it's too much to ask for but perhaps you could enlighten me a bit on the complexities of making a tail-call so I learn to appreciate it :)
@mrange,
The basic reason some things go slow, is that
there are "easy" tailcalls and "difficult" ones.
Roughly speaking, an "easy" one has a key characteristic:
int f1(int a)
{
return f2(a); // easy
}
int f1(int a)
{
return f2(a,1,2,4,5,6,7); // difficult,
// depending on ABI details (how many parameters
// can be passed in registers.
}
Before working on this area, I would have deemed "difficult"
to be "impossible", but it is not. Desktop CLR has been doing it a while, when F# gives .tail
.
There are also criteria, like passing the address of a local:
int f1(int a)
{
int b;
return f2(a, ref b); // difficult or impossible
}
And then, depending on ABI details, passing value types by value:
int f1(int a)
{
struct B b;
return f2(b); // depends on ABI
}
x86 and ARM pass structs by value "inline", ARM64 and AMD64 pass the address
of a local. So for x86/ARM this is easy, if b is smaller or same size as a (generalized
to the stack space consumed by f1's parameters).
As to how the difficult/slow ones work, see here, it is complicated:
https://github.com/dotnet/runtime/blob/master/docs/design/features/tailcalls-with-helpers.md
As to how the fast/easy ones work, it is really easy.
The function tears down its frame essentially, placing parameters
before or after the teardown, and jmps instead of calls to the target.
There are other cases, like cannot tailcall within try/finally or try/catch.
The calling function must remain on the stack.
There is essentially no downside to the easy/fast cases.
Some people complain about lost frames in call stacks in debugger.
I have seen code that is broken by that, because it does runtime stack walks, but it is bad and rare code imho.
See:
https://github.com/mono/mono/pull/16928
Code is using the name of its caller to form logging strings.
There is a public API that depends on walking to its caller in a static class constructor.
See: https://github.com/mono/mono/pull/9620/files#diff-9ce66b9dcdea1fbc95cc16fa3bca1954R53
and https://github.com/mono/mono/pull/9620/files#diff-e7526071878d708680e4a8c5fa5a46edR6232
Ah, there already is a tail call diagnoser for BenchmarkDotNet, though I can't figure out how to get it to compile on this example.
cc @adamsitnik
In full framework (not Core) JIT could detect tail recursion in IL and generate iteration instead of recursion even without .tail
optcode existed in the IL, but only for x64 Release builds: https://github.com/timba/tailrecursiondotnet
Is this still the same behaviour for Core JIT? And why only x64 Release?
And why only Release?
It is an optimization. The JIT does 99% of optimizations in Release only.
It makes debugging experience worse. The JIT avoids optimizations that make debugging experience worse in Debug.
And why only x64?
For compatibility, the 32-bit JIT for .NET Framework is very different codebase from the 64-bit JIT.
I can't figure out how to get it to compile on this example
I got it running and shared it here: https://github.com/adamsitnik/TryNewDisassembler/tree/fsharpTailCalls
git clone https://github.com/adamsitnik/TryNewDisassembler
cd TryNewDisassembler
checkout fsharpTailCalls
dotnet run -c Release -f netcoreapp3.0 --filter '*' --runtimes net461 netcoreapp3.0
Just FYI; tried the github project above and it reproduces the performance numbers I have seen. I learnt abit about BenchMarksDotNet as well from the code :)
One interesting thing to note unrelated to the issue is that the push stream is 20% slower on dotnet core than on .net framework 馃
There is essentially no downside to the easy/fast cases.
Some people complain about lost frames in call stacks in debugger.
I am personally totally ok losing frames, that's kind of the point IMHO :).
Anyway, I am out of my league here but should/can the jitter be more aggressive on inlining tail calls due to the cost of the complex tail calls?
@adamsitnik Thanks -- did not realize it required an extra package.
When I run this the results look odd... more like it is looking at inlining events than tail call events.
// * Diagnostic Output - TailCallDiagnoser *
--------------------
--------------------
Benchmarks.SimpleImperativeTest: DefaultJob [Count=100]
--------------------
--------------------
Benchmarks.SimpleLinqTest: DefaultJob [Count=100]
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64 ()
Callee: System.Linq.Enumerable.Select - generic class System.Collections.Generic.IEnumerable`1<!!1> (class System.Collections.Generic.IEnumerable`1<!!0>,class System.Func`2<!!0,!!1>)
Fail Reason: too many il bytes
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64 ()
Callee: System.Linq.Enumerable.Select - generic class System.Collections.Generic.IEnumerable`1<!!1> (class System.Collections.Generic.IEnumerable`1<!!0>,class System.Func`2<!!0,!!1>)
Fail Reason: too many il bytes
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64 ()
Callee: System.Linq.Enumerable.Sum - int64 (class System.Collections.Generic.IEnumerable`1<int64>)
Fail Reason: has exception handling
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64 ()
Callee: System.Linq.Enumerable.Range - class System.Collections.Generic.IEnumerable`1<int32> (int32,int32)
Fail Reason: unprofitable inline
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64 ()
Callee: System.MulticastDelegate.ThrowNullThisInDelegateToInstance - instance void ()
Fail Reason: does not return
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64 ()
Callee: System.MulticastDelegate.ThrowNullThisInDelegateToInstance - instance void ()
Fail Reason: does not return
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64 ()
Callee: System.Linq.Enumerable.Where - generic class System.Collections.Generic.IEnumerable`1<!!0> (class System.Collections.Generic.IEnumerable`1<!!0>,class System.Func`2<!!0,bool>)
Fail Reason: unprofitable inline
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64 ()
Callee: System.MulticastDelegate.ThrowNullThisInDelegateToInstance - instance void ()
Fail Reason: does not return
--------------------
When I run this the results look odd... more like it is looking at inlining events than tail call events.
It has surprised me as well, to the point that I double-checked the source code to make sure that we sign up for the right events. We do:
I am open to any suggestions that could improve the output.
When I run this the results look odd... more like it is looking at inlining events than tail call events.
I think I ran into this problem before, and I believe it was an EventPipe issue. The events if you use the in-process events (like in PMI) were right back then.
I thought the EventPipe issue was fixed by @jorive months ago, however, so this may be something else. Unfortunately I cannot find the original issue...
should/can the jitter be more aggressive on inlining tail calls
I think we can be more aggressive, but we also need to be careful; see notes in #12304.
I am open to any suggestions that could improve the output.
Aside from fixing whatever bug is reporting wrong-looking data, it would be good to surface the tail call type and tail prefix for successful tail calls; the "interesting" case is the one where we have a successful call with an explicit tail
prefix and the call needed a helper.
Aside from fixing whatever bug is reporting wrong-looking data
It looks like this issue is fixed in a newer version of TraceEvent
. I've sent a PR to BenchmarkDotNet to update the dependency: https://github.com/dotnet/BenchmarkDotNet/pull/1376 and also improve it's performance: https://github.com/dotnet/BenchmarkDotNet/pull/1375 I've also updated the sample app https://github.com/adamsitnik/TryNewDisassembler/commit/d94211bce264d888e11762800b0d60e04091ee55
it would be good to surface the tail call type and tail prefix for successful tail calls
this is already possible, it's configurable via attribute optional parameter:
[<TailCallDiagnoser(logFailuresOnly = false, filterByNamespace = false)>]
@AndyAyersMS in the meantime you can use the following app.
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFrameworks>netcoreapp2.1;netcoreapp2.2;netcoreapp3.0</TargetFrameworks>
</PropertyGroup>
<ItemGroup>
<Compile Include="Program.fs" />
</ItemGroup>
<ItemGroup>
<PackageReference Include="Microsoft.Diagnostics.Tracing.TraceEvent" Version="2.0.49" />
</ItemGroup>
</Project>
module TailCall =
open System.Linq
// let inline push r v = match r v with () -> ()
// let inline push r v = r v
// Minimalistic PushStream
// A PushStream accepts a receiver function that will be called
// with each value in the PushStream
type 'T PushStream = ('T -> unit) -> unit
module PushStream =
let inline zero () = LanguagePrimitives.GenericZero
let inline push r v = r v
// Creates a PushStream with all integers from b to e (inclusive)
let inline fromRange b e r = for i = b to e do push r i
// Maps all values in ps using mapping function f
let inline map f ps r = ps (fun v -> push r (f v))
// Filters all values in ps using filter function f
let inline filter f ps r = ps (fun v -> if f v then push r v)
// Sums all values in ps
let inline sum ps = let mutable s = zero () in ps (fun v -> s <- s + v); s
module Tests =
open System
type Benchmarks () =
member val public Count = 100 with get, set
member x.SimpleImperativeTest () =
let mutable i = x.Count
let mutable sum = 0L
while i >= 0 do
let v = int64 i
i <- i - 1
if (v &&& 1L) = 0L then
sum <- sum + (v + 1L)
sum
member x.SimpleLinqTest () =
Enumerable.Range(0, x.Count)
.Select(int64)
.Where(fun v -> (v &&& 1L) = 0L)
.Select((+) 1L)
.Sum()
member x.SimplePushStreamTest () =
PushStream.fromRange 0 x.Count
|> PushStream.map int64
|> PushStream.filter (fun v -> (v &&& 1L) = 0L)
|> PushStream.map ((+) 1L)
|> PushStream.sum
member x.StructLinqStreamTest () =
Enumerable.Range(0, x.Count)
.Select(int64)
.Select(fun v -> struct ((v &&& 1L) = 0L, v))
.Select(fun struct (b, v) -> struct (b, v + 1L))
.Select(fun struct (b, v) -> if b then v else 0L)
.Sum()
member x.StructPushStreamTest () =
PushStream.fromRange 0 x.Count
|> PushStream.map int64
|> PushStream.map (fun v -> struct ((v &&& 1L) = 0L, v))
|> PushStream.map (fun struct (b, v) -> struct (b, v + 1L))
|> PushStream.map (fun struct (b, v) -> if b then v else 0L)
|> PushStream.sum
type ReflectionHelper () =
member x.UseReflectionToInvokeMethods (methodName : string) =
let typeInfo = typeof<ReflectionHelper>
let benchmarkType = typeInfo.Assembly.GetType(typeInfo.FullName.Replace("ReflectionHelper", "Benchmarks"))
let method = benchmarkType.GetMethod(methodName)
let instance = Activator.CreateInstance(benchmarkType)
method.Invoke(instance, null)
open TailCall.Tests
open System
open System.Diagnostics
open System.Threading
open System.Threading.Tasks
open Microsoft.Diagnostics.Tracing.Session
open Microsoft.Diagnostics.Tracing.Parsers
open Microsoft.Diagnostics.Tracing.Parsers.Clr
open Microsoft.Diagnostics.Tracing
[<EntryPoint>]
let main argv =
let currentProcessId = Process.GetCurrentProcess().Id
use session = new TraceEventSession("tailCallsSession")
session.EnableProvider(ClrTraceEventParser.ProviderGuid, TraceEventLevel.Verbose, uint64 ClrTraceEventParser.Keywords.JitTracing, options = null) |> ignore
let onMethodTailCallSucceeded = System.Action<MethodJitTailCallSucceededTraceData>(fun e ->
if e.ProcessID = currentProcessId then printfn "OK, callee: %s compiled: %s TailPrefix: %b TailCallType: %A" e.CalleeName e.MethodBeingCompiledName e.TailPrefix e.TailCallType
)
let onMethodTailCallFailed = System.Action<MethodJitTailCallFailedTraceData>(fun e ->
if e.ProcessID = currentProcessId then printfn "NOT, callee: %s compiled: %s FailReason: %A" e.CalleeName e.MethodBeingCompiledName e.FailReason
)
let onMethodTailCallFailedAnsi = System.Action<MethodJitTailCallFailedAnsiTraceData>(fun e ->
if e.ProcessID = currentProcessId then printfn "NOTa, callee: %s compiled: %s FailReason: %A" e.CalleeName e.MethodBeingCompiledName e.FailReason
)
session.Source.Clr.add_MethodTailCallSucceeded onMethodTailCallSucceeded
session.Source.Clr.add_MethodTailCallFailed onMethodTailCallFailed
session.Source.Clr.add_MethodTailCallFailedAnsi onMethodTailCallFailedAnsi
// we use reflection helper to make sure that the "Benchmarks" type is not JITed before the tracing has started
let sut = ReflectionHelper
let result1 = sut().UseReflectionToInvokeMethods("SimpleImperativeTest")
let result2 = sut().UseReflectionToInvokeMethods("SimpleLinqTest")
let result3 = sut().UseReflectionToInvokeMethods("SimplePushStreamTest")
let result4 = sut().UseReflectionToInvokeMethods("StructLinqStreamTest")
let result5 = sut().UseReflectionToInvokeMethods("StructPushStreamTest")
printfn "%A %A %A %A %A" result1 result2 result3 result4 result5
let action = System.Action(fun _ -> session.Source.Process() |> ignore)
let task = Task.Factory.StartNew(action, TaskCreationOptions.LongRunning)
// give it some time & wait for delayed events
let timeToSleep = TimeSpan.FromSeconds(10.0)
Thread.Sleep(timeToSleep)
0 // return an integer exit code
Thanks @adamsitnik -- works well. Had to add in the namespace part to make sense of things, though:
caller: [email protected]
callee: Microsoft.FSharp.Core.FSharpFunc`2[System.ValueTuple`2[System.Boolean,System.Int64],System.__Canon].Invoke
TailPrefix: true
TailCallType: HelperAssistedTailCall
At any rate, we should try running the code here with the tail call changes from #341. Will put it on my todo list but may not get to it for a while.
I ran the benchmark with the changes from #341. That change speeds up StructPushStreamTest 7-8 times. Here are the numbers (before
is current, after
is with #341):
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.1.20155.7
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.12005, CoreFX 5.0.20.12005), X64 RyuJIT DEBUG
Job-CBUONO : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
Job-KATQIF : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
| Method | Toolchain | Count | Mean | Error | StdDev |
|--------------------- |-------------------------------------------------------------------
| SimpleImperativeTest | after | 100 | 106.3 ns | 1.04 ns | 0.98 ns |
| SimpleImperativeTest | before | 100 | 108.6 ns | 1.21 ns | 1.13 ns |
| | | | | | |
| SimpleLinqTest | after | 100 | 1,207.8 ns | 16.77 ns | 15.68 ns |
| SimpleLinqTest | before | 100 | 1,279.2 ns | 21.44 ns | 19.01 ns |
| | | | | | |
| SimplePushStreamTest | after | 100 | 360.7 ns | 3.60 ns | 3.37 ns |
| SimplePushStreamTest | before | 100 | 382.6 ns | 7.44 ns | 6.96 ns |
| | | | | | |
| StructLinqStreamTest | after | 100 | 2,574.6 ns | 49.93 ns | 46.71 ns |
| StructLinqStreamTest | before | 100 | 2,689.9 ns | 48.13 ns | 42.66 ns |
| | | | | | |
| StructPushStreamTest | after | 100 | 7,054.0 ns | 139.15 ns | 175.98 ns |
| StructPushStreamTest | before | 100 | 53,018.4 ns | 1,107.14 ns | 2,838.03 ns |
| | | | | | |
| SimpleImperativeTest | after | 1000 | 575.7 ns | 11.92 ns | 16.72 ns |
| SimpleImperativeTest | before | 1000 | 962.8 ns | 19.18 ns | 44.82 ns |
| | | | | | |
| SimpleLinqTest | after | 1000 | 11,382.0 ns | 244.17 ns | 325.96 ns |
| SimpleLinqTest | before | 1000 | 11,818.7 ns | 231.53 ns | 216.57 ns |
| | | | | | |
| SimplePushStreamTest | after | 1000 | 3,215.6 ns | 63.85 ns | 87.40 ns |
| SimplePushStreamTest | before | 1000 | 3,343.1 ns | 65.32 ns | 80.21 ns |
| | | | | | |
| StructLinqStreamTest | after | 1000 | 25,345.5 ns | 380.99 ns | 356.38 ns |
| StructLinqStreamTest | before | 1000 | 25,673.0 ns | 502.46 ns | 670.78 ns |
| | | | | | |
| StructPushStreamTest | after | 1000 | 72,993.9 ns | 1,481.72 ns | 4,251.33 ns |
| StructPushStreamTest | before | 1000 | 517,403.6 ns | 10,017.01 ns | 11,133.88 ns |
| | | | | | |
| SimpleImperativeTest | after | 10000 | 5,648.8 ns | 94.17 ns | 88.09 ns |
| SimpleImperativeTest | before | 10000 | 10,365.2 ns | 159.36 ns | 149.06 ns |
| | | | | | |
| SimpleLinqTest | after | 10000 | 108,323.9 ns | 1,096.84 ns | 915.91 ns |
| SimpleLinqTest | before | 10000 | 109,242.2 ns | 2,153.00 ns | 3,087.77 ns |
| | | | | | |
| SimplePushStreamTest | after | 10000 | 32,878.2 ns | 632.49 ns | 621.19 ns |
| SimplePushStreamTest | before | 10000 | 33,403.5 ns | 660.98 ns | 926.60 ns |
| | | | | | |
| StructLinqStreamTest | after | 10000 | 234,969.1 ns | 1,626.21 ns | 1,521.15 ns |
| StructLinqStreamTest | before | 10000 | 242,740.5 ns | 2,549.54 ns | 2,260.10 ns |
| | | | | | |
| StructPushStreamTest | after | 10000 | 610,634.5 ns | 12,163.64 ns | 29,837.62 ns |
| StructPushStreamTest | before | 10000 | 4,946,534.6 ns | 96,406.58 ns | 94,684.16 ns |
I'm working with @jakobbotsch on getting #341 merged soon.
Based on my reading of this issue, #341 will improve the situation a great deal, especially for F#. I don't see any other reason to keep this issue open, so I'm closing it.
Most helpful comment
@mrange thanks for the feedback.
We have been looking at tail calls recently and have a big change in the works that has not yet landed. It would be good to try these examples with #341 to see if it addresses your concerns.
cc @jakobbotsch @erozenfeld