mod2pi
is a useful function, being more accurate of mod(x, 2pi)
, but it's also much slower than mod
, in the case of Float64
input (instead for BigFloat
is only very slightly slower). According to @profile
, the bottleneck of mod2pi(::Float64)
is ieee754_rem_pio2
, in particular the allocation of the y
array (line 740 of math.jl
):
By looking at the function it wraps, https://github.com/JuliaLang/openspecfun/blob/39699a1c1824bf88410cabb8a7438af91ea98f4c/rem_pio2/e_rem_pio2.c, it seems to me that the initial value of the y
array isn't relevant, so it should be safe to initialize at a random value with Vector{Float64{(2)
. Here is a performance comparison between current implementation and the implementation I propose:
julia> using BenchmarkTools
julia> M = collect(-10pi:1e-4:10pi);
julia> function ieee754_rem_pio2_faster(x::Float64)
y = Vector{Float64}(2)
n = ccall((:__ieee754_rem_pio2, Base.Math.openspecfun), Cint, (Float64,Ptr{Float64}), x, y)
return (n,y)
end
ieee754_rem_pio2_faster (generic function with 1 method)
julia> @benchmark Base.Math.ieee754_rem_pio2.($M)
BenchmarkTools.Trial:
memory estimate: 81.49 MiB
allocs estimate: 1256640
--------------
minimum time: 37.773 ms (0.00% GC)
median time: 231.986 ms (86.90% GC)
mean time: 186.099 ms (81.66% GC)
maximum time: 282.167 ms (87.26% GC)
--------------
samples: 27
evals/sample: 1
julia> @benchmark ieee754_rem_pio2_faster.($M)
BenchmarkTools.Trial:
memory estimate: 81.49 MiB
allocs estimate: 1256640
--------------
minimum time: 34.056 ms (0.00% GC)
median time: 225.573 ms (88.34% GC)
mean time: 180.034 ms (83.64% GC)
maximum time: 260.522 ms (89.77% GC)
--------------
samples: 29
evals/sample: 1
julia> Base.Math.ieee754_rem_pio2.(M) == ieee754_rem_pio2_faster.(M)
true
There should be a speed-up of ~10%, not much, but better than nothing. In addition, the last test shows that the result is exactly equal, not approximately.
I already have a patch ready to be submitted, but I was wondering whether something smarter can be done to avoid allocation, like for sincos
in PR #21589. However, writing LLVM code goes way beyond my capabilities.
As a last remark, given how the return value of ieee754_rem_pio2
is used in math.jl
, I don't think that y
needs to be an array at all (y[1]
and y[2]
are always used as two distinct elements, not as a single vector). Is it possible to patch openspecfun to redefine ieee754_rem_pio2
in order not to use an array? I don't know if that interface is used elsewhere, though.
You can at least do
diff --git a/base/math.jl b/base/math.jl
index 4d8eb64fc8..ab21d70acf 100644
--- a/base/math.jl
+++ b/base/math.jl
@@ -737,9 +737,9 @@ function ieee754_rem_pio2(x::Float64)
# this is just wrapping up
# https://github.com/JuliaLang/openspecfun/blob/master/rem_pio2/e_rem_pio2.c
- y = [0.0,0.0]
- n = ccall((:__ieee754_rem_pio2, openspecfun), Cint, (Float64,Ptr{Float64}), x, y)
- return (n,y)
+ y = Ref{NTuple{2,Float64}}()
+ n = ccall((:__ieee754_rem_pio2, openspecfun), Cint, (Float64,Ptr{Void}), x, y)
+ return (n,y[])
end
# multiples of pi/2, as double-double (ie with "tail")
Edit: inlining this function also seems to have a big effect.
And if you want to do the llvmcall trick it's
diff --git a/base/math.jl b/base/math.jl
index 4d8eb64fc8..78afe3e891 100644
--- a/base/math.jl
+++ b/base/math.jl
@@ -737,9 +737,16 @@ function ieee754_rem_pio2(x::Float64)
# this is just wrapping up
# https://github.com/JuliaLang/openspecfun/blob/master/rem_pio2/e_rem_pio2.c
- y = [0.0,0.0]
- n = ccall((:__ieee754_rem_pio2, openspecfun), Cint, (Float64,Ptr{Float64}), x, y)
- return (n,y)
+ Base.llvmcall("""
+ %f = bitcast i8* %1 to i32 (double, [2 x double]*)*
+ %py = alloca [2 x double]
+ %n = call i32 %f(double %0, [2 x double]* %py)
+ %y = load [2 x double], [2 x double]* %py
+ %res0 = insertvalue {i32, [2 x double]} undef, i32 %n, 0
+ %res1 = insertvalue {i32, [2 x double]} %res0, [2 x double] %y, 1
+ ret {i32, [2 x double]} %res1
+ """, Tuple{Cint,NTuple{2,Float64}}, Tuple{Float64,Ptr{Void}}, x,
+ cglobal((:__ieee754_rem_pio2, openspecfun)))
end
# multiples of pi/2, as double-double (ie with "tail")
Wow, with the llvmcall trick mod2pi
would be even faster than mod(x, 2pi)
! I think you should open a PR, you did all the work ;-)
Would it be possible to make the optimization without calling into openspecfun? The Base dependency on openspecfun is slated to be removed in the near future, but mod2pi
would stay in Base. So if it's possible to speed this up with a pure-Julia implementation, that will make excising openspecfun that much easier.
@ararslan I looked into rewriting ieee754_rem_pio2
in Julia (that would allow not using a y
array), but I'm not able to completely port it, I miss a few lines, like GET_HIGH_WORD(hx,x)
or INSERT_WORDS(z, ix - ((int32_t)(e0<<20)), low)
.
I believe @simonbyrne started on an implementation a while back.
It would be great to have a pure Julia implementation, and has been on my todo list for a long time. Hopefully this can be part of @pkofod's GSoC project.
Doing a range reduction modulo pi usually requires a couple of different strategies: typically Cody-Waite for smallish values, and Payne-Hanek for large ones. I have an (untested) implementation for Payne-Hanek here:
https://gist.github.com/simonbyrne/d640ac1c1db3e1774cf2d405049beef3
The books by Jean-Michel Muller are a good reference for this stuff.
Very good there is someone that is going to work on this within GSoC! I'm eager to see if a pure Julia implementation can outperform the C one :-)
Let's see... I'm just gonna hijack this thread.
I had a go over at https://github.com/pkofod/RemPiO2.jl
There are some contribution credit headers missing if this is to be "copied" over to base, but the functionality should be there. Payne Hanek returns a t::TwicePrecision{Float64}
such that t.hi+t.lo==sum(y)
where y
is the vector of [hi, lo] disussed above. The two hi's and two lo's aren't identical between the two implementations, but if I feed them to a sin
or cos
kernel based on openlibm, I get the same results. Comments are more than welcome
For timing differences, have a look at https://gist.github.com/pkofod/81435213366a19fc5a0d6ce4f1c64c4d . I've updated runtests.jl
since that for larger numbers, and they seem to return the same ratio. For the medium case (cody waite) the hi-lo's are exactly the same, but for the large case, it is as explained above.
Now, as you may notice, the timings seem quite different, so I'm sort of wondering if I "cheated" somehow, or failed to take something into account. More than happy to receive comments on this part.
Is it a problem to pack the hi and lo values in a TwicePrecision
like I do when it's such a light function call? I didn't think it would matter, but removing it and just returning (n, yhi, ylo)
seems to get the new/old time ration down to 0.15 (!). This seems like a very drastic difference. A drastic difference to the openlibm version, but aslo almost a factor of 2 compared to the one with the TwicePrecision
-version (the one that is online right now).
((edit: I believe there might be a bug/issue still for the medium/large values in payne hanek...))
Edit: Still working out the kinks in this reduction scheme. There are some small idiosyncrasies that gives some rounding problems for something like 10/10000 values (of course this has to be ironed out) when using the output hi and lo for calculating sin and cos for example. However, it does seems as if there are some speedups to be found here. For values larger than 2.0^20, sin is something like 2-3 times faster. Let's see if those speedups survive the bughunting :)
Alright, bugs were not really bugs but floating point "fun". A PR should arrive later today (my time).
Timing these things can be tricky, but after speaking to @simonbyrne last night, I'm fairly sure that the following actually shows the difference between the version in base, and the version in the upcoming PR for small values. Still running the benchmarks on the larger cases. While there are only a finite number of Float64
s there sure are many! The picture is only based on 840 values of x.
Spikes are near multiples of pi/2 where additional operations are required for sufficient precision.
Note: the new PR does NOT allocate a new vector each time, so the problem of this issue should be fixed by this.
ratio
Looks great! Impressive work.
Most helpful comment
Alright, bugs were not really bugs but floating point "fun". A PR should arrive later today (my time).
Timing these things can be tricky, but after speaking to @simonbyrne last night, I'm fairly sure that the following actually shows the difference between the version in base, and the version in the upcoming PR for small values. Still running the benchmarks on the larger cases. While there are only a finite number of
Float64
s there sure are many! The picture is only based on 840 values of x.Spikes are near multiples of pi/2 where additional operations are required for sufficient precision.
Note: the new PR does NOT allocate a new vector each time, so the problem of this issue should be fixed by this.
ratio