The log function can often be the dominant cost in certain algorithms, particularly those related to sampling and fitting statistical models, e.g. see this discussion.
I've implemented the Tang (1990) table-based algorithm here:
https://github.com/simonbyrne/libm.jl/blob/master/src/log_tang.jl
It seems to be about 10% faster than the openlibm function, but there is probably still room for improvement. Does anyone have any suggestions?
For reference:
Once we have native fused multiply-add operations (#8112), we can probably speed it up further (on processors that support them).
If you can give me a simple way of doing benchmarks, I can compare openlibm and system libm on OS X 10.6, to see whether the performance improvement was already present on that version. But it's not my machine, so I can't install development tools.
I also found the Apple log
to be faster in tests I had done a while back when I was putting openlibm
together.
Have you compared with Intel's log
in MKL? They may also have fast vectorized versions. Perhaps @ArchRobison knows something about this.
@nalimilan I have a very, very rough test here:
https://github.com/simonbyrne/libm.jl/blob/master/test/perf/log.jl
But we probably should use something more rigorous (e.g. testing different ranges).
@ViralBShah I don't have access to MKL, so have not compared it.
@simonbyrne It is intalled on julia.mit.edu, if you have an account there. If you would like it, I can create one for you - just drop me an email. Pretty beefy machine for julia development - 80 cores / 1TB RAM.
Here's the timings using the above test script on julia.mit.edu:
Openlibm:
elapsed time: 0.366460477 seconds (13824 bytes allocated)
System:
elapsed time: 0.752183874 seconds (96 bytes allocated)
Intel math:
elapsed time: 0.597063228 seconds (96 bytes allocated)
Julia fdlibm:
elapsed time: 0.368376388 seconds (96 bytes allocated)
Julia tang:
elapsed time: 0.284002105 seconds (96 bytes allocated)
(here Intel math is the ICC libimf.so, not the MKL routines: I'm still figuring those out).
So that's a 25% boost over openlibm (much more than on my machine).
@ViralBShah I haven't had any luck figuring out how to call the MKL vector math routines from within julia: any tips?
Sorry, none of those are not hooked up. Has been a while since I looked up the Intel manual. IMF has the libm stuff that you can get to with USE_INTEL_LIBM.
Have a look here: https://github.com/simonster/VML.jl
Perhaps worth also looking at Yeppp.
Thanks, I'll try out the package. I had a look at Yeppp's code: their log function is pretty crazy. As they outline in this talk, they intentionally avoid table lookups and divisions, which means they need a 20-term polynomial expansion. This is pretty much the opposite of Tang's approach, which uses the table as a way to get a shorter polynomials.
That implementation is crazy! I wonder if we can leverage Yeppp or integrate parts of it into openlibm. The build process is completely crazy too.
@simonbyrne On Mac OS X 10.6 with Julia 0.2.1, I get these results (with an old 2GHz Core 2 Duo). Unfortunately I could not test your implementation as cloning a package fails.
julia> @testsum "System:" syslog X
@testsum "Openlibm:" log X
Warning: Possible conflict in library symbol log
System:
elapsed time: 0.463031635 seconds (64 bytes allocated)
Openlibm:
elapsed time: 0.455063121 seconds (64 bytes allocated)
Is the warning expected or does it indicate that the same library was used for both calls? If not, it would seem to indicate that Apple's libm improved since 10.6, or that it got better than openlibm on recent CPUs.
@nalimilan Thanks for looking at it. I'm not sure what the warning is, but it could be getting mixed up between Base.log and libm.log (my package: I really should rename it...)?
Perhaps the reason they refuse to release the source is that they've licensed a proprietary library?
It turns that julia.mit.edu doesn't support AVX. @simonster Any ideas on which library I should link to?
@nalimilan That warning is talking about a symbol being imported from a shared library; e.g. it's warning that a log
function is already defined in the Julia process, and you're loading a library that redefines that function. I don't remember what we changed in the last year, but that warning _might_ mean that the same log
is being called in both cases.
Actually, the error only appears after calling log
once, which I think loads openlibm. But the timings are the same before and after that, so the benchmark is probably fine. (Libm.jl isn't involved since I could not even install it on 0.2.1.)
It looks as though Apple have changed their algorithm, as they get more accurate results than a standard implementation of Tang's algorithm would provide. On 10 million points:
| Algorithm | Max relative error (units in last place) |
| --- | --- |
| Openlibm | 0.863 |
| OS X 10.9 libm | 0.508 |
| Tang's algorithm | 0.555 (theoretical bound of 0.566) |
I have an application that does a lot of raw number crunching. Redefining the basic special functions log, exp, ... to use the system libm as per @simonbyrne's example immediately cut down the runtime by ~20%. So I would be very much interested in (1) either a faster implementation of these basic functions, or (2) a way to tell Julia to use the system libm automatically.
For (2), compile julia with USE_SYSTEM_LIBM=1
.
@nolta Ah, interesting! Thanks. I was using Homebrew, where such options cannot be passed directly. I will try that.
If we implement log
in Base, will we eventually abandon openlibm in favor of a pure julia implementation? I'm ok with that.
That is right. We can abandon openlibm with a pure julia implementation, which might be easier to maintain and optimize for vectorization, etc. Otherwise we just have to pile on more libraries.
Prompted by a different thread I began to port a SIMD math library to Julia. Currently, exp
and log
are available as a proof of concept. See https://github.com/eschnett/Vecmathlib.jl .
This looks really cool!
@ntessore if you're computing log
on arrays, and you're on OS X, you might want to try out the Accelerate library that comes with the OS. It is blazingly fast compared with the standard log (from what I understand it uses SIMD operations and a bunch of extra threads), though does appear to sacrifice some accuracy. I have started putting together a package here:
https://github.com/simonbyrne/Accelerate.jl
It lacks documentation, but if you just call Accelerate.log
you should get what you want (if not, please file an issue). Once I have wrapped it up I'll put it on METADATA, but I'd be grateful for any feedback.
@simonbyrne Thanks for creating this, I will try and let you know if anything comes up.
Maybe we could put together a general Fast.jl package that offers fast but potentially less accurate math functions on all systems, choosing whatever options are best for a given platform. That way, code is less platform dependent, even though the performance might vary.
I tried looking at the Apple log function with a debugger, but they seem to have some basic reverse engineering protection in place (interestingly, this post is by one of their floating point gurus). However I did take a look at a hex dump of the library, and they have a table of pairs of values:
[(log(F),1/F) for F = 1.0:0x1p-8:2.0-0x1p-8]
which suggests that they are using a table lookup approach, similar (but different from) the Tang algorithm. I'll try playing around with it and see how I go.
I think we should be careful about this - since we may not be able to use the resulting code.
The Yeppp implementationt talks about avoiding table lookups for SIMD performance. Unfortunately, there is no paper on the details of their implementation - only the talk. The source is available, and is all assembly.
Well, that's basically the extent of my reverse engineering skills, and I don't think numbers are copyrightable, so we should be okay. Their approach intuitively makes sense: they use a slightly finer grained table, but can avoid the division (which messes up pipelining), and look up both points as a pair, which can be done in one step by exploiting the SIMD registers.
I tried translating the Yeppp code to Julia, but didn't see any real performance advantages: granted, I was using Float64 (which are harder vectorise), and don't have fma support on my machine, so it's tough to tell.
If you still have that code, I would love to try it out.
Yeppp! approach with table-free implementations is specifically targeted for vector elementary functions, where only throughput matters. For scalar and short SIMD versions the most important performance characteristic is latency because even Haswell doesn't have enough entries in re-order buffer to extract enough parallelism from sequential calls to saturate execution units. I think that scalar/short SIMD versions should use a variant of Tang's method.
If you want to port Yeppp! functions to Julia, I'd suggest to look at code in this file which should be easier to port.
This was my attempt at a translation:
https://github.com/simonbyrne/libm.jl/blob/master/src/log_yeppp.jl
This is about 10x slower than the built-in log
. Just had to change @horner
to @evalpoly
.
Proper loop unrolling and SIMD is a must to get good performance from Yeppp! algorithms. Without loop unrolling the execution is severely latency-bound, resulting in low ILP.
@Maratyszcza I admit I just translated the code to check out the accuracy, I wasn't really trying to match performance.
I have to concede that I'm stumped as to how Apple implemented their algorithm. They only lookup one value of log(F)
(as opposed to the pair that Tang's algorithm uses), and the error from this alone should exceed what they get in practice. They must have some other tricks going on there. Clearly they employ some very good numerical programmers.
One value + Newton iteration, or something like that, maybe?
That was my guess, but to do a Newton step on y=log(x)
, you need to compute an exp(y)
, so I don't really see how that helps...
We could disassemble their code...
I'd guess they use a variant of Gal's accurate tables: the pivot points log(x)
and 1/x
are not in a grid, they are chosen in such way that log(x)
is accurate to more then double precision (i.e. several bits after its double-precision value are zeroes)
No, they're certainly evenly spaced, as they're exactly:
[(Float64(log(big(F))),1/F) for F = 1.0:0x1p-8:2.0-0x1p-8]
and the error of some of those values are up to 0.48 ulps (e.g. at 1.765625
)
It could also be that the tables are for another function, but they do also have log2
and log10
versions as well.
I also did some timing analysis, and they are a bit quicker on the range [0.74805, 1.49605]
, which suggests that's the domain of their range reduction.
I was trying to check if the polynomial evaluation in Yeppp benefits from SIMD and can give faster log computations on vectors. I extracted this from @simonbyrne's implementation of Yeppp's log. So far, the simd version is no faster than the regular one.
function loghorner!(b::Vector{Float64}, a::Vector{Float64})
n = length(a)
for i=1:n
@inbounds b[i] = loghorner(a[i])
end
end
function loghorner_simd!(b::Vector{Float64}, a::Vector{Float64})
n = length(a)
@simd for i=1:n
@inbounds b[i] = loghorner(a[i])
end
end
@inline function loghorner(x::Float64)
Base.Math.@horner(x,
-0x1.FFFFFFFFFFFF2p-2,
0x1.5555555555103p-2,
-0x1.00000000013C7p-2,
0x1.9999999A43E4Fp-3,
-0x1.55555554A6A2Bp-3,
0x1.249248DAE4B2Ap-3,
-0x1.FFFFFFBD8606Dp-4,
0x1.C71C90DB06248p-4,
-0x1.9999C5BE751E3p-4,
0x1.745980F3FB889p-4,
-0x1.554D5ACD502ABp-4,
0x1.3B4ED39194B87p-4,
-0x1.25480A82633AFp-4,
0x1.0F23916A44515p-4,
-0x1.EED2E2BB64B2Ep-5,
0x1.EA17E14773369p-5,
-0x1.1654764F478ECp-4,
0x1.0266CD08DB2F2p-4,
-0x1.CC4EC078138E3p-6)
end
Cc: @ArchRobison @aviks
Unrolling is necessary to extract sufficient parallelism in polynomial evaluation. Optimal unroll factor depends on the microarchitecture, e.g. 8 SIMD registers/iteration for Nehalem/Sandy Bridge/Ivy Bridge, 10 SIMD registers for Haswell, 5 AVX registers for Bulldozer/Piledriver/Streamroller. Maybe Julia needs a vector polynomial evaluation function?
Note: LLVM does not detect/unroll latency-bound loops automatically.
@ViralBShah From the IR, it looks like LLVM already vectorizes (but doesn't partially unroll) loghorner!
without @simd
. The only difference is that the version with @simd
elides a check for aliasing before the loop executes.
@simonbyrne I did notice the vectorized operations in the IR, but was hoping that @simd
would do a few more things.
@Maratyszcza It is a good idea to have a vector polynomial evaluation, especially if we get all the unrolling and stuff implemented right.
Also, it would be nice to leverage the FMAs here.
@simonbyrne Could you update the openlibm log
implementation or provide a julia implementation that is faster? I am looking for the fastest log
I can get right now, and even 10% improvement is worthwhile.
I'll try to get a PR together this week. It would be nice to have the FMA stuff sorted before we merge it.
I just benchmarked Yeppp, and it is about 8x faster for log
and exp
for vectors of size 10,000. Between 1000 to 100000 sized vectors, one gets anywhere between 2x to 8x. At smaller vector lengths, sometimes julia is faster and sometimes Yeppp - too much measurement noise in my command line tests.
@Maratyszcza What are your long term plans for Yeppp? Do you plan to release updates for newer processors as they come out? Just curious - since the performance gains are worth having through a Yeppp package.
@ViralBShah We do plan to optimize Yeppp! for Sky Lake, but it is not a high-priority project.
Thanks. Good to know. I just cleaned up Yeppp.jl and made it a Julia package.
Will #27137 address this?
Not directly, but it will make it easier once we have a way to write SIMD kernels (ideally something like http://ispc.github.io/)
Julia is being slow in the benchmark from No, Python is Not Too Slow for Computational Economics because of the log
function. I profiled the code using Julia 1.0.2 and Profile.@profile
, most of the time is spent in the log
calculation that happens in the inner loop. The Julia code is in this file: https://github.com/jstac/julia_python_comparison/blob/master/RBC_Julia.jl
A SIMD-vectorizable log
function can be found in SLEEF:
EXPORT CONST double xlog(double d) {
double x, x2, t, m;
int e;
int o = d < DBL_MIN;
if (o) d *= (double)(1LL << 32) * (double)(1LL << 32);
e = ilogb2k(d * (1.0/0.75));
m = ldexp3k(d, -e);
if (o) e -= 64;
x = (m-1) / (m+1);
x2 = x * x;
t = 0.153487338491425068243146;
t = mla(t, x2, 0.152519917006351951593857);
t = mla(t, x2, 0.181863266251982985677316);
t = mla(t, x2, 0.222221366518767365905163);
t = mla(t, x2, 0.285714294746548025383248);
t = mla(t, x2, 0.399999999950799600689777);
t = mla(t, x2, 0.6666666666667778740063);
t = mla(t, x2, 2);
x = x * t + 0.693147180559945286226764 * e;
if (xisinf(d)) x = INFINITY;
if (d < 0 || xisnan(d)) x = NAN;
if (d == 0) x = -INFINITY;
return x;
}
I should have picked log
as the example, but I did try vectorizing exp
and got similar performance (actually better for Float64) to the C version:
https://github.com/musm/SLEEF.jl/issues/14
Would be really neat if the compiler could autovectorize these functions in loops!
Chiming in here as I stumbled across the slow performance of log
in Julia. numpy.log
seems to be about 6x faster than Base.log
broadcast to an array.
function test_log()
x = randn(10000) .+ 5
@benchmark log.($x)
end
@benchmark test_log()
yields:
BenchmarkTools.Trial:
memory estimate: 78.20 KiB
allocs estimate: 2
--------------
minimum time: 49.861 渭s (0.00% GC)
median time: 80.124 渭s (0.00% GC)
mean time: 97.785 渭s (12.32% GC)
maximum time: 43.889 ms (99.65% GC)
--------------
samples: 10000
evals/sample: 1
As compared to:
x = np.random.randn(10000) + 5
%timeit np.log(x)
Which yields:
15 碌s 卤 116 ns per loop (mean 卤 std. dev. of 7 runs, 100000 loops each)
The evaluation of log
and exp
is often a bottleneck in applications so it's a concern that we don't perform as well as NumPy. Do you know if your NumPy linked against MKL? It would be great if would could start utilizing vectorized math libraries automatically when broadcasting math functions such as exp
and log
.
Yes, my NumPy is linked against MKL -- I believe that's been the default for a while for the NumPy/SciPy toolchain distributed by Anaconda. Here's more info about my setups on Julia and Python.
OS: MacOS 10.14.4
julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
>>> import sys
>>> print(sys.version)
3.7.2 (default, Dec 29 2018, 00:00:04)
[Clang 4.0.1 (tags/RELEASE_401/final)]
>>> import numpy
>>> numpy.__version__
'1.15.4'
>>> numpy.show_config()
mkl_info:
libraries = ['mkl_rt', 'pthread']
...
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
...
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
...
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
...
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
...
Expanding on my comment here, there are a couple of things we would need to do to match (or make use of) MKL:
_Write faster scalar functions_. We use translated versions of openlibm (which were based on FreeBSD's libm, which in turn descend from fdlibm written by Sun in the 90s). These give respectable performance, but there is some room for improvement (e.g. by exploiting fma
instructions on newer architectures).
_Provide/build micro-arch specific system images_, since : we already do this.fma
instructions are not available on all x86_64 processors.
_Provide hooks to use vectorised kernels_, such as SVML, SLEEF, Yeppp or Apple's Accelerate library, for operations like broadcast
and reduce
. LLVM provides such hooks for its own intrinsics: we don't currently use these because they are hard coded to call the system math library, not our own scalar functions. Apparently this is fixable, but the best option would be to have a general framework to do this for arbitrary functions, not just the those blessed by LLVM (#15265, #21454)
_A framework to write our own vectorised kernels_. SIMD.jl provides some low-level functionality, but ideally we would have a higher-level way to write SPMD code like ISPC, which would do things like rewrite branches into bitmasks.
- Provide/build micro-arch specific system images, since fma instructions are not available on all x86_64 processors.
We have this! The default binaries use the sysimg multiversioning string generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1)
, which allows specialization up to the haswell
ISA to be used on processors that support it. Looking at the wiki page for FMA instructions, it looks like that's all we need.
- _Provide hooks to use vectorised kernels_, such as SVML, SLEEF, Yeppp or Apple's Accelerate library, for operations like
broadcast
andreduce
.
At least in this specific case log.(x)
I did try both SLEEF and Yeppp, and the timings were actually worse than Base.log
for my simple test above.
Since you're on a Mac, you could try AppleAccelerate.jl, but first you would have to update it to 1.0.
We really need to make it easy to get MKL and switch to using it.
Provide/build micro-arch specific system images
, since fma instructions are not available on all x86_64 processors.We have this! The default binaries use the sysimg multiversioning string
generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1)
, which allows specialization up to thehaswell
ISA to be used on processors that support it. Looking at the wiki page for FMA instructions, it looks like that's all we need.
Ah, I missed that!
We really need to make it easy to get MKL and switch to using it.
While that might help BLAS/LAPACK operations, it won't do much for log
, exp
, etc without point 3 above.
Sorry - I just use MKL as a stand-in for Intel accelerated libraries.
1. _Write faster scalar functions_. We use translated versions of openlibm (which were based on FreeBSD's libm, which in turn descend from fdlibm written by Sun in the 90s). These give respectable performance, but there is some room for improvement (e.g. by exploiting `fma` instructions on newer architectures).
No? Aren't we using the Tang's method version now? https://github.com/JuliaLang/julia/pull/24750
A framework to write our own vectorised kernels. SIMD.jl provides some low-level functionality, but ideally we would have a higher-level way to write SPMD code like ISPC, which would do things like rewrite branches into bitmasks.
Have you seen Hydra.jl?
Also, @musm 's SLEEF port is a good place to look for SIMD-able functions already written in Julia with a compatible license. Adding @inline
to most of the functions is enough to get them to vectorize in for loops
.
However, it's probably undesirable to inline them when they are used as scalar calls. Maybe instead of @inline
, some changes to Julia's inline heuristics could get it to make the correct decision?
Although the current versions are probably preferable when called on a singe scalar. Maybe @fastmath
could point to SLEEF functions marked with @inline
?
For example, a copy and paste from here:
using SLEEF, BenchmarkTools
@inline function log_inline(d::T) where {T<:Union{Float32,Float64}}
o = d < floatmin(T)
o && (d *= T(Int64(1) << 32) * T(Int64(1) << 32))
e = SLEEF.ilogb2k(d * T(1.0/0.75))
m = SLEEF.ldexp3k(d, -e)
o && (e -= 64)
x = SLEEF.ddiv(SLEEF.dadd2(T(-1.0), m), SLEEF.dadd2(T(1.0), m))
x2 = x.hi*x.hi
t = SLEEF.log_kernel(x2)
s = SLEEF.dmul(SLEEF.MDLN2(T), T(e))
s = SLEEF.dadd(s, SLEEF.scale(x, T(2.0)))
s = SLEEF.dadd(s, x2*x.hi*t)
r = T(s)
isinf(d) && (r = T(Inf))
(d < 0 || isnan(d)) && (r = T(NaN))
d == 0 && (r = -T(Inf))
return r
end
@inline function log_fast_inline(d::T) where {T<:Union{Float32,Float64}}
o = d < floatmin(T)
o && (d *= T(Int64(1) << 32) * T(Int64(1) << 32))
e = SLEEF.ilogb2k(d * T(1.0/0.75))
m = SLEEF.ldexp3k(d, -e)
o && (e -= 64)
x = (m - 1) / (m + 1)
x2 = x * x
t = SLEEF.log_fast_kernel(x2)
x = x * t + T(SLEEF.MLN2) * e
isinf(d) && (x = T(Inf))
(d < 0 || isnan(d)) && (x = T(NaN))
d == 0 && (x = -T(Inf))
return x
end
function log_base!(y, x)
@inbounds for i in eachindex(x,y)
y[i] = log(x[i])
end
end
function log_sleef!(y, x)
@inbounds for i in eachindex(x,y)
y[i] = SLEEF.log(x[i])
end
end
function log_sleef_fast!(y, x)
@inbounds for i in eachindex(x,y)
y[i] = SLEEF.log_fast(x[i])
end
end
function log_sleef_inline!(y, x)
@inbounds for i in eachindex(x,y)
y[i] = log_inline(x[i])
end
end
function log_sleef_fast_inline!(y, x)
@inbounds for i in eachindex(x,y)
y[i] = log_fast_inline(x[i])
end
end
x = rand(200); y = similar(x);
@benchmark log_base!($y, $x)
@benchmark log_sleef!($y, $x)
@benchmark log_sleef_fast!($y, $x)
@benchmark log_sleef_inline!($y, $x)
@benchmark log_sleef_fast_inline!($y, $x)
Results on a computer with avx2 (Ryzen):
julia> @benchmark log_base!($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.108 渭s (0.00% GC)
median time: 1.146 渭s (0.00% GC)
mean time: 1.148 渭s (0.00% GC)
maximum time: 2.790 渭s (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark log_sleef!($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.856 渭s (0.00% GC)
median time: 2.911 渭s (0.00% GC)
mean time: 2.939 渭s (0.00% GC)
maximum time: 7.420 渭s (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> @benchmark log_sleef_fast!($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.694 渭s (0.00% GC)
median time: 1.718 渭s (0.00% GC)
mean time: 1.727 渭s (0.00% GC)
maximum time: 9.869 渭s (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark log_sleef_inline!($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.237 渭s (0.00% GC)
median time: 1.262 渭s (0.00% GC)
mean time: 1.263 渭s (0.00% GC)
maximum time: 3.209 渭s (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark log_sleef_fast_inline!($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 571.902 ns (0.00% GC)
median time: 586.141 ns (0.00% GC)
mean time: 591.278 ns (0.00% GC)
maximum time: 879.457 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 184
Results with avx512:
julia> @benchmark log_base!($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.021 渭s (0.00% GC)
median time: 1.028 渭s (0.00% GC)
mean time: 1.030 渭s (0.00% GC)
maximum time: 3.740 渭s (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark log_sleef!($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.723 渭s (0.00% GC)
median time: 2.731 渭s (0.00% GC)
mean time: 2.736 渭s (0.00% GC)
maximum time: 4.694 渭s (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> @benchmark log_sleef_fast!($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.317 渭s (0.00% GC)
median time: 1.325 渭s (0.00% GC)
mean time: 1.327 渭s (0.00% GC)
maximum time: 3.138 渭s (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark log_sleef_inline!($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 408.535 ns (0.00% GC)
median time: 416.595 ns (0.00% GC)
mean time: 415.228 ns (0.00% GC)
maximum time: 4.026 渭s (0.00% GC)
--------------
samples: 10000
evals/sample: 200
julia> @benchmark log_sleef_fast_inline!($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 186.160 ns (0.00% GC)
median time: 186.397 ns (0.00% GC)
mean time: 189.494 ns (0.00% GC)
maximum time: 416.618 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 595
@code_native
shows that the inlined loops are indeed vectorized.
@chriselrod -- Thanks for your SLEEF examples. I had tried SLEEF, but only SLEEF.log_fast
which as your example shows performs worse than Base.log
. However, I tried your log_sleef_fast_inline
and indeed this starts to approach NumPy w/MKL speeds for the example I used previously. Here's the benchmark I get on my system using x = randn(10000) .+ 5
as input:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 14.697 渭s (0.00% GC)
median time: 15.074 渭s (0.00% GC)
mean time: 17.046 渭s (0.00% GC)
maximum time: 113.653 渭s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
This is on par with the ~15 渭s per call I was getting with NumPy + MKL.
If you want to assume the results are finite, and eliminate the checks:
@inline function log_fast_inline(d::T) where {T<:Union{Float32,Float64}}
o = d < floatmin(T)
o && (d *= T(Int64(1) << 32) * T(Int64(1) << 32))
e = SLEEF.ilogb2k(d * T(1.0/0.75))
m = SLEEF.ldexp3k(d, -e)
o && (e -= 64)
x = (m - 1) / (m + 1)
x2 = x * x
t = SLEEF.log_fast_kernel(x2)
x = x * t + T(SLEEF.MLN2) * e
# isinf(d) && (x = T(Inf))
# (d < 0 || isnan(d)) && (x = T(NaN))
# d == 0 && (x = -T(Inf))
return x
end
it should knock off another 10% or more from the time and beat NumPy + MKL.
Is there any enthusiasm among the core developers for @chriselrod's suggestion that "Maybe @fastmath could point to SLEEF functions marked with @inline?".
It should be possible to do that from a package: i.e. SLEEF could override Base.FastMath.log_fast
for those cases.
@fastmath
functionality itself needs a bit of a revamp (#26828) and should probably be moved to a package.
SLEEF prides itself to provide functions that are very accurate. If speed is of importance, then people often forego accuracy, stating e.g. that the last few digits might not be accurate. (If memory serves me well, OpenCL allows the last 4 bits to be wrong.) Comparing speed alone will not work; one will also have to compare accuracy.
Currently, Julia does not have a proper mechanism for people to choose what accuracy they want. Generally, using @fastmath
means that people will be happy with less accuracy, but that isn't quantified anywhere. Using the OpenCL (or some other) accuracy requirements for @fastmath
might make sense, and requiring results to be as accurate as possible by default would also be good.
@eschnett That's a very good point and with the risk of a tangent, are there mechanisms that allow one to systematically trade off accuracy for performance?
@ViralBShah Yes, there are several.
pi
when calculating a sine. You can just opt to not do that. It's a priori unclear how much accuracy is lost, but the speedup should be measurable.sin
). You can do one fewer iteration, or leave out one or two terms in the series.Hello,
Please consider making the Julia port based on the latest version. I believe that BSL is permissive enough. The latest version should be faster than version 2.
You can do one fewer iteration, or leave out one or two terms in the series.
In most of the cases, that will reduce the accuracy too much. It鈥檚 not easy to reduce the accuracy a little to get a large speed-up. I think LTO would be an easy and certain way.
We really need to make it easy to get MKL and switch to using it.
While that might help BLAS/LAPACK operations, it won't do much for
log
,exp
, etc without point 3 above.
@simonbyrne , Intel has 3 main libraries:
In MKL there are many things which all have in common - Work on large arrays.
It has BLAS / LAPACK, Solvers, Random Number Generators, DFT (FFT), etc...
One of its features is Intel MKL VML. This provides many element wise operations on arrays. So for that case indeed MKL is the answer (Or could be the answer). This is exactly what's used by Numpy / Scipy.
Of course, behind the scene, Intel VML is just looping over the array and using SVML Kernels.
Vectorize.jl is a nice way nowadays to deal with the vectorized cases - but it would still be nice to have a fast standalone log
.
Julia is already shipping the Tang algorithm which appears to be faster than the system libm and openlibm (unless I am somehow messing up this benchmarking):
julia> @benchmark ccall((:log, :libopenlibm), Float64, (Float64, ), 1000.0)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 9.011 ns (0.00% GC)
median time: 9.126 ns (0.00% GC)
mean time: 9.184 ns (0.00% GC)
maximum time: 35.417 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @benchmark ccall((:log, :libm), Float64, (Float64, ), 1000.0)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.851 ns (0.00% GC)
median time: 4.889 ns (0.00% GC)
mean time: 4.934 ns (0.00% GC)
maximum time: 24.782 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark log(1000.0)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.420 ns (0.00% GC)
median time: 1.434 ns (0.00% GC)
mean time: 1.903 ns (0.00% GC)
maximum time: 1.061 渭s (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
I think for the original purpose of this issue, it is resolved. The discussion around vectorized implementations and fastmath could be carried out elsewhere.
I'm pretty sure the julia version had the computation constant propagated out.
Yeah probably right. Here's a better example:
julia> a = abs.(randn(10^3));
julia> @benchmark log.(a)
BenchmarkTools.Trial:
memory estimate: 8.00 KiB
allocs estimate: 4
--------------
minimum time: 7.796 渭s (0.00% GC)
median time: 8.399 渭s (0.00% GC)
mean time: 9.844 渭s (8.69% GC)
maximum time: 1.795 ms (99.32% GC)
--------------
samples: 10000
evals/sample: 4
julia> f(x::Float64) = ccall((:log, :libopenlibm), Float64, (Float64, ), x)
f (generic function with 1 method)
julia> @benchmark f.(a)
BenchmarkTools.Trial:
memory estimate: 8.00 KiB
allocs estimate: 4
--------------
minimum time: 9.934 渭s (0.00% GC)
median time: 10.366 渭s (0.00% GC)
mean time: 10.828 渭s (0.00% GC)
maximum time: 39.387 渭s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> g(x::Float64) = ccall((:log, :libm), Float64, (Float64, ), x)
g (generic function with 1 method)
julia> @benchmark g.(a)
BenchmarkTools.Trial:
memory estimate: 8.00 KiB
allocs estimate: 4
--------------
minimum time: 6.013 渭s (0.00% GC)
median time: 6.494 渭s (0.00% GC)
mean time: 7.372 渭s (11.12% GC)
maximum time: 1.082 ms (99.04% GC)
--------------
samples: 10000
evals/sample: 6
We are in between the Apple libm and openlibm.
I believe you actually just want log($(1000.0))
. a
should also be $a
but it matters less in this particular case.
Using $a
speeds up all the 3 cases by 0.5 渭s.
That seems to work, and is simpler than the Ref
approach I've been using:
julia> @benchmark log($(Ref(1000.0))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.541 ns (0.00% GC)
median time: 4.582 ns (0.00% GC)
mean time: 4.592 ns (0.00% GC)
maximum time: 14.723 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark log($(1000.0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.463 ns (0.00% GC)
median time: 4.488 ns (0.00% GC)
mean time: 4.498 ns (0.00% GC)
maximum time: 14.917 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark log(1000.0)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.103 ns (0.00% GC)
median time: 1.105 ns (0.00% GC)
mean time: 1.111 ns (0.00% GC)
maximum time: 11.283 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
For comparison:
julia> g(x::Float64) = ccall((:log, "/usr/lib64/libm.so"), Float64, (Float64, ), x)
g (generic function with 1 method)
julia> g(100.0)
ERROR: could not load library "/usr/lib64/libm.so"
/usr/lib64/libm.so: invalid ELF header
Stacktrace:
[1] g(::Float64) at ./REPL[16]:1
[2] top-level scope at REPL[17]:1
julia> g(x::Float64) = ccall((:log, "/usr/lib64/libm.so.6"), Float64, (Float64, ), x)
g (generic function with 1 method)
julia> g(100.0)
4.605170185988092
julia> f(x::Float64) = ccall((:log, :libopenlibm), Float64, (Float64, ), x)
f (generic function with 2 methods)
julia> @benchmark f($(1000.0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.245 ns (0.00% GC)
median time: 6.261 ns (0.00% GC)
mean time: 6.273 ns (0.00% GC)
maximum time: 16.794 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark g($(1000.0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.936 ns (0.00% GC)
median time: 3.946 ns (0.00% GC)
mean time: 3.958 ns (0.00% GC)
maximum time: 14.315 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
Any idea why I may be getting an invalid elf header error?
For what it's worth, on linux, log
appears to be slower than log from glibc libm:
julia> @btime ccall((:log, :libopenlibm), Float64, (Float64,), $1000.0)
8.754 ns (0 allocations: 0 bytes)
6.907755278982137
julia> @btime ccall((:log, "/usr/lib/libm.so.6"), Float64, (Float64,), $1000.0)
5.306 ns (0 allocations: 0 bytes)
6.907755278982137
julia> @btime log(Ref($1000.0)[])
6.297 ns (0 allocations: 0 bytes)
6.907755278982137
EDIT: beat to the punch it looks like.
libm.so is usually a linker script. In fact, all .so
files in LD_LIBRARY_PATH are mainly meant for compile time.
julia> @benchmark log($(1000.0)) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 4.463 ns (0.00% GC) median time: 4.488 ns (0.00% GC) mean time: 4.498 ns (0.00% GC) maximum time: 14.917 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000
Huh. This doesn't work for avoiding constant propagation on my machine. Wonder what the difference is.
julia> @benchmark log($(1000.0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 0.017 ns (0.00% GC)
median time: 0.019 ns (0.00% GC)
mean time: 0.019 ns (0.00% GC)
maximum time: 0.033 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
On my computer, FMA_NATIVE should be true, using the test here:
but, it is not getting used. I've built Julia from source on this computer:
julia> const FMA_NATIVE = muladd(nextfloat(1.0),nextfloat(1.0),-nextfloat(1.0,2)) != 0
true
julia> Base.Math.FMA_NATIVE
false
@non-Jedi I'm also on Linux, and found "/usr/lib(64)/libm.so.6" to be the fastest.
libmvec.so
on Linux is also extremely fast, and provides SIMD implementations of a few functions. SLEEFPirates/LoopVectorization will use it when it can find it under "/usr/lib64/", "/usr/lib", or "/lib/x86_64-linux-gnu". Any list giving all the directories to search to cover most of the Linux distros?
I'm on the latest Julia master. What version are you on?
@ViralBShah I have the same problem. Here is a issue:
https://github.com/JuliaLang/julia/issues/33011
I'll try hard coding it to true, recompiling, and checking the difference that makes.
My tests suggest that the hardcoding the FMA to true has very little effect. Looking into the code to see what values it is triggered for.
For log
, the proc2 method that uses FMA is only invoked for a small range of values:
https://github.com/JuliaLang/julia/blob/17ad9223fa33d565a0cae9ce6023af14753f4eb1/base/special/log.jl#L258
Even in those cases, I can't see a noticeable difference.
@chriselrod It is probably easier to experiment with libm.jl for this purpose
Interesting. I'm not seeing much of a difference either.
I already rebuilt Julia, but Libm
makes it easy to test both at the same time:
julia> Libm.is_fma_fast(Float64)
false
julia> @benchmark Base.log($(1.04))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.653 ns (0.00% GC)
median time: 3.678 ns (0.00% GC)
mean time: 3.698 ns (0.00% GC)
maximum time: 20.121 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark Libm.log($(1.04))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.656 ns (0.00% GC)
median time: 3.917 ns (0.00% GC)
mean time: 3.884 ns (0.00% GC)
maximum time: 13.617 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> Base.Math.FMA_NATIVE
true
It is faster than the system libm version for 1.04 on this computer (which again takes about 3.94ns).
Speedwise, the llvm log seems faster. It is using the Tang algorithm as well: https://github.com/llvm-mirror/libclc/blob/master/generic/lib/math/log1p.cl
julia> l(x) = ccall("llvm.log.f64", llvmcall, Float64, (Float64, ), x)
l (generic function with 1 method)
julia> @btime l(Ref($1000.0)[])
4.303 ns (0 allocations: 0 bytes)
6.907755278982137
julia> @btime log(Ref($1000.0)[])
5.790 ns (0 allocations: 0 bytes)
6.907755278982137
Reopening in order to consider whether we want to use the llvm version.
I get nearly identical performance between the llvm and Julia logs
julia> l(x) = ccall("llvm.log.f64", llvmcall, Float64, (Float64, ), x)
l (generic function with 1 method)
julia> g(x::Float64) = ccall((:log, "/usr/lib64/libm.so.6"), Float64, (Float64, ), x)
g (generic function with 1 method)
julia> @btime l(Ref($1000.0)[])
4.583 ns (0 allocations: 0 bytes)
6.907755278982137
julia> @btime log(Ref($1000.0)[])
4.571 ns (0 allocations: 0 bytes)
6.907755278982137
julia> @btime g(Ref($1000.0)[])
3.938 ns (0 allocations: 0 bytes)
6.907755278982137
If they're both using the same algorithm, it's interesting that you see a difference.
Note that what I have linked is log1p
, but it does illustrate doing a few things differently:
// Note that we use a lookup table of size 64 rather than 128,
// and compensate by having extra terms in the minimax polynomial
// for the kernel approximation.
Where does one find the glibc log implementation?
On someone's private github mirror, here I think:
EDIT: note that glibc is obvious licensed under the GPL, so any implementation that could be considered "derived" from glibc must also be so licensed.
Looks surprisingly like the musl log: https://git.musl-libc.org/cgit/musl/tree/src/math/log.c
But one is MIT licensed and the other is GPL. Hmm.
Speedwise, the llvm log seems faster. It is using the Tang algorithm as well: https://github.com/llvm-mirror/libclc/blob/master/generic/lib/math/log1p.cl
julia> l(x) = ccall("llvm.log.f64", llvmcall, Float64, (Float64, ), x) l (generic function with 1 method)
IIUC that's not what you are using. I'm 99% sure the LLVM intrinsics just got lowered to libm functions. The one you linked is the OpenCL C library, which shouldn' be what you are using directly....
IIUC that's not what you are using. I'm 99% sure the LLVM intrinsics just got lowered to libm functions. The one you linked is the OpenCL C library, which shouldn' be what you are using directly....
Ok. That's what I thought originally. However, the llvm log seems slightly faster than the system libm on my mac.
You can check code_native
and/or in the debugger to make sure/compare.
Also note that I'm not sure what log function has the best performance and I'm merely responding to obvious issues in my notification. A few other issues to consider,
Anyway, I have no idea what's the best log implemenation, or any libm function for that matter..... However, if you are interested in algorithm, just blinding doing random testing may not give the most sensible result and I've certainly seen cases where the small effects matters at this level......
We clearly have the right algorithm and the original purpose of this issue is resolved. The julia log is overall reasonably competitive if not the fastest. There may be value in calling the system log
in certain cases (mac and linux64 it appears), but we should perhaps open a new issue for that and it would need significant testing in cases where we use the system provided versions.
Most helpful comment
Expanding on my comment here, there are a couple of things we would need to do to match (or make use of) MKL:
_Write faster scalar functions_. We use translated versions of openlibm (which were based on FreeBSD's libm, which in turn descend from fdlibm written by Sun in the 90s). These give respectable performance, but there is some room for improvement (e.g. by exploiting
fma
instructions on newer architectures)._Provide/build micro-arch specific system images_, since: we already do this.fma
instructions are not available on all x86_64 processors._Provide hooks to use vectorised kernels_, such as SVML, SLEEF, Yeppp or Apple's Accelerate library, for operations like
broadcast
andreduce
. LLVM provides such hooks for its own intrinsics: we don't currently use these because they are hard coded to call the system math library, not our own scalar functions. Apparently this is fixable, but the best option would be to have a general framework to do this for arbitrary functions, not just the those blessed by LLVM (#15265, #21454)_A framework to write our own vectorised kernels_. SIMD.jl provides some low-level functionality, but ideally we would have a higher-level way to write SPMD code like ISPC, which would do things like rewrite branches into bitmasks.