Julia: A faster log function

Created on 1 Nov 2014 · 108Comments · Source: JuliaLang/julia

The log function can often be the dominant cost in certain algorithms, particularly those related to sampling and fitting statistical models, e.g. see this discussion.

I've implemented the Tang (1990) table-based algorithm here:
https://github.com/simonbyrne/libm.jl/blob/master/src/log_tang.jl

It seems to be about 10% faster than the openlibm function, but there is probably still room for improvement. Does anyone have any suggestions?

For reference:

the OS X (10.9) system libm is still about 20-30% faster. I'm not sure what algorithm they use as they no longer release their libm code, but the code from 10.7 seems to use the Tang algorithm. I'm not sure which of the following is actually used:
- http://www.opensource.apple.com/source/Libm/Libm-2026/Source/Intel/log_universal.h
- http://www.opensource.apple.com/source/Libm/Libm-2026/Source/Intel/xmm_log.c
the glibc libm is about 3x slower (they use code based on the IBM Accurate Mathematical Library, which is supposed to be correctly rounded)

Once we have native fused multiply-add operations (#8112), we can probably speed it up further (on processors that support them).

maths performance

Source

simonbyrne

Most helpful comment

Expanding on my comment here, there are a couple of things we would need to do to match (or make use of) MKL:

_Write faster scalar functions_. We use translated versions of openlibm (which were based on FreeBSD's libm, which in turn descend from fdlibm written by Sun in the 90s). These give respectable performance, but there is some room for improvement (e.g. by exploiting fma instructions on newer architectures).
~~_Provide/build micro-arch specific system images_, since fma instructions are not available on all x86_64 processors.~~: we already do this.
_Provide hooks to use vectorised kernels_, such as SVML, SLEEF, Yeppp or Apple's Accelerate library, for operations like broadcast and reduce. LLVM provides such hooks for its own intrinsics: we don't currently use these because they are hard coded to call the system math library, not our own scalar functions. Apparently this is fixable, but the best option would be to have a general framework to do this for arbitrary functions, not just the those blessed by LLVM (#15265, #21454)
_A framework to write our own vectorised kernels_. SIMD.jl provides some low-level functionality, but ideally we would have a higher-level way to write SPMD code like ISPC, which would do things like rewrite branches into bitmasks.

simonbyrne on 11 Apr 2019

❤7

All 108 comments

If you can give me a simple way of doing benchmarks, I can compare openlibm and system libm on OS X 10.6, to see whether the performance improvement was already present on that version. But it's not my machine, so I can't install development tools.

nalimilan on 1 Nov 2014

I also found the Apple log to be faster in tests I had done a while back when I was putting openlibm together.

ViralBShah on 1 Nov 2014

Have you compared with Intel's log in MKL? They may also have fast vectorized versions. Perhaps @ArchRobison knows something about this.

ViralBShah on 1 Nov 2014

@nalimilan I have a very, very rough test here:
https://github.com/simonbyrne/libm.jl/blob/master/test/perf/log.jl
But we probably should use something more rigorous (e.g. testing different ranges).

@ViralBShah I don't have access to MKL, so have not compared it.

simonbyrne on 1 Nov 2014

@simonbyrne It is intalled on julia.mit.edu, if you have an account there. If you would like it, I can create one for you - just drop me an email. Pretty beefy machine for julia development - 80 cores / 1TB RAM.

ViralBShah on 1 Nov 2014

Here's the timings using the above test script on julia.mit.edu:

Openlibm:
elapsed time: 0.366460477 seconds (13824 bytes allocated)

System:
elapsed time: 0.752183874 seconds (96 bytes allocated)

Intel math:
elapsed time: 0.597063228 seconds (96 bytes allocated)

Julia fdlibm:
elapsed time: 0.368376388 seconds (96 bytes allocated)

Julia tang:
elapsed time: 0.284002105 seconds (96 bytes allocated)

(here Intel math is the ICC libimf.so, not the MKL routines: I'm still figuring those out).

So that's a 25% boost over openlibm (much more than on my machine).

simonbyrne on 1 Nov 2014

@ViralBShah I haven't had any luck figuring out how to call the MKL vector math routines from within julia: any tips?

simonbyrne on 1 Nov 2014

Sorry, none of those are not hooked up. Has been a while since I looked up the Intel manual. IMF has the libm stuff that you can get to with USE_INTEL_LIBM.

ViralBShah on 1 Nov 2014

Have a look here: https://github.com/simonster/VML.jl

andreasnoack on 1 Nov 2014

Perhaps worth also looking at Yeppp.

ViralBShah on 1 Nov 2014

Thanks, I'll try out the package. I had a look at Yeppp's code: their log function is pretty crazy. As they outline in this talk, they intentionally avoid table lookups and divisions, which means they need a 20-term polynomial expansion. This is pretty much the opposite of Tang's approach, which uses the table as a way to get a shorter polynomials.

simonbyrne on 1 Nov 2014

That implementation is crazy! I wonder if we can leverage Yeppp or integrate parts of it into openlibm. The build process is completely crazy too.

ViralBShah on 2 Nov 2014

@simonbyrne On Mac OS X 10.6 with Julia 0.2.1, I get these results (with an old 2GHz Core 2 Duo). Unfortunately I could not test your implementation as cloning a package fails.

julia> @testsum "System:" syslog X
       @testsum "Openlibm:" log X
Warning: Possible conflict in library symbol log

System:
elapsed time: 0.463031635 seconds (64 bytes allocated)

Openlibm:
elapsed time: 0.455063121 seconds (64 bytes allocated)

Is the warning expected or does it indicate that the same library was used for both calls? If not, it would seem to indicate that Apple's libm improved since 10.6, or that it got better than openlibm on recent CPUs.

nalimilan on 2 Nov 2014

@nalimilan Thanks for looking at it. I'm not sure what the warning is, but it could be getting mixed up between Base.log and libm.log (my package: I really should rename it...)?

Perhaps the reason they refuse to release the source is that they've licensed a proprietary library?

simonbyrne on 2 Nov 2014

It turns that julia.mit.edu doesn't support AVX. @simonster Any ideas on which library I should link to?

simonbyrne on 3 Nov 2014

@nalimilan That warning is talking about a symbol being imported from a shared library; e.g. it's warning that a log function is already defined in the Julia process, and you're loading a library that redefines that function. I don't remember what we changed in the last year, but that warning _might_ mean that the same log is being called in both cases.

staticfloat on 3 Nov 2014

Actually, the error only appears after calling log once, which I think loads openlibm. But the timings are the same before and after that, so the benchmark is probably fine. (Libm.jl isn't involved since I could not even install it on 0.2.1.)

nalimilan on 3 Nov 2014

It looks as though Apple have changed their algorithm, as they get more accurate results than a standard implementation of Tang's algorithm would provide. On 10 million points:

| Algorithm | Max relative error (units in last place) |
| --- | --- |
| Openlibm | 0.863 |
| OS X 10.9 libm | 0.508 |
| Tang's algorithm | 0.555 (theoretical bound of 0.566) |

simonbyrne on 4 Nov 2014

I have an application that does a lot of raw number crunching. Redefining the basic special functions log, exp, ... to use the system libm as per @simonbyrne's example immediately cut down the runtime by ~20%. So I would be very much interested in (1) either a faster implementation of these basic functions, or (2) a way to tell Julia to use the system libm automatically.

ntessore on 7 Nov 2014

For (2), compile julia with USE_SYSTEM_LIBM=1.

nolta on 7 Nov 2014

@nolta Ah, interesting! Thanks. I was using Homebrew, where such options cannot be passed directly. I will try that.

ntessore on 7 Nov 2014

If we implement log in Base, will we eventually abandon openlibm in favor of a pure julia implementation? I'm ok with that.

nolta on 7 Nov 2014

That is right. We can abandon openlibm with a pure julia implementation, which might be easier to maintain and optimize for vectorization, etc. Otherwise we just have to pile on more libraries.

ViralBShah on 7 Nov 2014

Prompted by a different thread I began to port a SIMD math library to Julia. Currently, exp and log are available as a proof of concept. See https://github.com/eschnett/Vecmathlib.jl .

eschnett on 8 Nov 2014

This looks really cool!

ViralBShah on 8 Nov 2014

@ntessore if you're computing log on arrays, and you're on OS X, you might want to try out the Accelerate library that comes with the OS. It is blazingly fast compared with the standard log (from what I understand it uses SIMD operations and a bunch of extra threads), though does appear to sacrifice some accuracy. I have started putting together a package here:

https://github.com/simonbyrne/Accelerate.jl

It lacks documentation, but if you just call Accelerate.log you should get what you want (if not, please file an issue). Once I have wrapped it up I'll put it on METADATA, but I'd be grateful for any feedback.

simonbyrne on 10 Nov 2014

@simonbyrne Thanks for creating this, I will try and let you know if anything comes up.

Maybe we could put together a general Fast.jl package that offers fast but potentially less accurate math functions on all systems, choosing whatever options are best for a given platform. That way, code is less platform dependent, even though the performance might vary.

ntessore on 10 Nov 2014

I tried looking at the Apple log function with a debugger, but they seem to have some basic reverse engineering protection in place (interestingly, this post is by one of their floating point gurus). However I did take a look at a hex dump of the library, and they have a table of pairs of values:

[(log(F),1/F) for F = 1.0:0x1p-8:2.0-0x1p-8]

which suggests that they are using a table lookup approach, similar (but different from) the Tang algorithm. I'll try playing around with it and see how I go.

simonbyrne on 6 Dec 2014

I think we should be careful about this - since we may not be able to use the resulting code.

ViralBShah on 6 Dec 2014

The Yeppp implementationt talks about avoiding table lookups for SIMD performance. Unfortunately, there is no paper on the details of their implementation - only the talk. The source is available, and is all assembly.

ViralBShah on 6 Dec 2014

Well, that's basically the extent of my reverse engineering skills, and I don't think numbers are copyrightable, so we should be okay. Their approach intuitively makes sense: they use a slightly finer grained table, but can avoid the division (which messes up pipelining), and look up both points as a pair, which can be done in one step by exploiting the SIMD registers.

I tried translating the Yeppp code to Julia, but didn't see any real performance advantages: granted, I was using Float64 (which are harder vectorise), and don't have fma support on my machine, so it's tough to tell.

simonbyrne on 6 Dec 2014

If you still have that code, I would love to try it out.

ViralBShah on 7 Dec 2014

Yeppp! approach with table-free implementations is specifically targeted for vector elementary functions, where only throughput matters. For scalar and short SIMD versions the most important performance characteristic is latency because even Haswell doesn't have enough entries in re-order buffer to extract enough parallelism from sequential calls to saturate execution units. I think that scalar/short SIMD versions should use a variant of Tang's method.
If you want to port Yeppp! functions to Julia, I'd suggest to look at code in this file which should be easier to port.

Maratyszcza on 7 Dec 2014

This was my attempt at a translation:
https://github.com/simonbyrne/libm.jl/blob/master/src/log_yeppp.jl

simonbyrne on 7 Dec 2014

This is about 10x slower than the built-in log. Just had to change @horner to @evalpoly.

ViralBShah on 8 Dec 2014

Proper loop unrolling and SIMD is a must to get good performance from Yeppp! algorithms. Without loop unrolling the execution is severely latency-bound, resulting in low ILP.

Maratyszcza on 8 Dec 2014

@Maratyszcza I admit I just translated the code to check out the accuracy, I wasn't really trying to match performance.

I have to concede that I'm stumped as to how Apple implemented their algorithm. They only lookup one value of log(F) (as opposed to the pair that Tang's algorithm uses), and the error from this alone should exceed what they get in practice. They must have some other tricks going on there. Clearly they employ some very good numerical programmers.

simonbyrne on 8 Dec 2014

One value + Newton iteration, or something like that, maybe?

StefanKarpinski on 8 Dec 2014

That was my guess, but to do a Newton step on y=log(x), you need to compute an exp(y), so I don't really see how that helps...

simonbyrne on 8 Dec 2014

We could disassemble their code...

StefanKarpinski on 8 Dec 2014

I'd guess they use a variant of Gal's accurate tables: the pivot points log(x) and 1/x are not in a grid, they are chosen in such way that log(x) is accurate to more then double precision (i.e. several bits after its double-precision value are zeroes)

Maratyszcza on 8 Dec 2014

No, they're certainly evenly spaced, as they're exactly:

[(Float64(log(big(F))),1/F) for F = 1.0:0x1p-8:2.0-0x1p-8]

and the error of some of those values are up to 0.48 ulps (e.g. at 1.765625)

It could also be that the tables are for another function, but they do also have log2 and log10 versions as well.

I also did some timing analysis, and they are a bit quicker on the range [0.74805, 1.49605], which suggests that's the domain of their range reduction.

simonbyrne on 8 Dec 2014

I was trying to check if the polynomial evaluation in Yeppp benefits from SIMD and can give faster log computations on vectors. I extracted this from @simonbyrne's implementation of Yeppp's log. So far, the simd version is no faster than the regular one.

function loghorner!(b::Vector{Float64}, a::Vector{Float64})
    n = length(a)
    for i=1:n
        @inbounds b[i] = loghorner(a[i])
    end
end

function loghorner_simd!(b::Vector{Float64}, a::Vector{Float64})
    n = length(a)
    @simd for i=1:n
        @inbounds b[i] = loghorner(a[i])
    end
end

@inline function loghorner(x::Float64)
    Base.Math.@horner(x,
            -0x1.FFFFFFFFFFFF2p-2,
            0x1.5555555555103p-2,
            -0x1.00000000013C7p-2,
            0x1.9999999A43E4Fp-3,
            -0x1.55555554A6A2Bp-3,
            0x1.249248DAE4B2Ap-3,
            -0x1.FFFFFFBD8606Dp-4,
            0x1.C71C90DB06248p-4,
            -0x1.9999C5BE751E3p-4,
            0x1.745980F3FB889p-4,
            -0x1.554D5ACD502ABp-4,
            0x1.3B4ED39194B87p-4,
            -0x1.25480A82633AFp-4,
            0x1.0F23916A44515p-4,
            -0x1.EED2E2BB64B2Ep-5,
            0x1.EA17E14773369p-5,
            -0x1.1654764F478ECp-4,
            0x1.0266CD08DB2F2p-4,
            -0x1.CC4EC078138E3p-6)
end

Cc: @ArchRobison @aviks

ViralBShah on 18 Dec 2014

Unrolling is necessary to extract sufficient parallelism in polynomial evaluation. Optimal unroll factor depends on the microarchitecture, e.g. 8 SIMD registers/iteration for Nehalem/Sandy Bridge/Ivy Bridge, 10 SIMD registers for Haswell, 5 AVX registers for Bulldozer/Piledriver/Streamroller. Maybe Julia needs a vector polynomial evaluation function?
Note: LLVM does not detect/unroll latency-bound loops automatically.

Maratyszcza on 18 Dec 2014

@ViralBShah From the IR, it looks like LLVM already vectorizes (but doesn't partially unroll) loghorner! without @simd. The only difference is that the version with @simd elides a check for aliasing before the loop executes.

simonster on 18 Dec 2014

@simonbyrne I did notice the vectorized operations in the IR, but was hoping that @simd would do a few more things.

@Maratyszcza It is a good idea to have a vector polynomial evaluation, especially if we get all the unrolling and stuff implemented right.

ViralBShah on 18 Dec 2014

Also, it would be nice to leverage the FMAs here.

ViralBShah on 18 Dec 2014

@simonbyrne Could you update the openlibm log implementation or provide a julia implementation that is faster? I am looking for the fastest log I can get right now, and even 10% improvement is worthwhile.

ViralBShah on 18 Jan 2015

I'll try to get a PR together this week. It would be nice to have the FMA stuff sorted before we merge it.

simonbyrne on 19 Jan 2015

I just benchmarked Yeppp, and it is about 8x faster for log and exp for vectors of size 10,000. Between 1000 to 100000 sized vectors, one gets anywhere between 2x to 8x. At smaller vector lengths, sometimes julia is faster and sometimes Yeppp - too much measurement noise in my command line tests.

ViralBShah on 6 Apr 2015

@Maratyszcza What are your long term plans for Yeppp? Do you plan to release updates for newer processors as they come out? Just curious - since the performance gains are worth having through a Yeppp package.

ViralBShah on 6 Apr 2015

@ViralBShah We do plan to optimize Yeppp! for Sky Lake, but it is not a high-priority project.

Maratyszcza on 9 Apr 2015

Thanks. Good to know. I just cleaned up Yeppp.jl and made it a Julia package.

https://github.com/JuliaLang/Yeppp.jl

ViralBShah on 9 Apr 2015

👍3

Will #27137 address this?

ViralBShah on 25 May 2018

Not directly, but it will make it easier once we have a way to write SIMD kernels (ideally something like http://ispc.github.io/)

simonbyrne on 25 May 2018

👍1

Julia is being slow in the benchmark from No, Python is Not Too Slow for Computational Economics because of the log function. I profiled the code using Julia 1.0.2 and Profile.@profile, most of the time is spent in the log calculation that happens in the inner loop. The Julia code is in this file: https://github.com/jstac/julia_python_comparison/blob/master/RBC_Julia.jl

diegozea on 17 Dec 2018

A SIMD-vectorizable log function can be found in SLEEF:

EXPORT CONST double xlog(double d) {
  double x, x2, t, m;
  int e;

  int o = d < DBL_MIN;
  if (o) d *= (double)(1LL << 32) * (double)(1LL << 32);

  e = ilogb2k(d * (1.0/0.75));
  m = ldexp3k(d, -e);

  if (o) e -= 64;

  x = (m-1) / (m+1);
  x2 = x * x;

  t = 0.153487338491425068243146;
  t = mla(t, x2, 0.152519917006351951593857);
  t = mla(t, x2, 0.181863266251982985677316);
  t = mla(t, x2, 0.222221366518767365905163);
  t = mla(t, x2, 0.285714294746548025383248);
  t = mla(t, x2, 0.399999999950799600689777);
  t = mla(t, x2, 0.6666666666667778740063);
  t = mla(t, x2, 2);

  x = x * t + 0.693147180559945286226764 * e;

  if (xisinf(d)) x = INFINITY;
  if (d < 0 || xisnan(d)) x = NAN;
  if (d == 0) x = -INFINITY;

  return x;
}

eschnett on 20 Dec 2018

❤2

or here, right?
https://github.com/musm/SLEEF.jl/blob/b089af504632f29b694c119f9d4fbbfe0441547b/src/log.jl#L111

pkofod on 21 Dec 2018

I should have picked log as the example, but I did try vectorizing exp and got similar performance (actually better for Float64) to the C version:
https://github.com/musm/SLEEF.jl/issues/14

Would be really neat if the compiler could autovectorize these functions in loops!

chriselrod on 21 Dec 2018

Chiming in here as I stumbled across the slow performance of log in Julia. numpy.log seems to be about 6x faster than Base.log broadcast to an array.

function test_log()
    x = randn(10000) .+ 5
    @benchmark log.($x)
end

@benchmark test_log() yields:

BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     49.861 μs (0.00% GC)
  median time:      80.124 μs (0.00% GC)
  mean time:        97.785 μs (12.32% GC)
  maximum time:     43.889 ms (99.65% GC)
  --------------
  samples:          10000
  evals/sample:     1

As compared to:

x = np.random.randn(10000) + 5
%timeit np.log(x)

Which yields:

15 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

pmagwene on 11 Apr 2019

The evaluation of log and exp is often a bottleneck in applications so it's a concern that we don't perform as well as NumPy. Do you know if your NumPy linked against MKL? It would be great if would could start utilizing vectorized math libraries automatically when broadcasting math functions such as exp and log.

andreasnoack on 11 Apr 2019

Yes, my NumPy is linked against MKL -- I believe that's been the default for a while for the NumPy/SciPy toolchain distributed by Anaconda. Here's more info about my setups on Julia and Python.

OS: MacOS 10.14.4

julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

>>> import sys
>>> print(sys.version)
3.7.2 (default, Dec 29 2018, 00:00:04) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
>>> import numpy
>>> numpy.__version__
'1.15.4'
>>> numpy.show_config()
mkl_info:
    libraries = ['mkl_rt', 'pthread']
...
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
...
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
...
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
...
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
...

pmagwene on 11 Apr 2019

Expanding on my comment here, there are a couple of things we would need to do to match (or make use of) MKL:

_Write faster scalar functions_. We use translated versions of openlibm (which were based on FreeBSD's libm, which in turn descend from fdlibm written by Sun in the 90s). These give respectable performance, but there is some room for improvement (e.g. by exploiting fma instructions on newer architectures).
~~_Provide/build micro-arch specific system images_, since fma instructions are not available on all x86_64 processors.~~: we already do this.
_Provide hooks to use vectorised kernels_, such as SVML, SLEEF, Yeppp or Apple's Accelerate library, for operations like broadcast and reduce. LLVM provides such hooks for its own intrinsics: we don't currently use these because they are hard coded to call the system math library, not our own scalar functions. Apparently this is fixable, but the best option would be to have a general framework to do this for arbitrary functions, not just the those blessed by LLVM (#15265, #21454)
_A framework to write our own vectorised kernels_. SIMD.jl provides some low-level functionality, but ideally we would have a higher-level way to write SPMD code like ISPC, which would do things like rewrite branches into bitmasks.

simonbyrne on 11 Apr 2019

❤7

Provide/build micro-arch specific system images, since fma instructions are not available on all x86_64 processors.

We have this! The default binaries use the sysimg multiversioning string generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1), which allows specialization up to the haswell ISA to be used on processors that support it. Looking at the wiki page for FMA instructions, it looks like that's all we need.

staticfloat on 11 Apr 2019

👍4

_Provide hooks to use vectorised kernels_, such as SVML, SLEEF, Yeppp or Apple's Accelerate library, for operations like broadcast and reduce.

At least in this specific case log.(x) I did try both SLEEF and Yeppp, and the timings were actually worse than Base.log for my simple test above.

pmagwene on 11 Apr 2019

Since you're on a Mac, you could try AppleAccelerate.jl, but first you would have to update it to 1.0.

simonbyrne on 11 Apr 2019

We really need to make it easy to get MKL and switch to using it.

ViralBShah on 11 Apr 2019

👍5

Provide/build micro-arch specific system images, since fma instructions are not available on all x86_64 processors.

We have this! The default binaries use the sysimg multiversioning string generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1), which allows specialization up to the haswell ISA to be used on processors that support it. Looking at the wiki page for FMA instructions, it looks like that's all we need.

Ah, I missed that!

We really need to make it easy to get MKL and switch to using it.

While that might help BLAS/LAPACK operations, it won't do much for log, exp, etc without point 3 above.

simonbyrne on 11 Apr 2019

Sorry - I just use MKL as a stand-in for Intel accelerated libraries.

ViralBShah on 11 Apr 2019

1. _Write faster scalar functions_. We use translated versions of openlibm (which were based on FreeBSD's libm, which in turn descend from fdlibm written by Sun in the 90s). These give respectable performance, but there is some room for improvement (e.g. by exploiting `fma` instructions on newer architectures).

No? Aren't we using the Tang's method version now? https://github.com/JuliaLang/julia/pull/24750

pkofod on 11 Apr 2019

A framework to write our own vectorised kernels. SIMD.jl provides some low-level functionality, but ideally we would have a higher-level way to write SPMD code like ISPC, which would do things like rewrite branches into bitmasks.

Have you seen Hydra.jl?

Also, @musm 's SLEEF port is a good place to look for SIMD-able functions already written in Julia with a compatible license. Adding @inline to most of the functions is enough to get them to vectorize in for loops.
However, it's probably undesirable to inline them when they are used as scalar calls. Maybe instead of @inline, some changes to Julia's inline heuristics could get it to make the correct decision?
Although the current versions are probably preferable when called on a singe scalar. Maybe @fastmath could point to SLEEF functions marked with @inline?

For example, a copy and paste from here:

using SLEEF, BenchmarkTools

@inline function log_inline(d::T) where {T<:Union{Float32,Float64}}
    o = d < floatmin(T)
    o && (d *= T(Int64(1) << 32) * T(Int64(1) << 32))

    e = SLEEF.ilogb2k(d * T(1.0/0.75))
    m = SLEEF.ldexp3k(d, -e)
    o && (e -= 64)

    x  = SLEEF.ddiv(SLEEF.dadd2(T(-1.0), m), SLEEF.dadd2(T(1.0), m))
    x2 = x.hi*x.hi

    t = SLEEF.log_kernel(x2)

    s = SLEEF.dmul(SLEEF.MDLN2(T), T(e))
    s = SLEEF.dadd(s, SLEEF.scale(x, T(2.0)))
    s = SLEEF.dadd(s, x2*x.hi*t)
    r = T(s)

    isinf(d) && (r = T(Inf))
    (d < 0 || isnan(d)) && (r = T(NaN))
    d == 0 && (r = -T(Inf))

    return r
end

@inline function log_fast_inline(d::T) where {T<:Union{Float32,Float64}}
    o = d < floatmin(T)
    o && (d *= T(Int64(1) << 32) * T(Int64(1) << 32))

    e = SLEEF.ilogb2k(d * T(1.0/0.75))
    m = SLEEF.ldexp3k(d, -e)
    o && (e -= 64)

    x  = (m - 1) / (m + 1)
    x2 = x * x

    t = SLEEF.log_fast_kernel(x2)

    x = x * t + T(SLEEF.MLN2) * e

    isinf(d) && (x = T(Inf))
    (d < 0 || isnan(d)) && (x = T(NaN))
    d == 0 && (x = -T(Inf))

    return x
end

function log_base!(y, x)
    @inbounds for i in eachindex(x,y)
        y[i] = log(x[i])
    end
end
function log_sleef!(y, x)
    @inbounds for i in eachindex(x,y)
        y[i] = SLEEF.log(x[i])
    end
end
function log_sleef_fast!(y, x)
    @inbounds for i in eachindex(x,y)
        y[i] = SLEEF.log_fast(x[i])
    end
end
function log_sleef_inline!(y, x)
    @inbounds for i in eachindex(x,y)
        y[i] = log_inline(x[i])
    end
end
function log_sleef_fast_inline!(y, x)
    @inbounds for i in eachindex(x,y)
        y[i] = log_fast_inline(x[i])
    end
end

x = rand(200); y = similar(x);
@benchmark log_base!($y, $x)
@benchmark log_sleef!($y, $x)
@benchmark log_sleef_fast!($y, $x)
@benchmark log_sleef_inline!($y, $x)
@benchmark log_sleef_fast_inline!($y, $x)

Results on a computer with avx2 (Ryzen):

julia> @benchmark log_base!($y, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.108 μs (0.00% GC)
  median time:      1.146 μs (0.00% GC)
  mean time:        1.148 μs (0.00% GC)
  maximum time:     2.790 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark log_sleef!($y, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.856 μs (0.00% GC)
  median time:      2.911 μs (0.00% GC)
  mean time:        2.939 μs (0.00% GC)
  maximum time:     7.420 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark log_sleef_fast!($y, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.694 μs (0.00% GC)
  median time:      1.718 μs (0.00% GC)
  mean time:        1.727 μs (0.00% GC)
  maximum time:     9.869 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark log_sleef_inline!($y, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.237 μs (0.00% GC)
  median time:      1.262 μs (0.00% GC)
  mean time:        1.263 μs (0.00% GC)
  maximum time:     3.209 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark log_sleef_fast_inline!($y, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     571.902 ns (0.00% GC)
  median time:      586.141 ns (0.00% GC)
  mean time:        591.278 ns (0.00% GC)
  maximum time:     879.457 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     184

Results with avx512:

julia> @benchmark log_base!($y, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.021 μs (0.00% GC)
  median time:      1.028 μs (0.00% GC)
  mean time:        1.030 μs (0.00% GC)
  maximum time:     3.740 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark log_sleef!($y, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.723 μs (0.00% GC)
  median time:      2.731 μs (0.00% GC)
  mean time:        2.736 μs (0.00% GC)
  maximum time:     4.694 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark log_sleef_fast!($y, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.317 μs (0.00% GC)
  median time:      1.325 μs (0.00% GC)
  mean time:        1.327 μs (0.00% GC)
  maximum time:     3.138 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark log_sleef_inline!($y, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     408.535 ns (0.00% GC)
  median time:      416.595 ns (0.00% GC)
  mean time:        415.228 ns (0.00% GC)
  maximum time:     4.026 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     200

julia> @benchmark log_sleef_fast_inline!($y, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     186.160 ns (0.00% GC)
  median time:      186.397 ns (0.00% GC)
  mean time:        189.494 ns (0.00% GC)
  maximum time:     416.618 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     595

@code_native shows that the inlined loops are indeed vectorized.

chriselrod on 12 Apr 2019

@chriselrod -- Thanks for your SLEEF examples. I had tried SLEEF, but only SLEEF.log_fast which as your example shows performs worse than Base.log. However, I tried your log_sleef_fast_inline and indeed this starts to approach NumPy w/MKL speeds for the example I used previously. Here's the benchmark I get on my system using x = randn(10000) .+ 5 as input:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.697 μs (0.00% GC)
  median time:      15.074 μs (0.00% GC)
  mean time:        17.046 μs (0.00% GC)
  maximum time:     113.653 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

This is on par with the ~15 μs per call I was getting with NumPy + MKL.

pmagwene on 12 Apr 2019

If you want to assume the results are finite, and eliminate the checks:

@inline function log_fast_inline(d::T) where {T<:Union{Float32,Float64}}
    o = d < floatmin(T)
    o && (d *= T(Int64(1) << 32) * T(Int64(1) << 32))

    e = SLEEF.ilogb2k(d * T(1.0/0.75))
    m = SLEEF.ldexp3k(d, -e)
    o && (e -= 64)

    x  = (m - 1) / (m + 1)
    x2 = x * x

    t = SLEEF.log_fast_kernel(x2)

    x = x * t + T(SLEEF.MLN2) * e

    # isinf(d) && (x = T(Inf))
    # (d < 0 || isnan(d)) && (x = T(NaN))
    # d == 0 && (x = -T(Inf))

    return x
end

it should knock off another 10% or more from the time and beat NumPy + MKL.

chriselrod on 12 Apr 2019

Is there any enthusiasm among the core developers for @chriselrod's suggestion that "Maybe @fastmath could point to SLEEF functions marked with @inline?".

pmagwene on 12 Apr 2019

It should be possible to do that from a package: i.e. SLEEF could override Base.FastMath.log_fast for those cases.

@fastmath functionality itself needs a bit of a revamp (#26828) and should probably be moved to a package.

simonbyrne on 12 Apr 2019

SLEEF prides itself to provide functions that are very accurate. If speed is of importance, then people often forego accuracy, stating e.g. that the last few digits might not be accurate. (If memory serves me well, OpenCL allows the last 4 bits to be wrong.) Comparing speed alone will not work; one will also have to compare accuracy.

Currently, Julia does not have a proper mechanism for people to choose what accuracy they want. Generally, using @fastmath means that people will be happy with less accuracy, but that isn't quantified anywhere. Using the OpenCL (or some other) accuracy requirements for @fastmath might make sense, and requiring results to be as accurate as possible by default would also be good.

eschnett on 13 Apr 2019

👍2

@eschnett That's a very good point and with the risk of a tangent, are there mechanisms that allow one to systematically trade off accuracy for performance?

ViralBShah on 13 Apr 2019

@ViralBShah Yes, there are several.

SLEEF might be using higher than usual precision for parts of the calculation to ensure accuracy, i.e. it might use a double-double representation when converting from base-2 log to natural log (I don't know whether it actually does so, but it might), or when dividing by pi when calculating a sine. You can just opt to not do that. It's a priori unclear how much accuracy is lost, but the speedup should be measurable.
The actual calculation is often either iterative (e.g. Newton iterations for square root) or involves a series (e.g. Taylor series for sin). You can do one fewer iteration, or leave out one or two terms in the series.

eschnett on 13 Apr 2019

Hello,

Please consider making the Julia port based on the latest version. I believe that BSL is permissive enough. The latest version should be faster than version 2.

You can do one fewer iteration, or leave out one or two terms in the series.

In most of the cases, that will reduce the accuracy too much. It’s not easy to reduce the accuracy a little to get a large speed-up. I think LTO would be an easy and certain way.

shibatch on 13 Apr 2019

We really need to make it easy to get MKL and switch to using it.

While that might help BLAS/LAPACK operations, it won't do much for log, exp, etc without point 3 above.

@simonbyrne , Intel has 3 main libraries:

Intel MKL.
Intel IPP.
Intel SVML.

In MKL there are many things which all have in common - Work on large arrays.
It has BLAS / LAPACK, Solvers, Random Number Generators, DFT (FFT), etc...
One of its features is Intel MKL VML. This provides many element wise operations on arrays. So for that case indeed MKL is the answer (Or could be the answer). This is exactly what's used by Numpy / Scipy.

Of course, behind the scene, Intel VML is just looping over the array and using SVML Kernels.

RoyiAvital on 30 Sep 2019

Vectorize.jl is a nice way nowadays to deal with the vectorized cases - but it would still be nice to have a fast standalone log.

ViralBShah on 25 Feb 2020

👍1

Julia is already shipping the Tang algorithm which appears to be faster than the system libm and openlibm (unless I am somehow messing up this benchmarking):

julia> @benchmark ccall((:log, :libopenlibm), Float64, (Float64, ), 1000.0)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.011 ns (0.00% GC)
  median time:      9.126 ns (0.00% GC)
  mean time:        9.184 ns (0.00% GC)
  maximum time:     35.417 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark ccall((:log, :libm), Float64, (Float64, ), 1000.0)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.851 ns (0.00% GC)
  median time:      4.889 ns (0.00% GC)
  mean time:        4.934 ns (0.00% GC)
  maximum time:     24.782 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark log(1000.0)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.420 ns (0.00% GC)
  median time:      1.434 ns (0.00% GC)
  mean time:        1.903 ns (0.00% GC)
  maximum time:     1.061 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

ViralBShah on 8 Mar 2020

I think for the original purpose of this issue, it is resolved. The discussion around vectorized implementations and fastmath could be carried out elsewhere.

ViralBShah on 8 Mar 2020

I'm pretty sure the julia version had the computation constant propagated out.

yuyichao on 8 Mar 2020

Yeah probably right. Here's a better example:

julia> a = abs.(randn(10^3));

julia> @benchmark log.(a)
BenchmarkTools.Trial: 
  memory estimate:  8.00 KiB
  allocs estimate:  4
  --------------
  minimum time:     7.796 μs (0.00% GC)
  median time:      8.399 μs (0.00% GC)
  mean time:        9.844 μs (8.69% GC)
  maximum time:     1.795 ms (99.32% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> f(x::Float64) = ccall((:log, :libopenlibm), Float64, (Float64, ), x)
f (generic function with 1 method)

julia> @benchmark f.(a)
BenchmarkTools.Trial: 
  memory estimate:  8.00 KiB
  allocs estimate:  4
  --------------
  minimum time:     9.934 μs (0.00% GC)
  median time:      10.366 μs (0.00% GC)
  mean time:        10.828 μs (0.00% GC)
  maximum time:     39.387 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> g(x::Float64) = ccall((:log, :libm), Float64, (Float64, ), x)
g (generic function with 1 method)

julia> @benchmark g.(a) 
BenchmarkTools.Trial: 
  memory estimate:  8.00 KiB
  allocs estimate:  4
  --------------
  minimum time:     6.013 μs (0.00% GC)
  median time:      6.494 μs (0.00% GC)
  mean time:        7.372 μs (11.12% GC)
  maximum time:     1.082 ms (99.04% GC)
  --------------
  samples:          10000
  evals/sample:     6

We are in between the Apple libm and openlibm.

ViralBShah on 8 Mar 2020

I believe you actually just want log($(1000.0)). a should also be $a but it matters less in this particular case.

yuyichao on 8 Mar 2020

Using $a speeds up all the 3 cases by 0.5 μs.

ViralBShah on 8 Mar 2020

That seems to work, and is simpler than the Ref approach I've been using:

julia> @benchmark log($(Ref(1000.0))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.541 ns (0.00% GC)
  median time:      4.582 ns (0.00% GC)
  mean time:        4.592 ns (0.00% GC)
  maximum time:     14.723 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark log($(1000.0))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.463 ns (0.00% GC)
  median time:      4.488 ns (0.00% GC)
  mean time:        4.498 ns (0.00% GC)
  maximum time:     14.917 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark log(1000.0)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.103 ns (0.00% GC)
  median time:      1.105 ns (0.00% GC)
  mean time:        1.111 ns (0.00% GC)
  maximum time:     11.283 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

For comparison:

julia> g(x::Float64) = ccall((:log, "/usr/lib64/libm.so"), Float64, (Float64, ), x)
g (generic function with 1 method)

julia> g(100.0)
ERROR: could not load library "/usr/lib64/libm.so"
/usr/lib64/libm.so: invalid ELF header
Stacktrace:
 [1] g(::Float64) at ./REPL[16]:1
 [2] top-level scope at REPL[17]:1

julia> g(x::Float64) = ccall((:log, "/usr/lib64/libm.so.6"), Float64, (Float64, ), x)
g (generic function with 1 method)

julia> g(100.0)
4.605170185988092

julia> f(x::Float64) = ccall((:log, :libopenlibm), Float64, (Float64, ), x)
f (generic function with 2 methods)

julia> @benchmark f($(1000.0))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.245 ns (0.00% GC)
  median time:      6.261 ns (0.00% GC)
  mean time:        6.273 ns (0.00% GC)
  maximum time:     16.794 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark g($(1000.0))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.936 ns (0.00% GC)
  median time:      3.946 ns (0.00% GC)
  mean time:        3.958 ns (0.00% GC)
  maximum time:     14.315 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Any idea why I may be getting an invalid elf header error?

chriselrod on 8 Mar 2020

For what it's worth, on linux, log appears to be slower than log from glibc libm:

julia> @btime ccall((:log, :libopenlibm), Float64, (Float64,), $1000.0)
  8.754 ns (0 allocations: 0 bytes)
6.907755278982137

julia> @btime ccall((:log, "/usr/lib/libm.so.6"), Float64, (Float64,), $1000.0)
  5.306 ns (0 allocations: 0 bytes)
6.907755278982137

julia> @btime log(Ref($1000.0)[])
  6.297 ns (0 allocations: 0 bytes)
6.907755278982137

EDIT: beat to the punch it looks like.

non-Jedi on 8 Mar 2020

libm.so is usually a linker script. In fact, all .so files in LD_LIBRARY_PATH are mainly meant for compile time.

yuyichao on 8 Mar 2020

julia> @benchmark log($(1000.0))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.463 ns (0.00% GC)
  median time:      4.488 ns (0.00% GC)
  mean time:        4.498 ns (0.00% GC)
  maximum time:     14.917 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Huh. This doesn't work for avoiding constant propagation on my machine. Wonder what the difference is.

julia> @benchmark log($(1000.0))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.017 ns (0.00% GC)
  median time:      0.019 ns (0.00% GC)
  mean time:        0.019 ns (0.00% GC)
  maximum time:     0.033 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

non-Jedi on 8 Mar 2020

On my computer, FMA_NATIVE should be true, using the test here:

https://github.com/JuliaLang/julia/blob/17ad9223fa33d565a0cae9ce6023af14753f4eb1/base/special/log.jl#L144

but, it is not getting used. I've built Julia from source on this computer:

julia> const FMA_NATIVE = muladd(nextfloat(1.0),nextfloat(1.0),-nextfloat(1.0,2)) != 0
true

julia> Base.Math.FMA_NATIVE
false

ViralBShah on 8 Mar 2020

@non-Jedi I'm also on Linux, and found "/usr/lib(64)/libm.so.6" to be the fastest.
libmvec.so on Linux is also extremely fast, and provides SIMD implementations of a few functions. SLEEFPirates/LoopVectorization will use it when it can find it under "/usr/lib64/", "/usr/lib", or "/lib/x86_64-linux-gnu". Any list giving all the directories to search to cover most of the Linux distros?

I'm on the latest Julia master. What version are you on?

@ViralBShah I have the same problem. Here is a issue:
https://github.com/JuliaLang/julia/issues/33011

I'll try hard coding it to true, recompiling, and checking the difference that makes.

chriselrod on 8 Mar 2020

My tests suggest that the hardcoding the FMA to true has very little effect. Looking into the code to see what values it is triggered for.

ViralBShah on 8 Mar 2020

For log, the proc2 method that uses FMA is only invoked for a small range of values:
https://github.com/JuliaLang/julia/blob/17ad9223fa33d565a0cae9ce6023af14753f4eb1/base/special/log.jl#L258

Even in those cases, I can't see a noticeable difference.

ViralBShah on 8 Mar 2020

@chriselrod It is probably easier to experiment with libm.jl for this purpose

ViralBShah on 8 Mar 2020

Interesting. I'm not seeing much of a difference either.
I already rebuilt Julia, but Libm makes it easy to test both at the same time:

julia> Libm.is_fma_fast(Float64)
false

julia> @benchmark Base.log($(1.04))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.653 ns (0.00% GC)
  median time:      3.678 ns (0.00% GC)
  mean time:        3.698 ns (0.00% GC)
  maximum time:     20.121 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark Libm.log($(1.04))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.656 ns (0.00% GC)
  median time:      3.917 ns (0.00% GC)
  mean time:        3.884 ns (0.00% GC)
  maximum time:     13.617 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> Base.Math.FMA_NATIVE
true

It is faster than the system libm version for 1.04 on this computer (which again takes about 3.94ns).

chriselrod on 8 Mar 2020

Speedwise, the llvm log seems faster. It is using the Tang algorithm as well: https://github.com/llvm-mirror/libclc/blob/master/generic/lib/math/log1p.cl

julia> l(x) = ccall("llvm.log.f64", llvmcall, Float64, (Float64, ), x)
l (generic function with 1 method)

julia> @btime l(Ref($1000.0)[])
  4.303 ns (0 allocations: 0 bytes)
6.907755278982137

julia> @btime log(Ref($1000.0)[])
  5.790 ns (0 allocations: 0 bytes)
6.907755278982137

ViralBShah on 8 Mar 2020

Reopening in order to consider whether we want to use the llvm version.

ViralBShah on 8 Mar 2020

I get nearly identical performance between the llvm and Julia logs

julia> l(x) = ccall("llvm.log.f64", llvmcall, Float64, (Float64, ), x)
l (generic function with 1 method)

julia> g(x::Float64) = ccall((:log, "/usr/lib64/libm.so.6"), Float64, (Float64, ), x)
g (generic function with 1 method)

julia> @btime l(Ref($1000.0)[])
  4.583 ns (0 allocations: 0 bytes)
6.907755278982137

julia> @btime log(Ref($1000.0)[])
  4.571 ns (0 allocations: 0 bytes)
6.907755278982137

julia> @btime g(Ref($1000.0)[])
  3.938 ns (0 allocations: 0 bytes)
6.907755278982137

If they're both using the same algorithm, it's interesting that you see a difference.

chriselrod on 8 Mar 2020

Note that what I have linked is log1p, but it does illustrate doing a few things differently:

https://github.com/llvm-mirror/libclc/blob/9aa6f350a6ce0f2cfc7e489495af8899ca74e079/generic/lib/math/log1p.cl#L105

    // Note that we use a lookup table of size 64 rather than 128,
    // and compensate by having extra terms in the minimax polynomial
    // for the kernel approximation.

ViralBShah on 8 Mar 2020

Where does one find the glibc log implementation?

ViralBShah on 8 Mar 2020

On someone's private github mirror, here I think:

https://github.com/bminor/glibc/blob/92b963699aae2da1e25f47edc7a0408bf3aee4d2/sysdeps/ieee754/dbl-64/e_log.c

EDIT: note that glibc is obvious licensed under the GPL, so any implementation that could be considered "derived" from glibc must also be so licensed.

non-Jedi on 8 Mar 2020

Looks surprisingly like the musl log: https://git.musl-libc.org/cgit/musl/tree/src/math/log.c

But one is MIT licensed and the other is GPL. Hmm.

ViralBShah on 8 Mar 2020

Speedwise, the llvm log seems faster. It is using the Tang algorithm as well: https://github.com/llvm-mirror/libclc/blob/master/generic/lib/math/log1p.cl
julia> l(x) = ccall("llvm.log.f64", llvmcall, Float64, (Float64, ), x)
l (generic function with 1 method)

IIUC that's not what you are using. I'm 99% sure the LLVM intrinsics just got lowered to libm functions. The one you linked is the OpenCL C library, which shouldn' be what you are using directly....

yuyichao on 8 Mar 2020

👍1

IIUC that's not what you are using. I'm 99% sure the LLVM intrinsics just got lowered to libm functions. The one you linked is the OpenCL C library, which shouldn' be what you are using directly....

Ok. That's what I thought originally. However, the llvm log seems slightly faster than the system libm on my mac.

ViralBShah on 8 Mar 2020

You can check code_native and/or in the debugger to make sure/compare.

Also note that I'm not sure what log function has the best performance and I'm merely responding to obvious issues in my notification. A few other issues to consider,

Most FMA check (at least any that derived from the one in base) is unreliable.
Last time I check LLVM still doesn't clear floating point registers aggressive enough on x86. Calling into a different library compiled using different compilers/flags also affect this. Comparing different implementation could be unfair due to this. (A julia implementation, for example would be compiled with exactly the same flag as the benchmark caller, a C implementation likely won't, glibc libm has uarch dispatch, so does sleef, libmvec and in some sense julia, whereas openlibm doesn't)
And this is not taking into account the obvious compiler difference. I've seen the same math function compiled with gcc and clang to have very different performance (to the level of the difference you are seeing here.) I usually see GCC wins IIRC but Clang also comes ahead sometimes.

Anyway, I have no idea what's the best log implemenation, or any libm function for that matter..... However, if you are interested in algorithm, just blinding doing random testing may not give the most sensible result and I've certainly seen cases where the small effects matters at this level......

yuyichao on 8 Mar 2020

👍1

We clearly have the right algorithm and the original purpose of this issue is resolved. The julia log is overall reasonably competitive if not the fastest. There may be value in calling the system log in certain cases (mac and linux64 it appears), but we should perhaps open a new issue for that and it would need significant testing in cases where we use the system provided versions.

ViralBShah on 8 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings