I got interested in Julia as it seems an outstanding new language. However, on the front page of your website you showed a speed comparison of different languages and at the first glance, I knew the timings for Mathematica are wrong.
I checked your perf.nb and indeed, no one would ever implement simple functions like fib like you have done. Let me give an example, using your code. You defined the Fibonacci sequence
fib = Compile[{{n, _Integer}}, If[n < 2, n, fib[n - 1] + fib[n - 2]]]
For fib[30] this gives a timing of 0.5 seconds on my machine. No one would use recursive calls inside a compiled function because they are not compiled down. This is common knowledge. What happens is that each time the inner fib is called, the compiled function jumps out and asks the Kernel again what the definition of fib is which makes it jump back into the compiled code. You could have placed some Pause calls with the same effect.
Let's consider this simple highlevel recursive definition that uses memoization to store intermediate results:
fib1[1] = 1;
fib1[2] = 1;
fib1[n_] := fib1[n] = fib1[n - 2] + fib1[n - 1];
AbsoluteTiming[fib1[30]]
No need for compile and it is clear functional style code. On the first run, it takes almost unmeasurable 0.000305 seconds. Each further call returns instantaneously because the results are cached.
If you like to use Compile then use it correctly. In this case, this means writing the algorithm iteratively
fibC = Compile[{{n, _Integer, 0}},
Module[{f1 = 1, f2 = 1, res = 1},
If[n < 3, Return[res]];
Do[
res = f1 + f2;
f1 = f2;
f2 = res, {i, n - 2}
];
res
]
]
AbsoluteTiming[fibC[30]]
Again, this implementation needs only 0.00004 seconds. So your Mathematica implementation to compare Julia to Mathematica is 12500 times slower. I believe this does not give the correct picture of the situation.
Another example is your mandel test. Do you know that Mathematica can easily vectorize calculations? For this, you don't have to change anything in the algorithm, you just have to use it.
mandel = Compile[{{zin, _Complex, 0}},
Module[{z = zin, c = zin, maxiter = 80, n = 0},
Do[If[Abs[z] > 2, maxiter = n - 1;
Break[]];
z = z^2 + c;, {n, 1, maxiter}];
maxiter],
CompilationTarget -> "C",
RuntimeAttributes -> {Listable},
Parallelization -> True
];
Now, you can apply the mandel in the usual way like you did with
AbsoluteTiming[Table[mandel[r + i*I], {i, -1., 1., 0.01}, {r, -2.0, 0.5, 0.01}];]
which needs about 0.15 seconds here or you apply it in parallel by providing the whole complex matrix instead of each number separately
AbsoluteTiming[mandel@Table[r + i*I, {i, -1., 1., 0.01}, {r, -2.0, 0.5, 0.01}];]
which needs only 0.09 seconds. Not using Table but Outer to create the matrix will again make 3x faster.
If you really want to make your comparison correct and not misleading as it is currently, why don't you head over to Mathematica.SE and ask for a simple reference implementation of your test methods?
The purpose of the benchmarks is clearly stated. It's to benchmark recursion, not to benchmark the fastest way to calculate fib. Of course no one will ever write code like this to implement the functions in the benchmark, but there are many other cases where you can't apply the trick that you do for this one. If there's another simple case that's not possible to reduce to a cached table (since we are benchmarking the function call performance and not caching performance) feel free to propose one instead.
And just to show that we are just trying to benchmark the generic algorithm (i.e. recursion/function calls) you can easily implement a much faster iterative fib in julia too.
julia> function fib(n)
f1 = 1
f2 = 1
for i in 3:n
f1, f2 = f2, f1 + f2
end
f2
end
fib (generic function with 1 method)
julia> @btime fib(30)
10.782 ns (0 allocations: 0 bytes)
832040
And I don't think the ~4000x timing difference mainly comes from the hardware.
not to benchmark the fastest way to calculate fib
That is surely not clear because the descriptive text below the table says that these benchmarks are to
test the performance of __specific algorithms__ implemented in each language.
The specific algorithm is the calculation of the Fibonacci numbers but now you say that this wasn't the goal.
The goal was rather to measure the speed of function calls which is surely slower in Mathematica.
In pisum the Julia implementation follows exactly the C implementation. If you would have done the same in Mathematica, you would have seen that it is only slightly slower than Julia (17.95ms compared to 18.51ms on my machine) with
pisum2 = Compile[{}, Module[{sum = 0., n = 10000},
Do[
sum = 0.0;
Do[sum += 1.0/(k*k), {k, n}],
{500}];
sum], CompilationTarget -> "C", RuntimeOptions -> "Speed"
]
I like that Julia doesn't drop performance if you implement pisum with sum(k -> 1.0/(k*k), 1:10000). Anyway, no harm was done and I thank you for your comment.
The specific algorithm is the calculation of the Fibonacci numbers
The term is perhaps ambiguous, but the intention here is to mean that the implementation approaches should be the same. Otherwise, the fastest way to compute the results of all of the benchmarks is simply to hard-code the answer – the fibonacci numbers are a known set.
That pisum change seems to be in the spirit of the intended benchmark, if you would like to submit a PR.
@halirutan – you're conflating problems with algorithms. The problem in this case is calculating Fibonacci numbers; the algorithm is using an (unmemoized) doubly recursive function. I agree with @vtjnash that pisum change seems in the spirit of the benchmarks. While the memoized doubly-recursive Mathematica algorithm for the fib benchmark is impressively elegant – memoization is very clean in Mathematica – it's not the same algorithm, so the change isn't kosher.
@StefanKarpinski Yes, I understand all your points and if I saw your talk earlier, I wouldn't have elaborated on the Fibonacci recursion at all. Just give me the benefit of the doubt as someone who visited the Julia website for the first time and the very first thing he sees is a table where Mathematica, a computer algebra system, has almost the worst timing in calculating the Fibonacci numbers.
As you said, you probably "spend some more work" on your own implementations (although they look very clean and straight forward to me) and maybe there is room for improvements that are indeed kosher. Mathematica is pretty shitty with quicksort. Using the very same algorithm but replacing the recursive call with an explicit stack only adds a few lines to the implementation and on my machine, I can cut the runtime in half
mathematica,quicksortOld,11.343
mathematica,quicksort,4.324
As already mentioned, for the slow pi-sum it seems very advantageous to replace Sum with a summing loop and I get the following improvement
mathematica,pi_sumOld,35.948
mathematica,pi_sum,17.958
The mandel-set test has a comment that says: complex arithmetic and comprehensions. So if it is OK to use Mathematica's feature to apply a function automatically to a list of values, this might speed up things as well. Here, the improvement clearly depends on the system you are using, but I guess it doesn't hurt
mathematica,mandelOld,1.716
mathematica,mandel,0.519
I ran all code with Mathematica 11.1 and if someone likes to test this either download the file or run directly
Get["https://raw.githubusercontent.com/halirutan/julia/master/test/perf/micro/perf.wl"]
In the past we have merged several pull requests to improve benchmark implementations in the other languages, so we are certainly open to that. However updating the numbers is not so easy, since we need to have all the software running on the same machine, so changes might take a while to show up.
I agree some of your changes are probably good, but they're also examples of performance being fairly unpredictable.
Most helpful comment
The purpose of the benchmarks is clearly stated. It's to benchmark recursion, not to benchmark the fastest way to calculate fib. Of course no one will ever write code like this to implement the functions in the benchmark, but there are many other cases where you can't apply the trick that you do for this one. If there's another simple case that's not possible to reduce to a cached table (since we are benchmarking the function call performance and not caching performance) feel free to propose one instead.