Julia: performance of creating an array from a range

Created on 12 Nov 2013 · 24Comments · Source: JuliaLang/julia

I have tested the following three variations of array creation statement:

function func1(a, b, inc, iter)
    tic()
    for i=1:iter
        A = [a:inc:b]
    end
    toc()
end

function func2(a, b, inc, iter)
    A = Array(Float64, length(a:inc:b))
    tic()
    for i=1:iter
        A = [a:inc:b]
    end
    toc()
end

function func3(a, b, inc, iter)
    A = Array(Float64, length(a:inc:b))
    tic()
    for i=1:iter
        A[:] = a:inc:b
    end
    toc()
end

and their execution times with a=0, b=10000, inc=0.1, iter=1000

func1: 0.67 seconds
func2: 0.69 seconds
func3: 0.34 seconds

Equivalent python/numpy function

def func(a, b, inc, iter):
    t0 = time()
    for i in range(1000):
        A = np.arange(a, b, inc)
    print(time()-t0, " seconds elapsed")

takes about 0.1 seconds and equivalent matlab function

function func(a, b, inc, iter)
    tic()
    for i=1:iter
        A = [a:inc:b];
    end
    toc()

takes about 0.1 seconds.

[jiahao - edit for formatting. Please use triple backquotes for posting code, otherwise it's unreadable.

also x-ref julia-users]

performance

Source

sungpily

Most helpful comment

func1 and func2 need a trailing ; to do the same measurement ([a:inc:b; ])

vtjnash on 26 May 2017

👍3

All 24 comments

Thanks for filing the issue and trying out all these different variations!

StefanKarpinski on 12 Nov 2013

Can you post the actual output of your test calls? As I mentioned on the mailing list, for me, func3 timings are equivalent to python's.

(At one point, I thought that python was a lot faster, but that was because I had put the arguments in the wrong order.)

kmsquire on 12 Nov 2013

I get this on my laptop:

julia> [ f(0,10000,0.1,1000) for f=[func1,func2,func3] ]
3-element Array{Any,1}:
 0.689111
 0.671948
 0.309497

julia> versioninfo()
Julia Version 0.2.0-rc4
Commit a37b4d6 (2013-11-11 18:47 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.0.0)
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

On the same machine, Python takes 0.091638 seconds, so there's definitely a discrepancy on some systems.

StefanKarpinski on 12 Nov 2013

On the julia.mit.edu Linux machine (the official benchmark system), on the other hand, I get this:

>>> import time
>>> import numpy as np
>>> def func(a, b, inc, iter):
...   t0 = time.time()
...   for i in range(1000):
...     A = np.arange(a, b, inc)
...   print(time.time()-t0, " seconds elapsed")
...
>>> func(0,10000,0.1,1000)
(0.5556738376617432, ' seconds elapsed')

and Julia:

julia> [ f(0,10000,0.1,1000) for f=[func1,func2,func3] ]
3-element Array{Any,1}:
 1.06376
 1.06842
 0.564819

julia> versioninfo()
Julia Version 0.2.0-rc4+2
Commit 249fdca (2013-11-11 20:53 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

So this may be an OS X vs. Linux thing to some extent. @sungpily, @skariel – what operating systems are you on?

StefanKarpinski on 12 Nov 2013

My results seem very consistent with those from Stefan. I am using Windows 7 machine.

sungpily on 12 Nov 2013

By the way, thanks for reformatting the original post. I could not figure out how.

sungpily on 12 Nov 2013

IMHO, func3 looks "hackish" even though it performs better. I will be happy to see func1 and func2 works as fast as python or matlab.

sungpily on 12 Nov 2013

Yeah, the point is not that people should use that version. We're trying to figure out what versions are faster or slower and why so that they can all be faster.

StefanKarpinski on 12 Nov 2013

This version is orders of magnitude faster than func3 above:

function func6(a, inc, b)
    A = Array(Float64, length(a:inc:b))
    for i=1:1000
        ix=1
        while a<b
            A[ix] = a
            ix+=1
            a+=inc
        end
    end
end

If you allocate each iteration then it is still significantly faster than func3 above.

I also compared to C++, code is here. I'm on Win7, using Julia-RC3-amd64, Python3.3 and latest Numpy (all amd64)

A couple of interesting observations:

When not allocating and de-allocating each iteration (i.e. Julia func3 above) then Julia is as fast as C (when using the same algorithms)
When only allocating and de-allocating, but not filling in the range, then again Julia is as fast as C
When allocating/de-allocating each iteration, Numpy beats the above func6 by a factor of ~2 (0.1 vs 0.22 secs)

So this seems like a GC issue to me, where Julia is using colder memory (since GC not always has time to clear recently used memory - so allocation gives fresh memory)

The only fix I think is appropriate is to optimize the range notation [a:inc:b] to use an algorithm like func6 above. The hot/cold memory is a non-issue really, since if you need hot memory then you just don't de-allocate it.

About using integer types and changing them to Float64 - I deleted this post from the forums like 1 minute after writing it ssince I timed the wrong function. It still got mailed though.. sorry for that

skariel on 12 Nov 2013

About using integer types and changing them to Float64 - I deleted this post from the forums like 1 minute after writing it ssince I timed the wrong function. It still got mailed though.. sorry for that

No worries, and thanks for your tests!

kmsquire on 12 Nov 2013

Yes, thanks for the excellent analysis. Now we just need to make it do the fast thing :-)

StefanKarpinski on 12 Nov 2013

@skariel, in func6 I think you need to reset the value of a on each iteration.

timholy on 12 Nov 2013

Oh man.... This is embarrassing.... I'll just maintain a low profile for the newest few months.... :(

skariel on 12 Nov 2013

@skariel, please don't! We need people who care about performance and have the ability to do the deep digging.

I remember that one of my earliest questions to the Julia community was "I see there's an isless() function, don't we need an isgreater() function?" Stefan very gently pointed out that you can implement isgreater(x,y) as isless(y,x). By that standard, you're doing just fine :-).

Hope to see you around a lot more!

timholy on 12 Nov 2013

I'll second that--I really appreciate that you're taking the time to explore this so deeply.

Kevin

kmsquire on 12 Nov 2013

Here's an update on this issue with a gist for all my code.
A few thoughts on the results:

vcat(::FloatRange) with no GC vs. devectorizing results are about the same with devec a little faster
vcat(::StepRange) with no GC vs. devectorizing: devec is significantly faster (fastest result overall)
vcat on FloatRange or StepRange (i.e. func1) is the only method impacted by GC
::StepRanges seem to get killed with the A[:] = range method, not sure why, but I'll look into it more
On the other hand, the best overall performance for FloatRanges is using the A[:] = range technique

A few other meta-thoughts:

These results are certainly affected by the rangepocalypse, particularly with the introduction of FloatRange and its more specialized construction behavior
It's also important to remember that Matlab is usually run multi-threaded and it seems even Numpy's range-to-array code is multithreaded (plus the fact that it's not actually python, but C code we're comparing to in that case)

quinnj on 21 Aug 2014

I think that we'll have to settle for matching single-threaded performance of other systems for now – so single-threaded C is the benchmark to compare against. Once we have threading in Julia, we can revisit this and make sure that we're as fast as other systems that use threads to speed this up.

StefanKarpinski on 21 Aug 2014

Small update on this. Seems we are doing a bit better now. Here are the times I get for func2 and func3 in 3 versions (func1 and func2 are identical):

# 0.3:
elapsed time: 1.223933704 seconds
elapsed time: 0.575497855 seconds

# 0.4:
elapsed time: 0.565238758 seconds
elapsed time: 1.041991829 seconds

# 0.5:
elapsed time: 0.484999175 seconds
elapsed time: 0.236237736 seconds

JeffBezanson on 21 Jul 2016

🎉1

# 0.6:
elapsed time: 0.00059386 seconds
elapsed time: 0.328283315 seconds

KristofferC on 26 May 2017

func1 and func2 need a trailing ; to do the same measurement ([a:inc:b; ])

vtjnash on 26 May 2017

👍3

Yeah, we're still a factor of 4 slower than numpy with func3 (in-place) and a factor of 7 with func2 on my machine. It's worth noting, though, that our float ranges are doing a whole lot more work (by default) than Numpy's… and this is a deliberate decision and trade-off.

If we use a lower precision range type, we're doing much better:

julia> begin
       function func4(a, b, inc, iter)
           A = Array{Float64}(length(a:inc:b))
           tic()
           for i=1:iter
               A = [StepRangeLen(a,inc,b*10+1);]
           end
           toc()
       end
       function func5(a, b, inc, iter)
           A = Array{Float64}(length(a:inc:b))
           tic()
           for i=1:iter
               A[:] = StepRangeLen(a,inc,b*10+1)
           end
           toc()
       end
       end

julia> func4(0,10000,.1,1000)
       func5(0,10000,.1,1000);
elapsed time: 0.523196442 seconds
elapsed time: 0.116625565 seconds

The Python definition on my machine takes 0.144 seconds.

mbauman on 27 May 2017

I think things improved since this issue was opened. Here the results in my laptop:

function func1(a, b, inc, iter)
    for i=1:iter
        A = [a:inc:b]
    end
end

function func2(a, b, inc, iter)
    A = Array{Float64}(undef, length(a:inc:b))
    for i=1:iter
        A = [a:inc:b]
    end
end

function func3(a, b, inc, iter)
    A = Array{Float64}(undef, length(a:inc:b))
    for i=1:iter
        A[:] = a:inc:b
    end
end
function func4(a, b, inc, iter)
    A = Array{Float64}(undef, length(a:inc:b))
    for i=1:iter
        A = [StepRangeLen(a,inc,b*10+1);]
    end
end
function func5(a, b, inc, iter)
    A = Array{Float64}(undef, length(a:inc:b))
    for i=1:iter
        A[:] = StepRangeLen(a,inc,b*10+1)
    end
end

@time func1(0,10000,.1,1000)
@time func2(0,10000,.1,1000)
@time func3(0,10000,.1,1000)
@time func4(0,10000,.1,1000)
@time func5(0,10000,.1,1000)

println("")
@time func1(0,10000,.1,1000)
@time func2(0,10000,.1,1000)
@time func3(0,10000,.1,1000)
@time func4(0,10000,.1,1000)
@time func5(0,10000,.1,1000)

$ julia a.jl 
  0.119092 seconds (397.67 k allocations: 21.202 MiB)
  0.009710 seconds (18.41 k allocations: 1.809 MiB)
  0.360205 seconds (90.05 k allocations: 5.279 MiB)
  0.236766 seconds (96.16 k allocations: 768.615 MiB, 8.72% gc time)
  0.136129 seconds (83.31 k allocations: 4.851 MiB)

  0.000187 seconds (1.00 k allocations: 125.156 KiB)
  0.000293 seconds (1.01 k allocations: 906.547 KiB)
  0.332319 seconds (6 allocations: 781.547 KiB)
  0.272638 seconds (2.01 k allocations: 763.840 MiB, 10.49% gc time)
  0.104920 seconds (6 allocations: 781.547 KiB)

heitorPB on 7 May 2019

👍1

What Julia means by [a:inc:b] has changed — that used to do concatenation but it now just puts the range itself inside a one-element array. You need to add a ; to make it do what it used to. And I think your laptop is faster than whatever was used in the original post — we've actually (intentionally) regressed a bit since 0.5.

In my tests, Julia 1.3-dev is still at roughly the same speed that version 0.6 had. Here's a slight update to the tests to use BenchmarkTools — I think there's also some adversarial compiler shenanigans going on in both languages.

julia> using BenchmarkTools

julia> a, b, inc = 0,10000,.1;

julia> A = Array{Float64}(undef, length(a:inc:b));

julia> @btime [$a:$inc:$b;];
  260.461 μs (2 allocations: 781.39 KiB)

julia> @btime $A[:] = $a:$inc:$b;
  287.677 μs (0 allocations: 0 bytes)

julia> @btime $A .= $a:$inc:$b;
  253.028 μs (0 allocations: 0 bytes)

julia> @btime [StepRangeLen($a,$inc,$b*10+1);];
  84.170 μs (2 allocations: 781.39 KiB)

julia> @btime $A[:] = StepRangeLen($a,$inc,$b*10+1);
  112.257 μs (0 allocations: 0 bytes)

julia> @btime $A .= StepRangeLen($a,$inc,$b*10+1);
  26.295 μs (0 allocations: 0 bytes)

Compare this to numpy:

julia> using PyCall

julia> @pyimport numpy

julia> @btime pycall(numpy.arange, PyObject, $a, $b, $inc);
  86.354 μs (9 allocations: 192 bytes)

So we're around a factor of 3-4 away for the colon construction, but the StepRangeLen construction is about the same. On its face this makes sense as numpy is doing roughly the same calculation as the latter, whereas the former is using higher precision arithmetic. Despite the threading macros in the numpy source, it doesn't appear to be doing any multithreading.

mbauman on 7 May 2019

Since we intentionally use higher precision for float ranges, maybe this should be closed?

JeffBezanson on 1 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings