Track performance of the laplace equation benchmark. Discussion here:
https://groups.google.com/d/topic/julia-dev/VDlr3_6Szos/discussion
Benchmark has been added to the perf2 suite.
I don't think the vectorized version will ever be as fast as the other one.
These performance issues are getting out of hand. They need to be analyzed and narrowed down.
Yes, I was just thinking the same. The vectorized version does far too much work, but there is certainly room for improvement.
-viral
On 17-Aug-2012, at 3:57 PM, Jeff Bezanson wrote:
I don't think the vectorized version will ever be as fast as the other one.
These performance issues are getting out of hand. They need to be analyzed and narrowed down.—
Reply to this email directly or view it on GitHub.
Original multi-language performance comparison:
Octave (0): 14.9932 seconds
GNU R (0): 56.4161 seconds
Julia (0): 39.1679 seconds
gcc intel
1 : 13.4774 * NumPy
2 : * 1.6633 Cilk Plus
3, 4 : 2.4963 1.9179 Vectorized Fortran (pure)
Devectorized julia for the same number of iterations (2^15) now runs in 3.8 seconds for me.
Does numpy implement views and ends up evaluating lazily in this case, and hence end up being much faster? Since octave ends up taking the same amount of time, I presume that we just have a lot of room for improvement in julia.
Yes, numpy indexing with scalars and slices (ranges) always produces views.
Does octave do the same? I didn't think so - since it is matching numpy's performance. I'll write the benchmark in a few other languages for comparison.
I would imagine that it does copy on write, since it strives to emulate matlab. But since I'm not an octave user, I really don't know.
I really think it's arguable that our slices should be views.
+1
Except for copy on write (which I'm definitely not suggesting),
I'm not aware of any other sane way to get performance with slices.
I could get on board with that. We either have to make all ops on SubArray
fast, or build views in to the Array
type. This is actually a tough decision. Of course we want it to be possible to write something like SubArray
and have it be fast.
An issue that immediately comes up is that linear indexing is no longer fast with views, though it is insanely fast for plain dense arrays. This makes me think we need a smarter index abstraction so there is a fast and generic way to iterate over more kinds of arrays. I'm willing to go to extreme lengths to get this. Syntax, changes to the compiler IR, everything is on the table. Rewriting array.jl is no big deal. We can do that in maybe a day.
An issue that immediately comes up is that linear indexing is no longer fast with views
In my currently-stalled rewrite of the Image library, I have a substitute brewing:
type Counter{N}
max::NTuple{N, Int}
end
Counter(sz) = Counter{length(sz)}(to_tuple(sz))
function start{N}(c::Counter{N})
state = ones(Int,N)
state[1] = 0 # because of start/done/next sequence, start out one behind
return state
end
function done(c::Counter, state)
# we do the increment as part of "done" to make exit-testing more efficient
state[1] += 1
i = 1
max = c.max
while state[i] > max[i] && i < length(state)
state[i] = 1
i += 1
state[i] += 1
end
state[end] > c.max[end]
end
function next(c::Counter, state)
return state, state
end
This can be pretty fast, particularly if you handle the inner loop like this:
szv = [sz...] # vector from tuple
szv[1] = 1
for c = Counter(szv)
cprod = prod(c)
for i = 1:sz[1]
p[index] = i*cprod
index += 1
end
end
Then for most applications the performance is essentially indistinguishable from linear indexing. (However, there are traps: this is the usage that prompted me to report performance issue 73d8ca4e). Assuming it could be generalized to work on general ranges, and to return a linear index, would that help?
I think in my initial version of this I tried returning both the linear index and the coordinate representation, and also ran into performance problems due to tuple creation, but memory of exactly what was going on is now hazy.
Should have specified: on the inner loop I was demoing how you'd set each element to the product of coordinates, i.e., p[i,j,k] = i*j*k
. And you'd need to set index = 1
at the beginning.
This seems to rely on making the first dimension a special case for performance, or at least assuming it is large enough for the inner loop to dominate. I think we need something more drastic. We may also need to depend on better compiler optimizations like allocating tuples on the stack or in registers.
Yes, that's basically true.
OK, might one be able to generate a user-friendly version of gen_array_index_map
that would make it trivial to write linear-indexing code wherever it might appear? Or is that something that would have to go into the compiler?
This is turning into a full-blown discussion. Perhaps best to move it over to julia-dev to invite some more ideas?
I would rather invite anyone to have a discussion here, so that all the perf issues can be tracked here and are addressed.
On the pure scalar indexing bit, we are still a factor of 1.5x-2x slower than C. So long as we do not close that gap either through better code generation and other optimizations, vectorized julia codes will continue to remain that much slower compared to other languages where the vectorized operations are implemented in C.
There was a mailing list discussion a while back that found that bounds-checking accounts for some of this. Code with bounds-checks disabled (by commenting out two lines in codegen.cpp) ran in about 75% of the time of the code with bounds-checks, pretty constantly on three different machines (mine, Krys', and Bill's, if memory serves). Not a huge performance difference, but if you're thinking about factors of 2 that will be some chunk of it.
I think with bounds-checking removed the code generated in loops was about the same as what clang generates for C/C++. You just have to feed the JIT engine enough code to be able to optimize it.
I really think it's arguable that our slices should be views.
I could get on board with that. We either have to make all ops on SubArray fast, or build views in to the Array type. This is actually a tough decision. Of course we want it to be possible to write something like SubArray and have it be fast.
Using views as the default behavior seems appealing for a few reasons:
An issue that immediately comes up is that linear indexing is no longer fast with views, though it is insanely fast for plain dense arrays. This makes me think we need a smarter index abstraction so there is a fast and generic way to iterate over more kinds of arrays. I'm willing to go to extreme lengths to get this. Syntax, changes to the compiler IR, everything is on the table. Rewriting array.jl is no big deal. We can do that in maybe a day.
Now that we have many more examples of types of arrays (sparse, distributed, etc.), we can maybe come up with something that unifies them better and has good performance.
I ran the Laplace benchmark on master:
Julia (vec) - 1874 ms
Matlab (vec) - 220 ms
Julia (devec) - 135 ms
C - 162 ms
It would be nice to figure out if this is a problem of too many temporaries, and to what extent we need to rethink our indexing by using views and such.
I am marking this as 0.2 just to figure out if there is something we can do to improve our performance for vectorized codes by the next release.
It's definitely a problem of too many temporaries, as a simple sprofile run will show: out of 1104 captures on perf2/laplace.jl; laplace_iter_vec; line: 25, there were 970 at array.jl; +; line: 875. That line allocates the output array:
F = Array(promote_type(S,T), promote_shape(size(A),size(B)))
(Out of those 970, 963 were in jl_gc_collect
, so the other function calls on that line have negligible impact.)
However, I'd guess that this is not the kind of thing we're likely to solve in 2-3 weeks. I'd also tentatively suggest that Julia's inlining behavior may be a higher priority. The reason is simple: often a careful programmer can "write around" the library's current tendency to allocate temporaries. It can be inconvenient, but it's usually possible. Case in point: laplace_iter_devec
is probably more readable than laplace_iter_vec
. In contrast, the mere fact that map(abs, A)
is much slower than abs(A)
---and there's nothing that one not steeped in LLVM lore can do to change that---influences language design and usage patterns in deep ways.
But coming from me this is all hot air, because ATM I don't feel up to the challenge of tackling this myself.
I think that's a good point, @timholy. cc: @JeffBezanson, who is, of course, already following this thread.
I concur that it will not be solved in 2 weeks just by setting a milestone. But I have plans. I have been ruminating a lot about julia's theory of "what is a function", and I may be starting to get somewhere.
Sounds exciting! Here, have a nice cup of tea to help your ruminations.
Right, I didn't expect anything to be solved in 2 weeks, but even being able to give a rough plan on tackling this would be an important signal for everyone.
Also, as @JeffBezanson has often said, the best way to avoid GC is to not allocate memory. Perhaps we can revisit our indexing along these lines, as is discussed in this thread.
57e010e19dacd13499b243c4913bcc875f33af02 helps both laplace_vec and laplace_devec significantly (but does not fix the gap between them).
I just reran this benchmark on julia.mit.edu, and find that octave takes 0.8 seconds, whereas julia takes 2.2 seconds.
I added a subarray version here, and it is significantly slower. I guess we have to wait for tuple overhaul. In any case, I was wondering if we can characterize why the vectorized version is so much slower (subarrays or otherwise), compared to the devectorized version. I was kind of hoping that @carnaval 's GC work would help here, and I wonder if a better GC will help (perhaps tune some knobs in the new GC), or if the gains need to come from elsewhere.
julia> @time laplace_vec_sub();
elapsed time: 4.229344435 seconds (1028 MB allocated, 3.58% gc time in 47 pauses with 0 full sweep)
julia> @time laplace_vec();
elapsed time: 1.707854514 seconds (1715 MB allocated, 13.36% gc time in 78 pauses with 0 full sweep)
julia> @time laplace_devec();
elapsed time: 0.162232645 seconds (1 MB allocated)
I thought the new gc already helped here significantly. Though of course it does not close the whole gap. We should work on subarrays.
If you see my older comment and the new results, the difference went from 13x to 10x. The new GC helped significantly on bench_eu
, but not much on laplace
.
I tried a subarray version a while ago, and it was a little faster. Can you link to your code?
I have checked in the subarray version into laplace.jl. With tuple overhaul now, there are significant gains, but still too much memory allocation.
julia> @time laplace_vec_sub();
elapsed time: 2.958093484 seconds (1027 MB allocated, 3.30% gc time in 47 pauses with 0 full sweep)
It was an indexing problem, not a subarray problem. Fixed in #10858. Will be improved further if #9080 can be fixed.
Almost all of the remaining memory allocation comes from A+B
, c*A
, etc.
I should add: technically, with the new GC the performance problems are presumably more an issue of cache efficiency than memory allocation per se. When you allocate a temporary C = A+B
, you traverse A
, B
, and C
; then in D = 2*C
, you traverse C
again. In contrast, the devectorized version traverses everything once.
Cache behavior is especially bad if you are allocating temps in a loop because you will get fresh (read: not cached) memory at each iteration up until collection, and start all over again.
With preallocation you get very good temporal coherency of course.
We may have a good solution for this kind of things at the end of the semester if all goes well :-)
We may have a good solution for this kind of things at the end of the semester if all goes well :-)
Sounds like something to look forward to!
I am really looking forward to it too!
Hopefully the performance of the vectorized version will be fixed by #17623, assuming we change the code to use .+
etcetera.
Syntactic fusion seems like a completely fair solution to this performance issue.
Even with loop fusion, this is still taking a lot more allocations than I would have expected: Nevermind, this was a typo in my code (I accidentally used a global).
julia> function laplace!(u, Niter, dx2, dy2)
Ny, Nx = size(u)
c = 1 / (2*(dx2+dy2))
u0 = @view u[2:Ny-1, 2:Nx-1]
u1 = @view u[1:Ny-2, 2:Nx-1]
u2 = @view u[3:Ny, 2:Nx-1]
u3 = @view u[2:Ny-1,1:Nx-2]
u4 = @view u[2:Ny-1, 3:Nx]
for i = 1:Niter
u0 .= ((u1 .+ u2) .* dy2 .+ (u3 .+ u4) .* dx2) .* c
end
return u0
end
laplace! (generic function with 1 method)
julia> u = zeros(150,150); u[1,:] = 1
1
julia> @time laplace!(u, 2^15, 0.01, 0.01);
7.999779 seconds (301.50 k allocations: 10.248 MB, 0.58% gc time)
julia> @time laplace!(u, 2^15, 0.01, 0.01);
7.747673 seconds (10 allocations: 496 bytes)
@yuyichao, @pabloferz, any ideas? Is this some problem with views?
In comparison, an unrolled version does slightly better, which is not too unexpected — there is some overhead (less than a factor of 2) to using all of those views and generic broadcasting iteration:
julia> function laplace_iter!(u, Niter, dx2, dy2)
Ny, Nx = size(u)
c = 1 / (2*(dx2+dy2))
for iter = 1:Niter
for i = 2:size(u,1)-1, j = 2:size(u,2)-1
u[i,j] = ((u[i-1,j]+u[i+1,j])*dx2 + (u[i,j-1]+u[i,j+1])*dy2)*c
end
end
return u
end
laplace_iter! (generic function with 1 method)
julia> @time laplace_iter!(u, 2^15, 0.01, 0.01);
4.281638 seconds (7.69 k allocations: 327.032 KB)
julia> @time laplace_iter!(u, 2^15, 0.01, 0.01);
4.221090 seconds (5 allocations: 176 bytes)
But I think the vectorized version is fast enough that we can now close this?
Most helpful comment
In comparison, an unrolled version does slightly better, which is not too unexpected — there is some overhead (less than a factor of 2) to using all of those views and generic broadcasting iteration:
But I think the vectorized version is fast enough that we can now close this?