Average insertion time for inserting n-elements:
Total insertion time for inserting n-elements:
_Preliminarily_, this behavior seems to be true for both push-front and push-back since 0.5 onwards. Still running the profiles.
general array-insert is O(n)
, only push/pop (e.g. insert at any fixed offset from begin or end) are expected to be O(1)
This is normal push!
though?
So... this is probably an orthogonal, unrelated problem, but @JeffBezanson and I have found that We're manually doing a malloc
, memcpy
, free
every time an array is grown, rather than using realloc
:
https://github.com/JuliaLang/julia/blob/a2149f890fb9ba8846e2731cd193774597bd4916/src/julia_internal.h#L840-L844
This seems to be because there _doesn't exist_ a reasonable method to realloc _aligned_ data, so we have to manually re-do the _aligned_malloc
. However, this StackOverflow post suggests that we may be better off _attempting_ a realloc
, then testing the resulting alignment, and then only re-doing the _aligned_malloc
if it's needed (because the alignment is off):
https://stackoverflow.com/a/9078627/751061
That's especially true when the arrays get huge -- because in that case it's pretty hard to imagine realloc not producing an aligned memory region, so it's probably worth the _potential_ double allocation.
EDIT: This would make it slower, but should be an amortized constant increase, not an order of magnitude increase.
Sorry, the problem description wasn't clear. yes, this is push!
and pushfront!
.
So... this is probably an orthogonal, unrelated problem, but JeffBezanson and I have found that We're manually doing a malloc, memcpy, free every time an array is grown, rather than using realloc:
This is still an O(1)
operation (and this implementation is better for future threading changes and will be required when switching to the Buffer-based pure-julia Array
implementation)
This is still an O(1) operation
Yeah, agreed. That's what i meant by this is orthogonal. I shouldn't have qualified that with "probably"
. Sorry. :)
and this implementation is better for future threading changes and will be required when switching to the Buffer-based pure-julia Array implementation
Huh yeah that's interesting, i'm interested to hear more about that, but that makes sense.
I still think we should use the technique of calling realloc
, checking the alignment, and only copying if necessary. realloc
is essential at large array sizes, and at that point also highly likely to be aligned.
(I'm happy to branch this conversation to another thread, though, if we want?)
So, to continue the original conversation here.
We discovered the actual problem. To summarize what we discussed offline at JuliaCon:
push!
, making each insertion O(N)
!However, upon more investigation with Valentin, we think that's not quite right. The behavior we noticed above should _actually_ be summarized as follows:
And that actually seems somewhat reasonable! That change was introduced here:
https://github.com/JuliaLang/julia/commit/139a73a0e1fdd3a84ba81537ba53990566136b3c
Now, despite that being _significantly_ more reasonable than we originally thought, I still think that might be somewhat related to the performance described above. It _still_ means that above a certain size, the arrays grow constantly -- just with a much larger constant growth rate than growing by 1. So I _think_ that would still result in an exponential growth of the total cost to insert N elements?
So this still bears some thinking about: is that acceptable?
BUT THEN, something else has changed that introduced a significant slowdown in array growth inbetween julia 0.4 and 0.5. As far as I can tell, it's a constant-slowdown of around 1.7x.
We see a pretty big slowdown between 0.4 and 0.5, and it's not related to the above change; the above change was introduced 5 years ago, in julia 0.3.
Here's the plots comparing 0.4-0.7:
https://plot.ly/~nhdaly/1/
https://plot.ly/~nhdaly/3/
In particular, here's the change between 0.4 and 0.5 for push!
0.4 looks deceivingly flat, but it actually has the same shape, just constant-scaled down:
So all of them suffer from this same problem of greater-than-O(1)
average time append, but after 5.0 that got slower still. So there's two unrelated "problems" here.
Sorry, and just to clarify, the v0.4 to v0.5 regression seems to be always around 2x:
julia0.7> v05[2] ./ v04[2]
>> 6-element Array{Float64,1}:
2.048994351707916
1.7865544298045446
1.1380501128347926
0.9317471797454173
0.9720312759644577
3.3472547019000554
So, NEXT STEPS:
We should have some kind of quick fix for this problem. I think the simplest thing would be to increase that 1% RAM cap to something like 5%. I _think_ who cares about the asymptotic complexity: it's fine if your growth becomes "quadratic" if there's only going to be at most 20 more growth operations. 100 is kind of a lot, but maybe 20 is okay. or 10. We could do 10% maybe? idk.
Find and fix the 2x array growth regression between julia 0.4 and julia 0.5.
Investigate other cool growth strategies:
what about 1.0? where is that relative to .5?
- Could we imagine some continuous function that behaves nicely such that it at least doubles -- maybe _faster_ than doubles -- at the beginning, and then tapers off near the end of RAM? Maybe!
Something in the Sigmoid family maybe? Probably best to write down the properties we want it to have and then derive a function that does that.
CC: @vchuravy
what about 1.0? where is that relative to .5?
I didn't actually measure 1.0, because i assumed it was the same as 0.7. I can do that in a bit though. Assuming it _is_, 0.7 was the same as 0.5.
oh, all the benchmarks were labled .4 or.5 so i assumed.
Leaving breadcrumbs to https://github.com/JuliaLang/julia/pull/25879, which might not be needed if we fix the performance of push!
.
oh, all the benchmarks were labled .4 or.5 so i assumed.
Ah, sorry, i didn't include a screenshot, but here's the full results:
Here's the plots comparing 0.4-0.7:
https://plot.ly/~nhdaly/1/
https://plot.ly/~nhdaly/3/
-- _EDIT: I forgot I ran the 0.4 and 0.5 tests on julia-debug, but 1.0 on julia. After switching them all to julia
, the 0.5 and 1.0 profiles are almost identical._ --
Okay, profiling update. (Is there a better place to be posting these thoughts/updates? Maybe in a performance slack channel or something? let me know.)
So @vchuravy helped me set up a profiling script, which I then profiled on my Mac using Xcode's Instruments:
# profilegrowth.jl (Note, this file is targetting julia 0.4 and 0.5)
function testfunc(capacity)
a = zeros(capacity梅2)
ccall(:jl_array_sizehint, Void, (Array, Cint), a, capacity) # For the julia 1.0 test, I changed Void to Nothing
end
testfunc(10)
function test(iterations)
for _ in 1:iterations
testfunc(10^8)
end
end
test(1)
test(10)
Some interesting early observations:
v0.4.7
vs 6.1 secs in v0.5.2
(or around 2x).jl_array_grow_end
) that got slower -- _everything_ seems to have gotten slower... Creating the original zeroed-out array takes about twice as long in 0.5, and allocating the array \~10x (but starting from a very small value, so probably insignificant).~ jl_array_grow_end
went from 254ms at 7.8% of the profile, to 3.29secs at 47.4% of the profile, which seems pretty bad.julia v0.4:
julia v0.5:
EDIT: And julia 1.0's profile is almost identical to 0.5's. (What I thought was a speed-up was actually just due to using julia
vs julia-debug
.)
~And for comparison, julia 1.0's profile looks very similar to 0.5's, percentage-wise. Everything is faster, with a total time of 5.98secs instead of 8.2secs, but the percentages are similar, and jl_array_grow_end
takes 2.87secs at 44% of the profile:~
(zero-filing in 1.0 is now _slightly_ faster than in 0.4 though, so that's cool!)
So then, digging into those profiles, from what I can tell, the extra time seems to come from the fact that 0.5
and 1.0
are actually calling memcpy
or memmove
, whereas 0.4
_isn't_!:
julia 0.4:
julia 0.5:
julia 1.0:
So actually, as best as I can tell, this might actually be a symptom of the problem Jeff originally identified above (https://github.com/JuliaLang/julia/issues/28588#issuecomment-412271105) and which I broke out into a separate issue here (https://github.com/JuliaLang/julia/issues/29526).
That is, that 0.4 was avoiding the expensive operation of _actually copying_ the data, since most of the time it can get away with just growing the memory in-place!
Am I understanding that correctly? And if so, is that only true because this is a toy example that does nothing else besides allocate that array? In the real-world, would we expect memory to be more fragmented than this such that that optimization matters less? I think it would probably be good to profile a bunch of example normal workflows to compare the CPU spent in jl_array_grow_end
.
This also makes sense to explain the ~2x slowdown: we're now touching all the data twice, once during allocation and once during the grow event.
Oops, I just realized I used julia-debug binaries for 0.4 and 0.5, but julia for 1.0. The results are still similar, but let me update the numbers above.
I watched a talk recently that seems quite relevant to this (https://youtu.be/2YXwg0n9e7E?t=2193), it compares std::vector
and their in-house vector class. The discussion is about 4 minutes
I paraphrase:
- One of the things that happens with a vector, as you push back you have to grow and resize the array.
std::vector
: create temporary, move the old to the new, resize... This is fine...- We wrote our growth capacity like this: we used
realloc
...- There are two artifacts, realloc doesn't have to allocate new memory if the memory allocation library says that you can just grow
- Unable to get std::vector to ever match the performance.
@KristofferC Yeah, so the equivalent C++ (compiled with -O3
) has a profile that looks more like the 0.5/1.0 versions:
It spends ~30% of its time zeroing the initial array, and an equal 30% of time copying that data to a new array.
Although c++ is hindered by spending an extra 30% of its time zeroing out the rest of the array after growing, whereas julia chooses to leave that data uninitialized.
It's hard to directly compare the total times, since the julia runs included parsing and compiling the code as well, and I'm not sure how much of that contributed to the final profiles, but still, it seems that the C++ is actually a good amount faster despite doing the extra work that 0.5 and 1.0 do... It would be interesting to know why! :)
Here's the c++ file I used:
// Compiled via: clang++ --std=c++11 -O3 profilegrowth.cc
#include <vector>
using namespace std;
void testfunc(int capacity) {
auto a = vector<int>(capacity/2);
a.resize(capacity);
}
void test(int iterations) {
for (int i = 0; i < iterations; ++i) {
testfunc((int)1e8);
}
}
int main() {
// "Warm up" To maintain correlation with the julia script which runs these to precompile
testfunc(10);
test(1);
// Actual test
test(10);
return 0;
}
watched a video recently that seems quite relevant to this (https://youtu.be/2YXwg0n9e7E?t=2193)
This is a fun talk, thanks for sharing! Agreed that it seems quite relevant, and that your summary is good.
Hi again!
So I've been (finally) doing some more investigation into this, and want to share my thoughts.
There are two problems we've discovered:
realloc
for growing the array, causing ~2x regressionrealloc
for _aligned data_, instead performing malloc, copy, free. This is because realloc
doesn't guarantee alignment.The solution to problem (2.) is probably to follow the suggested behavior in this stackoverflow post:
https://stackoverflow.com/a/9078627/751061
Namely, to _attempt_ realloc
, check for alignment (which it probably will be), and then only do _malloc, copy, free_ if not aligned. However, the legalese is sufficiently complicated enough to make me wary that this will be unsafe in some weird edge case... It seems probably worth doing, but it scares me a little.
Problem number (1.) however is straightforward. All the literature I have read on this topic says you should never grow with a constant growth amount. It should always be a scale-factor of the current array size. So we should fix that. The next remaining question is _what should the growth factor be_?
There is apparently a lot of interesting debate on this topic.
It's summarized well in the wikipedia article on dynamic arrays, which also includes a table of the growth factors in various languages:
https://en.wikipedia.org/wiki/Dynamic_array#Growth_factor
Implementation | Growth factor (a)
-- | --
Java ArrayList[1] | 1.5 (3/2)
Python聽PyListObject[7] | ~1.125 (n + n >> 3)
Microsoft Visual C++聽2013[8] | 1.5 (3/2)
G++聽5.2.0[5] | 2
Clang聽3.6[5] | 2
Facebook folly/FBVector[9] | 1.5 (3/2)
Introductory CS courses explain that you _double_ an array when it needs to grow. However, some sources such as Facebook's folly claim that 2x growth is provably the worst choice, "because it never allows the vector to reuse any of its previously-allocated memory":
https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md
I think that this stackoverflow answer explained the concept well:
https://stackoverflow.com/a/1100426/751061
From what I understand, conceptually any growth factor less than or equal to the _golden ratio_ sufficiently allows for reusing previous allocation slots, but people prefer 1.5 for other reasons.
The approach julia took to limit allocation size as you approach the total system RAM is interesting. One thing we could maybe consider is _gradually slowing_ the growth factor as you get larger, rather than switching to a constant growth size like we do now. (For example, we'd discussed other interesting growth functions like _sqrt(x)_ in the past.)
But I think that this is probably a bad idea. Here are some reasons:
So given all of that, we have a few decisions to make. I'll list them and my proposals here:
How does all of that sound? :)
I've opened https://github.com/JuliaLang/julia/pull/32035, which implements the proposals above, which we can use as a strawperson to investigate the best fix.
Most helpful comment
Hi again!
So I've been (finally) doing some more investigation into this, and want to share my thoughts.
First, a recap:
There are two problems we've discovered:
realloc
for growing the array, causing ~2x regressionrealloc
for _aligned data_, instead performing malloc, copy, free. This is becauserealloc
doesn't guarantee alignment.The solution to problem (2.) is probably to follow the suggested behavior in this stackoverflow post:
https://stackoverflow.com/a/9078627/751061
Namely, to _attempt_
realloc
, check for alignment (which it probably will be), and then only do _malloc, copy, free_ if not aligned. However, the legalese is sufficiently complicated enough to make me wary that this will be unsafe in some weird edge case... It seems probably worth doing, but it scares me a little.Problem number (1.) however is straightforward. All the literature I have read on this topic says you should never grow with a constant growth amount. It should always be a scale-factor of the current array size. So we should fix that. The next remaining question is _what should the growth factor be_?
Picking a scaling factor
There is apparently a lot of interesting debate on this topic.
It's summarized well in the wikipedia article on dynamic arrays, which also includes a table of the growth factors in various languages:
https://en.wikipedia.org/wiki/Dynamic_array#Growth_factor
Implementation | Growth factor (a)
-- | --
Java ArrayList[1] | 1.5 (3/2)
Python聽PyListObject[7] | ~1.125 (n + n >> 3)
Microsoft Visual C++聽2013[8] | 1.5 (3/2)
G++聽5.2.0[5] | 2
Clang聽3.6[5] | 2
Facebook folly/FBVector[9] | 1.5 (3/2)
Introductory CS courses explain that you _double_ an array when it needs to grow. However, some sources such as Facebook's folly claim that 2x growth is provably the worst choice, "because it never allows the vector to reuse any of its previously-allocated memory":
https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md
I think that this stackoverflow answer explained the concept well:
https://stackoverflow.com/a/1100426/751061
From what I understand, conceptually any growth factor less than or equal to the _golden ratio_ sufficiently allows for reusing previous allocation slots, but people prefer 1.5 for other reasons.
Changing growth factor based on physical RAM
The approach julia took to limit allocation size as you approach the total system RAM is interesting. One thing we could maybe consider is _gradually slowing_ the growth factor as you get larger, rather than switching to a constant growth size like we do now. (For example, we'd discussed other interesting growth functions like _sqrt(x)_ in the past.)
But I think that this is probably a bad idea. Here are some reasons:
Next Steps
So given all of that, we have a few decisions to make. I'll list them and my proposals here:
How does all of that sound? :)