Julia: Array `push!` seems to grow amortized O(N)...

Created on 11 Aug 2018  路  26Comments  路  Source: JuliaLang/julia

Average insertion time for inserting n-elements:
julia front-back insertion time average
Total insertion time for inserting n-elements:
julia front-back insertion time total

_Preliminarily_, this behavior seems to be true for both push-front and push-back since 0.5 onwards. Still running the profiles.

Most helpful comment

Hi again!
So I've been (finally) doing some more investigation into this, and want to share my thoughts.

First, a recap:

There are two problems we've discovered:

  1. Array growth is capped at 1% physical RAM, causing quadratic total insertion time complexity

    • After the size of the array reaches 1% of physical memory, each subsequent growth event will grow the array by another 1% of RAM (i.e. a constant-size growth).

    • Having any constant-sized growth instead of a scaled growth factor leads to quadratic total insertion time instead of linear (or ammortized linear per element insertion instead of amortized _O(1)_ per insertion).

    • We've been doing this since v0.3 (in https://github.com/JuliaLang/julia/pull/6667)

  2. Julia v0.5 stopped using realloc for growing the array, causing ~2x regression

    • Julia v0.5 stopped using realloc for _aligned data_, instead performing malloc, copy, free. This is because realloc doesn't guarantee alignment.

    • Based on the performance profiles above, this seems to cause a consistent around 2x slowdown, especially at large sizes. Presumably this is due to the extra time spent copying and freeing, rather than just a single allocation.

The solution to problem (2.) is probably to follow the suggested behavior in this stackoverflow post:
https://stackoverflow.com/a/9078627/751061
Namely, to _attempt_ realloc, check for alignment (which it probably will be), and then only do _malloc, copy, free_ if not aligned. However, the legalese is sufficiently complicated enough to make me wary that this will be unsafe in some weird edge case... It seems probably worth doing, but it scares me a little.

Problem number (1.) however is straightforward. All the literature I have read on this topic says you should never grow with a constant growth amount. It should always be a scale-factor of the current array size. So we should fix that. The next remaining question is _what should the growth factor be_?

Picking a scaling factor

There is apparently a lot of interesting debate on this topic.

It's summarized well in the wikipedia article on dynamic arrays, which also includes a table of the growth factors in various languages:
https://en.wikipedia.org/wiki/Dynamic_array#Growth_factor

Implementation | Growth factor (a)
-- | --
Java ArrayList[1] | 1.5 (3/2)
Python聽PyListObject[7] | ~1.125 (n + n >> 3)
Microsoft Visual C++聽2013[8] | 1.5 (3/2)
G++聽5.2.0[5] | 2
Clang聽3.6[5] | 2
Facebook folly/FBVector[9] | 1.5 (3/2)

Introductory CS courses explain that you _double_ an array when it needs to grow. However, some sources such as Facebook's folly claim that 2x growth is provably the worst choice, "because it never allows the vector to reuse any of its previously-allocated memory":
https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md

I think that this stackoverflow answer explained the concept well:
https://stackoverflow.com/a/1100426/751061

From what I understand, conceptually any growth factor less than or equal to the _golden ratio_ sufficiently allows for reusing previous allocation slots, but people prefer 1.5 for other reasons.

Changing growth factor based on physical RAM

The approach julia took to limit allocation size as you approach the total system RAM is interesting. One thing we could maybe consider is _gradually slowing_ the growth factor as you get larger, rather than switching to a constant growth size like we do now. (For example, we'd discussed other interesting growth functions like _sqrt(x)_ in the past.)

But I think that this is probably a bad idea. Here are some reasons:

  1. Scaling down the growth factor as the array size scales up is close to effectively having a constant-size growth increment. I'm not sure, but my gut tells me that doing this by any amount would make our total insertion time complexity super-linear.
  2. The current limit is based on a percentage of _physical RAM_, but these days with virtual memory, it's often times totally reasonable to keep an array that is larger than physical RAM!

    • Especially as more computers have SSDs, swapping to disk is significantly less expensive than in the past, and new feature like overcommit allow you allocate larger than RAM without slowing down/OOMing until you access it.

  3. No one else seems to do this, which makes me think it's not a good idea.

Next Steps

So given all of that, we have a few decisions to make. I'll list them and my proposals here:

  • Remove the entire "over allocation" protection mechanism vs. change it to scale down gradually? [NHDaly: I propose we remove it]
  • What should our scale factor be? [NHDaly: I propose 1.5, but python's very low growth rate is interesting.]
  • Related: should we allow users to configure the growth strategy in any way? [NHDaly: could be useful, but lower priority.]

How does all of that sound? :)

All 26 comments

general array-insert is O(n), only push/pop (e.g. insert at any fixed offset from begin or end) are expected to be O(1)

This is normal push! though?

So... this is probably an orthogonal, unrelated problem, but @JeffBezanson and I have found that We're manually doing a malloc, memcpy, free every time an array is grown, rather than using realloc:
https://github.com/JuliaLang/julia/blob/a2149f890fb9ba8846e2731cd193774597bd4916/src/julia_internal.h#L840-L844

This seems to be because there _doesn't exist_ a reasonable method to realloc _aligned_ data, so we have to manually re-do the _aligned_malloc. However, this StackOverflow post suggests that we may be better off _attempting_ a realloc, then testing the resulting alignment, and then only re-doing the _aligned_malloc if it's needed (because the alignment is off):
https://stackoverflow.com/a/9078627/751061

That's especially true when the arrays get huge -- because in that case it's pretty hard to imagine realloc not producing an aligned memory region, so it's probably worth the _potential_ double allocation.

EDIT: This would make it slower, but should be an amortized constant increase, not an order of magnitude increase.

Sorry, the problem description wasn't clear. yes, this is push! and pushfront!.

So... this is probably an orthogonal, unrelated problem, but JeffBezanson and I have found that We're manually doing a malloc, memcpy, free every time an array is grown, rather than using realloc:

This is still an O(1) operation (and this implementation is better for future threading changes and will be required when switching to the Buffer-based pure-julia Array implementation)

This is still an O(1) operation

Yeah, agreed. That's what i meant by this is orthogonal. I shouldn't have qualified that with "probably". Sorry. :)

and this implementation is better for future threading changes and will be required when switching to the Buffer-based pure-julia Array implementation

Huh yeah that's interesting, i'm interested to hear more about that, but that makes sense.

I still think we should use the technique of calling realloc, checking the alignment, and only copying if necessary. realloc is essential at large array sizes, and at that point also highly likely to be aligned.

(I'm happy to branch this conversation to another thread, though, if we want?)

So, to continue the original conversation here.

We discovered the actual problem. To summarize what we discussed offline at JuliaCon:

  • We came to believe: Arrays grow following the standard "double when full" algorithm, _up until_ a single array uses >= 1% of the computer's RAM, at which point it switches to _only growing by exactly the space needed to append the elements!_ i.e. it grows by one for each insertion if using push!, making each insertion O(N)!

However, upon more investigation with Valentin, we think that's not quite right. The behavior we noticed above should _actually_ be summarized as follows:

  • if an array attempts to grow by more than 1% memory, _CAP it's growth at 1% of RAM._

And that actually seems somewhat reasonable! That change was introduced here:
https://github.com/JuliaLang/julia/commit/139a73a0e1fdd3a84ba81537ba53990566136b3c

Now, despite that being _significantly_ more reasonable than we originally thought, I still think that might be somewhat related to the performance described above. It _still_ means that above a certain size, the arrays grow constantly -- just with a much larger constant growth rate than growing by 1. So I _think_ that would still result in an exponential growth of the total cost to insert N elements?

So this still bears some thinking about: is that acceptable?

BUT THEN, something else has changed that introduced a significant slowdown in array growth inbetween julia 0.4 and 0.5. As far as I can tell, it's a constant-slowdown of around 1.7x.

We see a pretty big slowdown between 0.4 and 0.5, and it's not related to the above change; the above change was introduced 5 years ago, in julia 0.3.

Here's the plots comparing 0.4-0.7:
https://plot.ly/~nhdaly/1/
https://plot.ly/~nhdaly/3/

In particular, here's the change between 0.4 and 0.5 for push!
screen shot 2018-10-04 at 8 21 04 pm

0.4 looks deceivingly flat, but it actually has the same shape, just constant-scaled down:
screen shot 2018-10-04 at 8 21 12 pm

So all of them suffer from this same problem of greater-than-O(1) average time append, but after 5.0 that got slower still. So there's two unrelated "problems" here.

Sorry, and just to clarify, the v0.4 to v0.5 regression seems to be always around 2x:

julia0.7> v05[2] ./ v04[2]
>> 6-element Array{Float64,1}:
 2.048994351707916
 1.7865544298045446
 1.1380501128347926
 0.9317471797454173
 0.9720312759644577
 3.3472547019000554

So, NEXT STEPS:

  1. We should have some kind of quick fix for this problem. I think the simplest thing would be to increase that 1% RAM cap to something like 5%. I _think_ who cares about the asymptotic complexity: it's fine if your growth becomes "quadratic" if there's only going to be at most 20 more growth operations. 100 is kind of a lot, but maybe 20 is okay. or 10. We could do 10% maybe? idk.

  2. Find and fix the 2x array growth regression between julia 0.4 and julia 0.5.

  3. Investigate other cool growth strategies:

    • we goofed around, talking about other growth functions we could use besides the current discontinuous function: "double until some cutoff, then always add a constant amount". Could we imagine some continuous function that behaves nicely such that it at least doubles -- maybe _faster_ than doubles -- at the beginning, and then tapers off near the end of RAM? Maybe!

what about 1.0? where is that relative to .5?

  • Could we imagine some continuous function that behaves nicely such that it at least doubles -- maybe _faster_ than doubles -- at the beginning, and then tapers off near the end of RAM? Maybe!

Something in the Sigmoid family maybe? Probably best to write down the properties we want it to have and then derive a function that does that.

CC: @vchuravy

what about 1.0? where is that relative to .5?

I didn't actually measure 1.0, because i assumed it was the same as 0.7. I can do that in a bit though. Assuming it _is_, 0.7 was the same as 0.5.

oh, all the benchmarks were labled .4 or.5 so i assumed.

Leaving breadcrumbs to https://github.com/JuliaLang/julia/pull/25879, which might not be needed if we fix the performance of push!.

oh, all the benchmarks were labled .4 or.5 so i assumed.

Ah, sorry, i didn't include a screenshot, but here's the full results:

Here's the plots comparing 0.4-0.7:
https://plot.ly/~nhdaly/1/
https://plot.ly/~nhdaly/3/

screen shot 2018-10-04 at 10 29 41 pm

-- _EDIT: I forgot I ran the 0.4 and 0.5 tests on julia-debug, but 1.0 on julia. After switching them all to julia, the 0.5 and 1.0 profiles are almost identical._ --

Okay, profiling update. (Is there a better place to be posting these thoughts/updates? Maybe in a performance slack channel or something? let me know.)

So @vchuravy helped me set up a profiling script, which I then profiled on my Mac using Xcode's Instruments:

# profilegrowth.jl  (Note, this file is targetting julia 0.4 and 0.5)
function testfunc(capacity)
  a = zeros(capacity梅2)
  ccall(:jl_array_sizehint, Void, (Array, Cint), a, capacity)  # For the julia 1.0 test, I changed Void to Nothing
end

testfunc(10)

function test(iterations)
  for _ in 1:iterations
    testfunc(10^8)
  end
end

test(1)
test(10)

Some interesting early observations:

  1. The total time is significantly different, 2.8secs in v0.4.7 vs 6.1 secs in v0.5.2 (or around 2x).

    • ~weirdly, actually, it seems like it's not just growing the array (jl_array_grow_end) that got slower -- _everything_ seems to have gotten slower... Creating the original zeroed-out array takes about twice as long in 0.5, and allocating the array \~10x (but starting from a very small value, so probably insignificant).~

  2. Anyway, the thing we're interested in, is that jl_array_grow_end went from 254ms at 7.8% of the profile, to 3.29secs at 47.4% of the profile, which seems pretty bad.

julia v0.4:
screen shot 2018-10-05 at 12 55 12 pm

julia v0.5:
screen shot 2018-10-05 at 12 54 40 pm

EDIT: And julia 1.0's profile is almost identical to 0.5's. (What I thought was a speed-up was actually just due to using julia vs julia-debug.)
~And for comparison, julia 1.0's profile looks very similar to 0.5's, percentage-wise. Everything is faster, with a total time of 5.98secs instead of 8.2secs, but the percentages are similar, and jl_array_grow_end takes 2.87secs at 44% of the profile:~
screen shot 2018-10-05 at 12 55 43 pm

(zero-filing in 1.0 is now _slightly_ faster than in 0.4 though, so that's cool!)


So then, digging into those profiles, from what I can tell, the extra time seems to come from the fact that 0.5 and 1.0 are actually calling memcpy or memmove, whereas 0.4 _isn't_!:

julia 0.4:
screen shot 2018-10-05 at 12 59 00 pm

julia 0.5:
screen shot 2018-10-05 at 12 59 30 pm

julia 1.0:
screen shot 2018-10-05 at 1 00 02 pm

So actually, as best as I can tell, this might actually be a symptom of the problem Jeff originally identified above (https://github.com/JuliaLang/julia/issues/28588#issuecomment-412271105) and which I broke out into a separate issue here (https://github.com/JuliaLang/julia/issues/29526).

That is, that 0.4 was avoiding the expensive operation of _actually copying_ the data, since most of the time it can get away with just growing the memory in-place!


Am I understanding that correctly? And if so, is that only true because this is a toy example that does nothing else besides allocate that array? In the real-world, would we expect memory to be more fragmented than this such that that optimization matters less? I think it would probably be good to profile a bunch of example normal workflows to compare the CPU spent in jl_array_grow_end.

This also makes sense to explain the ~2x slowdown: we're now touching all the data twice, once during allocation and once during the grow event.

Oops, I just realized I used julia-debug binaries for 0.4 and 0.5, but julia for 1.0. The results are still similar, but let me update the numbers above.

I watched a talk recently that seems quite relevant to this (https://youtu.be/2YXwg0n9e7E?t=2193), it compares std::vector and their in-house vector class. The discussion is about 4 minutes

I paraphrase:

  • One of the things that happens with a vector, as you push back you have to grow and resize the array.
  • std::vector: create temporary, move the old to the new, resize... This is fine...
  • We wrote our growth capacity like this: we used realloc...
  • There are two artifacts, realloc doesn't have to allocate new memory if the memory allocation library says that you can just grow
  • Unable to get std::vector to ever match the performance.

@KristofferC Yeah, so the equivalent C++ (compiled with -O3) has a profile that looks more like the 0.5/1.0 versions:
screen shot 2018-10-05 at 1 11 17 pm

It spends ~30% of its time zeroing the initial array, and an equal 30% of time copying that data to a new array.

Although c++ is hindered by spending an extra 30% of its time zeroing out the rest of the array after growing, whereas julia chooses to leave that data uninitialized.

It's hard to directly compare the total times, since the julia runs included parsing and compiling the code as well, and I'm not sure how much of that contributed to the final profiles, but still, it seems that the C++ is actually a good amount faster despite doing the extra work that 0.5 and 1.0 do... It would be interesting to know why! :)

Here's the c++ file I used:

// Compiled via: clang++ --std=c++11 -O3 profilegrowth.cc
#include <vector>
using namespace std;

void testfunc(int capacity) {
  auto a = vector<int>(capacity/2);
  a.resize(capacity);
}
void test(int iterations) {
  for (int i = 0; i < iterations; ++i) {
    testfunc((int)1e8);
  }
}
int main() {
  // "Warm up" To maintain correlation with the julia script which runs these to precompile
  testfunc(10);
  test(1);
  // Actual test
  test(10);
  return 0;
}

watched a video recently that seems quite relevant to this (https://youtu.be/2YXwg0n9e7E?t=2193)

This is a fun talk, thanks for sharing! Agreed that it seems quite relevant, and that your summary is good.

Hi again!
So I've been (finally) doing some more investigation into this, and want to share my thoughts.

First, a recap:

There are two problems we've discovered:

  1. Array growth is capped at 1% physical RAM, causing quadratic total insertion time complexity

    • After the size of the array reaches 1% of physical memory, each subsequent growth event will grow the array by another 1% of RAM (i.e. a constant-size growth).

    • Having any constant-sized growth instead of a scaled growth factor leads to quadratic total insertion time instead of linear (or ammortized linear per element insertion instead of amortized _O(1)_ per insertion).

    • We've been doing this since v0.3 (in https://github.com/JuliaLang/julia/pull/6667)

  2. Julia v0.5 stopped using realloc for growing the array, causing ~2x regression

    • Julia v0.5 stopped using realloc for _aligned data_, instead performing malloc, copy, free. This is because realloc doesn't guarantee alignment.

    • Based on the performance profiles above, this seems to cause a consistent around 2x slowdown, especially at large sizes. Presumably this is due to the extra time spent copying and freeing, rather than just a single allocation.

The solution to problem (2.) is probably to follow the suggested behavior in this stackoverflow post:
https://stackoverflow.com/a/9078627/751061
Namely, to _attempt_ realloc, check for alignment (which it probably will be), and then only do _malloc, copy, free_ if not aligned. However, the legalese is sufficiently complicated enough to make me wary that this will be unsafe in some weird edge case... It seems probably worth doing, but it scares me a little.

Problem number (1.) however is straightforward. All the literature I have read on this topic says you should never grow with a constant growth amount. It should always be a scale-factor of the current array size. So we should fix that. The next remaining question is _what should the growth factor be_?

Picking a scaling factor

There is apparently a lot of interesting debate on this topic.

It's summarized well in the wikipedia article on dynamic arrays, which also includes a table of the growth factors in various languages:
https://en.wikipedia.org/wiki/Dynamic_array#Growth_factor

Implementation | Growth factor (a)
-- | --
Java ArrayList[1] | 1.5 (3/2)
Python聽PyListObject[7] | ~1.125 (n + n >> 3)
Microsoft Visual C++聽2013[8] | 1.5 (3/2)
G++聽5.2.0[5] | 2
Clang聽3.6[5] | 2
Facebook folly/FBVector[9] | 1.5 (3/2)

Introductory CS courses explain that you _double_ an array when it needs to grow. However, some sources such as Facebook's folly claim that 2x growth is provably the worst choice, "because it never allows the vector to reuse any of its previously-allocated memory":
https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md

I think that this stackoverflow answer explained the concept well:
https://stackoverflow.com/a/1100426/751061

From what I understand, conceptually any growth factor less than or equal to the _golden ratio_ sufficiently allows for reusing previous allocation slots, but people prefer 1.5 for other reasons.

Changing growth factor based on physical RAM

The approach julia took to limit allocation size as you approach the total system RAM is interesting. One thing we could maybe consider is _gradually slowing_ the growth factor as you get larger, rather than switching to a constant growth size like we do now. (For example, we'd discussed other interesting growth functions like _sqrt(x)_ in the past.)

But I think that this is probably a bad idea. Here are some reasons:

  1. Scaling down the growth factor as the array size scales up is close to effectively having a constant-size growth increment. I'm not sure, but my gut tells me that doing this by any amount would make our total insertion time complexity super-linear.
  2. The current limit is based on a percentage of _physical RAM_, but these days with virtual memory, it's often times totally reasonable to keep an array that is larger than physical RAM!

    • Especially as more computers have SSDs, swapping to disk is significantly less expensive than in the past, and new feature like overcommit allow you allocate larger than RAM without slowing down/OOMing until you access it.

  3. No one else seems to do this, which makes me think it's not a good idea.

Next Steps

So given all of that, we have a few decisions to make. I'll list them and my proposals here:

  • Remove the entire "over allocation" protection mechanism vs. change it to scale down gradually? [NHDaly: I propose we remove it]
  • What should our scale factor be? [NHDaly: I propose 1.5, but python's very low growth rate is interesting.]
  • Related: should we allow users to configure the growth strategy in any way? [NHDaly: could be useful, but lower priority.]

How does all of that sound? :)

I've opened https://github.com/JuliaLang/julia/pull/32035, which implements the proposals above, which we can use as a strawperson to investigate the best fix.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

i-apellaniz picture i-apellaniz  路  3Comments

helgee picture helgee  路  3Comments

musm picture musm  路  3Comments

TotalVerb picture TotalVerb  路  3Comments

yurivish picture yurivish  路  3Comments