Godot 3.1 b60939be88d192b63798aec6e9b031d570048b8b
Windows 10 64 bits
Today I got stuck in an optimization problem: I am generating a lot of 2D geometry to be drawn often with canvas_item_add_triangle_array. This forces to use Vectors, which I decided re-use each time the geometry gets generated. But it keeps lagging at 6 ms just doing that since I started using it (without even including the call to draw)...
Then, I realized something for good about Vector:
The two following codes take the same amount of time to complete.
for(int i = 0; i < 10000; ++i) {
Vector<uint8_t> test; // Never re-use it
test.resize(5000);
test.resize(2);
}
Vector<uint8_t> test; // Re-use it
for(int i = 0; i < 10000; ++i) {
// I expected this to be free past the first iteration... it isn't.
test.resize(5000);
test.resize(2);
}
So it appears there is no way to re-use a Vector and hope it will be cost-free over repeated uses, which prevents from making my code faster. Is that really it? Is it a bug?
To describe more my problem, I tried rewriting my code in a way it never clears vectors, and instead resizes only once to the correct size before calling canvas_item_add_triangle_array... and I gained 200% speed, yet it still has to resize at the end (I am lucky enough to draw geometry that often roughly has the same amounts of vertices).
So Vector does copy on write, it's easy to pass around I guess, and it resizes by power of two. But how should we deal with this case? Are we stuck allocating memory each time and half-reimplementing capacity on top of it if we want more speed? I mean, it does sound like it has capacity somehow, but it doesn't yields the results I thought.
Note: this would also concern Line2D since it uses the same method too.
A Vector is under the hood an array of T. What we currently clear() we just call resize(0) on the underlaying CowData (previously we did the same btw).
Some suggestions:
a) Add an int parameter defaulting to 0 to clear() which gives the desired capacity after the clear. Then only destroy all elements rather than resizing.
b) Add a new method which does the above.
c) Add a new member to vector that gives it a minimum capacity (perhaps on creation?) which then does the case of (a)
Thoughts anyone? @reduz perhaps?
Another option could be for Vector to know whether or not it is filled with POD types. We can do a template specialization for that case then and upon clear()/resize() etc just drop the elements on the floor since we won't have to call any dtors anyway.
We can probably achieve this with C++03 but it'd be much prettier with 11 :)
(It would still need some kind of notion of a 'capacity' though)
Another option is actually to just use __is_pod, this is a compiler intrinsic supported by clang, msvc, and gcc. Then have a specialization for that and don't do anything with capacity at all. This would still mean a resize operation but it would be much, much faster. It may really be all that's necessary.
@hpvb here are the times I get from the same benchmark:
Debug, without your PR:
Debug, with your PR:
Release-tools, without your PR:
Release-tools, with your PR:
Note: there seem to be a bit of random at stake, because some rare times, for the same test, the total durations I recorded were up to 3 times less or more, so I wrote down the most occurring times. I assume it's due to core switches or the way memory is rearranged.
Now, to better highlight my concern, I also tested my geometry generation code with an std::vector, in release_debug build:
(for a quick look of what I'm doing, it's a code minimap renderer, which is just a bunch of coloured lines, and I use vectors for batching: https://sebsauvage.net/paste/?c29ea94d9fa3949a#MrxGYB4a5NwuONNYAEBoxcmar9sZyamLZJHnCaJWKHg=)
It really looks like the bottleneck lies in clear() and resize(), because I even tried working on a raw pointer using ptrw() and that didn't improve it.
I got another insight today by switching my Minecrafty mesher from Vector to std::vector, in particular these ones.
This mesher may generate one or more meshes per frame, and I logged the one taking the most time, in microseconds:

On the left, Godot release build using Vector to build up mesh data before creating the Mesh.
On the right, Godot release build, using std::vector instead.
As you can see, just changing the container at least DOUBLED how fast the mesher can turn voxels into mesh data in a release build, which is directly noticeable in the game without even looking at numbers. That means I can modify more voxels at once with less visual lag. I think that's enough to seriously consider an improvement to this container.
This gain is caused by the way I used the vector: I keep them around for re-use, and push_back/resize/clear are frequently called. std::vector grows a buffer more and more as capacity and reallocs less and less often. Vector appears to alloc upfront by power of two as well, but keeps calling realloc on every call and gets rid of it if size is zero. Hence std::vector calls are much cheaper.
Another note:
This container change is inside my module, so I can do my own soup easily (I basically took it away from everywhere by now).
BUT, Vector is mandatory in VisualServer APIs, and there is no intention of changing that. This means any heavy mesh generation routine (like the one I do in my voxel module but also the one in Line2D and cie.) is slowed down by the copy/allocs/resizes required to get a Vector. It's not even possible to re-use it because of that whole issue. I would be happy to use a Vector all the way if it was as fast as std::vector.
The pull request that would originally close this wasn't merged, but since https://github.com/godotengine/godot/pull/38386 was, I'll close this.
I need to profile it again some day
Does this mean that the visual server can take LocalVectors now as parameters? I can see LocalVector::operator Vector<T>() const, but it appears to just create a new Vector<T> and then resize it, which would still be allocating the whole thing. I don't see any overload for canvas_item_add_triangle_array that takes LocalVector<*> directly. Am I understanding this correctly?
Most helpful comment
I got another insight today by switching my Minecrafty mesher from
Vectortostd::vector, in particular these ones.This mesher may generate one or more meshes per frame, and I logged the one taking the most time, in microseconds:
On the left, Godot release build using Vector to build up mesh data before creating the Mesh.
On the right, Godot release build, using std::vector instead.
As you can see, just changing the container at least DOUBLED how fast the mesher can turn voxels into mesh data in a release build, which is directly noticeable in the game without even looking at numbers. That means I can modify more voxels at once with less visual lag. I think that's enough to seriously consider an improvement to this container.
This gain is caused by the way I used the vector: I keep them around for re-use, and
push_back/resize/clearare frequently called.std::vectorgrows a buffer more and more as capacity andreallocsless and less often.Vectorappears to alloc upfront by power of two as well, but keeps callingreallocon every call and gets rid of it if size is zero. Hencestd::vectorcalls are much cheaper.Another note:
This container change is inside my module, so I can do my own soup easily (I basically took it away from everywhere by now).
BUT,
Vectoris mandatory inVisualServerAPIs, and there is no intention of changing that. This means any heavy mesh generation routine (like the one I do in my voxel module but also the one inLine2Dand cie.) is slowed down by the copy/allocs/resizes required to get a Vector. It's not even possible to re-use it because of that whole issue. I would be happy to use a Vector all the way if it was as fast asstd::vector.