Godot: Vector<T> does not yield performance results that capacity usually would

Created on 3 Jan 2019  路  8Comments  路  Source: godotengine/godot

Godot 3.1 b60939be88d192b63798aec6e9b031d570048b8b
Windows 10 64 bits

Today I got stuck in an optimization problem: I am generating a lot of 2D geometry to be drawn often with canvas_item_add_triangle_array. This forces to use Vectors, which I decided re-use each time the geometry gets generated. But it keeps lagging at 6 ms just doing that since I started using it (without even including the call to draw)...

Then, I realized something for good about Vector:

The two following codes take the same amount of time to complete.

    for(int i = 0; i < 10000; ++i) {
        Vector<uint8_t> test; // Never re-use it
        test.resize(5000);
        test.resize(2);
    }
    Vector<uint8_t> test; // Re-use it
    for(int i = 0; i < 10000; ++i) {
        // I expected this to be free past the first iteration... it isn't.
        test.resize(5000);
        test.resize(2);
    }

So it appears there is no way to re-use a Vector and hope it will be cost-free over repeated uses, which prevents from making my code faster. Is that really it? Is it a bug?
To describe more my problem, I tried rewriting my code in a way it never clears vectors, and instead resizes only once to the correct size before calling canvas_item_add_triangle_array... and I gained 200% speed, yet it still has to resize at the end (I am lucky enough to draw geometry that often roughly has the same amounts of vertices).

So Vector does copy on write, it's easy to pass around I guess, and it resizes by power of two. But how should we deal with this case? Are we stuck allocating memory each time and half-reimplementing capacity on top of it if we want more speed? I mean, it does sound like it has capacity somehow, but it doesn't yields the results I thought.

Note: this would also concern Line2D since it uses the same method too.

enhancement core

Most helpful comment

I got another insight today by switching my Minecrafty mesher from Vector to std::vector, in particular these ones.
This mesher may generate one or more meshes per frame, and I logged the one taking the most time, in microseconds:

image
On the left, Godot release build using Vector to build up mesh data before creating the Mesh.
On the right, Godot release build, using std::vector instead.

As you can see, just changing the container at least DOUBLED how fast the mesher can turn voxels into mesh data in a release build, which is directly noticeable in the game without even looking at numbers. That means I can modify more voxels at once with less visual lag. I think that's enough to seriously consider an improvement to this container.

This gain is caused by the way I used the vector: I keep them around for re-use, and push_back/resize/clear are frequently called. std::vector grows a buffer more and more as capacity and reallocs less and less often. Vector appears to alloc upfront by power of two as well, but keeps calling realloc on every call and gets rid of it if size is zero. Hence std::vector calls are much cheaper.

Another note:
This container change is inside my module, so I can do my own soup easily (I basically took it away from everywhere by now).
BUT, Vector is mandatory in VisualServer APIs, and there is no intention of changing that. This means any heavy mesh generation routine (like the one I do in my voxel module but also the one in Line2D and cie.) is slowed down by the copy/allocs/resizes required to get a Vector. It's not even possible to re-use it because of that whole issue. I would be happy to use a Vector all the way if it was as fast as std::vector.

All 8 comments

A Vector is under the hood an array of T. What we currently clear() we just call resize(0) on the underlaying CowData (previously we did the same btw).

Some suggestions:
a) Add an int parameter defaulting to 0 to clear() which gives the desired capacity after the clear. Then only destroy all elements rather than resizing.
b) Add a new method which does the above.
c) Add a new member to vector that gives it a minimum capacity (perhaps on creation?) which then does the case of (a)

Thoughts anyone? @reduz perhaps?

Another option could be for Vector to know whether or not it is filled with POD types. We can do a template specialization for that case then and upon clear()/resize() etc just drop the elements on the floor since we won't have to call any dtors anyway.

We can probably achieve this with C++03 but it'd be much prettier with 11 :)

(It would still need some kind of notion of a 'capacity' though)

Another option is actually to just use __is_pod, this is a compiler intrinsic supported by clang, msvc, and gcc. Then have a specialization for that and don't do anything with capacity at all. This would still mean a resize operation but it would be much, much faster. It may really be all that's necessary.

@hpvb here are the times I get from the same benchmark:

Debug, without your PR:

  • Re-using the vector: 515ms
  • Not re-using the vector: 539ms

Debug, with your PR:

  • Re-using the vector: 39ms
  • Not re-using the vector: 65ms

Release-tools, without your PR:

  • Re-using the vector: 225ms
  • Not re-using the vector: 336ms

Release-tools, with your PR:

  • Re-using the vector: 39ms
  • Not re-using the vector: 63ms

Note: there seem to be a bit of random at stake, because some rare times, for the same test, the total durations I recorded were up to 3 times less or more, so I wrote down the most occurring times. I assume it's due to core switches or the way memory is rearranged.

Now, to better highlight my concern, I also tested my geometry generation code with an std::vector, in release_debug build:

  • With Godot Vector: 3ms
  • With std::vector: 0.18ms

(for a quick look of what I'm doing, it's a code minimap renderer, which is just a bunch of coloured lines, and I use vectors for batching: https://sebsauvage.net/paste/?c29ea94d9fa3949a#MrxGYB4a5NwuONNYAEBoxcmar9sZyamLZJHnCaJWKHg=)

It really looks like the bottleneck lies in clear() and resize(), because I even tried working on a raw pointer using ptrw() and that didn't improve it.

I got another insight today by switching my Minecrafty mesher from Vector to std::vector, in particular these ones.
This mesher may generate one or more meshes per frame, and I logged the one taking the most time, in microseconds:

image
On the left, Godot release build using Vector to build up mesh data before creating the Mesh.
On the right, Godot release build, using std::vector instead.

As you can see, just changing the container at least DOUBLED how fast the mesher can turn voxels into mesh data in a release build, which is directly noticeable in the game without even looking at numbers. That means I can modify more voxels at once with less visual lag. I think that's enough to seriously consider an improvement to this container.

This gain is caused by the way I used the vector: I keep them around for re-use, and push_back/resize/clear are frequently called. std::vector grows a buffer more and more as capacity and reallocs less and less often. Vector appears to alloc upfront by power of two as well, but keeps calling realloc on every call and gets rid of it if size is zero. Hence std::vector calls are much cheaper.

Another note:
This container change is inside my module, so I can do my own soup easily (I basically took it away from everywhere by now).
BUT, Vector is mandatory in VisualServer APIs, and there is no intention of changing that. This means any heavy mesh generation routine (like the one I do in my voxel module but also the one in Line2D and cie.) is slowed down by the copy/allocs/resizes required to get a Vector. It's not even possible to re-use it because of that whole issue. I would be happy to use a Vector all the way if it was as fast as std::vector.

I need to profile it again some day

Does this mean that the visual server can take LocalVectors now as parameters? I can see LocalVector::operator Vector<T>() const, but it appears to just create a new Vector<T> and then resize it, which would still be allocating the whole thing. I don't see any overload for canvas_item_add_triangle_array that takes LocalVector<*> directly. Am I understanding this correctly?

Was this page helpful?
0 / 5 - 0 ratings