Ive been doing experiments and little projects with godot, and this is my latest one.
To showcase why C++11 is interesting, ive used it to parallize the shadow system in godot.
This work is separate from the experiments with faster culling i was doing. This one uses the normal octree.
Green blocks are light culling/preparation (octree, visual server scene), red blocks are light rendering (rasterizer gles3)
Multithread off: 4.15 miliseconds (including initial scene cull)
Multithread on: 3 miliseconds. (including initial scene cull)
About 60-80% faster, once we remove the initial scene cull cost.
Lines of code for the multithread part:
Not even 20
Godot shadow system works by grabbing the lights in the view frustrum, deciding if they cast shadow or not, and if they do, cull the objects for the shadow and then render the shadow.
In pseudocode its something like this:
array objects_in_view = cullview();
for(object in objects_in_view){
if( IsLight(object) && CastShadow(object){
Light light = CastLight(object);
array affected_objects= cull_light(light);
render_shadow(affected_objects);
}
}
}
The 2 expensive functions are cull_light and render_shadow. The obvius fix is to have one thread culling lights, and another thread rendering them. And thats exactly what i did.
The first roadblock to the parallelization is the fact that godot reuses the same cull array (statically allocated) every time it needs to cull a light. This completely rules out multithreading, as there is global data.
To fix that, i created a new cull function (copypaste of the original one) that uses a C++11 lambda instead of outputting the data into an array. The function is used like this:
p_scenario->octree.cull_convex_lambda(planes, [&](Instance* instance) {
//this code will run once per every instance that passes the cull test
});
This new cull function doesnt use any memory, as instead of storing the objects that pass the test, it just calls the lambda.
The old version looks like this:
// cull is a global array
culled = scenario->octree.cull_convex(p_convex, cull, 1024);
for (int i = 0; i < culled; i++) {
//this code will run once per every instance that passes the cull test
}
Benefits of the new function:
Code Change:
Copypaste of the old one plus changing a couple lines of code.
Using this new lambda function, i no longer depend on global state to perform culling. Cull function is completely atomic, and can be called from as many threads as you want with no issues.
Now that i have the culling, i need a way to separate culling and rendering. The normal code first does a cull, and then submits that cull into the gles3 rasterizer.
To separate that, i used the awesome Moodycamel Concurrent Queue https://github.com/cameron314/concurrentqueue . Its the fastest multithreaded queue you can find and is single-header plus standard C++. It even works on consoles.
Any other concurrent queue would do, but moodycamels is a great one that has such insane speed that it can outspeed Godot Vector array, while being parallel.
Instead of launching the functions directly (for cull and render) i stored the parameters into a struct, and those stored into queues. For the culling, i dont need parallelism, so its just done into a normal std::vector. For the rendering work that goes into the parallel queue.
The code for the rendering loop now looks somewhat like this
queue cull_work;
queue render_work;
array objects_in_view = cullview();
for(object in objects_in_view){
if( IsLight(object) && CastShadow(object){
Light light = CastLight(object);
cull_work.push(cull_light_struct(light));
}
}
}
for(work in cull_work){
CullResult result = cull_light(work.light);
render_work.push(result);
}
for(work in render_work){
render_shadow(work.affected_objects);
}
Its not really more complicated, it just separates the stages of the work, using the queues.
At this moment, adding the paralelization is easy.
We execute the first cull work, start rendering work, and then launch an async work that performs the rest of the culling.
Code somewhat like this
CullResult result1 = cull_light(cull_work[].light);
LAUNCH_ASYNC(
for(work in cull_work){
CullResult result = cull_light(work.light);
render_work.push(result);
}
)
for(work in render_work){
render_shadow(work.affected_objects);
}
The reason i cull the first one is that the rendering cant really start until there is work to do, so i perform the first work item, and then start rendering in the main thread while the second thread culls the lights and inserts the data into the queue. It would be possible to perform the light culling on a parallel for, but i found it unnecesary.
For the async launch, i use std::async. Godot has threads, but this ones are completely useless for this kind of work. I needed something lightweight that is just used for a short time, so std::async works best.
An std::async can be launched in a second thread, and returns a future. You can wait on this future if you want, or just execute something else and then grab the value when its done.
The result is a textbook producer/consumer scenario, where there is a culler thread producing data for the render thread to consume.
The gains are as you can see above. A good %, but not a x2 in speed becouse the rendering is slower than the culling.
No other changes other than what ive commented about were necessary for the multithread shadows, and they "just work", with no mayor issues.
Given the current architecture of godot, its not easy to do, but it would be cool to also start preparing the actual render data while all this shadow work is being done, getting further gains. But such a system would require more advanced multithreading through a task system.
The numbers are recorded on a Ryzen CPU, on the godot TPS example slightly modified to add a few extra shadowcasting lights to it.
This is great work! While none of this will be merged for the time being (given most rendering code will be rewritten after 3.1 is out), availability of the source code for your work and implementation feedback or help would be appreciated when this happens.
@reduz Sure! The code right now is merged with a ton of other experiments, so its not isolated, but once 3.1 is out ill build a version that has minimal code to act as reference.
Most helpful comment
@reduz Sure! The code right now is merged with a ton of other experiments, so its not isolated, but once 3.1 is out ill build a version that has minimal code to act as reference.