Three.js: Explore the benefits of using Uniform Buffer Objects (UBO) for WebGL2 renderer

Created on 27 Mar 2018 · 4Comments · Source: mrdoob/three.js

When working on the webgl2 renderer, there're two features we could use to improve the performance as usually we're CPU-bound. One if Vertex Array Objects (VAO) and the other one is Uniform Buffer Objects. While VAO could be used on WebGL1 too via extensions, UBO can't, so we could just implement it on the webgl2 renderer.
As it will require lot of changes, it could be nice to benchmark first how good it will perform and specially try to identify which level of granularity when choosing the blocks are we aiming for, as it will have impact on the code's complexity and performance.

Enhancement WebGL2

Source

fernandojsg

Most helpful comment

Hi, coming from this twitter thread: https://twitter.com/mozillareality/status/978696743276699648 :) My experience with new WebGL2 features is a bit of a mixed bag performance-wise, some are specific to WebGL2 vs GLES3, but many also apply to GLES3 or GL in general.

I've been doing smaller cross-platform things, with the HTML5 side covered through cross-compiled asm.js / wasm, and rendering backends WebGL/WebGL2, GLES2/GLES3, GL3.3, D3D11 and Metal, so I have a pretty good way to compare performance and design strategies across platforms and 3D APIs.

Oryol: http://floooh.github.io/oryol/
Oryol Extension Samples: http://floooh.github.io/oryol-samples/index.html
and that C++ renderer module generalized into a simple 3D-wrapper API extracted into a single C: header: https://floooh.github.io/sokol-html5/

Most features that WebGL2 adds over WebGL (new texture types, new renderable pixel formats, etc...) work well.

Things that I hoped would help were disappointing though:

VAOs:
- the trouble with VAOs is that the vertex buffer binding is part of the VAO, this leads to a combinatorial explosion for some rendering scenarios (you need one VAO for each combination of vertex layout and buffer-binding-combination, especially with multiple input buffers, some of them with dynamic content, this is hard to manage (maybe better in a highlevel API like three.js though)
- even with this out of the way I didn't see performance advantages on native GL platforms over defining the vertex layout with granular calls
- ...of course WebGL has a high call overhead, so doing a dozen glVertexAttribPointer() calls over binding a single VAO should be much worse... but because of the combinatorial explosion made VAO's non-viable for me, instead I implemented caching for the vertex-buffer/layout-binding which basically tracks the vertex-attrib state and only calls through to WebGL when the state changes
UBOs: I tried different strategies, but they were all disappointing:
- I think this is mainly because in my code it doesn't make sense to keep uniform data in static uniform buffers across frames, instead all uniform data is expected to change each frame, the main reason is (again) combinatorial explosion for all possible combinations of shaders vs materials vs dynamic uniform parameters
- I didn't see any performance or code-complexity advantages when using UBOs compared to traditional glUniform calls for the following two scenarios
  - my first naive implementation was one UBO per shader, and update that before draw calls, this was slower then glUniform calls, other 3D APIs like D3D11 have a special fast-path this scenario
  - my second attempt was the same strategy I use in Metal (which has the fastest uniform update code path): use a double-buffered pair of big uniform buffers which record all uniform updates for a single frame, record the GL commands in parallel (actually more sparse higher level API calls), and record in the cmdlist where the uniform data starts in the global uniform buffer. This was faster than the naive one-UBO-per-shader approach, but still wasn't as fast as just naively calling glUniform (because I think GL drivers have this case optimized to death, by basically also doing just a memcpy + record offset in an intermal cmdlist)
  - Instead what I did to get rid of the many granular glUniform calls was to pack the shader uniforms into one vec4 array per 'uniform block', so that one uniform block update is always a single glUniform4fv() call. This is the fastest way I found both on WebGL and desktop GL, and it has the advantage that it also works on WebGL1 (I actually don't do the packing myself, but use SPIRVCross to convert GLSL300 uniform blocks into GLSL100 vec4 arrays, and use reflection info from SPIRVCross to build C structures... all this isn't an option for three.js unfortunately I guess).

Here are a few blog posts with more detailed infos of my WebGL2 experiments:

here's the "before": http://floooh.github.io/2016/10/06/oryol-webgl2.html
and here's the "after": http://floooh.github.io/2017/04/04/oryol-webgl2-merge.html

Cheers!
-Floh.

floooh on 28 Mar 2018

❤7

All 4 comments

Oryol: http://floooh.github.io/oryol/
Oryol Extension Samples: http://floooh.github.io/oryol-samples/index.html
and that C++ renderer module generalized into a simple 3D-wrapper API extracted into a single C: header: https://floooh.github.io/sokol-html5/

Most features that WebGL2 adds over WebGL (new texture types, new renderable pixel formats, etc...) work well.

Things that I hoped would help were disappointing though:

VAOs:
- the trouble with VAOs is that the vertex buffer binding is part of the VAO, this leads to a combinatorial explosion for some rendering scenarios (you need one VAO for each combination of vertex layout and buffer-binding-combination, especially with multiple input buffers, some of them with dynamic content, this is hard to manage (maybe better in a highlevel API like three.js though)
- even with this out of the way I didn't see performance advantages on native GL platforms over defining the vertex layout with granular calls
- ...of course WebGL has a high call overhead, so doing a dozen glVertexAttribPointer() calls over binding a single VAO should be much worse... but because of the combinatorial explosion made VAO's non-viable for me, instead I implemented caching for the vertex-buffer/layout-binding which basically tracks the vertex-attrib state and only calls through to WebGL when the state changes
UBOs: I tried different strategies, but they were all disappointing:
- I think this is mainly because in my code it doesn't make sense to keep uniform data in static uniform buffers across frames, instead all uniform data is expected to change each frame, the main reason is (again) combinatorial explosion for all possible combinations of shaders vs materials vs dynamic uniform parameters
- I didn't see any performance or code-complexity advantages when using UBOs compared to traditional glUniform calls for the following two scenarios
  - my first naive implementation was one UBO per shader, and update that before draw calls, this was slower then glUniform calls, other 3D APIs like D3D11 have a special fast-path this scenario
  - my second attempt was the same strategy I use in Metal (which has the fastest uniform update code path): use a double-buffered pair of big uniform buffers which record all uniform updates for a single frame, record the GL commands in parallel (actually more sparse higher level API calls), and record in the cmdlist where the uniform data starts in the global uniform buffer. This was faster than the naive one-UBO-per-shader approach, but still wasn't as fast as just naively calling glUniform (because I think GL drivers have this case optimized to death, by basically also doing just a memcpy + record offset in an intermal cmdlist)
  - Instead what I did to get rid of the many granular glUniform calls was to pack the shader uniforms into one vec4 array per 'uniform block', so that one uniform block update is always a single glUniform4fv() call. This is the fastest way I found both on WebGL and desktop GL, and it has the advantage that it also works on WebGL1 (I actually don't do the packing myself, but use SPIRVCross to convert GLSL300 uniform blocks into GLSL100 vec4 arrays, and use reflection info from SPIRVCross to build C structures... all this isn't an option for three.js unfortunately I guess).

Here are a few blog posts with more detailed infos of my WebGL2 experiments:

here's the "before": http://floooh.github.io/2016/10/06/oryol-webgl2.html
and here's the "after": http://floooh.github.io/2017/04/04/oryol-webgl2-merge.html

Cheers!
-Floh.

floooh on 28 Mar 2018

❤7

Just my 2 cents. Commercial engine I wrote, UBO was fine but it was vs uniforms. Never tried to compare with uniform arrays. Regarding VAO. I had to port from GLES2 to ES3 and I can confirm ,I hadn't noticed any gain vs direct vertex buffer binding neither on ios nor android. But I guess it can depend on a number separate buffer objects you bind.

sasmaster on 28 Mar 2018

We're refactoring our material system by introducing node base material system. I think it's good to start to think of UBO after we finish the refactoring.