At least on NVidia GPU's, it seems that calling glUseProgram(0) and then binding another program is slow. This makes Encoder::draw() slow, because gfx_backend_gl::CommandBuffer::bind_pipeline_state() is slow, because submitting Command::BindProgram(pso.program) is slow. Caching the current shader program and not bothering to issue a bind command when it doesn't change makes the performance example about 2x faster... WHEN you remove the self.reset_state() call from gfx_backend_gl::no_fence_submit().
I assume that not resetting the state on each submission call will break some things, so this isn't a general-purpose solution without some re-architecting to make the state reset unnecessary. But, if someone is interested in OpenGL performance, this is a pretty good place to start.
Characteristic timings from the performance example, on an NVidia 660 graphics card:
No changes:
total time: 51.97ms
create list: 6.94ms
submit: 44.18ms
gpu wait: 0.85ms
No redundant program binds:
total time: 24.20ms
create list: 7.22ms
submit: 16.28ms
gpu wait: 0.70ms
Only setting necessary values in gfx_backend_gl::CommandBuffer::bind_pipeline_state():
total time: 21.47ms
create list: 6.84ms
submit: 13.86ms
gpu wait: 0.77ms
Additionally, because of the multi-layered design of gfx, effort should be made to avoid as much work as possible at as high an abstraction level as possible. It's much faster to never submit a command than it is to submit a command, check it, and then decide it's redundant.
Thanks @icefoxen !
First of all, the general approach for gfx-rs state caching is to do it on a backend level, i.e. in the CommandBuffer implementation. Our backends already do some work:
if self.cache.rasterizer != pso.rasterizer {
self.cache.rasterizer = pso.rasterizer;
self.parser.parse(Command::SetRasterizer(pso.rasterizer));
}
Of course, it's not nearly complete, and we can be much more aggressive in optimizing this.
Caching things on the Encoder level may save us a few lines in the backends, however it is arguable: backends have more information and thus can do a better job at caching what they need. At the end of the day, the Encoder-level caching may become redundant and just add overhead, if we consider the backends doing the optimization ideally. Thus, I'd propose focusing on the backends directly.
makes the performance example about 2x faster... WHEN you remove the self.reset_state() call from gfx_backend_gl::no_fence_submit()
The idea behind calling reset_state is that the backend is fully accessible to 3rd parties between the command buffer submissions. However, this function used to be generic across backends, and now it's not. So a backend is responsible for tracking the state now, and the client doesn't care if it resets the state or not for as long as the result is predictable... So let's:
If anyone got concerns or ideas about this plan, let me know!
I made a hilarious and rather brute-force test of this, viewable in the latest three commits of this branch: https://github.com/icefoxen/gfx/tree/gl-command-filter . Except rustfmt reformatted pretty much everything, so you're probably best off starting to look from here: https://github.com/icefoxen/gfx/commit/67ee0eb358017655a3f0ea9db46b438f47a30bfa#diff-6f6419fffea5639b0a6421e949e6c773R183
It's far from perfect! There's a decent amount of stuff that it just doesn't try to handle. And I wasn't sure what the Cache structure was used for so I just wrote my own instead of trying to expand it.
But amusingly, it works pretty well. In the performance example, before doing this the timing for the gfx drawing was:
total time: 37.73ms
create list: 5.02ms
submit: 32.12ms
gpu wait: 0.60ms
and drawing a single triangle emitted the following commands:
1149 glBindFramebuffer(target = GL_DRAW_FRAMEBUFFER, framebuffer = 0)
1150 glViewport(x = 0, y = 0, width = 800, height = 600)
1151 glUseProgram(program = 3)
1152 glFrontFace(mode = GL_CCW)
1153 glDisable(cap = GL_CULL_FACE)
1154 glPolygonMode(face = GL_FRONT_AND_BACK, mode = GL_FILL)
1155 glDisable(cap = GL_POLYGON_OFFSET_FILL)
1156 glDisable(cap = GL_MULTISAMPLE)
1157 glDisable(cap = GL_DEPTH_TEST)
1158 glDisable(cap = GL_STENCIL_TEST)
1159 glDisablei(target = GL_BLEND, index = 0)
1160 glColorMaski(index = 0, r = GL_TRUE, g = GL_TRUE, b = GL_TRUE, a = GL_TRUE)
1161 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 1)
1162 glVertexAttribPointer(index = 0, size = 3, type = GL_FLOAT, normalized = GL_FALSE, stride = 12$
1163 glEnableVertexAttribArray(index = 0)
1164 glVertexAttribDivisor(index = 0, divisor = 0)
1165 glDisable(cap = GL_STENCIL_TEST)
1166 glBlendColor(red = 0, green = 0, blue = 0, alpha = 0)
1167 glDisable(cap = GL_SCISSOR_TEST)
1168 glUniformMatrix4fv(location = 0, count = 1, transpose = GL_FALSE, value = {-0.090533, 0, 0, 0,$
1169 glDrawArrays(mode = GL_TRIANGLES, first = 0, count = 3)
Afterwards, the timing is:
total time: 16.33ms
create list: 8.95ms
submit: 4.87ms
gpu wait: 2.50ms
And the GL commands emitted are the ideal minimum:
1156 glUniformMatrix4fv(location = 0, count = 1, transpose = GL_FALSE, value = {-0.090533, 0, 0, 0,$
1157 glDrawArrays(mode = GL_TRIANGLES, first = 0, count = 3)
However, the "create list" timing demonstrates why I feel more of the work should be handled at a higher level when possible: checking whether or not to execute each command has overhead. The way to reduce this overhead is to emit fewer commands in the first place, which can only be done at a higher level of abstraction. The sooner we decide not to do some work, the less overhead it involves.
Of course, I suppose there's no reason one couldn't have both, but ideally there would be as little overlap as possible. I just took this approach because it seemed easier and I was curious how well it worked. :-)
Thanks for your awesome tests @icefoxen !
And I wasn't sure what the Cache structure was used for so I just wrote my own instead of trying to expand it.
We should just expand the Cache instead.
The way to reduce this overhead is to emit fewer commands in the first place, which can only be done at a higher level of abstraction.
The Encoder knows the exact CommandBuffer implementation type, so issuing any commands to it can be completely inlined - there are no v-table jumps in there, it's not expensive to call.
At the same time, pushing the internal Command enums in the internal GL list is expensive, and that's what the backend-specific caching is supposed to limit. Each call to CommandBuffer function may result in one or more internal sub-calls, so the backend-specific cache can partially save this, while the higher level can only address the whole thing.
let's not reset the state for each submission!
I was a bit too fast with this conclusion, it is incorrect. The Encoder/CommandBuffer cache is working within an assumption that the initial state IS default. If the backend doesn't reset the state, then the command buffer cache will go out of sync...
This can be worked around by tracking if a particular state was set or not by the cache, but this would complicate the implementation :(
Okay, the time has come for me to actually do this for real in a form that a) is harmonious with everything you're already doing and b) doesn't break anything. I also kiiiinda want it to be possible to backport these changes to 0.14 since that's what I'm using for ggez.
Is that possible? If so, is there a particular commit I should start my own branch from?
@icefoxen if you start from the v0.14 branch, we'll surely be able to publish the updated GL backend with a patch version bump (or a minor one - since we have them decoupled anyway).
Once #1321 is done, a new option would be to do additional state caching at the CommandQueue, which actually submits the commands to the GL driver. We have more information at this endpoint compared to command buffers so we can optimize away many of the reset commands which are currently prepended.
Most helpful comment
I made a hilarious and rather brute-force test of this, viewable in the latest three commits of this branch: https://github.com/icefoxen/gfx/tree/gl-command-filter . Except rustfmt reformatted pretty much everything, so you're probably best off starting to look from here: https://github.com/icefoxen/gfx/commit/67ee0eb358017655a3f0ea9db46b438f47a30bfa#diff-6f6419fffea5639b0a6421e949e6c773R183
It's far from perfect! There's a decent amount of stuff that it just doesn't try to handle. And I wasn't sure what the
Cachestructure was used for so I just wrote my own instead of trying to expand it.But amusingly, it works pretty well. In the
performanceexample, before doing this the timing for the gfx drawing was:and drawing a single triangle emitted the following commands:
Afterwards, the timing is:
And the GL commands emitted are the ideal minimum:
However, the "create list" timing demonstrates why I feel more of the work should be handled at a higher level when possible: checking whether or not to execute each command has overhead. The way to reduce this overhead is to emit fewer commands in the first place, which can only be done at a higher level of abstraction. The sooner we decide not to do some work, the less overhead it involves.
Of course, I suppose there's no reason one couldn't have both, but ideally there would be as little overlap as possible. I just took this approach because it seemed easier and I was curious how well it worked. :-)