Godot: Testing results for new 2D batching in the GLES2 renderer

Created on 19 Apr 2020  路  33Comments  路  Source: godotengine/godot

This issue is meant for collecting user feedback about the new 2D batching in the GLES2 renderer implemented in #37349.

The new batching can be tested with Godot 3.2.2 beta 2:
https://godotengine.org/article/dev-snapshot-godot-3-2-2-beta-2

Batching is enabled by default both in games and in the editor. See below for details on project settings to configure it.
Note: It's only for the GLES2 renderer. Nothing changed for GLES3.

I'll let @lawnjelly edit this post to add some more details/testing instructions.


3.2.2 introduces 2D batching support to GLES2. This is quite a major modification to the renderer, so we are expecting a certain amount of issues, it would be helpful to keep them here together so we can work through them.

Note that batching is purely a performance optimization. If working correctly there should be no visual differences or differences in behaviour. It only works on rects to start with but we will be adding other primitives in future (lines, polys, etc).

There are 2 types of issue you are most likely to see:

1) Performance issues

We are particularly interested if you have a project that runs slower with batching than without. If we can isolate these cases and work out what is going on, we can fix them to prevent any slowdowns in the next version (and in some cases they can be accelerated).

If you get an increase in performance, let us know! :+1:

2) Visual regressions

If you encounter visual differences between your project with and without batching, that is something we need to fix. Please let us know full details.

A Quick Guide

The settings are in ProjectSettings->Rendering->Gles2->Batching

  • use_batching - turns on and off (you can easily compare frame rates with and without). Enabled by Default
  • max_join_item_commands - the default should work fine, but it is possible that different values may give you slightly better results in your game. A value of 0 turns joining items off, which can be used to help diagnose regressions (it will usually slightly decrease batching gains though).
  • colored_vertex_format_threshold - try varying this to see how it affects the performance in your game, the best value may be game / hardware dependent, we are not sure yet.
  • light_scissor_area_threshold - if you are using lights in your game, be sure to test varying this, even if you do not benefit from batching. This can significantly reduce GPU fillrate requirements, which can be a bottleneck in some games.
  • light_max_join_items _(new in 3.2.2. beta 2)_ - the multiple-pass lighting in GLES2 normally prevents a lot of batching, however when this parameter is set to non-zero, an overlap test is done which enables joining (and hence larger batches) in many circumstances. Larger values give larger batches, but there are diminishing returns as the overlap test is not free.
  • batch_buffer_size - usually the default will be fine but you can save memory on mobile by reducing this. It effectively limits the maximum size of batches.
  • item_reordering_lookahead _(new in 3.2.2. beta 2)_ - especially in the case of single rects (like sprites), if you have interleaved sprites using different textures this limits joining. However when this option is non-zero overlap tests are used to reorder items where possible so that similar items are next to each other and can be joined. There are diminishing returns as the overlap test is not free.
  • single_rect_fallback - when switched on, batching will use the legacy fast path to batches which only contain a single rect. This is faster, however there is risk of flickering on some hardware. For this reason it is discouraged unless crucial to performance in your game.

and in ProjectSettings->Rendering->Gles2->Debug:

  • flash_batching - if batching is on, this alternates between the legacy renderer and batching on alternate frames - this can help find visual differences, which you should report here so we can fix them.
  • diagnose_frame - every 10 seconds this will print out a list of batches being used in the frame. This can be useful to maximize batching and diagnose problems.
  • use_batching_in_editor - this determines whether batching is used in the Godot IDE. Enabled by Default

Tips

The order that objects are rendered generally follows the order in the scene tree. If you can group similar items together in the scene tree (similar material, texture etc), they are more likely to be batched together, and give you a speed increase.

Batching will be broken by (among other things):

  • Texture changes (not just changes in texture coordinates)
  • Material changes
  • Changes other than rect position, transform, UV coordinates and color

The best batching project settings will depend upon the game. Be sure to try varying every one, to see the effect on performance. A simple way to measure FPS is by setting the project setting debug/settings/stdout/print_fps (although you should disable this for exports, as printing anything can affect performance itself).

Diagnose Frame

This is a very useful option in order to optimize your game (make sure to switch it off when exporting though, as it costs performance). It prints out every 10 seconds a list of the batches used in the frame. This list may be quite long for complex games, you may need to increase the buffer size in project_settings/network/limits/debugger_stdout/max_chars_per_second to read it all.
It may look something like this:

canvas_begin FRAME 13891
items
    joined_item 2 refs
            batch D 0-0 
            batch R 0-18 [0 - 143] {255 255 255 255 } MULTI
    joined_item 1 refs
            batch D 0-0 
            batch R 0-57 [0 - 143] {255 255 255 255 } MULTI
    joined_item 1 refs
            batch D 0-0 
            batch R 0-34 [0 - 143] {255 255 255 255 } MULTI
    joined_item 1 refs
            batch D 0-1 r 
    joined_item 1 refs
            batch D 0-1 PA 
canvas_end

Each joined item will contain 1 or more item references (which correspond roughly to nodes). In general the more joined in each item the better, as batching can only take place within the joined_item.
The batches are of 2 types:

  • R - Rect batches - These can be accelerated, and if a batch of more than 1 occurs it is marked with MULTI. The more and bigger the MULTIs, the better your speedup.
  • D - Default batches - These are unhandled primitives that are not accelerated. Non-rect primitives and sometimes single rects. The letters give an indication of the types of primitives in the default batch (_l - line, PL - polyline, r - rect, n - ninepatch, PR - primitive, p - polygon, m - mesh, MM - multimesh, PA - particles_).

In both cases the batch type is followed by the number of commands in the batch, and with Rects also the batch texture in square brackets (the batch tex ID followed by the godot tex ID). Different textures cannot be batched together. The numbers in squiggly brackets are colors.

Template

As with most issues, it is useful if you provide us with the following info:

  • Godot version
  • OS including version, device, and GPU (if you know it)
  • Issue description
  • Steps to reproduce
  • Minimum reproduction project

In addition if you are dealing with a slowdown, it can be helpful to provide (or link to) your diagnose_frame log.

If you are reporting increase in performance, only minimal info is needed (is totally up to you). :+1:

enhancement rendering tracker

Most helpful comment

We have been live with a 3.2.2 build of our html5 game Tower Empire for 2 days now. 3500+ players have not reported any issues, and a few have even noticed that the game is running smoother.

No issues, but I thought you wanted the good news too :-)

All 33 comments

We have been live with a 3.2.2 build of our html5 game Tower Empire for 2 days now. 3500+ players have not reported any issues, and a few have even noticed that the game is running smoother.

No issues, but I thought you wanted the good news too :-)

We have been live with a 3.2.2 build of our html5 game Tower Empire for 2 days now. 3500+ players have not reported any issues, and a few have even noticed that the game is running smoother.

That's great! :grin: Generally the lack of news has been a good thing, because it suggests for most people it looks the same as before, which is the intention (except hopefully faster!).

Just to keep a record in here, there have been a couple of visual bugs in 3.2.2 beta:

1) There was a bug between z_indexes that could cause visual anomalies (#38013 and #38014). This can be temporarily fixed in project settings by setting max_join_items to 0. This has been fixed properly in the source and will be in the next beta.
2) There is a rare visual regression that I noticed in the IDE on the ruler, sometimes 1 of the lines doesn't show. This can be shown up by have flash_batching set and use_batching_in_editor.

No one else has noticed this so far, but I have spent over a day trying to track it down, because it is something that could exhibit in a game, and I'm particularly keen to iron out any visual regressions.

3) Possibly #38075. We aren't sure yet because it is not repeatable, it may not be caused by the batching but it is possible. There are no reported visual anomalies just a warning message about the shader, so we will fix this if we can get it to repeat.

Hey guys. Just dropped by to say say this gave a huge performance boost to my project, good job 馃憤
I haven't done a comparison though since i could not open the project with older version of Godot after opening it with this version so i'm afraid i cant give exact data. Keep it up!

Hey guys. Just dropped by to say say this gave a huge performance boost to my project, good job
I haven't done a comparison though since i could not open the project with older version of Godot after opening it with this version so i'm afraid i cant give exact data. Keep it up!

That's great! :+1: You can directly compare frame rates within this version, because in project_settings/rendering/gles2/ there is a setting use_batching you can use to toggle between batching and the legacy renderer.

As to old versions, I haven't had any trouble opening later projects in 3.2 stable, I'm not sure how far the backward compatibility is meant to go, officially.

Hey guys. Just dropped by to say say this gave a huge performance boost to my project, good job
I haven't done a comparison though since i could not open the project with older version of Godot after opening it with this version so i'm afraid i cant give exact data. Keep it up!

That's great! 馃憤 You can directly compare frame rates within this version, because in project_settings/rendering/gles2/ there is a setting use_batching you can use to toggle between batching and the legacy renderer.

As to old versions, I haven't had any trouble opening later projects in 3.2 stable, I'm not sure how far the backward compatibility is meant to go, officially.

Right, guess i was just being lazy! So in a direct comparison, same scene, its about 90-100 fps without batching and 140 fps with batching on. Before i say something stupid, batching is working for UI elements right (thats pretty much what i understand by Rects)? I'm asking because the project i'm working onto is an application and purely consists of UI elements. Changing the batching settings like Buffer size or Max join commands didn't seem to have a difference.

Also, the issue with older version of Godot is related to Mono:
The specified solution configuration "Tools|Any CPU" is invalid. Please specify a valid solution configuration using the Configuration and Platform properties (e.g. MSBuild.exe Solution.sln /p:Configuration=Debug /p:Platform="Any CPU") or leave those properties blank to use the default solution configuration.

So the project is working fine in 3.2.2 Beta 1 but in older versions i get this error.

Before i say something stupid, batching is working for UI elements right? Changing the batching settings like Buffer size or Max join commands didn't seem to have a difference.

At the moment it works for anything that is Rects, but not for instance Polys or Lines. So a lot of UI elements do benefit, but not all. As far as 2D games are concerned, in most cases nearly everything is drawn as Rects (sprites, tilemaps, text etc).

Changing the buffer size can be quite subtle, which is why it can be worth lowering it for mobile games (as you don't lose much performance, and it takes a little RAM).

The effect of max_join_item_commands varies according to the project. There will very often be a difference between 0 and 1 (because a value of 0 effectively turns off joining of items (think nodes in the scene tree)) but aside from being set to 0, it determines how likely it is to join larger items (by changing the lookahead). This often won't have much visible effect, but in some cases, through the frame diagnostics, you might see that some larger items are not getting joined, and you would like to test the effect of joining them, you can bump up the max_join_item_commands value until they join, and see the effect on frame rate.

Most people won't need to vary max_join_item_commands, but if you want to tune your game to get the absolute best performance, it can sometimes be very helpful, and it is always worth having a play with these values before releasing your game, to ensure you are not missing out on some extra performance.

I've _finally_ tracked down bug 2, the ruler bug, after 2 and a half days of testing. :partying_face:

I'll put in a PR soon, so it should be fixed in the next version. It's probably quite rare in the wild, but it's nice to have it fixed. Next I'll have a quick look at bug 3 to see whether it is likely to be to do with the batching.

Once that is done, I have some ideas for reordering the items / commands which should lead to better batching in a lot of games that don't currently batch well (and hence better performance).

Thanks to dynamictype for mentioning this to me on reddit, he has a visual regression which I'm hoping he will post on here, with 2 screenshots (not sure which is the intended version, I suspect the more lit version):

124CzLd
Zya1WsA

Without more info I suspect this may be due to this issue #38133 which is fixed in #38137. Indeed anyone having issues with lights this is a highly likely culprit. The bug appeared as a result of joining items across different z_indices which was introduced during the testing phase to accelerate more cases.

Hello, I'm dynamictype from reddit

I've traced down the issue to shaders on objects that read the SCREEN_TEXTURE. The grain filter you seen the images above is such a shader, and if I disable it, all goes back to normal, except in other parts of the game where I use other shaders that access the screen texture.

Attached is a simple reproduction project with 2 tile maps and 2 screen reading shaders. Toggle batching (or turn on batch flashing) to see the issue.

BrokenBatching.zip

Attached is a simple reproduction project with 2 tile maps and 2 screen reading shaders. Toggle batching (or turn on batch flashing) to see the issue.

BrokenBatching.zip

Ah fantastic! That should help me fix this. This is a rarely used path so I'm glad to have a chance to test it out. :+1:

Edit : Fixed. Was a one liner, see below. Should be in the next beta. :partying_face:

Thank you for the comprehensive instructions on the setup. This is a terrific optimization with an even greater potential so I am as delighted to read the previous euphoric reports as you are.

However, I'll make a contrasting feedback here :) With some suggestions which I hope will be appreciated, knowing this is still early stage.

Despite the caution that the optimization is game specific, we expect a neutral effect as the worst case. Sadly, I consistently got a slowdown for a game which I took the time to test on a desktop Windows 10, Android and iOS. The slowdown is of varying degree depending on the parameters.

The game is a 2D non-linear runner with very few repeating elements (except for particles), and the repeating elements have custom shaders reading VERTEX so I guessed initially that we always hit the non-optimized cases. To check it out, the diagnose frame feature was my friend. I really appreciated it since it is the only window of visibility I have on what's going on in practice. I had to read the source to make sense of the different numbers, but that was a joy, so I think am almost ready for a meaningful exchange on the topic :-)

To cut a long story short, my initial guess is confirmed. Of all the item joins, a few result in batches, and out of all the possible batches, only 2 exceed 3 rects. Here are 2 sample frames which illustrate the situation:
frame1.txt
frame2.txt

I have varied the batching parameters in the following ranges:

  • Max Join Item Commands: 16- 128
  • Colored Vertex Format Threshold : 0 - 1
  • Light Scissor Area Threshold: left at 1, did't touch it
  • Light Max Join Items: 0 - 64
  • Batch Buffer Size: 16k - 32k
  • Item Reordering Lookahead: 0 -32

I tested what I thought are meaningful combinations but got slowdowns of various degrees ranging from 2-3 fps down to 15 fps for motion critical sequences, compared to batching turned off. The tests on Android and iOS were done with release builds (non-debug) to max the speed, and showed comparable slowdowns. There is only one light, visible at times, but it is small and very localized in the scene.

Given these results, this raises a few points which deserve to be discussed or at least thought of.

  1. Should GLES2 batching be enabled or disabled by default. While I am fine with any choice provided this is well documented, the current state of affairs is such that enabling it by default may result in a slowdown. Which raises the second question why and whether we can alleviate that.

  2. It is understood that batching has a command pre-processing cost, look-ahead cost, overlap test cost & joining cost, the sum of which is expected to be exceeded by the benefit gained from batching and sending bursts to the GPU instead of individual commands. Yet, I expect to be able to configure the setup so that it is effectively disabled by "turning off" all parameters in order to prevent any speedup or slowdown. So the second question is this: what are the "off" values of all these parameters, when batching is still turned "on" to effectively reduce the batcher to a no op?

The answer isn't obvious so I walked with this in mind through the parameters again and set them to the following:

  • Max Join Item Commands: 16
  • Colored Vertex Format Threshold : 1
  • Light Scissor Area Threshold: 1
  • Light Max Join Items: 0
  • Batch Buffer Size: 16k
  • Item Reordering Lookahead: 0

The provided frames traces 1 and 2 are captured with these settings. I expected to minimize the cost of most batching checks while still benefiting from joining, but sadly this still resulted in a slowdown. The game is as it is, unhandled cases only, yet a noble goal of this effort would be to make the batcher neutral when the parameters are set to their "off" values. Meaning, get rid of the overhead and have an execution path which is a defacto "fallback" to the batcher turned off. Not sure this is possible but this is worth considering.

3) The speedup potential of the approach is undeniable. One of the other areas where it would really shine and will be much appreciated is CPUParticles. If that happens, CPUParticles won't be 2nd class citizens any longer compared to GLES3 which is often preferred due its support of GPU particles. GLES2 will thus become the renderer of choice for mobile, both Android ans iOS. Today iOS is better off with GLES3 while GLES3 for Android is a gamble.

Last but not least, I imagine the technique to be quite efficient for some scenes and inefficient for others. If we can't get rid of the constant overhead when batching is "on" then having a dynamic "on/off" API would be a must have in order to avoid the performance penalty for non-optimized scenes.

Finally a small dev wish: it would be nice to add to the Diagnose Frame output a summary/stats of the content per frame: total commands count, joined commands count, number of batches and similar. Currently it is kind of hard to follow the output which may span across several screens for a frame and one can't get the exact picture despite the reported number of Refs.. Counting the number of batches manually is a no go :)

Thanks again for this wonderful feature! I am sure plenty of games can benefit from it already.

@oeleo1 I'll try and answer in more detail later, but one point that may be particularly important in your case : we took the opportunity (I think introduced in beta 3) to default the single_rect_fallback to false, to discourage the use.

Single Rect Fallback (uniform draw method)

single_rect_fallback is the default uniform method of drawing rects in the legacy renderer, which is approx 2x as fast, however it seems to be the cause of flicker on some nvidia hardware. So given that we are hoping _for most people_ batching will give a gain of speed, we thought this might be an opportunity of taking the sting out of losing performance elsewhere.

The break-even point where a batch becomes cheaper than the single_rect_fallback method is around a batch size of 2. So most people who notice this decrease will be people who are getting very little batching.

So in order to get a level playing field here, you either need to set single_rect_fallback to true in the batched run, or set use_nvidia_workaround to true in the non-batched run. This will make it a bit easier to compare the two methods, as this effect will probably be much larger than any fixed batching costs.

Edit

Actually this brings up an important point that hadn't occurred to me earlier - if the flickering problem is only occurring on desktop, it might be an idea for us to hard code it to allow the fallback method on e.g. Android and iOS etc, which would allow the speedup at least on those platforms. Will consult with @clayjohn and @akien-mga . :+1:

Custom shaders

In your case it does sound like your custom shaders are preventing a lot of batching. I have plans to work around this, at least in the case of colors, but adding an additional vertex format containing 2 colors separately, the final_modulate and the vertex color. For transforms the situation is more tricky simply because the transform data is so large and putting it into the vertex format may only be useful in certain circumstances.

Realistically at this stage we are concentrating on batching the common cases (the low hanging fruit). It is possible we will get more working as time goes on. Anyway let us know how you get on with the level playing field! :grin:

Defaults

Should GLES2 batching be enabled or disabled by default. While I am fine with any choice provided this is well documented, the current state of affairs is such that enabling it by default may result in a slowdown. Which raises the second question why and whether we can alleviate that.

Initially we went with off by default, however (especially for test and beta builds) this changed to on by default, because a lot of testers won't touch settings, so we get 10x as much chance of finding regressions by defaulting to on.

Really the question probably has to be framed in terms of the single_rect_fallback option. If this is switched on, then there is unlikely to be a huge drop in performance compared to the legacy renderer, even worst case. There were some cases that we looked at with drop in performance, but as far as I remember they were pretty much down to either the use of, or not, of the single_rect_fallback.

Whether single_rect_fallback should default to on is controversial. It is a 'damned if you do, damned if you don't' scenario. People typically don't expect the possibility of a drop in performance moving to a new version, however in this situation the uniform draw method previously used was cheating, in that it doesn't work on all hardware. You end up with the situation of people developing games on a machine that it works fine, exporting it, then finding their players using nvidia hardware are getting graphic anomalies.

Personally I think on balance it probably is better to keep the new single_rect_fallback default as off, and then make it clear to people who are suffering that they can regain the speed lost by setting it to on (but that this comes at cost of compatibility). But this decision has just been consensus between me, clayjohn and akien. If a convincing argument could be made for the other we would swap it.

Minimum settings

It is understood that batching has a command pre-processing cost, look-ahead cost, overlap test cost & joining cost, the sum of which is expected to be exceeded by the benefit gained from batching and sending bursts to the GPU instead of individual commands. Yet, I expect to be able to configure the setup so that it is effectively disabled by "turning off" all parameters in order to prevent any speedup or slowdown. So the second question is this: what are the "off" values of all these parameters, when batching is still turned "on" to effectively reduce the batcher to a no op?

The closest to an 'off' setting would be:

max_join_item_comands 0
colored_vert_format_threshold 1.0
light_scissor_area 1.0
light_max_join_items 0
item_reordering_lookahead 0

However this is not off, it will still give you batching of tilemaps and text (that do not change color). Most of the options are to do with joining items (think nodes), rather than commands within items (think nodes that draw multiple things).

If you are using lights I would always suggest experimenting with light_scissor_area (especially try it at 0.0). This is because batching items makes it more difficult to cull them to the light area, and there is a danger of using unnecessary fill rate. Light scissoring will prevent this.

ECU

Last but not least, I imagine the technique to be quite efficient for some scenes and inefficient for others. If we can't get rid of the constant overhead when batching is "on" then having a dynamic "on/off" API would be a must have in order to avoid the performance penalty for non-optimized scenes.

Again, try with the single_rect_fallback. Aside from this I'm not sure we've seen good evidence of significant performance penalty - I am particularly interested in these worst cases though. I have toyed with the idea of an on the fly AI performance tuner (like a car ECU) however we have to weigh up the potential benefits against making the code more complex (it is already quite complex).

Diagnostics

Finally a small dev wish: it would be nice to add to the Diagnose Frame output a summary/stats of the content per frame: total commands count, joined commands count, number of batches and similar.

Yes totally agree, it's something I've noticed myself and intended to add, perhaps with optimization suggestions based on this data (similar to the ECU idea but kept to the diagnostics section of the codebase). :+1:

single_rect_fallback is the default uniform method of drawing rects in the legacy renderer, which is approx 2x as fast, however it seems to be the cause of flicker on some nvidia hardware. So given that we are hoping _for most people_ batching will give a gain of speed, we thought this might be an opportunity of taking the sting out of losing performance elsewhere.

I don't fully get that. Is the single_rect method used by the batcher when it is batching or only when it fails batching. If not used and the batcher fails batching, what method is it using? That's crystal clear to you, but quite shady for us common mortals so some additional effort should be made to explain that further in better terms.

Let's take it backwards, and explain that in terms of consequences. What's the catch when the option is on and vice versa, what's the catch when it is off? From what I understand, it better be on, unless one has a faulty nvidia hardware. But that's irrelevant for iOS, Android and my desktop where I do have an Nvidia graphics but never had any flickering. So subordinating that on/off choice to some faulty Nvidia hardware and imposing it on all systems seems an unreasonable thing to do, even if Nvidia was a Godot gold sponsor, which it isn't...

single_rect_fallback is the default uniform method of drawing rects in the legacy renderer, which is approx 2x as fast, however it seems to be the cause of flicker on some nvidia hardware. So given that we are hoping _for most people_ batching will give a gain of speed, we thought this might be an opportunity of taking the sting out of losing performance elsewhere.

I don't fully get that. Is the single_rect method used by the batcher when it is batching or only when it fails batching. If not used and the batcher fails batching, what method is it using? That's crystal clear to you, but quite shady for us common mortals so some additional effort should be made to explain that further in better terms.

You are right, and I'm intending do a document to help explain how to set up batching best for your project, as it is quite a complex beastie. Luckily the defaults should work to improve things for most people, they can just get slightly better gains sometimes by changing the parameters.

Batch sizes

Batching rects results in batches of size varying from 1 to lots. By default it draws them all using the same indexed primitive method, uploading a vertex buffer each time (this is not the most efficient, and earlier versions had less vertex buffer uploads, but it has proved far easier to get the whole system working, for now).

Uniform draw method

HOWEVER, the legacy renderer by default _didn't_ do this uploading of a vertex buffer, it used a 'trick' to send the rect position and size via uniforms instead of a vertex buffer. This has proved faster, approx 2x as fast, presumably because uniform changes are cheaper then vertex buffer uploads. The only snag is, that while this should in theory work all around, it seems to have exposed what our best guess is a driver bug on some nvidia hardware.

So the batching has a choice when it has a batch containing a single rect - should it render using the normal indexed primitive + VB method for batching, OR should it treat it as a special case and revert to rendering it via the old uniform method (the fallback method)?

It works for me...

Let's take it backwards, and explain that in terms of consequences. What's the catch when the option is on and vice versa, what's the catch when it is off? From what I understand, it better be on, unless one has a faulty nvidia hardware. But that's irrelevant for iOS, Android and my desktop where I do have an Nvidia graphics but never had any flickering. So subordinating that on/off choice to some faulty Nvidia handrware and imposing it on all systems seems an unreasonable things to do, even if Nvidia was a Godot gold sponsor, which it isn't...

The problem is, we have no good way of knowing whether the hardware is going to exhibit the flickering or not. As I noticed in my earlier comment, this might be possible to make this decision in advance for Android and iOS (I don't think we've had reports of flickering on them, but we'd need to check - some android devices use nvidia chipsets).

But outside of platforms that we decide at compile time don't cause a problem, we have no easy way of auto-selecting the best method. We could probably get a string from the OpenGL driver for the device and try and match it to a little database of known hardware causing problems / or known to be safe. But this isn't ideal - it could even vary by driver version, we just don't know.

The problem many seem to be missing as regards the nvidia_workaround (and thus single_rect_fallback) is that, just because it works for you on your development machine, doesn't mean it will work for players who download your game onto their hardware. The first you'll hear is a bunch of bad reviews...

Strategies for driver bugs

This is a common issue for driver bugs. There are many instances where developers have to make workarounds for the most buggy devices, and everyone has to suffer as a result. The proper way around this is to prevent it happening in the first place, by exhaustive conformance testing and certification for hardware and drivers. We are all hoping this situation will improve with Vulkan. Other than that the choices are rather limited and include:

  • Pretend the bug doesn't exist and just have it not work on those devices (not a good option if it is a large percentage of devices)
  • Workaround the bug and have working devices suffer the workaround version
  • Have some kind of hardware database
  • Have a switch at runtime in the game options

Switching at runtime is actually very viable (from my side), I'll have to consult with some of the others as to how this mechanism could work as I'm not familiar with these bits of godot (maybe the use could change project settings at runtime, and we could have a notification sent to the renderer).

Anyway, let us know how your performance comparisons are with the single_rect_fallback enabled comparing batching and non-batching. Also it is possible you will find light_scissoring may improve things with a value closer to 0.0, but it depends on your scene.

Excellent! I get it now. I think the terminology is unfortunate though so would recommend using more accessible words like for example fast_rect_draw instead of single_rect_fallback which may be unsafe, and safe_rect_draw for the nvidia case.

In my tests I had the single_rect_fallback set to off indeed. A quick test with switching it on melts down the slowdown to an unnoticeable amount, if any. Great tip, and I'd even say crucial for performance when batching is on. This is probably the single most important parameter in the list.

The problem many seem to be missing as regards the nvidia_workaround (and thus single_rect_fallback) is that, just because it works for you on your development machine, doesn't mean it will work for players who download your game onto their hardware. The first you'll hear is a bunch of bad reviews...

Hopefully I am not that naive (yet, but with age that may change :) ..I realize I didn't clarify that developing mainly on Windows to distribute on mobile is what we do. So using the fast_rect_draw set to on is a sensible thing to do for us if you don't improve on it in the meantime. Thanks!

Let me start off with saying that this is great! I just started using Godot a few weeks ago and I can already confidently say that it was the right choice for my project :)
Even more so with this change, since most of my levels will be built on layered tilemaps and should benefit a lot from batching!

Issue description
While trying out v3.2.2 beta3 I noticed on one of my Android devices that in some instances there are visible 1px gaps between tiles in my tilemaps, see the following screenshot. When disabling batching the gaps are gone.
batching_bug_minimal

  • The gaps disappear when switching of batching.
  • The gaps only appears for tiles that are created with an offset inside a texture, i.e. a 32x32px tile created from a 32x32px image will not have this problem, but a 32x32px tile created at row 256 in a 32x320px image will have the problem.
  • The gaps are only visible on my android tablet, not my PC, even if I set the window size to the same resolution as my tablet.
  • Filter is disabled for my tile texture, pixel snap is enabled in project settings.
  • Changing the project window size has an effect on the artifacts, e.g. 320x224 shows the problem while 320x400 does not.

Godot version: 3.2.2 beta 3 official
Device/OS: Galaxy Tab A (SM-T510), 1920x1200px resolution, Android 9, GPU: Mali-G71 MP2

Steps to reproduce:

  • Create a new project.
  • Set project resolution to 320x224, stretch mode to 2d, stretch aspect to keep, enable pixel snap
  • Add a new tilemap.
  • Add a new tileset to the tilemap.
  • Using the texture from the attached minimal example project, create a new 32x32 tile at position x:256 y:0
  • Draw a few tiles
  • Run the project on Galaxy Tab A

Minimal project to reproduce the problem:
batching_bug_minimal.zip

I'll be happy to run additional tests to narrow it down further :)

Am investigating this. It does sound like a precision issue somewhere, perhaps introduced by the 2d scaling. There are multiple possibilities. I'll try and get it running on my android device, but if it doesn't exhibit the problem on mine, it might be helpful for you to do some tests.

There seems to be 2 obvious possibilities (or both!):

  • The precision is affect the position of the vertices
  • The precision is affecting the UVs

You could find out whether it is the UVs (and not the positions) by filling the texture atlas with the same brown color (and using space for the tiles in between, rather than a tile with zero alpha). If the gaps are eliminated then it will suggest the UVs are the problem, and not position. If they are not eliminated it will suggest position is the problem.

If it is the position, it may be that pixel_snap is indirectly triggering it, so it would be nice to test with pixel_snap switched to off. Come to think of it, I'm not sure how exactly pixel_snap would be expected to work in any sort of useful way with 2d scaling.. :flushed:

As far as position is concerned one possibility might be the difference between the uniform buffer draw and the vertex buffer draw methods. It would be nice to test the legacy path with nvidia workaround set to true, however that is hard coded to false when GLES_OVER_GL is not set (which will happen for android export), so this wouldn't be easy to test like-for-like without a custom build of the template.

Edit

Actually re-reading your info, it already sounds likely to be a UV issue (I'm not 100% sure though). If making the tile texture a single 32x32 cures the issue, the positions used for drawing should be the same as when the problem exhibits.

I'll have a look at the shader and see if I can work out what might be happening. It is intriguing that there is any difference between what occurs with the 2 draw methods (maybe if it is precision in the UVs, this comes from the differences from the shader uniform method versus the normal drawing method).

It is also possible that this is causing not enough precision in the fragment shader for the UVs. You might find that if you make your tilemap texture a power of 2 it cures the problem (i.e. have a 256 height texture and 8 segments, or a 512 height texture with 16). Especially with filtering turned off, you need to be bang-on with the texel coordinates, taking into account that texels are read from texel centres, not texel corners (this could quite easily be a problem, I'll see if I can work out whether the tilemap is accounting for that).

I just had a 10 minute slot to try out a few things. Apologies if I've missed something in your reply, I'll read it again and more carefully later today :)

Filling the whole texture atlas with the same color resolves the gap problem, confirming that it is a UV issue?

Changing the size of the texture atlas image to a power of two (w=32px, h=512px) also solves the problem. And come to think of it, this is probably good practice anyway!

Filling the whole texture atlas with the same color resolves the gap problem, confirming that it is a UV issue?

Yes I now think this is likely. You could also try filling with another color, like blue, you might find that the 'gap' edge is now blue.

Changing the size of the texture atlas image to a power of two (w=32px, h=512px) also solves the problem. And come to think of it, this is probably good practice anyway!

Yes definitely. Especially mobile some texture calculations may even be done as fixed point rather than float (similar considerations are important for texture wrapping hence power of 2 textures etc).

It does seem likely that this issue is only indirectly related to batching, it seems more to be triggered by the differences between the regular drawing methods and the uniform method. As such it would probably occur in some other conditions aside from tilemaps.

There are also these lines in the shader which suggest this may have been a problem before:

#ifdef USE_PIXEL_SNAP
    outvec.xy = floor(outvec + 0.5).xy;
    // precision issue on some hardware creates artifacts within texture
    // offset uv by a small amount to avoid
    uv += 1e-5;
#endif

This bodge may not be the correct solution and may be causing these problems. It may be an attempt to work around the texel centric basis of texture filtering (based on the centres rather than the corners).

I'll need to look at the git history for when and why this was introduced, but I suspect it is to deal with texel filtering, and strictly speaking should be a half texel on each axis rather than an arbitrary value.

Also for tilemaps this range needs to be carefully done such that e.g. on a 32 pixel image pixel zero reads from texel zero and pixel 31 reads from texel 31 (NOT 32).

Is likely related to this issue #26467, and this PR #26567.

Some more input on this, not sure if it's helpful:

  • Filling the texture atlas with a blue color outside the brown tile indeed colors the 1px gap blue.
  • I can only make the gap appear at the bottom edge of my 32x32 tiles, i.e. the last row of the tile is sampled outside the texture. I tried to play around with different resolutions but could never get the top edge to be outside the tile texture.
  • When disabling pixel snap, the problem gets worse, meaning that more gaps appear.

That is useful, it confirms the theory and suggests the problem is occurring on one side of the range (I'm presuming the bottom it it draws with y at the top of the texture, haven't got time to check this at the mo).

It may be worth creating a fresh issue for this, for future reference, and given that it doesn't appear to be directly related to the batching (as it is likely to occur whenever the uniform method is not used, which may be phased out). Feel free to create a new issue and copy paste our posts, and we can go from there. :+1:

Also @clayjohn will probably want to come in on the discussion, we briefly mentioned it yesterday and he did the original PR with the UV tweak.

@lawnjelly yes the UV offset hack...

Honestly we never fully figured that one out. Introducing the small amount of error seemed to work, but it didn't really make sense why. I know we discussed it back then, but I can't remember exactly why we went with the hack.

Awesome, thanks for all the quick responses! I have created a new issue: #39096

Oh, and since reports on performance improvements were requested in the original post:

My original plan was to rely quite a bit on Light2D nodes for setting the mood in my project. I had abandoned that plan since I saw a huge hit on frame rate, even though I'm running at a low resolution (~320x240) and with just a couple of 64x64px lights with shadows disabled.

With batching I can now run 32 overlapping 128x128px lights at stable 60fps on my low-end android tablet. Without batching this number drops to around 9fps.
I'm very happy about this change, great work! :)

Batching seems to break some tiling textures if used in both sprites and texturerects
https://github.com/godotengine/godot/issues/39766

In the 3.2.2-stable tag, there's no GLOBAL_DEF for light_scissor_area_threshold.

I don't see it in the Project Settings either. Is this setting still relevant?

In the 3.2.2-stable tag, there's no GLOBAL_DEF for light_scissor_area_threshold.

I don't see it in the Project Settings either. Is this setting still relevant?

Yes sorry about that. We changed the location of the settings since some of the betas, to avoid duplication for GLES3 batching. They are now in rendering/batching. This one is now rendering/batching/lights/scissor_area_threshold.

The IDE still displays old settings that are stored in your godot project file even if they are no longer used. This can be confusing, I'll ask on IRC what the thoughts are on hiding settings that are no longer used.

If you want to remove any of these remaining settings, if you click the 'default' swirly arrow, then resave, afaik they will disappear next time you load the project.

@lawnjelly thanks for the clarification!

I have been working in my project using 3.2.1.stable.official for several months, since atlases didn't work properly in this version I decided to download 3.2.2.beta3.official and I kept working in this version. Holy moly, what an improvement in performance! my god!!!!, my platform 1280x720 game finally hits 60FPS in old low-end Android devices, I don't really know what you did but it's awesome! thank you very much! this "batching thingy" needs to be taken more into account, please rise its priority, it really really worth it! thanks again!

Thanks everyone for the testing reports. Now that 3.2.2 is released and the feature enabled by default, let's close this and have further bug reports as individual issues (if you want to report a non-bug scenario though, feel free to comment here :)).

Was this page helpful?
0 / 5 - 0 ratings