While @nkast has done some great work with his low level optimizations there is still more to be done.
For example all of these functions in SpriteBatch:
void Draw(Texture2D texture, Vector2 position, Color color);
void Draw(Texture2D texture, Rectangle destinationRectangle, Color color);
void Draw (Texture2D texture, Vector2 position, Rectangle? sourceRectangle, Color color);
void Draw (Texture2D texture, Rectangle destinationRectangle, Rectangle? sourceRectangle, Color color);
All redirect to the same DrawInternal method which does a common set of optimizations. We do this mainly to make the code easier to read and maintain as well as to void bugs. This however comes at the expensive of performance.
Take void Draw(Texture2D texture, Vector2 position, Color color) for example. This draw doesn't need rotation, doesn't need flipping of sprites, doesn't need UV calculations. If this function created its own SpriteBatchItem internally it could do so much more efficiently than DrawInternal does now.
Same goes for the other specializations of Draw and DrawFont in SpriteBatch. This matters when games are drawing 1000s for sprites per frame.
While this does mean a lot more duplicate code... it also means much more performance. We can deal with the issue of duplicate code via more unit tests. Performance here is more important.
I agree with all of the above. Actually I've done a few test recently to do exactly that. I pushed two more commits into #5267 the other day, the one called Refactor SpriteBatchItem.Set() is towards this goal. The performance improvements were negligible, the actual goal was to move calculations out of .Set().
While this does mean a lot more duplicate code... [...]. We can deal with the issue of duplicate code via more unit tests.
One pattern I notices was that we can redirect all Draw(Texture2D texture, Vector2 position, ...) into the similar Draw(Texture2D texture, Rectangle destinationRectangle, ...) without too much overhead.
Another thing that became obvious during this fun process :dizzy_face: was the need for a standardized benchmark that can run a number of tests and compare them against the baseline.
One pattern I notices was that we can redirect all
Draw(Texture2D texture, Vector2 position, ...)into the similarDraw(Texture2D texture, Rectangle destinationRectangle, ...)without too much overhead.
Well our goal would be zero extra overhead. I think sprite rendering performance is important enough that we would duplicate a function just to remove even a single instruction worth of unnecessary code.
I think we start by duplicating the code in DrawInternal to every single Draw in SpriteBatch. Then we optimize them for each particular version of Draw and its arguments. Then if we see two Draws that are 100% identical we can consider sharing that code.
When you render 1000s of sprites a frame to hit 60fps... just a little less code matters.
It is less apparent on a modern 4.0 GHz PC... but it certainly hurts lower end mobile devices and consoles.
a standardized benchmark that can run a number of tests and compare them against the baseline.
I assume you mean performance here. Yeah we do need some sort of basic performance benchmarks.
Well our goal would be zero extra overhead. I think sprite rendering performance is important enough that we would duplicate a function just to remove even a single instruction worth of unnecessary code.
Sure. Although during my tests I found myself keep copying & moving code and though it might be helpful to have this redirection, at least in the early stages of refactoring & refining.
I think we start by duplicating the code in DrawInternal to every single Draw in SpriteBatch. Then we optimize them for each particular version of Draw and its arguments. Then if we see two Draws that are 100% identical we can consider sharing that code.
For the sort version I implemented them from scratch, because I wanted to understand what exactly was going on. My thinking was that as we move to the more complex methods, the first code should be identical and just apply extra transformations. As I moved to the more complex methods I found that even the position methods need to keep a Rectangle to apply any transformations.
Bad news: I lost that branch because I never published it in github.
So, basically, I propose to implement the destinationRectangle methods first, starting from the short/simple ones. The position Methods will call Draw(texture, new Rectangle(x,y,texture.Width,texture.Height, ...);
Once we are happy with the results we expand and optimize the position methods.
I can send a PR within the next week for the simple Draw Methods for further discussion.
@tomspilman Also I'd like your opinion on something like this:
https://github.com/MonoGame/MonoGame/compare/develop...nkast:tncSpriteBatchDXOpt
I am using a VertexBuffer and copy the Vertices directly into the mapped memory. +12% perf.
I'm trying to combine the two in this PR but still needs work and I'm not so sure about the results. It's overly complicated (PlatformMethods calling back InternalMethods in the main .cs) and the GL version will certainly take a hit in performance.
https://github.com/MonoGame/MonoGame/compare/develop...nkast:tncSpriteBatchMappedMemory
When you render 1000s of sprites a frame to hit 60fps... just a little less code matters.
And of course this give you a little extra time in each frame to do a better AI.
and the GL version will certainly take a hit in performance
@nkast Do you mean if you make it use a VertexBuffer too? Why not make the changes to the DX implementation and keep things as is for GL? It's not like there's a lot of maintenance for SpriteBatch so having more code fragmentation for better performance seems like a nice tradeoff for something used so often. #speedoverbeauty
Great job on the optimizations you've done so far!
I am using a VertexBuffer and copy the Vertices directly into the mapped memory. +12% perf.
I have a branch with just that same optimization from 2 years ago... there is a complication with that.
Our OpenGL implementation of DrawUserPrimitive was actually faster than using a DynamicVertexBuffer when I tried this before. In particular on iOS.
So we would need to maintain two paths to do that which sounded unappealing at the time.
Now we could add some sort of SPRITEBATCH_USE_BUFFERS macro around this new path with the fallback to DrawUserPrimitive. Then on a platform-by-platform basis we define SPRITEBATCH_USE_BUFFERS if that is faster on this platform.
We can't improve one at the expense of another, especially when the other
is the API used on mobile devices that are in most need of the performance
improvements.
Ok, I remembered now that I wanted to convert .Set() to accept a ref rectangle destRect instead of float x, float y, float w,float h. So it made sense at the time to reroute all Draw(..., position, ...) methods through Draw(..., dest, ...), but the problem is that Position is float and Rectangle will convert it all to ints.
so scrap that.
Ok, I remembered now that I wanted to convert .Set()
Really we should inline all of that. I don't think there is any benefit to share any code in those Draw methods. Inline all of it, remove any extra work that isn't required by that draw.
I have a branch ready where I refactored the simple versions of Draw().
I would like to wait for #5267 and move some of it's optimizations on this new branch for a fair comparison.
https://github.com/MonoGame/MonoGame/compare/develop...nkast:tncSpriteBatchDraw
I would anyway suggest thinking also on SpriteBatch instanced rendering, I saw an example of instanced sprite batch rendering under DirectX11.
instanced rendering
I think that won't provide much benefit here because we're only drawing rectangles so the number of vertices is very low. Though we would transfer some minor computation to the GPU (rotation, scaling). This would be a nice experiment. Note that when a custom effect with a vertex shader is passed we can't do instanced rendering, so we would need to maintain two paths.
Since we are on the topic of amending SpriteBatch's API. What do you guys think of the following?
Support for passing in a model-to-world matrix by reference using the ref keyword when enqueuing a sprite. This matrix would be used to transform the 4 vertices of the quadrilateral representing a sprite from model space to world space. The reason I suggest this is because it is likely that the world transform of a sprite (position, rotation, scale, origin, etc) doesn't change every frame and thus doesn't need to be recalculated every frame.
Note that when a custom effect with a vertex shader is passed we can't do instanced rendering
wait why i was just trying to get that to work a bit ago ?
anyways.
I once petitioned to shawn hargraves this should be a internal thing for xna 5.0. Since SpriteBatch is not only aimed at new people but ease of use. Errr that i shouldn't need to create my own call, to spin a image in place simply as one would expect, like so. Or for that matter post code repeatedly as i got sick of explaining it.
public static void DrawToScreen(Texture2D t, Rectangle scrdraw_rect, Rectangle texture_rect, float rotation, float scale)
{
scrdraw_rect.Width = (int)(scrdraw_rect.Width * scale);
scrdraw_rect.Height = (int)(scrdraw_rect.Height * scale);
Vector2 trectoffset = new Vector2(texture_offset.X + texture_rect.Width * .5f, texture_offset.Y + texture_rect.Height * .5f);
BgEngine.Spritebatch.Draw(t, scrdraw_rect, texture_rect, SpriteColor, rotation, trectoffset , flip_effect, 0);
}
That said i think methods with new functionality should go in a separate class similar to sprite batch. Its pretty much no overhead to just do it yourself as is, but it would be a nice to have a little more basic commonly needed or useful functionality along side spritebatch. Naming it would probably be the hard part. SpriteBatchB ?
Even with the text i realized there is no need to alter the class to fix it up you can basically just implement your own pretty easily as a separate class or a new more advanced one right next to it with new functionality for monogame itself but a official one would be nice
...
Oh ya btw while were on the subject, before i forget, you know what else is annoying it irks me whenever i think about it.
Having to pass rectangles to spritebatch calls, that get converted into floats internally. When they actually started off in your code as floats. Because you thought, "now i wont have to convert them in all those vector2 spritebatch calls with implicit or explicit casts", but ya vector2, it don't work for rectangle calls, lol.
Think about how inefficient that is (Cast Cast Cast) ok maybe its not the worst compared to the instancing thing but it did bug me, half the spritebatch calls are vector2's the other half are rectangles.
if you keep your positions as int's and want to use a vector2 spritebatch call grrrrrrrr the other way keeping them as floats and want to use a rectangle overload grrrrrr.
_I think monogame needs a official floating rectangle class and spritebatch needs a couple more overloads just saying it would make sense_
I've recently noticed that
makes a matrix transformation for each vertex.
I would guess that in most cases this is not needed and should only occur when the user specifies a transformation.
It might be worth considering having a second, simpler vertex shader which just passes through the input.
@UncleThomy
That is both the user transformation and the screen projection.
https://github.com/MonoGame/MonoGame/blob/develop/MonoGame.Framework/Graphics/SpriteBatch.cs#L146-L148
We can break up Begin() method to one that avoid the multiplication when user does not provide a Matrix.
Can we really optimize the effect any more than than that? I was thinking that we could move more of the expensive operations (rotation & scale) into the effect, but is it possible without hurting the simple cases?
I've never seen the SpriteBatch vertex shader be a bottleneck in any game... it does very little work. That minuscule performance improvement is not worth the hassle of maintaining two versions of it.
The problems are all CPU side and mostly with our implementation of the API.
There's too many variables at play at this point... remember MonoGame is cross platform so moving stuff from CPU to GPU might not work well on all platforms.. and waht if the spritebatch is being used in a 3d game which is already very GPU heavy... making general purpose frameworks is fun ;-)
So other than optimisations that clearly remove some processing its very hard to do this without cross platform profiling. Heck even CPU optimisations at the cost of memory can affect platforms with lower RAM so you can't even call those easy.
What we really need is some decently complex games that run on several platforms that we could profile and see exactly where we need to do work.
EDIT: nvm, i assumed spritebatchitem to work a little different, i was wrong.
Well i been working on bypassing drawstring altogether on and off so i could wrap text without the overhead of calling measure string. So i have a prototype which makes it way easier to see what everything is doing from Drawstring right up to Drawinternal. Their is a lot of redundancy on that side as well.
Let me summarize the case of drawstring.
Drawstring calls DrawInto after it accesses Character Source :( were it copied in the text...
_(which btw is a completely redundant duncil structure which ill show later by removing it.)_
https://github.com/MonoGame/MonoGame/blob/develop/MonoGame.Framework/Graphics/SpriteBatch.cs#L146-L148
DrawInto in Spritefont does most of the work to decide were to place a glyph.
_(the exception is when spriteffects flips are used then MeasureString must be called)_
https://github.com/MonoGame/MonoGame/blob/develop/MonoGame.Framework/Graphics/SpriteFont.cs
Drawinto then calls back to DrawInternal back in SpriteBatch. were it started out.
_(however somethings drawinternal checks for are already resolved by that point)_
So its flopping around all over but it can be reduced by the following.
A fonts glyph information, a method that takes a stringbuilder instead, and a call to draw internal.
Right now its just ordered in a messy convoluted way.
This code below will do exactly what DrawString does by just copy pasting it next to a game1.
_Edit i made some edits to this to cut it down for clarity._
public class SpriteBatchFont
{
private Dictionary<char, SpriteFont.Glyph> _glyphs;
private SpriteFont.Glyph defaultGlyph;
private SpriteFont tsf;
private char defaultfontchar = ' ';
private static SpriteBatch spritebatch;
// this is mine its new for buffering words to wrapp text directly within drawstring
//private DrawableWordBuffer word = new DrawableWordBuffer();
public void SetSpriteFont(SpriteFont s)
{
tsf = s;
_glyphs = tsf.GetGlyphs();
defaultGlyph = new SpriteFont.Glyph();
if (tsf.DefaultCharacter.HasValue)
{
defaultfontchar = (char)(tsf.DefaultCharacter.Value);
defaultGlyph = _glyphs[defaultfontchar];
}
}
//
// .... overloads and wrapps removed for brevity
//
// you could just pass in the spritefont.
//
public void DrawString(SpriteBatch spriteBatch, StringBuilder text, Vector2 position, Color color,
float rotation, Vector2 origin, Vector2 scale, SpriteEffects effect, float depth)
{
var LineSpacing = (float)tsf.LineSpacing;
var Spacing = tsf.Spacing;
bool flip = false;
bool flippedVert = false;
bool flippedHorz = false;
Vector2 msize = Vector2.Zero;
Vector2 offset = Vector2.Zero;
Vector2 q = new Vector2((float)(Math.Sin(rotation)), (float)(Math.Cos(rotation)));
Rectangle dest = new Rectangle();
var currentGlyph = defaultGlyph;
var firstGlyphOfLine = true;
// for now ill set a flag if we cannot skip all the spriteeffect calculations
// flipping is a pretty big punishment to this method
if (effect != SpriteEffects.None)
{
flip = true;
// measure string needs fixed next or to be gotten rid off or around, its kinda bad.
// maybe a entirely different approach to measuring
// were we get a pre calculated measured drawable buffer back instead
msize = tsf.MeasureString(text);
flippedHorz = (effect & SpriteEffects.FlipHorizontally) == SpriteEffects.FlipHorizontally;
flippedVert = (effect & SpriteEffects.FlipVertically) == SpriteEffects.FlipVertically;
}
for (var i = 0; i < text.Length; i++)
{
// This line is were emojis as well aa penalty incurred should be considered.
var c = text[i];
if (c == '\r')
continue;
if (c == '\n')
{
offset.X = 0;
offset.Y += LineSpacing;
firstGlyphOfLine = true;
continue;
}
// needs to be touched up
if (_glyphs.ContainsKey(c))
currentGlyph = _glyphs[c];
else
if (!tsf.DefaultCharacter.HasValue)
// just draw a red square instead.
throw new ArgumentException("Text Contains a Unresolvable Character");
else
currentGlyph = defaultGlyph;
// Solves the problem- the first character on a line with a negative left side bearing.
if (firstGlyphOfLine)
{
offset.X = Math.Max(currentGlyph.LeftSideBearing, 0);
firstGlyphOfLine = false;
}
else
offset.X += Spacing + currentGlyph.LeftSideBearing;
// matrix calculations unrolled varys max 8 mults and up to 10 add subs.
var m = offset;
m.X += currentGlyph.Cropping.X;
m.Y += currentGlyph.Cropping.Y;
// more sprite effect penaltys
if (flip)
{
if (flippedHorz)
m.X = msize.X - (m.X + currentGlyph.BoundsInTexture.Width);
if (flippedVert)
m.Y = msize.Y - (m.Y + currentGlyph.BoundsInTexture.Height);
}
var r = m - origin;
var p = r;
if (rotation != 0)
{
p.X = r.X * q.Y - r.Y * q.X;
p.Y = r.X * q.X + r.Y * q.Y;
}
dest = new Rectangle(
(int)(p.X * scale.X + position.X),
(int)(p.Y * scale.Y + position.Y),
(int)(currentGlyph.BoundsInTexture.Width * scale.X),
(int)(currentGlyph.BoundsInTexture.Height * scale.Y)
);
// This following really should be a call to DrawInternal
// But i cant do it from outside spritebatch.
spriteBatch.Draw(tsf.Texture, dest, currentGlyph.BoundsInTexture, color, rotation, Vector2.Zero, effect, depth);
offset.X += currentGlyph.Width + currentGlyph.RightSideBearing;
}
}
}
I get quickly and easily distracted or it would probably be a bit better, Its not yet done. though it works, its not optimized at all. I was in no rush but... Its good for looking at what needs to happen vs what is going on in all these classes with drawstring fundamentally.
As you can see here i couldn't actually call DrawInternal from outside of spritebatch as a stand alone class or i would have. So i called draw instead which is a bit redunant and slows it down a little,
Technically if this were within spritebatch though, you could include what DrawInternal would do only for this method, as well further reducing redundancy since drawinto just calls drawInternal.
So there is tons of redundancy reduction and optimizing that can be done, before you even get beneath DrawInternal as well.
_I can see now that drawstring was purely a convenience method in xna, as drawstring is really just about how to draw these special sprites from a spritesheet_
DrawString word wrapping methods are omitted for clarity, which is what ive been working on.
Just a thought... instead of duplicating a lot of code here, perhaps just using the [MethodImpl(MethodImplOptions.AggressiveInlining)] attribute on relevant methods and doing some refactoring would be enough to get the compiler to do all this inlining for you? (Further reading: http://stackoverflow.com/a/8746128)
I think it's at least worth investigating. If this does the trick, then it would mean you get both maintainable and high performance code. If I get some time, I'll try messing around with this.
FYI - I've been messing a little with the aggressive inlining option and have had some success with it. Some conditions will cause inlining to fail though (e.g. too many parameters, modifying parameter values, etc) which can be tricky to diagnose.
BenchmarkDotNet has been very useful in regards to this though. Not only is it great for micro-benchmarks, it also has a JIT Inlining Events diagnoser which can tell you the reason why inlining has failed. It also supports running benchmarks under different environments so you can test how code performs using different JIT compilers and runtimes.
I did an experiment using the simplest Draw method (i.e. texture, position, color overload) as this would be the best case scenario for inlining. With inlining, it took approximately 60% less time than the original version!
For comparison, I did a manually inlined version (i.e. all code within 1 method) and a compiler inlined version (i.e. several small helper methods using aggressive inlining). The performance was practically identical, and in fact the compiler inlined version seemed to perform just slightly better.
I now definitely think the "aggressive inlining" option is worth pursuing as it means the code will be a lot easier to read and maintain while still performing well.
Thoughts?
P.S. Here's the output from BenchmarkDotNet:
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.2.9200.0
Processor=AMD Phenom(tm) II X4 955 Processor, ProcessorCount=4
Frequency=3138888 Hz, Resolution=318.5842 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1586.0
Job-FMXRTV : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1586.0
EvaluateOverhead=True MaxStdErrRelative=0.001 RemoveOutliers=True
Jit=LegacyJit Platform=X86 Runtime=Clr
Force=True
Method | Mean | StdDev | StdErr | Min | Q1 | Median | Q3 | Max | P95 | Op/s | Scaled | Scaled-StdDev
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
Original | 69.6027 ns | 0.0420 ns | 0.0121 ns | 69.5304 ns | 69.5741 ns | 69.5985 ns | 69.6438 ns | 69.6587 ns | 69.6571 ns | 14367268.17 | 1.00 | 0.00
Optimised | 25.5017 ns | 0.0027 ns | 0.0008 ns | 25.4987 ns | 25.4994 ns | 25.5016 ns | 25.5032 ns | 25.5084 ns | 25.5060 ns | 39212995.66 | 0.37 | 0.00
ManualInline | 27.7248 ns | 0.1068 ns | 0.0296 ns | 27.5813 ns | 27.6443 ns | 27.7049 ns | 27.7772 ns | 27.9421 ns | 27.9224 ns | 36068824.84 | 0.40 | 0.00
And here's my work in progress branch with the changes I've made:
https://github.com/MichaelDePiazzi/MonoGame/compare/finaldays...MichaelDePiazzi:fdSpriteBatchOpt2?w=1
(NOTE: The "Optimised" method is named DrawOpt2 and the "ManualInline" method is named DrawOpt3)
I assume you are only testing on windows though we have no idea how well inlining works on other platforms/compilers.
Manual inlining works on all platforms, using the compiler doesn't necessarily do the same.
@theZMan - The above benchmarks were run using the LegacyJIT x86 compiler and the CLR runtime. BenchmarkDotNet does support other environments (e.g. RyuJIT, x64, Mono), so I know it at least covers desktop platforms. But I don't know about other platforms (e.g. mobile, console).
I do know manual inlining is a safer bet than asking the compiler to do it. But it will also result in the code being more complex and harder to maintain. So I was just trying to investigate a possible alternative solution before we embark on this. Using compiler inlining means we can write simpler and easier to maintain code. But the downside is that we'd have to benchmark on all platforms to ensure it's actually being inlined on each platform.
Anyways, just some food for thought fwiw. There are definitely pros and cons to both approaches. But I'd definitely understand if the consensus was to use manual inlining.
I do know manual inlining is a safer bet than asking the compiler to do it. But it will also result in the code being more complex and harder to maintain.
I 100% agree that manual inlining sucks and is harder than normal to maintain. I am the first one to suggest breaking up code for reusability and easy maintenance.
However the behavior of C# compilers/jits/aots is all over the map still. We don't know what we will get from inline attributes. Some platforms it might work and we get the expected performance and others might not inline at all... Something difficult to spot and maintain as we add platforms.
So to me it is less maintenance to manually inline in these performance critical areas. We can also lessen the costs by using unit tests to ensure we don't break behavior.
@tomspilman - That's totally reasonable. It was an interesting exercise anyway playing around with this inlining option as well as BenchmarkDotNet. I've already used it to optimise some other performance critical code of mine, so it has been beneficial for me at least. I only have to worry about Windows though... ;)
Are you going with this? My game would benefit from this optimization :)
Manual optional argument is inconvienent. I have to create new method instead of writing lots of duplication.
@MMKnight I don't understand what you're saying. Manual optional argument?
I have to create new method instead of writing lots of duplication.
Also don't know what you mean by this, byt sounds like a good thing?
Anyway, as a user of MonoGame, this issue doesn't affect you (noticeably), because it's not related to the API.
@Jjagg I have to use ton of optional arguments to make it obsoleted.
Most helpful comment
I have a branch with just that same optimization from 2 years ago... there is a complication with that.
Our OpenGL implementation of
DrawUserPrimitivewas actually faster than using aDynamicVertexBufferwhen I tried this before. In particular on iOS.So we would need to maintain two paths to do that which sounded unappealing at the time.
Now we could add some sort of
SPRITEBATCH_USE_BUFFERSmacro around this new path with the fallback toDrawUserPrimitive. Then on a platform-by-platform basis we defineSPRITEBATCH_USE_BUFFERSif that is faster on this platform.