Skiasharp: [BUG] SKSurface.ReadPixels is super slow

Created on 3 Jul 2020 · 25Comments · Source: mono/SkiaSharp

Description

I'm using SkiaSharp with a OpenGL backend and I'm drawing on a SkSurface wich is super fast, took around 0.25ms.
To get the rendered pixel data I'm using SkSurface.ReadPixels to copy the data to a buffer. But this calls are super slow ~30ms.
Is there any chance to speed it up? Or getting a direct pointer to the pixeldata?

Code
var _info = new SKImageInfo(1920, 1080, SKColorType.Bgra8888);
var _glContext = GlContext.Create();
var _glContext.MakeCurrent();
var _glInterface = GRGlInterface.Create();
var _context = GRContext.CreateGl(_glInterface);
var _buffer = Marshal.AllocHGlobal(_info.BytesSize);
_surface = SKSurface.Create(_context, true, _info);

// doing the actual drawing like _surface.Canvas.Draw...

_surface.ReadPixels(_info, _buffer, _info.RowBytes, 0, 0));

Basic Information

Version with issue: 2.80.0-preview.24
IDE: Rider 2020.1
Platform Target Frameworks: .Net Core 3.1
- macOS 10.15.5 (currently only tested here)

Source

joa77

👍1

All 25 comments

I'm not an offical Skia developer, but here's my take on it.

you are drawing using a GPU backend. That allows for very fasting rendering.
however, when you read the pixels back to the CPU, all pixels must be transferred over the PCI bus from GPU memory to host CPU memory. This is called "readback", and is always slow. Furthermore, it stops both the GPU and CPU.
for high performance readback, async DMA must be used, so that neither GPU nor CPU are blocked while the pixels are being transferred. This is still slow, but at least none of the machinery will stall, so you can overlap rendering and readback. But I don't think SkiaSharp supports such an interface (that being said, Skia does have a protected asyncReadPixels and a public GrSurfaceContext::asyncRescaleAndReadPixels; the latter could be exposed by the SkiaSharp 2.0 with some work)

Ziriax on 3 Jul 2020

I understand that the transfer is the problem, but I don't understand why.
When I calculate the amount auf data that should be transferred between GPU to RAM it's about 500 MB/s which for my knowledge is much less the actual speed capabilities of modern RAM or PCI busses.

joa77 on 4 Jul 2020

Do you have a project that you can share? I am interested in profiling this on my system.

Ziriax on 4 Jul 2020

I created a minimal console application example. To make it work include the files from https://github.com/mono/SkiaSharp/tree/master/tests/Tests/GlContexts to the project.

var watch = new Stopwatch();
watch.Restart();

// See https://github.com/mono/SkiaSharp/tree/master/tests/Tests/GlContexts for context creation
var glContext = GlContext.Create();
glContext.MakeCurrent();

Console.WriteLine($"{watch.Elapsed.TotalMilliseconds:0.000}ms for OpenGl context creation");
watch.Restart();

var glInterface = GRGlInterface.Create();
var context = GRContext.CreateGl(glInterface);

Console.WriteLine($"{watch.Elapsed.TotalMilliseconds:0.000}ms for Skia context creation");
watch.Restart();

var info = new SKImageInfo(1920, 1080, SKColorType.Bgra8888);
var surface = SKSurface.Create(context, true, info);

Console.WriteLine($"{watch.Elapsed.TotalMilliseconds:0.000}ms for Skia surface creation");
watch.Restart();

var buffer = Marshal.AllocHGlobal(info.BytesSize);

Console.WriteLine($"{watch.Elapsed.TotalMilliseconds:0.000}ms for buffer allocation");

var totalDrawingTime = 0.0;
var totalCopyTime = 0.0;
var cycles = 10;

using (var paint = new SKPaint())
{
    paint.IsAntialias = true;
    paint.Color = SKColors.Red;
    paint.Style = SKPaintStyle.Stroke;
    paint.StrokeWidth = 4;

    for (var i = 0; i < cycles; i++)
    {
        Console.WriteLine($"## Cycle {i} ##");
        var rand = new Random(cycles);

        watch.Restart();

        // Drawing start
        surface.Canvas.Clear(SKColors.SkyBlue);

        for (var j = 0; j < 1000; j++)
        {
            surface.Canvas.DrawLine(rand.Next(0,info.Width), rand.Next(0,info.Height), rand.Next(0,info.Width), rand.Next(0,info.Height), paint);
        }
        // Drawing end

        totalDrawingTime += watch.Elapsed.TotalMilliseconds;
        Console.WriteLine($"  {watch.Elapsed.TotalMilliseconds:0.000}ms for drawing");
        watch.Restart();

        if (!surface.ReadPixels(info, buffer, info.RowBytes, 0, 0))
        {
            Console.WriteLine($"Failed to copy pixels from GPU");
        }

        totalCopyTime += watch.Elapsed.TotalMilliseconds;
        Console.WriteLine($"  {watch.Elapsed.TotalMilliseconds:0.000}ms for copying from GPU");

    }

}

watch.Restart();

surface.Dispose();
Marshal.FreeHGlobal(buffer);
context.Dispose();
glContext.Destroy();

Console.WriteLine($"{watch.Elapsed.TotalMilliseconds:0.000}ms for freeing up");
watch.Restart();

Console.WriteLine($"Average of {totalDrawingTime / cycles:0.000}ms for drawing");
Console.WriteLine($"Average of  {totalCopyTime/ cycles:0.000}ms for copying from GPU");

On my mac I get the following output (no Debugger attached, Release configuration, build for x64):

46.406ms for OpenGl context creation
588.590ms for Skia context creation
1.772ms for Skia surface creation
0.192ms for buffer allocation
## Cycle 0 ##
  9.028ms for drawing
  71.424ms for copying from GPU
## Cycle 1 ##
  2.682ms for drawing
  55.342ms for copying from GPU
## Cycle 2 ##
  1.817ms for drawing
  33.128ms for copying from GPU
## Cycle 3 ##
  2.244ms for drawing
  65.454ms for copying from GPU
## Cycle 4 ##
  1.941ms for drawing
  34.987ms for copying from GPU
## Cycle 5 ##
  1.690ms for drawing
  35.023ms for copying from GPU
## Cycle 6 ##
  1.894ms for drawing
  47.286ms for copying from GPU
## Cycle 7 ##
  1.738ms for drawing
  34.455ms for copying from GPU
## Cycle 8 ##
  1.915ms for drawing
  32.325ms for copying from GPU
## Cycle 9 ##
  1.853ms for drawing
  44.672ms for copying from GPU
23.287ms for freeing up
Average of 2.676ms for drawing
Average of  45.408ms for copying from GPU

joa77 on 4 Jul 2020

Could you add surface.Canvas.Flush() and/or context.Flush() just after // Drawing end, and measure again?

I think Skia might batch many commands before sending them to the GPU. So the 44ms you are measuring might also include a subset of the 1000 lines you are drawing?

Ziriax on 4 Jul 2020

With surface.Canvas.Flush()

Average of 4.445ms for drawing
Average of  62.336ms for copying from GPU

With context.Flush()

Average of 3.796ms for drawing
Average of  57.863ms for copying from GPU

With surface.Canvas.Flush() and context.Flush()

Average of 3.619ms for drawing
Average of  58.371ms for copying from GPU

And without any change (it's a little bit slower now, so added for reference)

Average of 3.174ms for drawing
Average of  52.344ms for copying from GPU

So it looks like without any flush function, the performance is better (I ran the tests multiple times and the values seem to be stable)

joa77 on 4 Jul 2020

Okay, I'll try to reproduce your test case on Monday; I'm curious what is causing this, and what can be done about it.

Ziriax on 4 Jul 2020

👍1

@Ziriax did you had time to look into this?

joa77 on 13 Jul 2020

Sorry, I completely forgot about this. I'll try to find some time.

Ziriax on 13 Jul 2020

I got this running on my machine (Windows 10 x64, RTX 2070, AMD Ryzen 3950x), I get the following numbers with your code:

305,354ms for OpenGl context creation
35,677ms for Skia context creation
0,990ms for Skia surface creation
0,174ms for buffer allocation
## Cycle 0 ##
  5,529ms for drawing
  14,434ms for copying from GPU
## Cycle 1 ##
  1,463ms for drawing
  11,229ms for copying from GPU
## Cycle 2 ##
  1,475ms for drawing
  7,258ms for copying from GPU
## Cycle 3 ##
  1,442ms for drawing
  7,039ms for copying from GPU
## Cycle 4 ##
  1,464ms for drawing
  7,973ms for copying from GPU
## Cycle 5 ##
  1,422ms for drawing
  7,358ms for copying from GPU
## Cycle 6 ##
  1,410ms for drawing
  7,347ms for copying from GPU
## Cycle 7 ##
  1,443ms for drawing
  6,955ms for copying from GPU
## Cycle 8 ##
  1,441ms for drawing
  6,988ms for copying from GPU
## Cycle 9 ##
  1,452ms for drawing
  6,958ms for copying from GPU
6,365ms for freeing up
Average of 1,854ms for drawing
Average of  8,354ms for copying from GPU

Ziriax on 14 Jul 2020

I'll try again with a debug build of Skia, to see what is taking time.

Ziriax on 14 Jul 2020

Is that 8ms or 8000ms?

mattleibow on 14 Jul 2020

My locale settings seem to use Dutch, so the 8,354ms is actually 8.354ms :)

So 8ms

Ziriax on 14 Jul 2020

As expected, glReadPixels is called, and that is a synchronous call, and stalls the pipeline.

However, just after that, the pixels are converted... It seems the orientation if different, and the pixels need to be flipped. I'm trying to figure out what happens.

Ziriax on 14 Jul 2020

@mattleibow I found a silly bug in SkiaSharp, in

SkiaSharp\binding\Binding\SKSurface.cs

        public static SKSurface Create (GRContext context, bool budgeted, SKImageInfo info, int sampleCount, GRSurfaceOrigin origin) =>
            Create (context, budgeted, info, sampleCount, GRSurfaceOrigin.BottomLeft, null, false);

As you can see the origin is not passed to the Create function, GRSurfaceOrigin.BottomLeft is passed.

This causes the pixel flip.

I will patch this, and see what happens :)

Ziriax on 14 Jul 2020

😕1 👍1

@joa77 Could you change you code in

var surface = SKSurface.Create(context, true, info, 0, GRSurfaceOrigin.TopLeft, new SKSurfaceProperties(SKPixelGeometry.Unknown), false);

And profile again?

Ziriax on 14 Jul 2020

In my debug build, this reduces the readpixels from 8ms to 4ms...

Going to try a release build now... I'm be back in 2 hours, LOL

Ziriax on 14 Jul 2020

@mattleibow I see GRSurfaceOrigin.BottomLeft is always passed as the default for the origin. Is this for a reason? Because the native read-pixels code is:

    bool flip = srcProxy->origin() == kBottomLeft_GrSurfaceOrigin;

    auto supportedRead = caps->supportedReadPixelsColorType(
            this->colorInfo().colorType(), srcProxy->backendFormat(), dstInfo.colorType());

    bool makeTight = !caps->readPixelsRowBytesSupport() && tightRowBytes != rowBytes;

    bool convert = unpremul || premul || needColorConversion || flip || makeTight ||
                   (dstInfo.colorType() != supportedRead.fColorType);

So it seems to prefer a TopLeft origin, otherwise it will convert the pixels.

Ziriax on 14 Jul 2020

The origin was BL for historical reasons I think. Here is the origin in m60:
https://github.com/google/skia/blob/chrome/m60/include/core/SkSurface.h#L167-L168

I don't know about changing the default, but how does this affect different platforms? do iOS/macOS and Linux and Windows do different things? It may have to do with the fact that it was originally meant to match the GL origin of bottom left: https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glReadPixels.xhtml

mattleibow on 14 Jul 2020

In a release build, when passing TopLeft as the surface origin, I get a similar speedup as in debug 🍦

So almost twice as fast on my machine! 🚀

Skia also has an asyncReadPixels, but this isn't exposed yet in SkiaSharp (I even don't think it is part of the public Skia API).

306,579ms for OpenGl context creation
8,506ms for Skia context creation
0,856ms for Skia surface creation
0,151ms for buffer allocation
## Cycle 0 ##
  10,819ms for drawing
  5,958ms for copying from GPU
## Cycle 1 ##
  1,597ms for drawing
  7,008ms for copying from GPU
## Cycle 2 ##
  1,549ms for drawing
  4,031ms for copying from GPU
## Cycle 3 ##
  1,546ms for drawing
  3,984ms for copying from GPU
## Cycle 4 ##
  1,555ms for drawing
  4,236ms for copying from GPU
## Cycle 5 ##
  1,565ms for drawing
  3,853ms for copying from GPU
## Cycle 6 ##
  1,539ms for drawing
  4,028ms for copying from GPU
## Cycle 7 ##
  1,574ms for drawing
  3,772ms for copying from GPU
## Cycle 8 ##
  1,504ms for drawing
  3,893ms for copying from GPU
## Cycle 9 ##
  1,590ms for drawing
  3,849ms for copying from GPU
6,173ms for freeing up
Average of 2,484ms for drawing
Average of  4,461ms for copying from GPU

Ziriax on 14 Jul 2020

@joa77 Could you change you code in

var surface = SKSurface.Create(context, true, info, 0, GRSurfaceOrigin.TopLeft, new SKSurfaceProperties(SKPixelGeometry.Unknown), false);

And profile again?

Do i need to get a new build of SkiaSharp to do this?

joa77 on 14 Jul 2020

Do i need to get a new build of SkiaSharp to do this?

No this particular overload works fine.

Ziriax on 14 Jul 2020

It's around 4ms faster now (running on macOS, maybe this makes a difference depending on the OS)
Any chance that the asyncReadPixels method will get implemented in SkiaSharp?

joa77 on 14 Jul 2020

Okay, since the conversion is done on the CPU, this explains why we both get 4ms when the conversion isn't done.

In my case, the glReadPixels call takes 4ms. On your machine, it takes about 40ms. This is really odd. Are you sure the videocard is in a high bandwidth PCI slot? Because 40ms is way too high.

You might want to profile your OpenGL calls, just to make sure. I haven't done that for a while, but maybe tools like https://apitrace.github.io/ and https://renderdoc.org/ can help

Ziriax on 15 Jul 2020

In the old days, we used two PBO surfaces, then glReadPixels becomes async "for free", e.g. without callbacks:

http://www.songho.ca/opengl/gl_pbo.html

You might be able to use that directly, without calling Skia's read pixels, not sure

Ziriax on 16 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings