Skiasharp: [BUG] SKSurface.ReadPixels is super slow

Created on 3 Jul 2020  路  25Comments  路  Source: mono/SkiaSharp

Description

I'm using SkiaSharp with a OpenGL backend and I'm drawing on a SkSurface wich is super fast, took around 0.25ms.
To get the rendered pixel data I'm using SkSurface.ReadPixels to copy the data to a buffer. But this calls are super slow ~30ms.
Is there any chance to speed it up? Or getting a direct pointer to the pixeldata?

Code
var _info = new SKImageInfo(1920, 1080, SKColorType.Bgra8888);
var _glContext = GlContext.Create();
var _glContext.MakeCurrent();
var _glInterface = GRGlInterface.Create();
var _context = GRContext.CreateGl(_glInterface);
var _buffer = Marshal.AllocHGlobal(_info.BytesSize);
_surface = SKSurface.Create(_context, true, _info);

// doing the actual drawing like _surface.Canvas.Draw...

_surface.ReadPixels(_info, _buffer, _info.RowBytes, 0, 0));

Basic Information

  • Version with issue: 2.80.0-preview.24
  • IDE: Rider 2020.1
  • Platform Target Frameworks: .Net Core 3.1

    • macOS 10.15.5 (currently only tested here)

All 25 comments

I'm not an offical Skia developer, but here's my take on it.

  • you are drawing using a GPU backend. That allows for very fasting rendering.

  • however, when you read the pixels back to the CPU, all pixels must be transferred over the PCI bus from GPU memory to host CPU memory. This is called "readback", and is always slow. Furthermore, it stops both the GPU and CPU.

  • for high performance readback, async DMA must be used, so that neither GPU nor CPU are blocked while the pixels are being transferred. This is still slow, but at least none of the machinery will stall, so you can overlap rendering and readback. But I don't think SkiaSharp supports such an interface (that being said, Skia does have a protected asyncReadPixels and a public GrSurfaceContext::asyncRescaleAndReadPixels; the latter could be exposed by the SkiaSharp 2.0 with some work)

I understand that the transfer is the problem, but I don't understand why.
When I calculate the amount auf data that should be transferred between GPU to RAM it's about 500 MB/s which for my knowledge is much less the actual speed capabilities of modern RAM or PCI busses.

Do you have a project that you can share? I am interested in profiling this on my system.

I created a minimal console application example. To make it work include the files from https://github.com/mono/SkiaSharp/tree/master/tests/Tests/GlContexts to the project.

var watch = new Stopwatch();
watch.Restart();

// See https://github.com/mono/SkiaSharp/tree/master/tests/Tests/GlContexts for context creation
var glContext = GlContext.Create();
glContext.MakeCurrent();

Console.WriteLine($"{watch.Elapsed.TotalMilliseconds:0.000}ms for OpenGl context creation");
watch.Restart();

var glInterface = GRGlInterface.Create();
var context = GRContext.CreateGl(glInterface);

Console.WriteLine($"{watch.Elapsed.TotalMilliseconds:0.000}ms for Skia context creation");
watch.Restart();

var info = new SKImageInfo(1920, 1080, SKColorType.Bgra8888);
var surface = SKSurface.Create(context, true, info);

Console.WriteLine($"{watch.Elapsed.TotalMilliseconds:0.000}ms for Skia surface creation");
watch.Restart();

var buffer = Marshal.AllocHGlobal(info.BytesSize);

Console.WriteLine($"{watch.Elapsed.TotalMilliseconds:0.000}ms for buffer allocation");

var totalDrawingTime = 0.0;
var totalCopyTime = 0.0;
var cycles = 10;

using (var paint = new SKPaint())
{
    paint.IsAntialias = true;
    paint.Color = SKColors.Red;
    paint.Style = SKPaintStyle.Stroke;
    paint.StrokeWidth = 4;

    for (var i = 0; i < cycles; i++)
    {
        Console.WriteLine($"## Cycle {i} ##");
        var rand = new Random(cycles);

        watch.Restart();

        // Drawing start
        surface.Canvas.Clear(SKColors.SkyBlue);

        for (var j = 0; j < 1000; j++)
        {
            surface.Canvas.DrawLine(rand.Next(0,info.Width), rand.Next(0,info.Height), rand.Next(0,info.Width), rand.Next(0,info.Height), paint);
        }
        // Drawing end

        totalDrawingTime += watch.Elapsed.TotalMilliseconds;
        Console.WriteLine($"  {watch.Elapsed.TotalMilliseconds:0.000}ms for drawing");
        watch.Restart();

        if (!surface.ReadPixels(info, buffer, info.RowBytes, 0, 0))
        {
            Console.WriteLine($"Failed to copy pixels from GPU");
        }

        totalCopyTime += watch.Elapsed.TotalMilliseconds;
        Console.WriteLine($"  {watch.Elapsed.TotalMilliseconds:0.000}ms for copying from GPU");

    }

}

watch.Restart();

surface.Dispose();
Marshal.FreeHGlobal(buffer);
context.Dispose();
glContext.Destroy();

Console.WriteLine($"{watch.Elapsed.TotalMilliseconds:0.000}ms for freeing up");
watch.Restart();

Console.WriteLine($"Average of {totalDrawingTime / cycles:0.000}ms for drawing");
Console.WriteLine($"Average of  {totalCopyTime/ cycles:0.000}ms for copying from GPU");

On my mac I get the following output (no Debugger attached, Release configuration, build for x64):

46.406ms for OpenGl context creation
588.590ms for Skia context creation
1.772ms for Skia surface creation
0.192ms for buffer allocation
## Cycle 0 ##
  9.028ms for drawing
  71.424ms for copying from GPU
## Cycle 1 ##
  2.682ms for drawing
  55.342ms for copying from GPU
## Cycle 2 ##
  1.817ms for drawing
  33.128ms for copying from GPU
## Cycle 3 ##
  2.244ms for drawing
  65.454ms for copying from GPU
## Cycle 4 ##
  1.941ms for drawing
  34.987ms for copying from GPU
## Cycle 5 ##
  1.690ms for drawing
  35.023ms for copying from GPU
## Cycle 6 ##
  1.894ms for drawing
  47.286ms for copying from GPU
## Cycle 7 ##
  1.738ms for drawing
  34.455ms for copying from GPU
## Cycle 8 ##
  1.915ms for drawing
  32.325ms for copying from GPU
## Cycle 9 ##
  1.853ms for drawing
  44.672ms for copying from GPU
23.287ms for freeing up
Average of 2.676ms for drawing
Average of  45.408ms for copying from GPU

Could you add surface.Canvas.Flush() and/or context.Flush() just after // Drawing end, and measure again?

I think Skia might batch many commands before sending them to the GPU. So the 44ms you are measuring might also include a subset of the 1000 lines you are drawing?

With surface.Canvas.Flush()

Average of 4.445ms for drawing
Average of  62.336ms for copying from GPU

With context.Flush()

Average of 3.796ms for drawing
Average of  57.863ms for copying from GPU

With surface.Canvas.Flush() and context.Flush()

Average of 3.619ms for drawing
Average of  58.371ms for copying from GPU

And without any change (it's a little bit slower now, so added for reference)

Average of 3.174ms for drawing
Average of  52.344ms for copying from GPU

So it looks like without any flush function, the performance is better (I ran the tests multiple times and the values seem to be stable)

Okay, I'll try to reproduce your test case on Monday; I'm curious what is causing this, and what can be done about it.

@Ziriax did you had time to look into this?

Sorry, I completely forgot about this. I'll try to find some time.

I got this running on my machine (Windows 10 x64, RTX 2070, AMD Ryzen 3950x), I get the following numbers with your code:

305,354ms for OpenGl context creation
35,677ms for Skia context creation
0,990ms for Skia surface creation
0,174ms for buffer allocation
## Cycle 0 ##
  5,529ms for drawing
  14,434ms for copying from GPU
## Cycle 1 ##
  1,463ms for drawing
  11,229ms for copying from GPU
## Cycle 2 ##
  1,475ms for drawing
  7,258ms for copying from GPU
## Cycle 3 ##
  1,442ms for drawing
  7,039ms for copying from GPU
## Cycle 4 ##
  1,464ms for drawing
  7,973ms for copying from GPU
## Cycle 5 ##
  1,422ms for drawing
  7,358ms for copying from GPU
## Cycle 6 ##
  1,410ms for drawing
  7,347ms for copying from GPU
## Cycle 7 ##
  1,443ms for drawing
  6,955ms for copying from GPU
## Cycle 8 ##
  1,441ms for drawing
  6,988ms for copying from GPU
## Cycle 9 ##
  1,452ms for drawing
  6,958ms for copying from GPU
6,365ms for freeing up
Average of 1,854ms for drawing
Average of  8,354ms for copying from GPU

I'll try again with a debug build of Skia, to see what is taking time.

Is that 8ms or 8000ms?

My locale settings seem to use Dutch, so the 8,354ms is actually 8.354ms :)

So 8ms

As expected, glReadPixels is called, and that is a synchronous call, and stalls the pipeline.

However, just after that, the pixels are converted... It seems the orientation if different, and the pixels need to be flipped. I'm trying to figure out what happens.

@mattleibow I found a silly bug in SkiaSharp, in

SkiaSharp\binding\Binding\SKSurface.cs

        public static SKSurface Create (GRContext context, bool budgeted, SKImageInfo info, int sampleCount, GRSurfaceOrigin origin) =>
            Create (context, budgeted, info, sampleCount, GRSurfaceOrigin.BottomLeft, null, false);

As you can see the origin is not passed to the Create function, GRSurfaceOrigin.BottomLeft is passed.

This causes the pixel flip.

I will patch this, and see what happens :)

@joa77 Could you change you code in

var surface = SKSurface.Create(context, true, info, 0, GRSurfaceOrigin.TopLeft, new SKSurfaceProperties(SKPixelGeometry.Unknown), false);

And profile again?

In my debug build, this reduces the readpixels from 8ms to 4ms...

Going to try a release build now... I'm be back in 2 hours, LOL

@mattleibow I see GRSurfaceOrigin.BottomLeft is always passed as the default for the origin. Is this for a reason? Because the native read-pixels code is:

    bool flip = srcProxy->origin() == kBottomLeft_GrSurfaceOrigin;

    auto supportedRead = caps->supportedReadPixelsColorType(
            this->colorInfo().colorType(), srcProxy->backendFormat(), dstInfo.colorType());

    bool makeTight = !caps->readPixelsRowBytesSupport() && tightRowBytes != rowBytes;

    bool convert = unpremul || premul || needColorConversion || flip || makeTight ||
                   (dstInfo.colorType() != supportedRead.fColorType);

So it seems to prefer a TopLeft origin, otherwise it will convert the pixels.

The origin was BL for historical reasons I think. Here is the origin in m60:
https://github.com/google/skia/blob/chrome/m60/include/core/SkSurface.h#L167-L168

I don't know about changing the default, but how does this affect different platforms? do iOS/macOS and Linux and Windows do different things? It may have to do with the fact that it was originally meant to match the GL origin of bottom left: https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glReadPixels.xhtml

In a release build, when passing TopLeft as the surface origin, I get a similar speedup as in debug 馃崷

So almost twice as fast on my machine! 馃殌

Skia also has an asyncReadPixels, but this isn't exposed yet in SkiaSharp (I even don't think it is part of the public Skia API).

306,579ms for OpenGl context creation
8,506ms for Skia context creation
0,856ms for Skia surface creation
0,151ms for buffer allocation
## Cycle 0 ##
  10,819ms for drawing
  5,958ms for copying from GPU
## Cycle 1 ##
  1,597ms for drawing
  7,008ms for copying from GPU
## Cycle 2 ##
  1,549ms for drawing
  4,031ms for copying from GPU
## Cycle 3 ##
  1,546ms for drawing
  3,984ms for copying from GPU
## Cycle 4 ##
  1,555ms for drawing
  4,236ms for copying from GPU
## Cycle 5 ##
  1,565ms for drawing
  3,853ms for copying from GPU
## Cycle 6 ##
  1,539ms for drawing
  4,028ms for copying from GPU
## Cycle 7 ##
  1,574ms for drawing
  3,772ms for copying from GPU
## Cycle 8 ##
  1,504ms for drawing
  3,893ms for copying from GPU
## Cycle 9 ##
  1,590ms for drawing
  3,849ms for copying from GPU
6,173ms for freeing up
Average of 2,484ms for drawing
Average of  4,461ms for copying from GPU

@joa77 Could you change you code in

var surface = SKSurface.Create(context, true, info, 0, GRSurfaceOrigin.TopLeft, new SKSurfaceProperties(SKPixelGeometry.Unknown), false);

And profile again?

Do i need to get a new build of SkiaSharp to do this?

Do i need to get a new build of SkiaSharp to do this?

No this particular overload works fine.

It's around 4ms faster now (running on macOS, maybe this makes a difference depending on the OS)
Any chance that the asyncReadPixels method will get implemented in SkiaSharp?

Okay, since the conversion is done on the CPU, this explains why we both get 4ms when the conversion isn't done.

In my case, the glReadPixels call takes 4ms. On your machine, it takes about 40ms. This is really odd. Are you sure the videocard is in a high bandwidth PCI slot? Because 40ms is way too high.

You might want to profile your OpenGL calls, just to make sure. I haven't done that for a while, but maybe tools like https://apitrace.github.io/ and https://renderdoc.org/ can help

In the old days, we used two PBO surfaces, then glReadPixels becomes async "for free", e.g. without callbacks:

http://www.songho.ca/opengl/gl_pbo.html

You might be able to use that directly, without calling Skia's read pixels, not sure

Was this page helpful?
0 / 5 - 0 ratings