Sqlclient: Reading binary data asynchronously is extremely slow

Created on 5 Jun 2020  路  16Comments  路  Source: dotnet/SqlClient

Reading a 10MB VARBINARY(MAX) value asynchronously takes around 5 seconds my machine, while doing the same synchronously takes around 20ms. Increasing the data size to 20MB increases the running time to around 52 seconds.

These numbers were with Microsoft.Data.SqlClient 1.1.3, but 2.0.0-preview4.20142.4 has the same issue (it actually seems slightly slower). Note that I'm not posting final BDN results because the benchmark ran extremely slowly.

Note that this looks very similar to #245, but that issue was fixed for 1.1.0. The difference may be that here we're writing binary data - or possibly the bigger size.


Benchmark code

```c#
[MemoryDiagnoser]
public class Benchmarks
{
const string ConnectionString = "Server=localhost;Database=test;User=SA;Password=Abcd5678;Connect Timeout=60;ConnectRetryCount=0";

[GlobalSetup]
public void Setup()
{
    using var conn = new SqlConnection(ConnectionString);
    conn.Open();

    using var cmd = conn.CreateCommand();
    cmd.CommandText = @"

IF OBJECT_ID('dbo.data', 'U') IS NOT NULL
DROP TABLE data;
CREATE TABLE data (id INT, foo VARBINARY(MAX))
";
cmd.ExecuteNonQuery();

    cmd.CommandText = "INSERT INTO data (id, foo) VALUES (@id, @foo)";
    cmd.Parameters.AddWithValue("id", 1);
    cmd.Parameters.AddWithValue("foo", new byte[1024 * 1024 * 10]);
    cmd.ExecuteNonQuery();
}

[Benchmark]
public async ValueTask<int> Async()
{
    using var conn = new SqlConnection(ConnectionString);
    using var cmd = new SqlCommand("SELECT foo FROM data", conn);
    await conn.OpenAsync();

    return ((byte[])await cmd.ExecuteScalarAsync()).Length;
}

[Benchmark]
public async ValueTask<int> Sync()
{
    using var conn = new SqlConnection(ConnectionString);
    using var cmd = new SqlCommand("SELECT foo FROM data", conn);
    conn.Open();

    return ((byte[])cmd.ExecuteScalar()).Length;
}

static void Main(string[] args)
    => BenchmarkRunner.Run<Benchmarks>();

}
```

Originally raised in https://github.com/dotnet/efcore/issues/21147

Performance

Most helpful comment

With the fix in https://github.com/dotnet/SqlClient/pull/603 the numbers are better:

| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------ |------------:|----------:|----------:|----------:|----------:|----------:|----------:|
| Async | 2,096.68 ms | 14.390 ms | 13.460 ms | 2000.0000 | 1000.0000 | 1000.0000 | 30.5 MB |
| StreamAsync | 25.42 ms | 0.499 ms | 0.534 ms | 2031.2500 | 1968.7500 | 1968.7500 | 32.36 MB |
| Sync | 16.67 ms | 0.041 ms | 0.036 ms | 406.2500 | 406.2500 | 406.2500 | 20 MB |

All 16 comments

Cannot see much relation to #245 really.

Here is the result of the test at NetCore & Netfx:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18362.657 (1903/May2019Update/19H1)
Intel Core i7-8665U CPU 1.90GHz (Coffee Lake), 1 CPU, 8 logical and 4 physical cores
  [Host]     : .NET Framework 4.8 (4.8.4121.0), X64 RyuJIT  [AttachedDebugger]
  Job-TUQHOY : .NET Framework 4.8 (4.8.4121.0), X64 RyuJIT

IterationCount=10  LaunchCount=1  WarmupCount=10  

| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |------------:|-----------:|-----------:|------------:|------------:|------------:|------------:|
| Async | 3,170.53 ms | 222.636 ms | 132.487 ms | 635000.0000 | 634000.0000 | 633000.0000 | 13155.71 MB |
| Sync | 37.88 ms | 2.139 ms | 1.415 ms | 333.3333 | 333.3333 | 333.3333 | 20 MB |

  • ManagedSNI = false
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18362.657 (1903/May2019Update/19H1)
Intel Core i7-8665U CPU 1.90GHz (Coffee Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.202
  [Host]     : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT
  Job-LXIUHK : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT

IterationCount=10  LaunchCount=1  WarmupCount=10  

| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |------------:|-----------:|-----------:|----------:|----------:|----------:|----------:|
| Async | 4,731.62 ms | 228.407 ms | 151.077 ms | 3000.0000 | 2000.0000 | 1000.0000 | 30.5 MB |
| Sync | 36.68 ms | 2.590 ms | 1.713 ms | 500.0000 | 500.0000 | 500.0000 | 20 MB |

  • ManagedSNI = true
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18362.657 (1903/May2019Update/19H1)
Intel Core i7-8665U CPU 1.90GHz (Coffee Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.202
  [Host]     : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT
  Job-OTAQAT : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT

IterationCount=10  LaunchCount=1  WarmupCount=10  

| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |------------:|-----------:|-----------:|----------:|----------:|----------:|----------:|
| Async | 4,054.54 ms | 474.155 ms | 282.162 ms | 3000.0000 | 2000.0000 | 1000.0000 | 30.5 MB |
| Sync | 34.23 ms | 1.365 ms | 0.903 ms | 357.1429 | 357.1429 | 357.1429 | 20 MB |

Working from memory here. Reading a MAX defined field will mean using chunked returns and the complete length of the field may not be known ahead of time so the optimization I put in place for reading won't help. If that's right then you'll see the same horrific escalating performance with increased data sizes as the reader continuously allocates and releases the temporary buffers, those GC numbers seem to support that.

It might be possible to do adapt the cached buffer we've got for multipacket known lengths to work like a list and auto expand as we go. What I can't do is prevent the repeated rescanning of the async packet queue.

I've had a look into the profiles and I was wrong. The optimization from https://github.com/dotnet/SqlClient/issues/245 is being used because it's reading as PLP data. The excessive memory usage is caused mostly by input packet buffers on the TDS side being lost into the snapshot packet chain. The time escalation is cause by the snapshot chain repeatedly being replayed.

The only currently available mitigations are to increase the packet size to as close to 16K as you can get or to use a stream reader mode which should work on a per-packet basis and not buffer the entire contents (I'm a bit hazy on that for async, it probably needs checking).

The long term fix would be to rewrite async reads to avoid replaying the entire packet chain on every receipt. This is akin to brain surgery. I've looked into this while reviewing other performance issues around async and it isn't something to attempt lightly and if I managed it solo and dropped a PR for it the team they may all justifiably quit in protest.

The

Oh, and this in System.Data.Common causes about 1/3 of the allocations and really isn't helpful.

https://github.com/dotnet/runtime/blob/7a57b927c3bdc31b91b7605f25f8598b8e6e58b8/src/libraries/System.Data.Common/src/System/Data/SQLTypes/SQLBinary.cs#L61-L74

We already use type workarounds with an overlay type to access the single value field directly to create SqlBinary instances so we could do the same to get it out but we'd then be handing the user a direct reference to a mutable array instead of a copy an I've no doubt that some idiot somewhere in the world is relying on the copy semantic to make their code work.

@Wraith2

Oh, and this in System.Data.Common causes about 1/3 of the allocations and really isn't helpful.

You're referring to the fact that the binary data is copied out into an array that's given to the user, right? If so, isn't that the same between sync and async?

In any case, that's indeed a lot of allocations, but the alternative would be to return memory that's owned by SqlClient, right? If I'm understanding things right, that would mean that if the user reads the next row, the reference they get would suddenly start pointing to a totally different value... I did think about this at some point as an advanced/high-perf/"unsafe" API in https://github.com/dotnet/runtime/issues/24899 - but that's tricky because of the lifetime issue, and also because the entire binary data isn't necessarily in memory if CommandBehavior.Sequential is used (as it should if values are big).

Anyway, all that stuff is less important than making async at least perform similar to sync...

You're referring to the fact that the binary data is copied out into an array that's given to the user, right? If so, isn't that the same between sync and async?

Yes, and yes.

In any case, that's indeed a lot of allocations, but the alternative would be to return memory that's owned by SqlClient, right?

Each row of data is stored in an array of SqlBuffer instances stored in the reader, The SqlBuffer type is a discriminated union and it'll hold a reference to an SqlBinary. When you move row or move field in sequential mode the buffer instance is dropped to the GC. The backing byte[] array inside SqlBinary is not re-used.

If we returned the array directly it would only have an effect on multiple reads of the same field in the same row and the only way to observe a change of behaviour would be to change the contents of the array and then re-request it from the reader. It's a pretty clear optimization that I'd like to be able to make but I'm pretty sure I can't because someone will have relied on it returning a new array each time.

As you suggest the ideal approach would be to use a ReadOnlyMemory then it'd all be safe but that'll require a runtime api change and for all providers to consume it.

also because the entire binary data isn't necessarily in memory if CommandBehavior.Sequential is used (as it should if values are big).

The entire array is skipped in Sequential mode if the field is not requested, as you say.
In standard mode it will be read into memory and kept until Read() is called to move to the next row if it or any field after it in the column order is read.

Anyway, all that stuff is less important than making async at least perform similar to sync...

I can sort of understand how it might be done but it's daunting and difficult.

As you suggest the ideal approach would be to use a ReadOnlyMemory then it'd all be safe but that'll require a runtime api change and for all providers to consume it.

Wouldn't it be possible for a provider to simply support GetFieldValue<ReadOnlyMemory<byte>>(), without introducing a new API?

I can sort of understand how it might be done but it's daunting and difficult.

I can certainly understand (and appreciate) that, and I'm definitely not suggesting anyone do anything specific (was just reporting the problem). Along with #547 and possibly other async-related issues, it seems that at least documenting these limitations could help users avoid pitfalls.

BTW do you think there are any workarounds to this, besides switching to sync?

Wouldn't it be possible for a provider to simply support GetFieldValue>(), without introducing a new API?

I hadn't considered that, nice idea.

Currently the only workaround I can think of is to use a stream read overload because I think those drive the parser per-packet directly so you don't have to replay the packet queue, that'll need verifying.

Currently the only workaround I can think of is to use a stream read overload because I think those drive the parser per-packet directly so you don't have to replay the packet queue, that'll need verifying.

That at least sounds like a good workaround if it works! Definitely worth benchmarking that.

Unless I've got something very wrong it won't work because there's a bug.

            using var conn = new SqlConnection(ConnectionString);
            using var cmd = new SqlCommand("SELECT foo FROM data", conn);
            await conn.OpenAsync();
            using var reader = await cmd.ExecuteReaderAsync(
                System.Data.CommandBehavior.SequentialAccess
            );
            await reader.ReadAsync();
            using var stream =  reader.GetStream(0);
            using var memory = new MemoryStream(16 * 1024);

            await stream.CopyToAsync(memory);
            return (int)memory.Length;

This should work but freezes in a task wait after a single read cycle, the second read never completed. IF you change it to standard mode it'll work but it does so by fetching the entire field and giving you a reader over the byte[]. So no workaround, this needs fixing if the team agree that it's a bug.

Thanks for testing this... Definitely looks like a bug to me.

IF you change it to standard mode it'll work but it does so by fetching the entire field and giving you a reader over the byte[]

Is this at least fast - in the sense that it resembles the sync perf rather than the async? If so it's at least a workaround, even if not ideal...

| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------ |------------:|----------:|----------:|----------:|----------:|----------:|----------:|
| Async | 2,181.93 ms | 42.299 ms | 39.567 ms | 2000.0000 | 1000.0000 | 1000.0000 | 30.5 MB |
| StreamAsync | 2,162.94 ms | 30.055 ms | 28.114 ms | 2000.0000 | 1000.0000 | 1000.0000 | 40.52 MB |
| Sync | 18.40 ms | 0.359 ms | 0.414 ms | 406.2500 | 406.2500 | 406.2500 | 20 MB |

So no. The only way to get the speed benefit Is to put the reader in sequential mode.

https://github.com/dotnet/SqlClient/blob/f2fd632f522a7c0f96c8b5c6a75ea7f2d80c0dd5/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs#L1557

There's also an interesting comment elsewhere in the file that says streaming isn't supported when using encryption. I think that makes streaming unusable for a generic consumer like EFCore because you'll have to special case it per driver.

We need to fix the sequential stream issue and then we need to rewrite async reads.

Thanks for the details @Wraith2.

With the fix in https://github.com/dotnet/SqlClient/pull/603 the numbers are better:

| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------ |------------:|----------:|----------:|----------:|----------:|----------:|----------:|
| Async | 2,096.68 ms | 14.390 ms | 13.460 ms | 2000.0000 | 1000.0000 | 1000.0000 | 30.5 MB |
| StreamAsync | 25.42 ms | 0.499 ms | 0.534 ms | 2031.2500 | 1968.7500 | 1968.7500 | 32.36 MB |
| Sync | 16.67 ms | 0.041 ms | 0.036 ms | 406.2500 | 406.2500 | 406.2500 | 20 MB |

That looks quite good, at the very least you can point people to a workaround.

Was this page helpful?
0 / 5 - 0 ratings