Stackexchange.redis: Very high CPU usage

Created on 14 Feb 2019 · 36Comments · Source: StackExchange/StackExchange.Redis

We've been doing some performance testing with the new version of our application however we've had some regressions that appear to be StackExchange.Redis related that I cannot pin down. It happens in both 2.0.519 and 2.0.513.

During a period of http activity (which interacts with redis quite a bit), the CPU spikes to 100% until the activity stops. Running a CPU profiler shows that nearly 99% of the CPU time is spent in StackExchange.Redis.PhysicalBridge::WriteMessageTakingWriteLock (any further down is clr.dll and [External Code]). I took a dump during the same period of 100% cpu activity and there are a very large amount of threads inside the same method.

This didn't happen previously using 2.0.513 so I imagine it is related to how we interact with the library but I can't find out why.

performance

Source

joshmackey

Most helpful comment

(mutters curses to the gods of assembly binding and transitive dependency
resolution)

On Thu, 21 Feb 2019, 15:51 tommymonk, notifications@github.com wrote:

I tried to install 545 in a virgin project to perform the same test but
get the exception "The assembly for System.Buffers could not be loaded".
I manually added System.Buffers 4.5.0, System.Numerics.Vectors 4.5.0 to
rectify the issue and recorded this result.

2.0.545 (42 seconds)

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/StackExchange/StackExchange.Redis/issues/1060#issuecomment-466051728,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABDsNVvY5FfFRzzRBG1ZsX4d-U8ot6fks5vPsCagaJpZM4a6nNr
.

mgravell on 21 Feb 2019

😄2

All 36 comments

Hey there; we're aware of a problem that can occur under heavy load when using a mixture of *Async and sync calls (reported via the SignalR folks); we are working on it - we even released 2.0.542 on myget (not nuget) which we thought should have fixed it, but that just shifted the problem. Working on this is my current daytime priority, and we hope to get a fix out real soon. We understand what is happening, and why; have detailed stack-dumps, etc - but the fix involves some nuanced tricky code that needs to be correct.

mgravell on 14 Feb 2019

clarification: the 542 build requires a mix of *Async and sync to replicate this; the previous versions might be able to cause the problem when using just one; both of them require a rare corner case of heavy concurrent load and a "flush" happening at an inconvenient moment

the actual problem here is a nasty combination of sync-over-async on the TPL combined with thread-pool starvation; you'd be surprised at some of the places that can hide a problematic sync-over-async - the one that makes 542 (myget) unreliable is actually in the sync (not async) SemaphoreSlim.Wait(timeout) method

mgravell on 14 Feb 2019

Do Scans count as Async or Sync?
I tried the 542 build and while it didn't cap at 100% it did reach 90% for a couple of second or so.

joshmackey on 14 Feb 2019

Define "scans"

mgravell on 14 Feb 2019

I mean the server.Keys(pattern: "*foo*") style calls, but never mind, it appears I've since removed all the scans. Still trying to narrow down why we're only hitting this now rather than in a previous version.

There doesn't appear to be any Sync calls in my tests.

joshmackey on 15 Feb 2019

k, sorry I just wanted to be 100% sure; note: you might want to increase the page size when using Keys (which prefers SCAN as the implementation, indeed); the library uses the default page size of the SCAN operation - i.e. 10, but frankly if you're using SCAN you probably want this to be way higher - as in, I'd probably use 1000, which is immediately a 100-times reduction in the number of round-trips.

But general advise (globally to redis, not specific to this library) is: don't use KEYS or SCAN on a real system in production

mgravell on 15 Feb 2019

Noted, but we're not using SCAN/KEY anymore there.
Any who, I found section of code causing the CPU spike in the following code:

```var transaction = db.CreateTransaction();
transaction.AddCondition(Condition.KeyExists(clientKey));
var getTask = transaction.HashGetAllAsync(clientKey);
var bindingsTask = DoGetBindings(transaction, clientId, domainKey);
var subTask = DoGetSubscriptions(transaction, clientId, domainKey);

(DoGetSubscriptions looks the same)

private async Task DoGetBindings(IDatabaseAsync db, string clientId, string domainKey)
{
var entries = await db.HashGetAllAsync(GetClientBindingsKey(domainKey, clientId));
return new Bindings(entries.Select(entry => new Binding(new Record(entry.Name, entry.Value))));
}
```

The spike happens when DoGetBindings/DoGetSubscriptions has the continuation. If I return HashGetAllAsync directly and then transform the result after the transaction.ExecuteAsync, all is fine.

This bit of code hasn't changed recently, so maybe there's a difference in the MSIL causing it? I don't know.

joshmackey on 18 Feb 2019

I'm tempted to say "see what happens when we deploy the semaphore fix",
which should be soon. This addresses a number of CPU-related issues.

On Mon, 18 Feb 2019, 19:08 Josh Mackey, notifications@github.com wrote:

Noted, but we're not using SCAN/KEY anymore there.
Any who, I found section of code causing the CPU spike in the following
code:
  transaction.AddCondition(Condition.KeyExists(clientKey));
  var getTask = transaction.HashGetAllAsync(clientKey);
  var bindingsTask = DoGetBindings(transaction, clientId, domainKey);
  var subTask = DoGetSubscriptions(transaction, clientId, domainKey);
(DoGetSubscriptions looks the same)

private async Task DoGetBindings(IDatabaseAsync db, string clientId, string domainKey)
{
var entries = await db.HashGetAllAsync(GetClientBindingsKey(domainKey, clientId));
return new Bindings(entries.Select(entry => new Binding(new Record(entry.Name, entry.Value))));
}

The spike happens when DoGetBindings/DoGetSubscriptions has the
continuation. If I return HashGetAllAsync directly and then transform the
result after the transaction.ExecuteAsync, all is fine.

This bit of code hasn't changed recently, so maybe there's a difference in
the MSIL causing it? I don't know.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/StackExchange/StackExchange.Redis/issues/1060#issuecomment-464846283,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABDsGhPtckJzcgiatjw-g4dpU0r9dapks5vOvo_gaJpZM4a6nNr
.

mgravell on 18 Feb 2019

Do you have an ETA on it? Or perhaps a prerelease?

joshmackey on 18 Feb 2019

I expect to have a build on myget this morning; I'll probably be dogfooding it on Stack Overflow today.

mgravell on 19 Feb 2019

myget was angry with me; I've thrown it onto nuget instead (marked "unlisted"): https://www.nuget.org/packages/StackExchange.Redis/2.0.561-ge0760c7052

mgravell on 19 Feb 2019

and now: https://www.nuget.org/packages/StackExchange.Redis/2.0.544

mgravell on 19 Feb 2019

I'm not sure if the slowness I'm seeing is related to this issue, I can create a new issue if not, I have a simple test to reproduce the performance drop in 2.x.

Time taken

1.2.6 (77 seconds)
2.0.513 (155 seconds)
2.0.519 (156 seconds)
2.0.544 (594 seconds)

var client = ConnectionMultiplexer.Connect(ConfigurationOptions.Parse("localhost"));
client.BlockUntilConnected();
var db = client.GetDatabase(0);

var start = DateTime.Now;

var tasks = Enumerable.Range(0, 1000).Select(async i =>
{
    for (int t = 0; t < 1000; t++)
        await db.StringIncrementAsync(i.ToString(), 1);
});

await Task.WhenAll(tasks);

var duration = DateTime.Now.Subtract(start).TotalMilliseconds;

I see much higher CPU in addition to it taking longer.

tommymonk on 21 Feb 2019

seconds? or milliseconds? on a local against 519, I see this taking < 4 seconds (1.2.6 for me takes 3 seconds); 544 and 545 are taking a little longer, which is something I can look into, but nothing like the times you're talking about here

any chance you're talking about a cloud scenario here (remote servers, etc)?

(edit: I've used GetCounters() to confirm that 1000000 operations did complete asynchronously between the two timings)

mgravell on 21 Feb 2019

also: is the redis server a cluster in this case, perhaps?

mgravell on 21 Feb 2019

Although the code reported milliseconds I converted to seconds for the post.
This is all against a redis server running on the same machine (Windows 10) redis_version:3.2.100

tommymonk on 21 Feb 2019

I ran the same code against a remote redis 3.2.8 server running on Ubuntu going through a 10gig switch and got faster results but still a big difference

1.2.6 (10 seconds)
2.0.519 (19 seconds)
2.0.544 (46 seconds)

Probably worth noting that I'm running these testing in a console app on .net 4.7.2

tommymonk on 21 Feb 2019

k, that's really useful feedback; I'll dig more into this before we release anything; 545 is on nuget, but is "unlisted"

mgravell on 21 Feb 2019

I have a handle on the 544/545 delta, and a plan to fix that regression - and hopefully eat back some of the 1.2.6->2.* delta

mgravell on 21 Feb 2019

I tried to install 545 in a virgin project to perform the same test but get the exception "The assembly for System.Buffers could not be loaded".
I manually added System.Buffers 4.5.0, System.Numerics.Vectors 4.5.0 to rectify the issue and recorded this result.

2.0.545 (42 seconds)

tommymonk on 21 Feb 2019

(mutters curses to the gods of assembly binding and transitive dependency
resolution)

On Thu, 21 Feb 2019, 15:51 tommymonk, notifications@github.com wrote:

I tried to install 545 in a virgin project to perform the same test but
get the exception "The assembly for System.Buffers could not be loaded".
I manually added System.Buffers 4.5.0, System.Numerics.Vectors 4.5.0 to
rectify the issue and recorded this result.

2.0.545 (42 seconds)

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/StackExchange/StackExchange.Redis/issues/1060#issuecomment-466051728,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABDsNVvY5FfFRzzRBG1ZsX4d-U8ot6fks5vPsCagaJpZM4a6nNr
.

mgravell on 21 Feb 2019

😄2

I'll also ping in here note that we're still seeing CPU spikes, just in a different area once we expanded our tests. I haven't pinned them down yet though. I think it may be caused by an upgrade to asp.net core but I can't prove it yet. They didn't happen when we were still on 2.0.0.

joshmackey on 21 Feb 2019

@mgravell I checked out one of the working versions of our code base (that uses 1.2.6) and then only updated the StackExchange.Redis library to latest. The CPU started spiking on load. Seems like a regression.

joshmackey on 26 Feb 2019

Just wanted to share an updated test result of the little test mentioned earlier in this issue against 2.0.561;

2.0.561 (10 seconds)

This is obviously a marked improvement over the previous 2.x versions and is on-par with 1.2.6

Unfortunately the majority of attempts to run this code resulted in the following exception, with the occasional success. Previous versions e.g. 2.0.544 didn't have this issue.

Unhandled Exception: StackExchange.Redis.RedisTimeoutException: The timeout was reached before the message could be written to the output buffer, and it was not sent, timeout: 5000, phys: Idle, outbound: 0KiB, inbound: 0KiB, inst: 0, qu: 1, qs: 0, in: 0, serverEndpoint: Unspecified/redis01:6379, mgr: 9 of 10 available, clientName: TestHost01, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=1,Free=2046,Min=4,Max=2047), v: 2.0.561.32355 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

tommymonk on 12 Mar 2019

2.0.571 still spikes the CPU significantly more than 1.2.6 does.

joshmackey on 14 Mar 2019

@tommymonk I'm just guessing here but that may be because of this (fixed just today) issue https://github.com/mgravell/Pipelines.Sockets.Unofficial/issues/26

NinoFloris on 14 Mar 2019

No, 571 includes the update for that. So if the comparison is to 571, it
isn't related.

On Thu, 14 Mar 2019, 19:36 Nino Floris, notifications@github.com wrote:

@tommymonk https://github.com/tommymonk I'm just guessing here but that
may be because of this (fixed just today) issue
mgravell/Pipelines.Sockets.Unofficial#26
https://github.com/mgravell/Pipelines.Sockets.Unofficial/issues/26

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/StackExchange/StackExchange.Redis/issues/1060#issuecomment-473027446,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABDsG6aan2GJfLaSNF8edZZ5h-EumWHks5vWqTVgaJpZM4a6nNr
.

mgravell on 14 Mar 2019

@mgravell I meant the exceptions @tommymonk experienced, that was on 2.0.561

NinoFloris on 14 Mar 2019

@NinoFloris I just re-tested with 2.0.571 and experience the same timeouts.

tommymonk on 14 Mar 2019

To give more details, I have a test that throws 600 clients at our server. In 1.2.6, the join phase spikes to around 30% (571, spikes to 60%). For the leave phase, 1.2.6 spikes to 40-50%, whereas in 571, it spikes to 100%.

Each client has a dozen or so redis calls for each phase (I know its super inefficient, we're fixing that) so ~7200 redis calls within a second or so.

joshmackey on 14 Mar 2019

Interesting. By 600 clients: is that 600 multiplexers, or 600
threads/tasks/etc sharing a multiplexer (or some smaller number of
multiplexers)?

On Thu, 14 Mar 2019, 21:08 Josh Mackey, notifications@github.com wrote:

To give more details, I have a test that throws 600 clients at our server.
In 1.2.6, the join phase spikes to around 30% (571, spikes to 60%). For the
leave phase, 1.2.6 spikes to 40-50%, whereas in 571, it spikes to 100%.

Each client has a dozen or so redis calls for each phase (I know its super
inefficient, we're fixing that) so ~7200 redis calls within a second or so.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/StackExchange/StackExchange.Redis/issues/1060#issuecomment-473063530,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABDsBzNl3kFJTIWsXv62VhSoTPMB2MRks5vWrpNgaJpZM4a6nNr
.

mgravell on 14 Mar 2019

They all share the same multiplexer.

I've triple checked, and the entire pipeline is async as well.

joshmackey on 14 Mar 2019

We are aware of a problem here (relating to a race condition in the write-lock); we're dogfooding a fix locally, and hope to deploy an update in the next day or two.

mgravell on 2 Apr 2019

👀1

The combined improvements in 2.0.588 and 2.0.571 should, we believe, address a wide range of timeout and performance problems. I'm closing all timeout issues as "believed fixed in 2.0.588"; if you still see problems, please report them as new issues so we can start investigation with fresh eyes.

mgravell on 4 Apr 2019

I've just tested 2.0.588 several times and found it now runs as fast as 1.2.6 and without any timeouts :clap:

tommymonk on 4 Apr 2019

Great news, time to upgrade to 2.x 💪

NinoFloris on 4 Apr 2019

Was this page helpful?

0 / 5 - 0 ratings