We've been doing some performance testing with the new version of our application however we've had some regressions that appear to be StackExchange.Redis related that I cannot pin down. It happens in both 2.0.519 and 2.0.513.
During a period of http activity (which interacts with redis quite a bit), the CPU spikes to 100% until the activity stops. Running a CPU profiler shows that nearly 99% of the CPU time is spent in StackExchange.Redis.PhysicalBridge::WriteMessageTakingWriteLock (any further down is clr.dll and [External Code]). I took a dump during the same period of 100% cpu activity and there are a very large amount of threads inside the same method.
This didn't happen previously using 2.0.513 so I imagine it is related to how we interact with the library but I can't find out why.
Hey there; we're aware of a problem that can occur under heavy load when using a mixture of *Async and sync calls (reported via the SignalR folks); we are working on it - we even released 2.0.542 on myget (not nuget) which we thought should have fixed it, but that just shifted the problem. Working on this is my current daytime priority, and we hope to get a fix out real soon. We understand what is happening, and why; have detailed stack-dumps, etc - but the fix involves some nuanced tricky code that needs to be correct.
clarification: the 542 build requires a mix of *Async and sync to replicate this; the previous versions might be able to cause the problem when using just one; both of them require a rare corner case of heavy concurrent load and a "flush" happening at an inconvenient moment
the actual problem here is a nasty combination of sync-over-async on the TPL combined with thread-pool starvation; you'd be surprised at some of the places that can hide a problematic sync-over-async - the one that makes 542 (myget) unreliable is actually in the sync (not async) SemaphoreSlim.Wait(timeout) method
Do Scans count as Async or Sync?
I tried the 542 build and while it didn't cap at 100% it did reach 90% for a couple of second or so.
Define "scans"
I mean the server.Keys(pattern: "*foo*") style calls, but never mind, it appears I've since removed all the scans. Still trying to narrow down why we're only hitting this now rather than in a previous version.
There doesn't appear to be any Sync calls in my tests.
k, sorry I just wanted to be 100% sure; note: you might want to increase the page size when using Keys (which prefers SCAN as the implementation, indeed); the library uses the default page size of the SCAN operation - i.e. 10, but frankly if you're using SCAN you probably want this to be way higher - as in, I'd probably use 1000, which is immediately a 100-times reduction in the number of round-trips.
But general advise (globally to redis, not specific to this library) is: don't use KEYS or SCAN on a real system in production
Noted, but we're not using SCAN/KEY anymore there.
Any who, I found section of code causing the CPU spike in the following code:
```var transaction = db.CreateTransaction();
transaction.AddCondition(Condition.KeyExists(clientKey));
var getTask = transaction.HashGetAllAsync(clientKey);
var bindingsTask = DoGetBindings(transaction, clientId, domainKey);
var subTask = DoGetSubscriptions(transaction, clientId, domainKey);
(DoGetSubscriptions looks the same)
private async Task
{
var entries = await db.HashGetAllAsync(GetClientBindingsKey(domainKey, clientId));
return new Bindings(entries.Select(entry => new Binding(new Record(entry.Name, entry.Value))));
}
```
The spike happens when DoGetBindings/DoGetSubscriptions has the continuation. If I return HashGetAllAsync directly and then transform the result after the transaction.ExecuteAsync, all is fine.
This bit of code hasn't changed recently, so maybe there's a difference in the MSIL causing it? I don't know.
I'm tempted to say "see what happens when we deploy the semaphore fix",
which should be soon. This addresses a number of CPU-related issues.
On Mon, 18 Feb 2019, 19:08 Josh Mackey, notifications@github.com wrote:
Noted, but we're not using SCAN/KEY anymore there.
Any who, I found section of code causing the CPU spike in the following
code:transaction.AddCondition(Condition.KeyExists(clientKey)); var getTask = transaction.HashGetAllAsync(clientKey); var bindingsTask = DoGetBindings(transaction, clientId, domainKey); var subTask = DoGetSubscriptions(transaction, clientId, domainKey);(DoGetSubscriptions looks the same)
private async Task
DoGetBindings(IDatabaseAsync db, string clientId, string domainKey)
{
var entries = await db.HashGetAllAsync(GetClientBindingsKey(domainKey, clientId));
return new Bindings(entries.Select(entry => new Binding(new Record(entry.Name, entry.Value))));
}The spike happens when DoGetBindings/DoGetSubscriptions has the
continuation. If I return HashGetAllAsync directly and then transform the
result after the transaction.ExecuteAsync, all is fine.This bit of code hasn't changed recently, so maybe there's a difference in
the MSIL causing it? I don't know.—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/StackExchange/StackExchange.Redis/issues/1060#issuecomment-464846283,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABDsGhPtckJzcgiatjw-g4dpU0r9dapks5vOvo_gaJpZM4a6nNr
.
Do you have an ETA on it? Or perhaps a prerelease?
I expect to have a build on myget this morning; I'll probably be dogfooding it on Stack Overflow today.
myget was angry with me; I've thrown it onto nuget instead (marked "unlisted"): https://www.nuget.org/packages/StackExchange.Redis/2.0.561-ge0760c7052
I'm not sure if the slowness I'm seeing is related to this issue, I can create a new issue if not, I have a simple test to reproduce the performance drop in 2.x.
Time taken
var client = ConnectionMultiplexer.Connect(ConfigurationOptions.Parse("localhost"));
client.BlockUntilConnected();
var db = client.GetDatabase(0);
var start = DateTime.Now;
var tasks = Enumerable.Range(0, 1000).Select(async i =>
{
for (int t = 0; t < 1000; t++)
await db.StringIncrementAsync(i.ToString(), 1);
});
await Task.WhenAll(tasks);
var duration = DateTime.Now.Subtract(start).TotalMilliseconds;
I see much higher CPU in addition to it taking longer.
seconds? or milliseconds? on a local against 519, I see this taking < 4 seconds (1.2.6 for me takes 3 seconds); 544 and 545 are taking a little longer, which is something I can look into, but nothing like the times you're talking about here
any chance you're talking about a cloud scenario here (remote servers, etc)?
(edit: I've used GetCounters() to confirm that 1000000 operations did complete asynchronously between the two timings)

also: is the redis server a cluster in this case, perhaps?
Although the code reported milliseconds I converted to seconds for the post.
This is all against a redis server running on the same machine (Windows 10) redis_version:3.2.100
I ran the same code against a remote redis 3.2.8 server running on Ubuntu going through a 10gig switch and got faster results but still a big difference
Probably worth noting that I'm running these testing in a console app on .net 4.7.2
k, that's really useful feedback; I'll dig more into this before we release anything; 545 is on nuget, but is "unlisted"
I have a handle on the 544/545 delta, and a plan to fix that regression - and hopefully eat back some of the 1.2.6->2.* delta
I tried to install 545 in a virgin project to perform the same test but get the exception "The assembly for System.Buffers could not be loaded".
I manually added System.Buffers 4.5.0, System.Numerics.Vectors 4.5.0 to rectify the issue and recorded this result.
(mutters curses to the gods of assembly binding and transitive dependency
resolution)
On Thu, 21 Feb 2019, 15:51 tommymonk, notifications@github.com wrote:
I tried to install 545 in a virgin project to perform the same test but
get the exception "The assembly for System.Buffers could not be loaded".
I manually added System.Buffers 4.5.0, System.Numerics.Vectors 4.5.0 to
rectify the issue and recorded this result.
- 2.0.545 (42 seconds)
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/StackExchange/StackExchange.Redis/issues/1060#issuecomment-466051728,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABDsNVvY5FfFRzzRBG1ZsX4d-U8ot6fks5vPsCagaJpZM4a6nNr
.
I'll also ping in here note that we're still seeing CPU spikes, just in a different area once we expanded our tests. I haven't pinned them down yet though. I think it may be caused by an upgrade to asp.net core but I can't prove it yet. They didn't happen when we were still on 2.0.0.
@mgravell I checked out one of the working versions of our code base (that uses 1.2.6) and then only updated the StackExchange.Redis library to latest. The CPU started spiking on load. Seems like a regression.
Just wanted to share an updated test result of the little test mentioned earlier in this issue against 2.0.561;
This is obviously a marked improvement over the previous 2.x versions and is on-par with 1.2.6
Unfortunately the majority of attempts to run this code resulted in the following exception, with the occasional success. Previous versions e.g. 2.0.544 didn't have this issue.
Unhandled Exception: StackExchange.Redis.RedisTimeoutException: The timeout was reached before the message could be written to the output buffer, and it was not sent, timeout: 5000, phys: Idle, outbound: 0KiB, inbound: 0KiB, inst: 0, qu: 1, qs: 0, in: 0, serverEndpoint: Unspecified/redis01:6379, mgr: 9 of 10 available, clientName: TestHost01, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=1,Free=2046,Min=4,Max=2047), v: 2.0.561.32355 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
2.0.571 still spikes the CPU significantly more than 1.2.6 does.
@tommymonk I'm just guessing here but that may be because of this (fixed just today) issue https://github.com/mgravell/Pipelines.Sockets.Unofficial/issues/26
No, 571 includes the update for that. So if the comparison is to 571, it
isn't related.
On Thu, 14 Mar 2019, 19:36 Nino Floris, notifications@github.com wrote:
@tommymonk https://github.com/tommymonk I'm just guessing here but that
may be because of this (fixed just today) issue
mgravell/Pipelines.Sockets.Unofficial#26
https://github.com/mgravell/Pipelines.Sockets.Unofficial/issues/26—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/StackExchange/StackExchange.Redis/issues/1060#issuecomment-473027446,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABDsG6aan2GJfLaSNF8edZZ5h-EumWHks5vWqTVgaJpZM4a6nNr
.
@mgravell I meant the exceptions @tommymonk experienced, that was on 2.0.561
@NinoFloris I just re-tested with 2.0.571 and experience the same timeouts.
To give more details, I have a test that throws 600 clients at our server. In 1.2.6, the join phase spikes to around 30% (571, spikes to 60%). For the leave phase, 1.2.6 spikes to 40-50%, whereas in 571, it spikes to 100%.
Each client has a dozen or so redis calls for each phase (I know its super inefficient, we're fixing that) so ~7200 redis calls within a second or so.
Interesting. By 600 clients: is that 600 multiplexers, or 600
threads/tasks/etc sharing a multiplexer (or some smaller number of
multiplexers)?
On Thu, 14 Mar 2019, 21:08 Josh Mackey, notifications@github.com wrote:
To give more details, I have a test that throws 600 clients at our server.
In 1.2.6, the join phase spikes to around 30% (571, spikes to 60%). For the
leave phase, 1.2.6 spikes to 40-50%, whereas in 571, it spikes to 100%.Each client has a dozen or so redis calls for each phase (I know its super
inefficient, we're fixing that) so ~7200 redis calls within a second or so.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/StackExchange/StackExchange.Redis/issues/1060#issuecomment-473063530,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABDsBzNl3kFJTIWsXv62VhSoTPMB2MRks5vWrpNgaJpZM4a6nNr
.
They all share the same multiplexer.
I've triple checked, and the entire pipeline is async as well.
We are aware of a problem here (relating to a race condition in the write-lock); we're dogfooding a fix locally, and hope to deploy an update in the next day or two.
The combined improvements in 2.0.588 and 2.0.571 should, we believe, address a wide range of timeout and performance problems. I'm closing all timeout issues as "believed fixed in 2.0.588"; if you still see problems, please report them as new issues so we can start investigation with fresh eyes.
I've just tested 2.0.588 several times and found it now runs as fast as 1.2.6 and without any timeouts :clap:
Great news, time to upgrade to 2.x 💪
Most helpful comment
(mutters curses to the gods of assembly binding and transitive dependency
resolution)
On Thu, 21 Feb 2019, 15:51 tommymonk, notifications@github.com wrote: