Stackexchange.redis: "Timeout performing EVAL" with high "in" - application too slow?

Created on 25 Jun 2017 · 27Comments · Source: StackExchange/StackExchange.Redis

We occasionally have a TimeoutException thrown when we receive bursts of traffic.

I have done some reading, and from how I understand it, "in" represents the number of bytes (22293 bytes in the example below) sent by the Redis server, received by the network layer of the client machine, but still waiting to be read by the client application. What this means is that the application is not processing quick enough, and not within the timeout configured on the Redis client.

By this logic, is it true to say that machine/network bandwidth would NOT be the limiting factor, because the data has in fact been received by the client? Furthermore, upgrading the Redis server would not help fix the problem either. I have also ruled out threading issues (see this post).

Would upgrading the client server (VM) be the optimal solution? Otherwise, increasing the specified timeouts.

System.TimeoutException: Timeout performing EVAL, inst: 1, mgr: Inactive, err: never, queue: 47, qu: 0, qs: 47, qc: 0, wr: 0, wq: 0, in: 22293, ar: 0, clientName: xyz, serverEndpoint: Unspecified/my-application.redis.cache.windows.net:6380, keyHashSlot: 14506, IOCP: (Busy=3,Free=997,Min=200,Max=1000), WORKER: (Busy=10,Free=32757,Min=200,Max=32767), Local-CPU: unavailable (Please take a look at this article for some common client-side issues that can cause timeouts: https://github.com/StackExchange/StackExchange.Redis/tree/master/Docs/Timeouts.md)
   at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server)
   at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server)
   at CacheManager.Redis.RedisCacheHandle`1.Eval(ScriptType scriptType, RedisKey redisKey, RedisValue[] values, CommandFlags flags)
   at CacheManager.Redis.RedisCacheHandle`1.<>c__DisplayClass44_0.<GetCacheItemAndVersion>b__0()
   at CacheManager.Redis.RetryHelper.Retry[T](Func`1 retryme, Int32 timeOut, Int32 retries, ILogger logger)
   at CacheManager.Redis.RedisCacheHandle`1.Retry[T](Func`1 retryme)
   at CacheManager.Redis.RedisCacheHandle`1.GetCacheItemAndVersion(String key, String region, Int32& version)
   at CacheManager.Redis.RedisCacheHandle`1.GetCacheItemInternal(String key, String region)
   at CacheManager.Core.BaseCacheManager`1.GetCacheItemInternal(String key, String region)
   at CacheManager.Core.Internal.BaseCache`1.Get(String key, String region)
   at CacheManager.Core.Internal.BaseCache`1.Get[TOut](String key, String region)

connection timeout

Source

davenewza

👍1

Most helpful comment

Yes, still an issue unfortunately.

davenewza on 2 Aug 2017

👍2

All 27 comments

yes, this looks like a reader stall based on the non-zero "in" and zero
"ar". we have some plans to completely overhaul the network layer to add
stability. I need to move forward on this work :/

On 25 Jun 2017 1:34 p.m., "Dave" notifications@github.com wrote:

We occasionally have a TimeoutException thrown when we receive bursts of
traffic.

I have done some reading, and from how I understand it, "in" represents
the number of bytes (22293 bytes in the example below) sent by the Redis
server, received by the network layer of the client machine, but still
waiting to be read by the client application. What this means is that
the application is not processing quick enough, and not within the timeout
configured on the Redis client.

By this logic, is it true to say that machine/network bandwidth would NOT
be the limiting factor, because the data has in fact been received by the
client? Furthermore, upgrading the Redis server would not help fix the
problem either. I have also ruled out threading issues (see this post
https://gist.github.com/JonCole/e65411214030f0d823cb.

Would upgrading the client server (VM) be the optimal solution? Otherwise,
increasing the specified timeouts.

System.TimeoutException: Timeout performing EVAL, inst: 1, mgr: Inactive, err: never, queue: 47, qu: 0, qs: 47, qc: 0, wr: 0, wq: 0, in: 22293, ar: 0, clientName: xyz, serverEndpoint: Unspecified/my-application.redis.cache.windows.net:6380, keyHashSlot: 14506, IOCP: (Busy=3,Free=997,Min=200,Max=1000), WORKER: (Busy=10,Free=32757,Min=200,Max=32767), Local-CPU: unavailable (Please take a look at this article for some common client-side issues that can cause timeouts: https://github.com/StackExchange/StackExchange.Redis/tree/master/Docs/Timeouts.md)
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImplT
at CacheManager.Redis.RedisCacheHandle1.Eval(ScriptType scriptType, RedisKey redisKey, RedisValue[] values, CommandFlags flags) at CacheManager.Redis.RedisCacheHandle1.<>c__DisplayClass44_0.b__0()
at CacheManager.Redis.RetryHelper.RetryT

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/StackExchange/StackExchange.Redis/issues/650, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AABDsK7OqIB0j4ZGaAC9LVpHcp9MDlg5ks5sHlO_gaJpZM4OEmIM
.

mgravell on 25 Jun 2017

In that case I will increase the timeout in order to handle this. Thanks

davenewza on 27 Jun 2017

I have often seen this issue as a result of CPU spikes to 100%. The "local-cpu: unavailable" indicates that you are running in an environment that is blocking access to system perf counters, so these can be difficult to track down. Capturing a process dump at the time of the exception may be the only way to verify...

JonCole on 27 Jun 2017

IMPORTANT UPDATE: it turns out that the build config was broken during the csproj=>json=>csproj gymnastics; a very important flag wasn't ... flagged. This has been fixed in 1.2.5 which will be available on NuGet as soon as it indexed. We think this relates to this bug, and I'm hopeful that 1.2.5 will improve or completely fix this.

mgravell on 24 Jul 2017

Fantastic. Will update pronto and report back in due course.

davenewza on 25 Jul 2017

This still seems to be an issue with the 1.2.5 update.

oofpez on 28 Jul 2017

Yes, still an issue unfortunately.

davenewza on 2 Aug 2017

👍2

@JonCole I've got a few memory snapshots with Redis timeouts, how to find the cause out of it?
Although we've followed the steps from Diagnosing Redis errors on the client side, we still get timeouts:

Memory usage is within box physical RAM
ThreadPool was strictly limited to scale from 'Min=40, Max=48' with 4 processors so that runtime does not create excessive threads, in other words - we have secured CPU slice for each thread
Redis server felt good

Despite an application is forced to take small chunks of work for processing (excessive requests will be in a queue) and Redis server has tons of free capacity, timeouts will still bubble =\

I was wondering what bits are to be inspected to figure out timeout causes?

mitikov on 6 Mar 2018

@mitikov The dump will capture information about CPU, callstacks, etc and you can try to see if there are clues about what could be causing the processing of responses to stall. For instance, the CPU information should be in the dump. If you are on Windows, you can use WinDBG (free debugger from Microsoft) to examine the dump. Make sure you have installed a recent version of that debugger as the newer versions are much better than the older versions.

Some interesting commands are (remove quotes):
"!threadpool" -> this will show CPU usage and thread pool stats at the time the dump was taken

"~" -> list all thread ids.

"~*k" -> This will print out the callstacks for all threads.

"~s10" -> switch the active thread for debugging to thread with id 10.

"!dso" -> dumps objects on the stack for the current thread. You should then be able to click on the address of the various objects to drill into those objects.

Questions to ask yourself:

Is The CPU running hot? high CPU is a very well known cause of timeouts.
Do you see any call stacks you don't expect? What are those stacks doing?
Is GC running currently? If you see all threads blocked on the GC, then that could explain the timeouts. If GC is running too often and taking a long time, then that will pause execution.

Anyway, I can't possibly give you a full tutorial on debuggers in this response, so I hope this gives you some ideas.

JonCole on 6 Mar 2018

I've seen a good dozen of memory snapshots with Redis timeouts; some of those do have high CPU reported, while others do not.

The technical story can be obtained from 118011717482866 Microsoft support ticket.

As for your questions:

Is The CPU running hot? high CPU is a very well known cause of timeouts.

I guess meant points are:

ThreadPool will not create more threads if CPU is high - nobody to pick the callback & parse the response from Redis. Note: skip balancing algorithms for now.
CPU is high = threads might not get CPU time slice

Do you mean those issues? Are there any more?

Do you see any call stacks you don't expect? What are those stacks doing?

A vast majority sits in a Pulse inside StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl

Is GC running currently? If you see all threads blocked on the GC, then that could explain the timeouts.

I expect GC to run from time to time, and noticeable delays while dealing with 2nd gen & compacting memory.
The support request states timeouts during GC are expected.
Taking into account GC is expected in (productions) loaded solutions, it sounds timeouts are expected in productions.
We are using Azure Web Apps -> limited RAM == faster GC (compared to monster boxes).

Anyway, I can't possibly give you a full tutorial on debuggers in this response, so I hope this gives you some ideas.

At this point in time, all we have is a set of memory snapshots without technical background on timeout - while monitored object did not pulse....

The thing that buzzes me most:

We have data - user memory snapshots with exceptions
There are some 'known' theoretical causes for them

Despite those points, we are not aware of steps that lead from snapshot to known cause with a certain degree of accuracy :(
I would much appreciate your advice on how would you investigate 'phantom' timeouts.

mitikov on 7 Mar 2018

Is anyone still replicating this once moving to the 1.2.6 (or higher) release?

NickCraver on 28 May 2018

Yes I've been running into this in production (with 1.2.6) on an almost daily basis now. If there's anything I can do to help track down the corner case, let me know. It's a big deal for us since we're trying to do distributed locking with Redis for the primary function that this system serves.

If there's any kind of known workaround so I can fix some production issues that would be great - someone in another issue mentioned some locking.. is there some class which isn't thread-safe ?

We have CPU, memory and all kinds of fairly detailed logging in place for this system, and can add more. Happy to try stuff out in our test environment if that helps. Would love to help wrap this up!

E.g.

logentry

kierenj on 2 Jun 2018

@kierenj The in being so large there means the incoming reader is locked up, as there are bytes in queue to read but they're not coming back (the 1394 pending). The fact that the manager is Inactive means the reader thread died. Do you have an earlier exception that shows why it died?

Related: it's very curious to me that EVAL seems to be such a large percentage of ones we're seeing - I'm going to take a hard look there.

NickCraver on 3 Jun 2018

I can't see a reason for it dying, but the first exception was:

@ 17:02:54.921 (in=340) - an RPOPLPUSH

https://kierenj.tinytake.com/media/797e1c?filename=1528009095570_03-06-2018-07-58-17.png&sub_type=thumbnail_preview&type=attachment&width=2000&height=1460

Followed by:

@ 17:02:54.967 (in=340) - a SET

https://kierenj.tinytake.com/media/797e36?filename=1528009179453_03-06-2018-07-59-42.png&sub_type=thumbnail_preview&type=attachment&width=1999&height=1485

It seems to then continue with very similar exceptions (lots and lots of them), with the in queue size increasing.

I think I may have seen some trace logging code in there. Is there some way I could grab that output? - maybe on exception I could capture the contents of some buffer of the last 1000 trace output lines and log that?

kierenj on 3 Jun 2018

It'd be quite noisy, but you can attach to the current library (and anything else tracing...) in a few ways: https://stackoverflow.com/questions/198322/redirect-trace-output-to-console

I'm taking a look to see how it managed to get into Inactive and not Faulted in this case.

NickCraver on 3 Jun 2018

I'm not getting much coming through the Trace listener at all for normal operation - is that correct?

There's a Conditional("VERBOSE") around the main Trace method - do I need to do a build with that on or is that the default?

(I guess - to debug this issue, would you prefer just normal trace output or verbose trace output? I've put together a custom trace listener which, in the event of an error being logged, dumps the last 1,000 trace messages - up to a 30 min limit - to the log too. Is that likely to be enough history, too?)

kierenj on 4 Jun 2018

@JonCole , @NickCraver , @kierenj - here is an explanation of the problem and solution that has helped us: https://kb.sitecore.net/articles/464570

mitikov on 4 Jun 2018

@mitikov, very nice article! Although, that ties in with thread pool exhaustion, whereas in my case the CPU of both the server and redis is very low (<10%), and as you can see from the stats there are only 1 or 2 worker threads, with a very large number of free threads. I believe the issue I have (and others) is that there's no active reader (ar=0) but is pending input traffic (in>0), i.e. some bug in the wider state machine

kierenj on 4 Jun 2018

@kierenj - can you collect a full exceptional memory snapshot WebApp guide / onPrem guide for the timeout exception and share it in any way?

I'll try to dig smth useful from there, how does that sound?

mitikov on 4 Jun 2018

I can't do that due to the confidential nature of the data in this case, but thank you for the offer of help. I'm keen to see what @NickCraver thinks is most useful to diagnose/debug (perhaps that trace output)

kierenj on 4 Jun 2018

Ok, I'm getting this super-super repeatably right now. It's an Azure app service, most basic level (poss. 1 core) - and the app just plain won't start today. Immediate errors with an in value >0 and ar=0.

What can I do to move this one on? Will any old trace output do, or do I want to define VERBOSE?

kierenj on 13 Jun 2018

I'll admit it, this is ruining me.

I defined VERBOSE on the 1.2.6 tag, but there were a few too many build errors to get my head around. Maybe a global.json-needed situation. So instead I've built it against the master branch. Only a few tweaks needed to get it to build (missing imports).

I can't just drop in a direct binary reference to my VERBOSE build because I'm using it the package indirectly through some packages I have published to myget. So I've created a forked version of the package and published that too (to a private myget).

Now - I can see tons of debug output. I can't dump all trace output, there would be too much. So I'd written a custom trace listener to collect output, then in the case of an error, it'll dump the last 10,000 lines of trace output.

However, on .NET Core, Debug.WriteLine (as used by VERBOSE build) doesn't go to trace listeners. According to hearsay, it doesn't go to any listener. Certainly while the XML documentation says that "System.Diagnostics.Debug.Listeners" exists in netstandard2.0, it actually doesn't.

Nevertheless, I changed up all the calls to be Trace.WriteLine, and got that working..

When I run my new build on the server.. things work fine, I don't get the timeout exception, I roll back, I get it again. Roll forward, it's gone.

Does that make sense.. could the version I grabbed from your master today have a fix for this? Or have I changed some other variable in this experiment?

kierenj on 14 Jun 2018

We're still seeing some in>0, ar=0 on the version I grabbed from master 2 weeks ago.. in slightly different situations.

I don't mean to be a needy GitHub issue fiend, but do let me know if I can assist in testing/diagnosing this. Is the new #868 PR intended to cover this kind of issue?

kierenj on 28 Jun 2018

@kierenj yes, that's the v2 branch - completely new network core based on the new "pipelines" pieces; network connection stability improvements is the primary driver; secondary drivers include lower allocations and simplified codebase (only one approach, rather than 2 - depending on the OS/TLS). As you can see, we're active on that right now. It passes most tests, if you want to try it :)

edit: oh, and better control over threading to avoid ThreadPool starvation

mgravell on 28 Jun 2018

👍1

That is truly exciting to hear, thanks very much! Having to kill production worker services fairly frequently at the moment to reset them from the ar=0 state. Think I'll get a package together and try on our test environment asap. :)

kierenj on 28 Jun 2018

Closing; please see #871

mgravell on 28 Jun 2018

@kierenj "ar" literally doesn't exist any more :)

mgravell on 29 Jun 2018

🎉1

Was this page helpful?

0 / 5 - 0 ratings