Sidekiq: Solution for "Error fetching message: Connection timed out"

Created on 22 Apr 2015  Â·  8Comments  Â·  Source: mperham/sidekiq

Hi there,

We've been plagued by those errors for a long time, the ones that start with:

Error fetching message: Connection timed out

And end with

Redis is online, 5.552555561065674 sec downtime

We've just discovered the cause and the fix and I wanted to share it. It seems to be caused by blocking operations in redis used by sidekiq, like this BRPOP or a BRPOPLPUSH used in sidekiq-pro when using reliable fetching (if I'm not mistaken).

If you pass a timeout (as sidekiq does), ruby's redis-rb client calls those operations with a timeout of its own added to the one you passed. So, if you call a brpop with a timeout of 1s, and redis-rb is configured with a timeout of, say, 5s, then ruby-land will raise a timeout error if redis takes longer than 6s to return.

It turns out blocking operations don't block for an exact time. It uses an algorithm that checks for things to be done (like unblocking clients) with certain frequency. This frequency is configurable, and in our case wasn't close enough to what we needed for the amount of connected clients (all blocking for new jobs to come).

Running a blocking operation with a timeout of 1 in the production server took more than 6 seconds for us, thus causing the timeout errors in ruby-land.

Increasing the frequency stopped raising these errors, which are actually harmless but rather noisy and made us scared there was some underlaying issue with our redis/sidekiq setup.

Do you think this would be a nice addition to the Problems and Troubleshooting wiki page?

Cheers!!

All 8 comments

How does redis-rb get that 5s timeout?

Frankly that redis-rb behavior looks like a terrible idea. If I pass in 1s to brpop, I expect it to timeout after 1s, not 2s. How does the current behavior make any sense?

Nevermind, I think I understand what's going on. What did you have hz set to?

We had hz set to 10 and a test BRPOPLPUSH with 1s timeout took on average 4-6s. We increased the hz to 50 and it dropped to about 1.2 - 1.8s.

We have a total of 24 processes running with 5 concurrency each, meaning a potential for ~120 connections to redis (I think?).

Just to make it extra clear, redis-rb's timeout seems like a sane one. All operations timeout after that timeout if redis doesn't respond earlier. It's just that for these blocking ones, it timeouts 5s (or whatever the setting is) after the "expected" blocking time is.

That's really bizarre. Are you sure you weren't seeing a rogue latency spike or CPU starvation? I'd think hz 10 means check for timeouts every 100ms so seeing a 5 sec delay is completely unintuitive.

I've tried to come up with a working example booting many connections to my local redis, but I've only managed to get those blocks for 1 second to go from 1.0X seconds to ~1.8s consistently. Nothing as bad as 5-6s. However, our production setup reports: connected_clients:2956, which considering all the sidekiqs and unicorns it doesn't sound completely inaccurate.

What we're 100% positive is that increasing the hz setting fixed this.

Update connection pool. This might be a symptom of a recent bug.

On Apr 24, 2015, at 09:46, Albert Llop [email protected] wrote:

I've tried to come up with a working example booting many connections to my local redis, but I've only managed to get those blocks for 1 second to go from 1.0X seconds to ~1.8s consistently. Nothing as bad as 5-6s. However, our production setup reports: connected_clients:2956, which considering all the sidekiqs and unicorns it doesn't sound completely inaccurate.

What we're 100% positive is that increasing the hz setting fixed this.

—
Reply to this email directly or view it on GitHub.

ActionMailer::DeliveryJob/execute
ActiveRecord::ConnectionTimeoutError and

ActiveJob::QueueAdapters::SidekiqAdapter::JobWrapper/perform
ActiveJob::DeserializationError
could not obtain a database connection within 5.000 seconds (waited 5.000 seconds)

is their any solution for above errors in rails4

Was this page helpful?
0 / 5 - 0 ratings

Related issues

agrobbin picture agrobbin  Â·  4Comments

sandstrom picture sandstrom  Â·  3Comments

paul-ylz picture paul-ylz  Â·  4Comments

HenleyChiu picture HenleyChiu  Â·  4Comments

bartimaeus picture bartimaeus  Â·  3Comments