Hi - while I'm very much liking RQ overall, unfortunely it seems to become unreliable under some curcumstances, specifically if I stress it by spawning a large number of workers (via a Condor cluster) with a queue of short runtime jobs.
The symptom is that the queue is empty - q.is_empty()==True and rqinfo and the dashboard agree. However iterating the jobs for is_queued/started/finished/failed doesn't agree, for instance some jobs still report is_queued==True. Since I'm testing num_queued + num_started == 0 to know when the whole grid job is complete, this is a problem.
I'd guess this is the communication with the Redis server timing out and not being retried, thus leaving this inconsistency? We've had problems like this with Python's socket library - despite it claiming to not have a timeout, the underlying C socket API returns ETIMEDOUT and the library throws an exception rather than loop on this condition. Note that I've configured the Redis server with enough fds to honour it's default of maxclients=10000.
Would very much like to use RQ over alternatives, but unreliability is a show-stopper. Any ideas what could be causing this behaviour, and can it be fixed?
Wow, I am considering python-RQ as a primary task job queues for a highload project and that's really a stopper :(
by the way, are there any notable examples of python-rq being use on production under heavy loads?
I switched to using mrq (https://github.com/pricingassistant/mrq ). Docs and API are less mature than RQ and it needs a bit more setup (requires both mongodb and redis), however the developer is very responsive to enhancements and questions, the dashboard is way better, and so far it's proved reliable (the only issue I had was having to configure mongo to accept more simultaneous connections). My tests are still just at a prototype stage, but my impression so far is good.
My basic model is: user supplies a module which generates jobs (callable+params). My script enqueues the jobs via rq/mrq, then launches some number of workers via Condor. The workers are launched in 'burst mode' so when the queues are empty they exit. My script then polls for (num_queued+num_started==0), which means we're done. Jobs can either return a simple result, or write files which Condor can then transfer back to a subdir on the host which ran the script.
I have a version in Celery also, but have only done simple simple tests, and it doesn't have any equivalent to burst mode, which would make shutting down the workers trickier.
@mark-99 thanks for that find! Btw, why does it need mongo db?.. Does not sound like a highload solution.
Celery... well, I was looking for someting more lightweight and that's how I've found python-rq :)
I believe it stores the queues in redis and the job metadata in mongodb. I'm not sure why it was done this way, but they are using it in a real-world production system so I guess there was a good reason, you could always ask...
Asked the developer about it - https://github.com/pricingassistant/mrq/issues/99
Well, I still think that using two storages (redis and mongo) is bit of an overkill. Hope, @nvie will figure it out ;) We're here to help anyway, right, @mark-99 ?
@nvie ?
If Redis connection is unreliable, jobs could be lost when they are popped off the queue for execution.
The only real solution for this is if we use brpoplpush to atomically pop a job and move it to a temporary queue. However, this would limit workers to only be able to listen to one queue.
Perhaps we can use brpoplpush if worker only has one queue to listen to and use brpop and lpop for multiple queues. This wouldn't be too hard to do and I'll try to find the time to implement it if @nvie has no objections.
@selwin if ot's really about redis, then the same would happen with celery when using redis as a broker
I don't believe Redis is intrinsically unreliable - I'm now using the same instance with mrq without any problems.
As per the original post, I'd strongly suspect socket timeouts are the underlying issue.
I am happy to retest any updates.
@mark-99 I'm sure redis IS reliable, but I cannot understand the situation. Why are there socket timeouts if the culprit is not Redis? And how can we avoid them?
@mark-99 btw, the problem with MRQ is that it does not seem like production-ready :(
Again I'm guessing here, but presumably the worker process has to communicate with Redis to push some status about job start/completion. Perhaps if Redis is heavily loaded it does not respond within some timeout (or max simultaneous connections, or some other limit being hit, although as I said I configured it for 10k clients). In this case it seems the update about the job status is simply lost, hence the inconsistency between the queue object and iterating over the job statuses. In this case the fix would be to better detect the failure to update the database, and retry.
MRQ seems good - I agree it's a bit rough in places, but the dashboard is way better and it does seem to work. The author has been responsive, and his company is using it in a commercial application. My use case is an internal grid computing facility so provided it works for the features we need, that's sufficient.
That said, I do like RQ also and have maintained that version in parallel.
@mark-99 Redis is a wonderful data store. However, like any other data store, it's possible to saturate a data store if you throw enough traffic at it (this is why companies like Twitter shard their Redis instances).
With the current design, it _is_ possible for RQ to lose jobs. RQ currently pops a job off a queue and puts the job into the StartedJobRegistry
using two separate instructions. If the second step fails, the job will be "lost" (the job will not be present in any queue, StartedJobRegistry
, FailedQueue
or FinishedJobRegistry
).
This is the part that I think is most crucial to address.
If connection failure happens in other places, you can probably work around them by writing a custom exception handler (rqworker --exception-handler='my.custom.exception_handler'
) that handles Redis connection errors.
@mark-99 I agree with the overall usefullness of mrq, though the author himself says it's not production ready yet. Right now I am looking for a lightweight, yet reliable queues for my highload project, so I though maybe there is some way to patch up rq, that seems more production ready to me.
@selwin thank you for the insight. Do you know if there are common practices for handling this types of issue? If I was about to create a PR, how should I approach this problem?
Hi! I'm the primary developer of MRQ. We consider it production-ready now (we've been using it for more than a year on billions of jobs), though docs are indeed less mature than RQ.
Not loosing any jobs is a strong guarantee that you can reach with redis only if you have some way to store which jobs were started, and a way to requeue jobs that were interrupted for some reason. We wrote builtin jobs for all those cases:
http://mrq.readthedocs.org/en/latest/jobs-maintenance/
Our implementation of the "started" zset:
https://github.com/pricingassistant/mrq/blob/master/mrq/redishelpers.py#L47
We added MongoDB as a data store primarily for the dashboard, but it also provides a dual storage for jobs that can be very useful if Redis is abruptly emptied/flushed/crashed (happened to us a couple times in production because we didn't monitor it enough, and we requeued everything from Mongo).
As I said before, we were strongly influenced by RQ which we still love, and would be happy to see some ideas flow the other way around!
Cheers,
@sylvinus that's good news, thanks! I'll consider using it. By the way, was it load-tested?
Talking about jobs, well, I don't know any apps where all jobs are idempotent.
For example, a job for sending push message or email cannot be idempotent.
@DataGreed it was load tested with 1000s of workers.
Jobs sending emails can be idempotent if you have a proper lock/semaphore inside on the thing you want to do only once. In this case repeated calls would wait or fail/abort.
@sylvinus then what is the point of queing if I save the state of every email in the database? :)))) I can just run a cron worker that will cycle through records, select unsent ones and send them.
@sylvinus you see, if I need to send 1000 push notifications per second there and save a lock for every one of them, then queing system that saves jobs seems like an overhead, that's what I meant.
The question is what happens for instance when your task timeouts for some reason. Maybe the worker was killed abruptly and you're not sure if the push was sent or not. That's an issue for your app to solve, not your taskqueue.
@selwin Could you explain why the brpoplpush
approach would only work when using one queue? With some refactoring this scenario should also be able to work for multiple queues right?
@joostdevries no, because the Redis command itself only allows you to listen to a single queue :): http://redis.io/commands/rpoplpush
@selwin Ah right I see. One other alternative could be to use a non-blocking version and use that every second but that's probably not the nicest way to solve this.
No, polling Redis would be terribly inefficient and inelegant in my opinion.
@mark-99 , what kinds of numbers are you talking about in your original ticket? Hundreds? Thousands? Tens of thousands?
This is a pretty serious concern for us, but our project might have a more modest definition of "heavy load".
@selwin what is the status here? Currently seeing some strange issues in production ( Using Sentinel ) where suddenly RQ cannot find a job in Redis. It seems like the job has just vanished, as it has been written to redis at least once.
I looked into this a few times but were unable to find anything wrong with the code so I'll close this issue. @atainter fixed his issue by upgrading Redis cluster version so it's actually a Redis issue.
Thx for you super-quick response :)
I looked into this a few times but were unable to find anything wrong with the code so I'll close this issue. @atainter fixed his issue by upgrading Redis cluster version so it's actually a Redis issue.
So I added this on the other issue, but a Redis engine upgrade absolutely does not resolve the problem. I am not certain why is fixed things for @atainer, but it definitely did not resolve the bug for my application.
Most helpful comment
@sylvinus then what is the point of queing if I save the state of every email in the database? :)))) I can just run a cron worker that will cycle through records, select unsent ones and send them.