Been having an issue on and off where workers are disappearing, and am having trouble figuring out why. Currently running 200 workers with supervisord, in five batches. Batches support a total of 25 queues. Worker numbers seem to fluctuate. Sometimes there will suddenly only be 10 workers, sometimes 239, though the numbers vary. Trying to find a solid way to track down the problem, or at least deal with it, but so far no luck.
@RedYetiSoftworks I've seen similar behaviour on my setup. I'm using rq-dashboard to monitor the queues and randomly some workers "dissappear". Then if I check one by one, the workers seem to be active and handling Jobs.
@selwin do you have any idea of what can be going on?
@jlopex I think it may be being caused by workers timing out on long queries. I went through yesterday and made sure all db touches were in try/excepts, and my numbers now seem stable. I suspect that RQ was timing out the workers, but supervisord wasn't restarting them because of the hanging processes. I'm going to try scaling my worker counts back up on Monday and see if they remain stable.
@jlopex from eye balling worker registration related code, I don't see any apparent bugs.
@RedYetiSoftworks in addition to what you did, could you please print out the result of connection.smembers('rq:workers')
everytime you check the number of available workers?
@selwin connection.smembers('rq:workers') throws a syntax error near token 'rq:workers'
@RedYetiSoftworks you may need to tweak the syntax a little bit. The connection
here should be a Redis connection of your queue.
@selwin I've tried a few permutations, but still getting the syntax error. I can see the connections when I use netstat, if that helps. Also, if it matters, I'm running the redis server locally on the default port.
@selwin Using 'redis-cli smembers rq:workers' works, and lists all of the workers (160 at the moment, I can't seem to get the correct number of workers to spawn when I go much higher)
Can you get the number to increment if you manuallya start rqworker?
Can you also check the log files whether supervisor is correctly spinning up the right number of workers? (Or does any of them unexpectedly die or unable to start)
Sent from my phone
On Nov 17, 2015, at 4:58 AM, Michael Good [email protected] wrote:
@selwin Using 'redis-cli smembers rq:workers' works, and lists all of the workers (160 at the moment, I can't seem to get the correct number of workers to spawn when I go much higher)
—
Reply to this email directly or view it on GitHub.
@selwin - number does increment, and I am seeing some interesting things in the supervisor log when it comes to not enough workers spawning:
2015-11-16 21:04:46,674 INFO spawnerr: too many open files to spawn 'flight_workers_13'
2015-11-16 21:04:46,674 INFO gave up: flight_workers_13 entered FATAL state, too many start retries too quickly
Log doesn't go back to when they kept dying off.
Thanks for the help!
Figured out the issue causing an incorrect number of workers to spawn. Max open file limit was too low on the server. It's been raised, and now I can spawn as many workers as I want.
Is this a config change on the OS side? Could you please let us know what parameters to change in this thread so that other people can fix it if they see this problem?
Sent from my phone
On Nov 18, 2015, at 1:46 AM, Michael Good [email protected] wrote:
Figured out the issue causing an incorrect number of workers to spawn. Max open file limit was too low on the server. It's been raised, and now I can spawn as many workers as I want.
—
Reply to this email directly or view it on GitHub.
For sure! Basically, the open file limit was set too low on the server.
Detailed fix:
The supervisord logs showed the following messages:
2015-11-16 21:04:46,674 INFO spawnerr: too many open files to spawn 'flight_workers_13'
2015-11-16 21:04:46,674 INFO gave up: flight_workers_13 entered FATAL state, too many start retries too quickly
The ulimit for open files were:
ulimit -Hn
4096
ulimit -Sn
1024
Added the following entries to /etc/security/limits.conf:
root soft nofile 65535
root hard nofile 65535
and ran the following command:
ulimit -n 65535
These steps seems to have solved the problem of spawning larger number of workers.
Another issue we have is that we've recently began running monit in addition to supervisord, and it was bringing supervisord back up before the deployment script wanted it up. Our deployment script now looks something like this:
kill monit -> kill supervisord -> send term signal to workers -> wait for workers to die off (can take up to the timeout for the worker itself) -> restart supervisord -> restart monit .
It seemed to work well yesterday. We'll be testing it more extensively today, and I'll share any other problems we may run into here.
Worker counts still refuse to remain stable over ~200, especially at redeployment. I'll update this thread if we get anything figured out.
@selwin I did not have much time to time to look what it could be causing this. I only have have 8 workers on 6 different queues (some of them are usually empty).
In my specific case queues are not long, usually queues are way below 100 items. Could it be related with the time needed to perform a Job? Long jobs may need between 30min to 4 hours to complete.
Another information that might help, as far as I noticed workers start to disappear after the service has been running for a "long" time (weeks).
@jlopex - Are you using supervisord to monitor the workers? If so, try checking the supervisord logs; alongside the syslogs, it's been greatly helpful in my own troubleshooting.
@jlopex as long as you set the right timeout, it shouldn't be an issue. Right before a worker executes a queued job, it sends a heartbeat with the right timeout to prevent worker key from expiring.
Can you verify that the workers are still running (processing jobs) when you see the worker disappear from the dashboard?
I wonder if this could be related to unreliable connection to Redis server. Are you running the worker in the same box as the Redis?
@RedYetiSoftworks Indeed, I use supervisord to monitor the workers. I haven´t seen any trace that gives me a lead to understand exactly what's happening.
@selwin the workers are not in the same machine and I verified that workers are still handling Jobs because the system works as expected.
The infrastructure is composed of 5 instances, 1 acting as entry-point/server and the other instances hold the workers. This whole infrastructure is deployed in AWS so the connectivity should be pretty stable, but connectivity needs of the platform is sometimes very high (transfer peaks of 1Gbps during a few minutes) so it could be possible that under high load the connectivity may not be good enough for Redis.
Here I come with a little bit more of information, this is the current number of workers that should be attached.
4 workers connected to high, normal, low queues
2 workers connected to audio queue
1 worker connected to upload queue
But if we check the rq-dashboard or rqinfo we only see:
ip-10-194-134-60.1660 idle: audio
ip-10-76-8-2.14913 idle: upload
ip-10-197-5-146.22644 busy: high, normal, low
ip-10-194-134-60.1651 busy: high, normal, low
ip-10-195-3-228.2383 busy: high, normal, low
ip-10-76-8-2.14914 busy: high, normal, low
6 workers, 6 queues
So I go to the machine ending in 228 (that I know that should have a worker connected to the audio queue but it doesn't show up)
user@ip-10-195-3-228:/opt/hh/production/current$ sudo supervisorctl status
audio_enc_worker:audio_enc_worker-00 RUNNING pid 2392, uptime 7 days, 1:05:03
video_enc_worker:video_enc_worker-00 RUNNING pid 2383, uptime 7 days, 1:05:03
And the worker is there, RUNNING.
Excerpt from audio_enc_worker:audio_enc_worker-00 logs:
...
2015-11-23 12:46:44 INFO [hasselhoff.media.tools.ffmpeg ffmpeg:454] Muxing 1 final outputs
...
12:49:55 Job OK
12:49:55 Result is kept for 500 seconds
12:49:55
12:49:55 *** Listening on audio...
So I can confirm that the worker is handling Jobs but is not being shown on the or rq-dashboard and rqinfo.
Could it be possible that the worker is still subscribed (and receiving elements) from the redis queue but somehow the rq framework has "expired" the rqworker?
I'm not sure on how rq handles the workers registration at framework level.
@jlopex Thanks for the additional info! Very helpful!
@selwin My worker counts on rqinfo and worker processes (ps aux | grep rqworker | wc) running are not in sync, and are fluctuating. As I write this I have 134 workers showing in rq, and 189 from the command. After typing that, I checked again, and I have 131 workers in rq, and the command returns 191. Supervisord logs give no insight.
_EDIT_ After a reboot, rqinfo shows 214 workers, but a process search only shows 164.
Thanks.
Another update, looks relevant to @jlopex 's issue.
ps aux | grep rqworker is showing 293 rq workers. rqinfo shows 74. Still not sure /why/, exactly, but I have queues with zero workers assigned to them in rqinfo that are still getting serviced. I suspect that somehow workers are de-registering with rq, but continue servicing jobs. At the same time, new ones are being created (but not by supervisord.) Still investigating, and if I find anything more I'll post it here.
@jlopex @selwin I spent most of yesterday debugging this issue, and I believe I have found another part of the cause. If I am understanding the worker process properly, each time a worker grabs a job, it creates a child process to do the actual work, and this child is named the same as the worker itself. When checking processes to see if any new ones need to be spawned, supervisord counts these children as worker processes. When workers die/unregister/etc, they aren't being respawned because supervisord doesn't realize it. My temporary solution was to front-load all of my jobs, so that there is a period every ~10 minutes where there should be no jobs running, and supervisord is now spawning new workers up to the number I want in that period. Still not sure why workers are dying, or why it seems that sometimes child processes are taking longer than they should to go away. Is there any way to change the naming convention for the child processes?
Hope this helps.
@jlopex I noticed that rqinfo and rq-dashboard aee showing you different number of workers. Could you run this command on the Redis server and see how many workers it considers to still be alive? "redis-cli smembers rq:workers"
Perhaps the heartbeat is not received by Redis because of connectivity issues because it's sent during one of your peak connectivity period. I'll have to do more investigation here.
@RedYetiSoftworks Thanks for doing all the investigation. We could definitely change the names of the child processes if this confuses supervisord.
Does this happen with small number of workers? I personally don't use supervisord but it's quite popular and I haven't heard any issues about this.
@selwin It definitely confuses supervisord. I haven't checked it out with small numbers of workers; currently I'm running 150. Possibly @jlopex could investigate? Even a slight change to the child process naming should fix the supervisord issue. The fact that sometimes child processes aren't always terminating when they time out (my timeout is currently set to 300 seconds) is an entirely different issue that I intend to look into more at some point.
@RedYetiSoftworks sorry for the late reply. Do you mind opening a PR to change the process name of the child process then?
@selwin @RedYetiSoftworks Tomorrow I'll try to invest some time investigating this issue.
@selwin I just had time to review this.
This is the output I get:
$ redis-cli smembers rq:workers
1) "rq:worker:ip-10-197-5-146.22644"
2) "rq:worker:ip-10-194-134-60.1651"
3) "rq:worker:ip-10-76-8-2.14914"
4) "rq:worker:ip-10-195-3-228.2383"
This matches the result from:
$ rqinfo
normal | 0
failed | 0
upload | 0
low |███████ 7
audio | 0
high |█ 1
6 queues, 8 jobs total
ip-10-197-5-146.22644 busy: high, normal, low
ip-10-194-134-60.1651 busy: high, normal, low
ip-10-195-3-228.2383 busy: high, normal, low
ip-10-76-8-2.14914 busy: high, normal, low
4 workers, 6 queues
Going a little bit deeper I checked the ouput of redis-cli INFO and it returns something that might be related.
# Clients
connected_clients:10
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:3
There are three "blocked_clients" that might be the three "hidden" workers that are handling jobs but not showing up at rq.
From the redis-cli documentation I see:
blocked_clients: Number of clients pending on a blocking call (BLPOP, BRPOP, BRPOPLPUSH)
I also checked for the 'birth' date of one of the "hidden" but active workers and one of the visible ones, as you can see the redis hash of the hidden ones has been deleted.
127.0.0.1:6379> hget 'rq:worker:ip-10-76-8-2.14914' 'birth'
"2015-11-16T10:58:01Z"
127.0.0.1:6379> hget 'rq:worker:ip-10-195-3-228.2392' 'birth'
(nil)
Update going a little bit further I requested all keys of one of the worker that is "hidden":
127.0.0.1:6379> hgetall rq:worker:ip-10-195-3-228.2392
1) "state"
2) "idle"
Compare to one of the correctly shown:
127.0.0.1:6379> hgetall rq:worker:ip-10-76-8-2.14914
1) "birth"
2) "2015-11-16T10:58:01Z"
3) "queues"
4) "high,normal,low"
5) "state"
6) "busy"
7) "current_job"
8) "d9618712-df95-456c-ae05-6c449af3230d"
After reviewing the path I think I can guess where the issue is happening.
The worker is on charge of sending a heartbeat every N seconds to Redis, (in fact we update the expire attribute at Redis), I've seen that this timeout value depends if the worker is IDLE or BUSY. If this expire update (heartbeat message) doesn't arrive to Redis the Worker expires and keys will be removed from Redis.
I think the issue is happening while the failing worker is in IDLE state (but it may happen in any), by some reason the Network is too busy and this new expire value doesn't arrive (either arrives too late or it never arrives). Checking the RQ code it seems clear that missing a heartbeat may lead us to this situation.
https://github.com/nvie/rq/blob/45addbcaf24fbfef37e07b7ff3f73a26fe09e9ca/rq/worker.py#L476
A way to mitigate this might be to increase the expires time by 2 or three times the expected heartbeat period, so we accept that a 1 or 2 heartbeat messages might be "lost" or delayed before removing a worker from Redis.
@selwin What do you think? I don't see any relation to the "blocked_clients" counter on redis-cli.
Meanwhile I made a quick hack to reduce the period where expire value is updated when a Worker is on IDLE state waiting for an upcoming Job. This won't fix if the same behaviour happens when the Worker is on BUSY state.
https://github.com/jlopex/rq/commit/d8d177c86a1eecf75759998f7a3688be74748c44
@jlopex If Redis is not reliable, job execution and registries also won't work properly. I don't think adding complexity to our code to work around unstable Redis server or connection is a good idea so I'm -1 on making this change.
RQ relies on Redis for keeping worker states so the library assume the Redis server we're talking with to be reliable. The real fix is to tune your server to make it respond reliably to requests.
+1 on renaming the child worker name though.
@jlopex thanks for investigating!
@jlopex I'm having the exact same issue as you, and only since I moved to a more distributed architecture where the workers are now running on (multiple) boxes separate from Redis.
I have very low numbers of jobs & workers (20 workers on two boxes serving two queues, rarely more than a job every few seconds, and typically short-running jobs that take less than a second).
I'm not convinced this is an issue with heartbeats though - the keys for the workers themselves don't expire, because I can see them in Redis. But they are getting removed from the rq:workers
set, just like you saw in your setup.
/ # redis-cli smembers rq:workers
1) "rq:worker:f5974440cf99.14"
2) "rq:worker:f5974440cf99.12"
3) "rq:worker:cd5a6e019bea.146"
4) "rq:worker:cd5a6e019bea.147"
5) "rq:worker:cd5a6e019bea.144"
6) "rq:worker:cd5a6e019bea.145"
7) "rq:worker:f5974440cf99.15"
8) "rq:worker:f5974440cf99.11"
9) "rq:worker:cd5a6e019bea.148"
10) "rq:worker:f5974440cf99.13"
/ # redis-cli keys rq:worker:*
1) "rq:worker:cd5a6e019bea.146"
2) "rq:worker:b0baa834b911.12"
3) "rq:worker:0e4a774113c3.13"
4) "rq:worker:f5974440cf99.11"
5) "rq:worker:f5974440cf99.13"
6) "rq:worker:cd5a6e019bea.147"
7) "rq:worker:cd5a6e019bea.148"
8) "rq:worker:0e4a774113c3.15"
9) "rq:worker:b0baa834b911.14"
10) "rq:worker:0e4a774113c3.11"
11) "rq:worker:b0baa834b911.13"
12) "rq:worker:f5974440cf99.14"
13) "rq:worker:f5974440cf99.12"
14) "rq:worker:f5974440cf99.15"
15) "rq:worker:b0baa834b911.11"
16) "rq:worker:cd5a6e019bea.144"
17) "rq:worker:0e4a774113c3.14"
18) "rq:worker:cd5a6e019bea.145"
19) "rq:worker:0e4a774113c3.12"
20) "rq:worker:b0baa834b911.15"
You can see I have 20 workers, all of whose keys are present in Redis, and all of which are working (and sending heartbeats as I can see in their logs). But rq:workers
only has some of them listed.
So, the heartbeat is getting registered, otherwise the workers' keys would have expired and they would not be visible in Redis.
So something is removing (some of) those entries from rq:workers
. I can only see two occurences of srem
in worker.py
:
https://github.com/nvie/rq/blob/45addbcaf24fbfef37e07b7ff3f73a26fe09e9ca/rq/worker.py#L111:
@classmethod
def find_by_key(cls, worker_key, connection=None):
...
if not connection.exists(worker_key):
connection.srem(cls.redis_workers_keys, worker_key)
return None
And also at https://github.com/nvie/rq/blob/45addbcaf24fbfef37e07b7ff3f73a26fe09e9ca/rq/worker.py#L249:
def register_death(self):
"""Registers its own death."""
self.log.debug('Registering death')
with self.connection._pipeline() as p:
# We cannot use self.state = 'dead' here, because that would
# rollback the pipeline
p.srem(self.redis_workers_keys, self.key)
p.hset(self.key, 'death', utcformat(utcnow()))
p.expire(self.key, 60)
p.execute()
I don't think it can be the second one, because that code block also expires the workers' own key, and that's not happening to me. But I don't understand how the first one can occur, either.
I will investigate this more tomorrow.
@smaccona that's an awesome find. Please let me know the result of your investigation tomorrow.
@smaccona thanks for your investigation. Hopefully this is not related with the expire value at Redis. Would it work to add a trace on the srem calls and see if this is happening?
It's funny that the only key that is not deleted from redis is the "state" key, I searched on rq where this is updated and the only place that is happening is here:
https://github.com/nvie/rq/blob/master/rq/worker.py#L277
Maybe we could check the hset "status" key return value. Redis documentation for this command says that it should return 1 if it´s a new value or 0 if it was updated.
I'm not sure if I could help you to debug this issue by modifying RQ. This is happening on production instances. Systems guys won't be happy if they see me patching the library :-P
@selwin @jlopex - this issue has not recurred for me in the past 2 days. Some observations and things I changed, though I have no proof that any of these were the problem:
rq:workers
disappeared). I did not try to reproduce this or investigate further. Whether or not this is a bug is a separate issue than this one.If the problem recurs, I will post again to this thread. In the meantime, @jlopex you may want to do some client counts in your setup to see if you have the same issue I described above.
Edit: a simple way to get client counts by IP address - run this from your Redis server:
redis-cli client list | cut -d' ' -f2 | cut -c 6- | cut -d':' -f1 | sort | uniq -c
@selwin - Sorry for the vast delay, been swamped. I'll try to set up a PR today or tomorrow.
@smaccona - Any more details on the connection bug? We had something similar happen a few weeks ago. Additionally, I am started to suspect that we have some other issues as well. Our db slowed down dramatically on Friday while we were expanding it in AWS, and RQ tanked. Nothing showing up on the failed queue, but workers dying off dramatically when the child processes would hang.
When you migrated to AWS and the workers were dying, could you run the "info" command on Redis to see how many open connections you have?
Sent from my phone
On Dec 22, 2015, at 12:12 AM, Michael Good [email protected] wrote:
@selwin - Sorry for the vast delay, been swamped. I'll try to set up a PR today or tomorrow.
@smaccona - Any more details on the connection bug? We had something similar happen a few weeks ago. Additionally, I am started to suspect that we have some other issues as well. Our db slowed down dramatically on Friday while we were expanding it in AWS, and RQ tanked. Nothing showing up on the failed queue, but workers dying off dramatically when the child processes would hang.—
Reply to this email directly or view it on GitHub.
@selwin I wish I had thought to do that on Friday. If I get the chance, I'll see if I can recreate the situation in our dev environment. We've been tracking the situations that cause workers to die off, and the common theme seems to be jobs timing out while being processed at the same time that jobs are timing out on the queue. No solid evidence to a root cause at the moment, but we do have a team member investigating it this week. If he finds anything, I'll make sure it makes its way here.
I'm having this problem. In particular there is a temp table build in mysql in the process that can take anywhere from seconds to over an hour. My workers wander off into a zombie state, not dead enough to be respawned, but not working either. Eventually the system grinds to a halt, even though there are plenty of idle workers.
@jbremson - Our solution was twofold: First, we adjusted the code to minimize the time taken for db interaction. Second, we had every worker tied to a timeout script that forces the worker process to die when a timeout occurs. Long running jobs were failing to timeout properly, so we had to create a secondary process to force the timeouts.
I maybe have a similar problem.
on step1: I start 32 workers listening queue2 ,
on step2: then start a worker listening queue1 ,
on step3: finally start another process to push job into queue1 , so that the worker listening queue1 will push jobs to queue2.
when the order is step1, step2, step3 , the cmd rqinfo shows that 32 workers, 33 workers, 0 workers;
when the order is step3, step2, step1 , the cmd rqinfo shows that 0 workers, 0 workers, 32 workers.
What is the reason for the difference between the two cases?
my env: python3.4 + rq0.13.0 + supervisor
Do you mind creating a few scripts to demonstrate this problem?
What I’m not entirely sure about are:
step3 is to start a monitor process,like:
def overseer_process():
loop:
master_queue = conn_master_queue()
master_job = master_queue.enqueue(master_process)
wait for master_job finished:
wait for worker_queue is empty & all workers are idle:
and master process:
def master_process():
worker_queue = conn_worker_queue()
for _ in range(10):
worker_queue.enqueue(worker_process)
worker_process:
def worker_process():
do sth
return
rqinfo shows the results after each step.After step 2,it shows there are 33 workers while 0 workers after step3.
In the overseer, I start tasks of the next stage through worker status, so it really bothers me.