Bull: updateDelaySet may cause very high redis cpu usage

Created on 28 Oct 2020  路  18Comments  路  Source: OptimalBits/bull

We have a very large queue which has about several millions of delay jobs. And our queue has tens of workers(pods) in k8s cluster and even double or triple more pods when the business peak time. We have encountered an issue that bull can lead high redis cpu usage and the redis gets slow then stop prosessing jobs. It's a pain last for several months. Finally we find that the updateDelaySet get triggered so frequently on each worker and thus caused the high redis cpu usage and then consumer get so slow even stop processing waiting jobs and the waiting queue get stacked and finally the entire bull queue stop working.

I found that the updateDelaySet was expensive when the queue was huge so we should stop doing updateDelaySet when the waiting queue has too many jobs to process. We can add an option saymaxWaitingJobs which default set Number.MAX. The user can set this value according their needs.

Most helpful comment

yes, I think it is normal, my fix basically disables the updateDelayTimer as long as there are jobs in the queue to be processed, but sometimes when it gets rate limited it will kick again. It is very difficult to make it perfect just using LUA scripts. I am working on a redis module that solves all this problems perfectly but it will not work on AWS elasticache since it does not support modules.

All 18 comments

Before I dig more into this issue, could you tell me more about your queue setup? for instance why do you have so many delayed jobs, and what is the average delay for these jobs? are you using rate limiting?

also, are you using priorities or just normal jobs?

My queue config: new Queue('FOOD_DIGEST_JOB', { redis: { host, port }, limiter: { max: 1000, duration: 1000 } }). We have millions of users and they can feed fish and every fish has a delay job in our setting. we use normal jobs. Each of our jobs can delay tens of seconds to days.

ok, so you have both delayed jobs and rate limited to 1000 jobs per second right?

yes exactly

and where are you running your redis instances?

we run our redis instance on aws elastic cache redis.

did you consider upgrading the instance type to a faster one? I know in aws is wasteful since larger instances also has more cores but redis will only use one.

I assume you are using last bull version 3.18.1 right?

yes we tried upgrading our instance but nothing difference. The queue can run a period of time but this issue will occur occasionally. we use 3.18.

yes. Moving the delayed jobs back to wait is not very efficient, specially in this case when your queues are always full of jobs. I think I can implement a solution for this, but it may take a couple of days before it is ready (if everything goes well).

I implemented some improvements that should reduce the CPU usage for a case like this, I wonder if you have any chance to verify it before we make a new release? https://github.com/OptimalBits/bull/pull/1897

I tried #1897 with option new Queue('FOOD_DIGEST_JOB', { redis: { host, port }, limiter: { max: 1000, duration: 1000 } }) this morning and I found the CPU usage is lower but the consumimg rate get slower at the same time which result in the delayed jobs get stacked in delay queue and not consumed in time. The rate limiter setting is 1000/s, but the consume rate was just 140+/s. The adding rate was 300+/s (There are many jobs adding and consuming at the same time). BTW I found the redis slow log is clean now. The redis slow log increase quickly in the before version.

ok, I will try to mimic those numbers in a synthetic test.

I added a small fix to the above mentioned PR, would you mind to test if it works faster?

Thx for your working. I tested again and the consuming speed was very fast. And the rate is close to the rate limiter. Jobs moves to active and get consumed quickly but the redis cpu is much lower than before.

I found the Queue.prototype.updateDelayTimer get called very frequently and is it normal? I also inspect the redis slow log found logs as below:

127.0.0.1:6371> slowlog get 10
1) Slow log id: 667021
   Start at: 1604160921
   Running time(ms): 10435
   Command: evalsha 451d4221d6a7ff2251cd0864ab0a28bdba5ea099 10 bull:FOOD_DIGEST_JOB:active bull:FOOD_DIGEST_JOB:wait bull:FOOD_DIGEST_JOB:delayed bull:FOOD_DIGEST_JOB:paused bull:FOOD_DIGEST_JOB:completed bull:FOOD_DIGEST_JOB:failed bull:FOOD_DIGEST_JOB:priority bull:FOOD_DIGEST_JOB:5f9d708c0397ad2ac118dcc4 bull:FOOD_DIGEST_JOB:5f9d708c0397ad2ac118dcc4:logs bull:FOOD_DIGEST_JOB:limiter 5f9d708c0397ad2ac118dcc4 aa11c239-d230-4300-9722-283e3851da24
   Client IP and port: 10.5.229.39:59240
   Client name:
2) Slow log id: 667020
   Start at: 1604160921
   Running time(ms): 10390
   Command: LREM bull:FOOD_DIGEST_JOB:wait 0 5f9d708c0397ad2ac118dcc4
   Client IP and port: ?:0
   Client name:
3) Slow log id: 667019
   Start at: 1604160919
   Running time(ms): 10717
   Command: evalsha 451d4221d6a7ff2251cd0864ab0a28bdba5ea099 10 bull:FOOD_DIGEST_JOB:active bull:FOOD_DIGEST_JOB:wait bull:FOOD_DIGEST_JOB:delayed bull:FOOD_DIGEST_JOB:paused bull:FOOD_DIGEST_JOB:completed bull:FOOD_DIGEST_JOB:failed bull:FOOD_DIGEST_JOB:priority bull:FOOD_DIGEST_JOB:5f9d5471569a9534485395e3 bull:FOOD_DIGEST_JOB:5f9d5471569a9534485395e3:logs bull:FOOD_DIGEST_JOB:limiter 5f9d5471569a9534485395e3 9f802678-ca99-4050-a253-78338e6dc3be
   Client IP and port: 10.5.228.117:59884
   Client name:
4) Slow log id: 667018
   Start at: 1604160919
   Running time(ms): 10664
   Command: LREM bull:FOOD_DIGEST_JOB:wait 0 5f9d5471569a9534485395e3
   Client IP and port: ?:0
   Client name:

However the slow log length was just 4 and didn't increase any more.

yes, I think it is normal, my fix basically disables the updateDelayTimer as long as there are jobs in the queue to be processed, but sometimes when it gets rate limited it will kick again. It is very difficult to make it perfect just using LUA scripts. I am working on a redis module that solves all this problems perfectly but it will not work on AWS elasticache since it does not support modules.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

davedbase picture davedbase  路  3Comments

sibelius picture sibelius  路  3Comments

thelinuxlich picture thelinuxlich  路  3Comments

JSRossiter picture JSRossiter  路  3Comments

pigaov10 picture pigaov10  路  3Comments