Bull: Redis network bandwidth consumption

Created on 19 Feb 2019  路  59Comments  路  Source: OptimalBits/bull

Description

I have a cluster with 30 workers consuming jobs from redis using bull (around 0.5 tasks per second, and each task lasts for 1 minute). Each worker restarts every hour and is on one of 3 different nodes (which might be different from the one redis is hosted).

I want to debug or understand why the network consumption is around 25mb/s. Is this normal behavior from bull and how do I debug this problem?

Each job payload is a simple string.

Bull version

"bull": "^3.5.2"

Additional information

Each queue has around 10k of tasks waiting for completion and/or delayed. We can see the traffic is between redis and the workers although only a string is fetched

All 59 comments

It is ok to leave the issues even if you discovered that the bug was in your side, it can be useful to other users :).

@manast I thought I had solved the problem, but I guess not. Can you provide some guideline on how to debug this issue? I was thinking about booting up a wireshark micro service

ok, so lets see if I understand the workload: 30 workers, each one processing 0.5 jobs/second, which means 15 jobs/second in total, since thew payload is around 10k, the total amount of network bandwidth consumption should lie around 150Kb + some small queue overhead, right?

@manast yes you are about right. I have managed to run redis-cli monitor and it seems that there is a lot of talk between bull and redis regarding delayed tasks.

Is this the solution? drainDelay: 500?

can you try to upgrade to 3.7.0? there are some important improvements regarding delayed jobs, if you are using a lot of them it may help.

@manast I will. Right now it seems drainDelay: 500 solved the problem. It seems that there is some logic that is always checking redis for jobs switching context

EDIT: it hasn't. Trying "bull": "^3.7.0"

@manast I have tried "bull": "^3.7.0 and my redis is still getting flooded with messages like this

bandwidth_consumption

From what I understand this seems like routine everyone of my 30 workers is running checking the delayed queue, hence de 20mb/s network traffic

Or perhaps it can be the guardInterval watchdog which is set to 5 seconds, and it takes more than that to go through the delayed queue, i.e., the watchdog starts running multiples times exponentially till it hits a processing bottleneck. What do you think?

@abriosi The delay set is indeed updated at least every 5 seconds as the default setting in the guard interval. However, the amount of jobs that is done is limited to max 1000 jobs per call, which in my tests runs in a few milliseconds. Also, I cannot connect the dots between the update delay set being slow vs the amount of traffic you get between redis and your workers, also consider that a script like updateDelaySet works entirely in redis, it does not generate any traffic between servers (unless you have some replication enabled, but still it is minimal).

If you can provide with as much info as possible about this case, I would try to help you as much as I can.

another piece of info: the delay set can be updated sooner than the guard interval in the case that there really are delayed jobs that needs to be executed sooner than the guard interval. Still, since we move 1000 jos at every call, it should handle even quite large loads without many problems.

@manast Is there an easy way to check the traffic going between redis and the workers?

In some parts of my code I'm doing the following

var Queue = require('bull');

var videoQueue = new Queue('video transcoding', 'redis://127.0.0.1:6379');
var imageQueue = new Queue('image transcoding');

videoQueue.add({video: 'http://example.com/video1.mov'});

videoQueue.process(function(job, done){
  imageQueue.add({image: 'http://example.com/image1.tiff'});
  done();
});

imageQueue.process(function(job, done){
  done
});

Is this acceptable?

@abriosi not really, since imageQueue.add returns a promise you should return that, otherwise you will be doing much more jobs in parallel than what you would like, it will also mark jobs as completed although they may not already been completed. Rewrite as this:

videoQueue.process(function(job){
  return imageQueue.add({image: 'http://example.com/image1.tiff'});
});

I'm seeing this problem too. Very similar to @abriosi: high traffic between workers and redis and high CPU consumption. Using "redis-cli monitor" I also see a bunch of the same logs like in his screenshot.

The strange bit for us is that it can be OK for hours and then all of a sudden it starts doing this, going from <1MB/s to 40-50MB/s.

I've only recently upgraded from v3.4.3 (never had a problem in a year of usage) to v3.6.0 and most recently v.3.7.0 and we're experiencing this issue on both. I've just downgraded to see if it's related to the version change. -- Edit: it was definitely not the version

It's probably worth noting as well that we're using a fairly dated version of redis (3.2.9) and we use named jobs, delayed jobs and call getJobFromId() quite heavily.

@antonio0 it is quite strange because to get up to 50Mb/s just because of delay checks and such does not add up. I will love to get more info regarding the context, for instance how many workers do you use, what concurrent factor per worker and how many jobs per second during the highest peaks?

@antonio0 and every restart the consumption goes down, to then start rising again?
image

@abriosi how many delayed jobs do you usually have in average?

@manast the order of magnitude is 1k-10k of delayed jobs, worker count is 30.

@abriosi this is what it looks like for me, I restarted the workers at 21.45 and that's when it went back to near 0 (jobs are being running fine).

screenshot at feb 20 22-51-02

@manast I've also noticed this with the memory of redis (it eventually got restarted by kubernetes, it didn't go down by itself)

screenshot at feb 20 22-57-35

and while CPU is high this occurs a lot
Error: job stalled more than allowable limit at getJobFromId.then.job (/usr/src/app/node_modules/bull/lib/queue.js:917:13) at <anonymous> at process._tickDomainCallback (internal/process/next_tick.js:229:7)

@manast The above were running v3.7.0. I'm using 20-30 workers, running 8 Queues, most of them have concurrency set at 5-10. It's a mix of short and long running jobs. There was a cronjob that would schedule jobs in bulk (about 30k short running jobs that would end up scheduling stuff on other queues) and that was a cause of peak load but I've disabled that and I'm still experiencing the above

@antonio0 The long running jobs, are they run in a external process? in other case it is important to not block the event loop because that will lead to stalled jobs which is something you will avoid for the most part (stalled jobs should only happen if a worker has crashed for example).

@antonio0 regarding redis memory usage, that depends exclusively of the amount of jobs and data you are storing, so I do not think it is so much bull related, unless there was a bug where bull was writing data indiscriminately to redis, which I do no think is the case.

@manast I'm thinking the memory shooting up to 10GB thing was probably because I'd left redis-cli monitor open and the connection got lost while that was still running.. It only occurred twice and in both cases I was playing with it

When using redis-cli monitor and the traffic is high I notice the same commands being run over and over again on my queues that has delayed jobs (only ~3k jobs). This is a sample:

1550757663.896417 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.896426 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.896456 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.896510 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.896519 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.896528 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.896548 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.896579 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.896587 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.896595 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.896613 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.896644 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.896652 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.896660 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.896678 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.896709 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.896717 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.896738 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.896758 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.896825 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.896835 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.896844 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.896864 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.896895 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.896904 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.896912 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.896930 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.896960 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.896969 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.896977 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.897003 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.897034 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.897043 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.897051 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.897070 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.897100 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.897109 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.897117 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.897135 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.897165 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.897174 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.897182 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.897200 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.897230 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.897239 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.897247 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.897265 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.897324 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.897341 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.897358 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.897384 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.897423 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.897442 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.897459 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.897538 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633324" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.897603 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266095104" "LIMIT" "0" "1000"
1550757663.897613 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.897622 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.897647 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.897714 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.897740 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.897757 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.897812 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.897879 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.897888 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.897897 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.897953 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.897986 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.898003 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.898012 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.898031 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.898063 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.898071 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.898080 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.898096 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.898127 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.898136 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.898144 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.898161 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.898192 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.898200 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.898209 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.898226 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.898256 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.898265 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.898273 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.898290 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.898321 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.898329 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.898337 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.898354 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.898385 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.898393 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.898402 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.898418 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.898449 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.898457 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.898465 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.898482 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.898517 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.898526 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.898534 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.898591 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.898623 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.898632 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.898641 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.898659 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.898690 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.898698 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.898707 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.898907 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.898954 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.898964 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.898973 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.899003 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.899034 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.899043 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.899051 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.899070 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.899101 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.899109 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.899118 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.899135 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633325" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.899166 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266099200" "LIMIT" "0" "1000"
1550757663.899174 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.899183 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.899200 [4 127.0.0.1:59876] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757633326" "0cd3e42d-1067-4f24-b181-1d6fe132e756"
1550757663.899232 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903266103296" "LIMIT" "0" "1000"
1550757663.899240 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.899249 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.899282 [4 127.0.0.1:59262] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757632630" "5165384e-fa4c-4e3c-8c8c-3bab22b45523"
1550757663.899315 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903263252480" "LIMIT" "0" "1000"
1550757663.899324 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.899332 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.899352 [4 127.0.0.1:59262] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757632630" "5165384e-fa4c-4e3c-8c8c-3bab22b45523"
1550757663.899383 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903263252480" "LIMIT" "0" "1000"
1550757663.899391 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.899400 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.899418 [4 127.0.0.1:59262] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757632630" "5165384e-fa4c-4e3c-8c8c-3bab22b45523"
1550757663.899448 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903263252480" "LIMIT" "0" "1000"
1550757663.899457 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.899465 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.899483 [4 127.0.0.1:59262] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757632630" "5165384e-fa4c-4e3c-8c8c-3bab22b45523"
1550757663.899529 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903263252480" "LIMIT" "0" "1000"
1550757663.899538 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.899547 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.899569 [4 127.0.0.1:59262] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757632630" "5165384e-fa4c-4e3c-8c8c-3bab22b45523"
1550757663.899600 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903263252480" "LIMIT" "0" "1000"
1550757663.899608 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.899617 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.899635 [4 127.0.0.1:59262] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757632631" "5165384e-fa4c-4e3c-8c8c-3bab22b45523"
1550757663.899666 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903263256576" "LIMIT" "0" "1000"
1550757663.899674 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.899682 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.899700 [4 127.0.0.1:59262] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757632631" "5165384e-fa4c-4e3c-8c8c-3bab22b45523"
1550757663.899759 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903263256576" "LIMIT" "0" "1000"
1550757663.899795 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.899812 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.899850 [4 127.0.0.1:59262] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757632631" "5165384e-fa4c-4e3c-8c8c-3bab22b45523"
1550757663.899911 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903263256576" "LIMIT" "0" "1000"
1550757663.899929 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.899943 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.899981 [4 127.0.0.1:59262] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757632631" "5165384e-fa4c-4e3c-8c8c-3bab22b45523"
1550757663.900058 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903263256576" "LIMIT" "0" "1000"
1550757663.900076 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.900094 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.900129 [4 127.0.0.1:59262] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757632631" "5165384e-fa4c-4e3c-8c8c-3bab22b45523"
1550757663.900191 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903263256576" "LIMIT" "0" "1000"
1550757663.900207 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.900224 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.900258 [4 127.0.0.1:59262] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757632631" "5165384e-fa4c-4e3c-8c8c-3bab22b45523"
1550757663.900315 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903263256576" "LIMIT" "0" "1000"
1550757663.900334 [4 lua] "ZRANGE" "bull:MyQueue:delayed" "0" "0" "WITHSCORES"
1550757663.900349 [4 lua] "PUBLISH" "bull:MyQueue:delayed" "1550757635983.6501"
1550757663.900370 [4 127.0.0.1:59262] "evalsha" "8b912cdc5b4c20108ef73d952464fba3a7470d7b" "6" "bull:MyQueue:delayed" "bull:MyQueue:active" "bull:MyQueue:wait" "bull:MyQueue:priority" "bull:MyQueue:paused" "bull:MyQueue:meta-paused" "bull:MyQueue:" "1550757632631" "5165384e-fa4c-4e3c-8c8c-3bab22b45523"
1550757663.900409 [4 lua] "ZRANGEBYSCORE" "bull:MyQueue:delayed" "0" "6351903263256576" "LIMIT" "0" "1000"

When bandwidth is low (< 1MB/s) I still have about the same amount of jobs in the delayed queues but in the monitor I see also other normal operations and only casually ZRANGEBYSCORE stuff

@antonio0 I can confirm the exact same behaviour. I'd guess something is triggering delayed job checks on redis without the previous check routine finishes. I have a very incomplete knowledge about how the framework works, so that guess might be inappropriate

Another anomaly: right when there were a lot of "ZRANGEBYSCORE" "bull:MyQueue:delayed" operations I went in that queue through Arena and it's erroring half of the times with this error "https://github.com/bee-queue/arena/issues/148" .. the folks there are saying it's a job without ID ? Maybe it's related to this issue?

@manast do you know how I could find that job using the redis-cli?

@antonio0 you should be able to find the job by appending your queue prefix (bull:queueName) and the jobId itself, which is an integer unless you specified your own jobId.

ok, so it seems as if the script updateDelaySet is been called too often, I will make some experiments with high load of delayed jobs to see if I can trigger it. Currently the logic that determines when to run the script is quite crappy (we have much better on Bull 4.0 but that is still quite far away).

ok, so it seems as if the script updateDelaySet is been called too often, I will make some experiments with high load of delayed jobs to see if I can trigger it. Currently the logic that determines when to run the script is quite crappy (we have much better on Bull 4.0 but that is still quite far away).

Thanks a lot for your work on Bull!

One question: does the logic that determines when to run updateDelaySet depend on the contents of the Queue? If so, is it possible that a job with corrupted data mess up this logic?

@antonio0 basically the logic works like this: every time a job that is delayed is inserted in the queue a pubsub message is sent to all workers so that they can update their delay timers if necessary (i.e., if they need to update the delay set sooner due to a new delayed job that has a lower delay time than the existing ones). However, since pubsub is not 100% reliable (thats why in bull 4 we use redis 5 streams), we need to have a guardian that updates the set every x seconds. if you have many workers you can change this guardian to much larger time because the risk of loosing a pubsub decreases exponentially with the number of workers (if they are in different machines).

@manast this the flag you are talking about? guardInterval https://github.com/OptimalBits/bull/blame/develop/REFERENCE.md#L95

How large should this value be setup too. I tried 100 seconds and the problem still persists

@abriosi when you say problem, you mean that the updateDelaySet is called too often?
I wonder, are new jobs constantly arriving with very small delay values?

@abriosi also, are you using the rate limiter function?

@manast

@abriosi when you say problem, you mean that the updateDelaySet is called too often?

yes

I wonder, are new jobs constantly arriving with very small delay values?

There are new jobs constantly arriving with delay values of 30 minutes

@abriosi also, are you using the rate limiter function?

No. Do I need to? I have only 30 active jobs at any given time

image

no you do not need to use the rate limiter, but the rate limiter uses the delay functionality so thats why I asked.

@abriosi Btw. I will try to reproduce this issue using a similar workload as the one you showed in that screenshot.

@manast Thanks.

Thanks a lot for your work on Bull.

I have set guardInterval to 1000 seconds. I'm not absolutely sure this is related but I will report the results

since you have 30 workers, a delay of 1000 will imply a call to updateDelaySet at at least once every 33 seconds, more than enough.

@manast @antonio0
Setting guardInterval to 1000 seconds solved the problem.

image

Shall we add this to the documentation? I guess that people who were complaining about the rising CPU usage were also experiencing this

1224

Thanks @abriosi for the head up, I'll definitely try it now. -- Edit: not working for me
screenshot at feb 22 02-18-43

@antonio0 basically the logic works like this: every time a job that is delayed is inserted in the queue a pubsub message is sent to all workers so that they can update their delay timers if necessary (i.e., if they need to update the delay set sooner due to a new delayed job that has a lower delay time than the existing ones). However, since pubsub is not 100% reliable (thats why in bull 4 we use redis 5 streams), we need to have a guardian that updates the set every x seconds. if you have many workers you can change this guardian to much larger time because the risk of loosing a pubsub decreases exponentially with the number of workers (if they are in different machines).

Thank you @manast for explaining this to me. I'm still suspicious about the cause being some job causing issues because if I understand correctly for any new job it checks the timestamp / delay time
and triggers updateDelaySet if it's lower than the ones it has seen already. So is there a chance that if timestamp/delay aren't there (or have value 0 or null) it will always trigger updateDelaySet as 0 is lower than everything else?

I've found this weird job in my failed queue without timestamp / delay and a lot of other missing fields. Is this even an acceptable format? Could it be related?

127.0.0.1:6379[4]> HGETALL "bull:MyQueue:check_provider_ready_bc300960_user234157"
 1) "stacktrace"
 2) "[\"Error: Missing process handler for job type __default__\\n    at Queue.processJob (/usr/src/app/node_modules/bull/lib/queue.js:1054:7)\\n    at <anonymous>\\n    at process._tickDomainCallback (internal/process/next_tick.js:229:7)\"]"
 3) "attemptsMade"
 4) "1"
 5) "failedReason"
 6) "Missing process handler for job type __default__"
 7) "finishedOn"
 8) "1550794941227"
 9) "processedOn"
10) "1550794940537"

(when I schedule this type of jobs on this queue they always have a delay of 1 second, have name, title and other metadata... so I'm not sure how this happened in the first place)

@antonio0 that is a job that failed directly when it arrived to one of the workers and the worker did not have registered a processor "compatible" with the job. I.e.e, the name of the job did not match any of the processors defined in a given worker.
Having said that, I do not think that can trigger all the calls to updateDelaySet.

@antonio0 how many concurrent workers are you running and how many milliseconds have you chosen for guardInterval?

@antonio0 how many concurrent workers are you running and how many milliseconds have you chosen for guardInterval?

20 workers, guardInterval was set at 1000 seconds (so 1000 * 1000 in milliseconds value). I don't see any difference unfortunately.

I do keep seeing that whenever CPU utilisation is high and redis is being flooded with ZRANGEBYSCORE requests on "Queue X", if I go in that specific queue in Arena, see "active" jobs, the page gives an error because it can't parse one of the jobs (but doesn't tell me which one). Doesn't occur when the CPU utilisation is low and all other jobs in other queues can be rendered. It's either a strange coincidence (that I can replicate consistently though) or related to this

Hi @manast I was able to look into the error I was getting with Arena UI and it's due to one of the jobs being null.
By looking at their code, it seems that Arena just calls Bull's getActive() and one of the items in the response was null. Is this behaviour ok, and they should handle that, or should I file a separate issue?

EDIT: in the docs the return type of getActive is Promise<Array<Job>> so it should probably be handled within bull.

@antonio0 If getActive returns an array where one of the items is null then I would say that is a bug in Bull but would you be able to provide code that reproduces this, otherwise I think it is going to be really tricky to solve.

@manast not quite sure how I'd write a test that replicates this, but I'm quite sure it's returning null because between getRanges and getJobFromId the job has completed and therefore has been removed (I'm using removeOnCompleted: true)

  Queue.prototype.getJobs = function(types, start, end, asc) {
    return this.getRanges(types, start, end, asc).then(jobIds => {
      return Promise.all(jobIds.map(this.getJobFromId));
    });
  };

Having the same issue

We have been experiencing similar issues. Looking at the code, could the following be happening:

  • A lot of jobs comes in all with a delay set
  • The workers cannot quite keep up processing jobs
  • Some jobs that were delayed is now "overdue"

Whenever updating the delay set it gets back a timestamp in the past (or right now), which causes it to have an instant setTimeout, which then keeps spinning in circles?

@holm it can keep spinning in circles for the rate limiter, but otherwise it should be faster in version 3.7.0 since we can move up to 1000 delayed jobs at once instead of one by one.

When this was an issue for us last week, we hit 17,300 requests per minute of EVALSHA to Redis. This was running just 9 workers. During that time we were processing around 600 jobs per minutes, but the queue was large and had on average around 2,500 jobs waiting to be processed. Many of these had a delay configured on them. This doesn't feel right and I am pretty sure there was something that was spinning a bit out of control.

so it was just delayed jobs, not rate limited jobs?

Yes only delayed jobs with a delay of 30 seconds.

that is pretty strange because after the jobs are moved back to the wait list they will not enter the delay set again, so I do not understand how this can build up a problem :/

@holm have you tried this suggestion?

https://github.com/OptimalBits/bull/pull/1224

I have increased that value now, but it doesn't come close to explaining what we saw. With 9 workers and the default fo 5000 ms, you should not be seeing more than two checks per second. And we saw closer to 300 per second.

I have been working on a small repo, and will continue to try to narrow it down further, but wanted to share results so far.

What I am doing:
1: A single worker processing jobs with concurrency of 1. For every job it just sleeps for 1 second.
2: Have a publisher that publishes new jobs every 5 ms with a delay of 30 seconds

What I am seeing:

  • Most of the time things work just fine. There are no excess script calls and jobs are being processed.
  • However all of a sudden it changes, and there are endless calls to updateDelaySet. I hacked a few log lines in bull and put it here: https://gist.github.com/holm/fe8087c62a1dd015dffb3f883a0d84f1. It does log out lines when a job starts or ends, but there are thousands of updateDelaySet in between each job (once a second).
  • It seems as if it starts when the queue gets to around 5000 jobs, but that might be more related to the time it takes to fill up 5000 jobs and the delay I have in the test.

I did some more digging into this, and there is at least a race-condition on timer management that I think could explain why this is only an issue at certain times.

The PR is here: https://github.com/OptimalBits/bull/pull/1294

@holm maybe the race condition can be resolved without too much refactoring of the current code, so that we can have something that works, and refactor later on.

I tried that initially, but I found the code very hard to reason about. I think the changes I did are the minimal possible to have a solid solution. I appreciate you are rewriting this, but I expect that it will be some time before that is released and fully production ready.

3.8.0 should have resolved this issue please comment here if that is not the case.

Was this page helpful?
0 / 5 - 0 ratings