My worker queue normally oscillates between 0 and 4000 entries.
After pulling the latest changes it is exploding. Currently it has grown in less than 40 minutes to a number of 54000 queued messages. Database structure update 1346 was successfully applied. According to the log the postupdate 1345 is currently running.
These queued jobs seem to be like this one:
Delivery: wall-new: 65885532: 439833 2020-05-05 21:39:46 20
Looks like there are around 10 new entries per second added to the queue. All of them are Delivery: wall-new. It's like a loop or that old stuff is re-added to the queue. I have not seen such a behavior in a very, very long time.
Other stuff I see: Thousands and thousands of these warnings/notices I reported here and here.
But this started before I pulled the latest PRs. But maybe there is a correlation.
Also around 4000 DB errors because of duplicate errors:
DB Error {"code":1062,"error":"Duplicate entry '1952795' for key 'PRIMARY'","callstack":"Database::insert, DBA::insert, PostUpdate::update1345, PostUpdate::update, CronJobs::execute, Worker::execFunction, Worker::execute, Worker::processQueue","params":"INSERT INTO `post-delivery-data` (`uri-id`, `postopts`, `inform`, `queue_count`, `queue_done`, `activitypub`, `dfrn`, `diaspora`, `ostatus`, `legacy_dfrn`, `queue_failed`) VALUES (1952795, 'twitter', '', 0, 0, 0, 0, 0, 0, 0, 0)"} - {"file":"Database.php","line":736,"function":"e","uid":"7407c3","process_id":20492}
2020.06-dev commit f7a45e41534c392b51fd07bd7cd4d3c969d75228
git
7.3
10.3 MariaDB
The duplicated entries for the post-delivery-data aren't no issue - but annoying. The other stuff I will have a look at.
@annando Are these worker queue entries deliveries from my server to others. Or are they deliveries to my server. I'm concerned that I'm flooding other servers with that stuff.
For the workerqueue stuff: I really don't know where this comes from. At first I will have a look at the other messages.
The queue seems to go down slowly now. Down to 40000 at the moment. :-)
PR for the warning, notice and DB Error is done: https://github.com/friendica/friendica/pull/8587
Thank you. Postupdate 1345 is done. Queue continues to go down... (29000).
The warnings and notices are fixed. The db errors are gone, too. So, I tend to close this issue.
But my queue is still very long. This morning it was over 100000. :-/ At the moment it's at around 80000 queued jobs.
Any idea how to debug this? I really don't know what's going on.
The queue is processed. I checked the jpm values in my log file.
index [INFO]: Load: 3.86/24 - processes: 7312/38/78639 - jpm: 560/711/884 (0:0, 20:140/77909, 30:12/404, 40:4/983, 50:0/6655) - maximum: 38/64 [] - {"file":"Worker.php","line":729,"function":"tooMuchWorkers","uid":"f280e1","process_id":1954}
The values are looking pretty neat. But look at the high priority jobs: 77909!
Weird...
MariaDB [friendica]> select count(*) from `queue`;
+----------+
| count(*) |
+----------+
| 1 |
+----------+
1 row in set (0.000 sec)
MariaDB [friendica]> select count(*) from `deliverq`;
+----------+
| count(*) |
+----------+
| 0 |
+----------+
1 row in set (0.000 sec)
MariaDB [friendica]> select count(*) from `workerqueue`;
+----------+
| count(*) |
+----------+
| 214122 |
+----------+
1 row in set (0.042 sec)
MariaDB [friendica]> select count(*) from `workerqueue-view`;
+----------+
| count(*) |
+----------+
| 124 |
+----------+
1 row in set (0.000 sec)
please query only for workerqueue entries with "done" equals "false".
MariaDB [friendica]> select count(*) from `workerqueue` where done = 0;
+----------+
| count(*) |
+----------+
| 83721 |
+----------+
1 row in set (0.018 sec)
I made a test post 10 minutes ago. It still shows 'notifier task pending' and is not delivered to 5 other fedi nodes I checked.
I do see a slightly higher numer of pending stuff at my system as well. I guess I can investigate in this.
Looks like a lot of notifier tasks are pending...
MariaDB [friendica]> select count(*) from `workerqueue` where `done` = 0 and `parameter` like '%wall-new%';
+----------+
| count(*) |
+----------+
| 102468 |
+----------+
1 row in set (0.108 sec)
MariaDB [friendica]> select count(*) from `workerqueue` where `done` = 0 and `parameter` like '%notifier%';
+----------+
| count(*) |
+----------+
| 970 |
+----------+
1 row in set (0.106 sec)
You could use this slightly advanced query:
select count(*), priority, substring(parameter, 3, position(',' in parameter) - 4) as command from workerqueue where not done group by priority, command;
And yes, my system is experiencing this problem currently as well. I guess I have to have a look at it. Possibly there is some endless loop in it.
Advanced...
ariaDB [friendica]> select count(*), priority, substring(parameter, 3, position(',' in parameter) - 4) as command from workerqueue where not done group by priority, command;
+----------+----------+-------------------+
| count(*) | priority | command |
+----------+----------+-------------------+
| 1 | 20 | |
| 67 | 20 | CreateShadowEntry |
| 83961 | 20 | Delivery |
| 10 | 20 | ForkHook |
| 1 | 20 | MergeContact |
| 1248 | 20 | Notifier |
| 13 | 20 | PubSubPublish |
| 31 | 30 | APDelivery |
| 96 | 30 | Delivery |
| 61 | 40 | APDelivery |
| 426 | 40 | Delivery |
| 173 | 40 | UpdateContact |
| 4754 | 50 | APDelivery |
| 1808 | 50 | Delivery |
+----------+----------+-------------------+
14 rows in set (0.332 sec)
On a small develop node it works. Delivery is a little bit slow but posts are delivered. On master it's lightning fast. Only on the big one it is problematic.
My queue ist over 200.000
I only had a small glance at it - and I have the feeling as if this is the cause of a fix that I had done. For a long time now we supported the "participation" functionality from Diaspora. This means that you can subscribe to posts, even if you hadn't subscribed to that person. Due to a small bug this most likely hadn't worked very well. This had been fixed. And possibly now the amount of changes is really high.
Hmm. Ok. So, I assume that one day the queue will be back to normal. Right?
Btw. my test posts have not yet been delivered - after more than three hours. I'll keep watching this...
You can calm down the system by temporarily deactivating the system. Go to this line https://github.com/friendica/friendica/blob/develop/src/Protocol/Diaspora.php#L263
Change it this way:
public static function participantsForThread(array $parent, array $contacts)
{
return $contacts;
if (!in_array($parent['private'], [Item::PUBLIC, Item::UNLISTED])) {
return $contacts;
}
Maybe I try that later. Thank you!
As a first test I configured max. workers to 100. 馃槅
metoo queue at 220830
脛hm. More workers don't help. They just seem to put even more items into the queue more quickly.
This way the queue is growing, not shrinking.
You haven't applied the proposed change?
No. Not yet.
257297
@copiis You don't need to report your numbers on a constant basis. The problem is known and I'm looking at it.
Hmm, seems like my fix hadn't really fixed the issue.
@annando Okay, I thought we were in a bit of a downfall mood now that the pandemic is waning.
I applied the code change from here https://github.com/friendica/friendica/issues/8586#issuecomment-624633734 now.
I guess that now I found the origin of the problem.
Right before my queue is reaching half a million entries. :-)
During some test phases, I had got more than a million entries. So don't complain :-)
I'm afraid the issue isn't fixed. The queue is still growing. +100000 in the last 2 hours. And delivery of posts/comments still doesn't work.
If you use the worker daemon, the changes aren't automatically applied, you may have to stop and restart it.
Anecdata: my queue was 4000 on my single node before the change, and is now at 300 a few hours after it.
My system also has absolutely normal values. The numbers will still go up when several deliveries are waiting (have a look for the worker task "notifier"), so with many pending deliveries it is expected that the number of tasks will still rise by some amount.
No, I don't use the daemon - just the plain old worker cron job. I have now reached 0.9 million queued jobs. That makes 500000 additional entries in the last 7 hours.
After pulling PR #8594 the queue went down from 400000 to 300000. Then it started to go up again. My test post from yesterday are still showing the hour glass icon (notifier task pending). The test posts are not delivered yet - after 7 hours.
When I noticed that I moved the old, not done "Delivery wall-new" jobs from prio 20 to prio 30 as a way to improve the delivery of new items . But the delivery (prio 20) jobs are piling up so quickly that the system isn't successful in delivering them in a normal time frame.
But the queue is processed as I can see in the log file:
2020-05-07T06:41:04Z index [INFO]: Load: 6.08/24 - processes: 7324/39/883316 - jpm: 1,380/1,384/1,122 (0:0, 20:104/644355, 30:60/239253, 40:8/633, 50:5/6399) - maximum: 38/90 [] - {"file":"Worker.php","line":729,"function":"tooMuchWorkers","uid":"2dc304","process_id":18869}
have a look for the worker task "notifier"
Ok. How do I find that task? When I query the worker queue I get 1147 results with "Notifier".
You have got 1,147 "Notifier" entries. This means that there are 1,147 pending posts. When every post had got 100 receivers, this would mean that this would result in 114,700 "Delivvery" and "APDelivery" jobs.
@annando Is your comment still valid?
You can calm down the system by temporarily deactivating the system. Go to this line https://github.com/friendica/friendica/blob/develop/src/Protocol/Diaspora.php#L263
Change it this way:
public static function participantsForThread(array $parent, array $contacts)
{
return $contacts;
if (!in_array($parent['private'], [Item::PUBLIC, Item::UNLISTED])) {
return $contacts;
}
Also, is it...
public static function participantsForThread(array $parent, ...
or...
public static function participantsForThread(array $item, ...
?
It seems to be $item in the code but you suggested $parent in your comment.
Thank you.
Please stick now to the latest code changes and don't apply any other changes. At the evening I could tell you more data to look at.
Ok. I had to do something because - you know - there are some users on that node. Delivery was delayed by approx. 22 hours.
So, I deleted all Delivery tasks from the worker queue. All of them (1.2 million). The queue started to grow again because I left the Notifiers in the queue.
Then I waited some time until the notifiers were all processed.
Now it looks like this and delivery seems to work again:
MariaDB [friendica]> select count(*), priority, substring(parameter, 3, position(',' in parameter) - 4) as command from workerqueue where not done group by priority, command;
+----------+----------+----------------------------------+
| count(*) | priority | command |
+----------+----------+----------------------------------+
| 5 | 20 | APDelivery |
| 1 | 20 | CreateShadowEntry |
| 1511 | 20 | Delivery |
| 1 | 20 | Notifier |
| 9 | 30 | addon\/twitter\/twitter_sync.php |
| 52 | 30 | APDelivery |
| 2 | 30 | DBClean |
| 32 | 30 | Delivery |
| 1 | 30 | ForkHook |
| 328 | 30 | OnePoll |
| 2 | 40 | ForkHook |
| 1 | 40 | PubSubPublish |
| 440 | 40 | UpdateContact |
| 3 | 40 | UpdateServerDirectory |
+----------+----------+----------------------------------+
14 rows in set (0.009 sec)
How did you delete the delivery tasks?
I will not recommend what I did. It was kind of the last resort to re-establish a working delivery for me. You might lose deliveries or other side effects are possible.
That was the disclaimer. Here is the SQL for dropping all Delivery tasks from the queue:
delete from workerqueue where not done and parameter like '["Delivery"%';
The deleting of deliveries is really bad :-) Moving them (and the notifiers) to priority 50 would had been the better idea. This of course means that several posts had not been delivered.
I tried to move them to prio 40. But I had the impression that they were re-queued with prio 20. After I moved them the prio 40 and 30, these deliveries went slowly down and the prio 20 deliveries went up again. There was no progress. The total number stayed the same (> 1 million) and the delivery of items still was delayed approx. 22 hours - or didn't work at all.
The complete deletion was the only way to get a working node again.
You are parking your car as well by crashing it onto a wall? 馃榿
When I've been trying to park a car for 48 hours, I will - at some point - take into consideration that the wall is a solution. :-)
okay, i crasht mine to...
Most helpful comment
When I've been trying to park a car for 48 hours, I will - at some point - take into consideration that the wall is a solution. :-)