Friendica: Worker queue is exploding

Created on 5 May 2020 · 47Comments · Source: friendica/friendica

Expected behavior

My worker queue normally oscillates between 0 and 4000 entries.

Actual behavior

After pulling the latest changes it is exploding. Currently it has grown in less than 40 minutes to a number of 54000 queued messages. Database structure update 1346 was successfully applied. According to the log the postupdate 1345 is currently running.

These queued jobs seem to be like this one:
Delivery: wall-new: 65885532: 439833 2020-05-05 21:39:46 20

Looks like there are around 10 new entries per second added to the queue. All of them are Delivery: wall-new. It's like a loop or that old stuff is re-added to the queue. I have not seen such a behavior in a very, very long time.

Other stuff I see: Thousands and thousands of these warnings/notices I reported here and here.
But this started before I pulled the latest PRs. But maybe there is a correlation.

Also around 4000 DB errors because of duplicate errors:
DB Error {"code":1062,"error":"Duplicate entry '1952795' for key 'PRIMARY'","callstack":"Database::insert, DBA::insert, PostUpdate::update1345, PostUpdate::update, CronJobs::execute, Worker::execFunction, Worker::execute, Worker::processQueue","params":"INSERT INTO `post-delivery-data` (`uri-id`, `postopts`, `inform`, `queue_count`, `queue_done`, `activitypub`, `dfrn`, `diaspora`, `ostatus`, `legacy_dfrn`, `queue_failed`) VALUES (1952795, 'twitter', '', 0, 0, 0, 0, 0, 0, 0, 0)"} - {"file":"Database.php","line":736,"function":"e","uid":"7407c3","process_id":20492}

Friendica version you encountered the problem

2020.06-dev commit f7a45e41534c392b51fd07bd7cd4d3c969d75228

Friendica source (git, zip)

git

PHP version

7.3

SQL version

10.3 MariaDB

Bug

Source

AlfredSK

Most helpful comment

When I've been trying to park a car for 48 hours, I will - at some point - take into consideration that the wall is a solution. :-)

AlfredSK on 7 May 2020

😄2

All 47 comments

The duplicated entries for the post-delivery-data aren't no issue - but annoying. The other stuff I will have a look at.

annando on 5 May 2020

👍1

@annando Are these worker queue entries deliveries from my server to others. Or are they deliveries to my server. I'm concerned that I'm flooding other servers with that stuff.

AlfredSK on 5 May 2020

For the workerqueue stuff: I really don't know where this comes from. At first I will have a look at the other messages.

annando on 5 May 2020

❤1

The queue seems to go down slowly now. Down to 40000 at the moment. :-)

AlfredSK on 5 May 2020

PR for the warning, notice and DB Error is done: https://github.com/friendica/friendica/pull/8587

annando on 5 May 2020

Thank you. Postupdate 1345 is done. Queue continues to go down... (29000).

AlfredSK on 5 May 2020

The warnings and notices are fixed. The db errors are gone, too. So, I tend to close this issue.
But my queue is still very long. This morning it was over 100000. :-/ At the moment it's at around 80000 queued jobs.

Any idea how to debug this? I really don't know what's going on.

The queue is processed. I checked the jpm values in my log file.
index [INFO]: Load: 3.86/24 - processes: 7312/38/78639 - jpm: 560/711/884 (0:0, 20:140/77909, 30:12/404, 40:4/983, 50:0/6655) - maximum: 38/64 [] - {"file":"Worker.php","line":729,"function":"tooMuchWorkers","uid":"f280e1","process_id":1954}

The values are looking pretty neat. But look at the high priority jobs: 77909!

AlfredSK on 6 May 2020

Weird...

MariaDB [friendica]> select count(*) from `queue`;
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.000 sec)

MariaDB [friendica]> select count(*) from `deliverq`;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.000 sec)

MariaDB [friendica]> select count(*) from `workerqueue`;
+----------+
| count(*) |
+----------+
|   214122 |
+----------+
1 row in set (0.042 sec)

MariaDB [friendica]> select count(*) from `workerqueue-view`;
+----------+
| count(*) |
+----------+
|      124 |
+----------+
1 row in set (0.000 sec)

AlfredSK on 6 May 2020

please query only for workerqueue entries with "done" equals "false".

annando on 6 May 2020

MariaDB [friendica]> select count(*) from `workerqueue` where done = 0;
+----------+
| count(*) |
+----------+
|    83721 |
+----------+
1 row in set (0.018 sec)

I made a test post 10 minutes ago. It still shows 'notifier task pending' and is not delivered to 5 other fedi nodes I checked.

AlfredSK on 6 May 2020

I do see a slightly higher numer of pending stuff at my system as well. I guess I can investigate in this.

annando on 6 May 2020

Looks like a lot of notifier tasks are pending...

MariaDB [friendica]> select count(*) from `workerqueue` where `done` = 0 and `parameter` like '%wall-new%';
+----------+
| count(*) |
+----------+
|   102468 |
+----------+
1 row in set (0.108 sec)

MariaDB [friendica]> select count(*) from `workerqueue` where `done` = 0 and `parameter` like '%notifier%';
+----------+
| count(*) |
+----------+
|      970 |
+----------+
1 row in set (0.106 sec)

AlfredSK on 6 May 2020

You could use this slightly advanced query:
select count(*), priority, substring(parameter, 3, position(',' in parameter) - 4) as command from workerqueue where not done group by priority, command;

And yes, my system is experiencing this problem currently as well. I guess I have to have a look at it. Possibly there is some endless loop in it.

annando on 6 May 2020

👍1

Advanced...

ariaDB [friendica]> select count(*), priority, substring(parameter, 3, position(',' in parameter) - 4) as command from workerqueue where not done group by priority, command;
+----------+----------+-------------------+
| count(*) | priority | command           |
+----------+----------+-------------------+
|        1 |       20 |                   |
|       67 |       20 | CreateShadowEntry |
|    83961 |       20 | Delivery          |
|       10 |       20 | ForkHook          |
|        1 |       20 | MergeContact      |
|     1248 |       20 | Notifier          |
|       13 |       20 | PubSubPublish     |
|       31 |       30 | APDelivery        |
|       96 |       30 | Delivery          |
|       61 |       40 | APDelivery        |
|      426 |       40 | Delivery          |
|      173 |       40 | UpdateContact     |
|     4754 |       50 | APDelivery        |
|     1808 |       50 | Delivery          |
+----------+----------+-------------------+
14 rows in set (0.332 sec)

On a small develop node it works. Delivery is a little bit slow but posts are delivered. On master it's lightning fast. Only on the big one it is problematic.

AlfredSK on 6 May 2020

My queue ist over 200.000

copiis on 6 May 2020

I only had a small glance at it - and I have the feeling as if this is the cause of a fix that I had done. For a long time now we supported the "participation" functionality from Diaspora. This means that you can subscribe to posts, even if you hadn't subscribed to that person. Due to a small bug this most likely hadn't worked very well. This had been fixed. And possibly now the amount of changes is really high.

annando on 6 May 2020

Hmm. Ok. So, I assume that one day the queue will be back to normal. Right?
Btw. my test posts have not yet been delivered - after more than three hours. I'll keep watching this...

AlfredSK on 6 May 2020

You can calm down the system by temporarily deactivating the system. Go to this line https://github.com/friendica/friendica/blob/develop/src/Protocol/Diaspora.php#L263

Change it this way:

    public static function participantsForThread(array $parent, array $contacts)
    {
        return $contacts;
        if (!in_array($parent['private'], [Item::PUBLIC, Item::UNLISTED])) {
            return $contacts;
        }

annando on 6 May 2020

Maybe I try that later. Thank you!

As a first test I configured max. workers to 100. 😆

AlfredSK on 6 May 2020

metoo queue at 220830

copiis on 6 May 2020

Ähm. More workers don't help. They just seem to put even more items into the queue more quickly.
This way the queue is growing, not shrinking.

AlfredSK on 6 May 2020

You haven't applied the proposed change?

annando on 6 May 2020

No. Not yet.

AlfredSK on 6 May 2020

257297

copiis on 6 May 2020

@copiis You don't need to report your numbers on a constant basis. The problem is known and I'm looking at it.

annando on 6 May 2020

Hmm, seems like my fix hadn't really fixed the issue.

annando on 6 May 2020

@annando Okay, I thought we were in a bit of a downfall mood now that the pandemic is waning.

copiis on 6 May 2020

I applied the code change from here https://github.com/friendica/friendica/issues/8586#issuecomment-624633734 now.

AlfredSK on 6 May 2020

I guess that now I found the origin of the problem.

annando on 6 May 2020

Right before my queue is reaching half a million entries. :-)

AlfredSK on 6 May 2020

🎉1

During some test phases, I had got more than a million entries. So don't complain :-)

annando on 6 May 2020

😄1

I'm afraid the issue isn't fixed. The queue is still growing. +100000 in the last 2 hours. And delivery of posts/comments still doesn't work.

AlfredSK on 7 May 2020

If you use the worker daemon, the changes aren't automatically applied, you may have to stop and restart it.

MrPetovan on 7 May 2020

Anecdata: my queue was 4000 on my single node before the change, and is now at 300 a few hours after it.

MrPetovan on 7 May 2020

My system also has absolutely normal values. The numbers will still go up when several deliveries are waiting (have a look for the worker task "notifier"), so with many pending deliveries it is expected that the number of tasks will still rise by some amount.

annando on 7 May 2020

No, I don't use the daemon - just the plain old worker cron job. I have now reached 0.9 million queued jobs. That makes 500000 additional entries in the last 7 hours.

After pulling PR #8594 the queue went down from 400000 to 300000. Then it started to go up again. My test post from yesterday are still showing the hour glass icon (notifier task pending). The test posts are not delivered yet - after 7 hours.

When I noticed that I moved the old, not done "Delivery wall-new" jobs from prio 20 to prio 30 as a way to improve the delivery of new items . But the delivery (prio 20) jobs are piling up so quickly that the system isn't successful in delivering them in a normal time frame.

But the queue is processed as I can see in the log file:
2020-05-07T06:41:04Z index [INFO]: Load: 6.08/24 - processes: 7324/39/883316 - jpm: 1,380/1,384/1,122 (0:0, 20:104/644355, 30:60/239253, 40:8/633, 50:5/6399) - maximum: 38/90 [] - {"file":"Worker.php","line":729,"function":"tooMuchWorkers","uid":"2dc304","process_id":18869}

have a look for the worker task "notifier"

Ok. How do I find that task? When I query the worker queue I get 1147 results with "Notifier".

AlfredSK on 7 May 2020

You have got 1,147 "Notifier" entries. This means that there are 1,147 pending posts. When every post had got 100 receivers, this would mean that this would result in 114,700 "Delivvery" and "APDelivery" jobs.

annando on 7 May 2020

@annando Is your comment still valid?

You can calm down the system by temporarily deactivating the system. Go to this line https://github.com/friendica/friendica/blob/develop/src/Protocol/Diaspora.php#L263

Change it this way:

    public static function participantsForThread(array $parent, array $contacts)
    {
        return $contacts;
        if (!in_array($parent['private'], [Item::PUBLIC, Item::UNLISTED])) {
            return $contacts;
        }

Also, is it...
public static function participantsForThread(array $parent, ...
or...
public static function participantsForThread(array $item, ...
?

It seems to be $item in the code but you suggested $parent in your comment.

Thank you.

AlfredSK on 7 May 2020

Please stick now to the latest code changes and don't apply any other changes. At the evening I could tell you more data to look at.

annando on 7 May 2020

Ok. I had to do something because - you know - there are some users on that node. Delivery was delayed by approx. 22 hours.

So, I deleted all Delivery tasks from the worker queue. All of them (1.2 million). The queue started to grow again because I left the Notifiers in the queue.
Then I waited some time until the notifiers were all processed.

Now it looks like this and delivery seems to work again:

MariaDB [friendica]> select count(*), priority, substring(parameter, 3, position(',' in parameter) - 4) as command from workerqueue where not done group by priority, command;
+----------+----------+----------------------------------+
| count(*) | priority | command                          |
+----------+----------+----------------------------------+
|        5 |       20 | APDelivery                       |
|        1 |       20 | CreateShadowEntry                |
|     1511 |       20 | Delivery                         |
|        1 |       20 | Notifier                         |
|        9 |       30 | addon\/twitter\/twitter_sync.php |
|       52 |       30 | APDelivery                       |
|        2 |       30 | DBClean                          |
|       32 |       30 | Delivery                         |
|        1 |       30 | ForkHook                         |
|      328 |       30 | OnePoll                          |
|        2 |       40 | ForkHook                         |
|        1 |       40 | PubSubPublish                    |
|      440 |       40 | UpdateContact                    |
|        3 |       40 | UpdateServerDirectory            |
+----------+----------+----------------------------------+
14 rows in set (0.009 sec)

AlfredSK on 7 May 2020

How did you delete the delivery tasks?

copiis on 7 May 2020

I will not recommend what I did. It was kind of the last resort to re-establish a working delivery for me. You might lose deliveries or other side effects are possible.

That was the disclaimer. Here is the SQL for dropping all Delivery tasks from the queue:

delete from workerqueue where not done and parameter like '["Delivery"%';

AlfredSK on 7 May 2020

👍1

The deleting of deliveries is really bad :-) Moving them (and the notifiers) to priority 50 would had been the better idea. This of course means that several posts had not been delivered.

annando on 7 May 2020

I tried to move them to prio 40. But I had the impression that they were re-queued with prio 20. After I moved them the prio 40 and 30, these deliveries went slowly down and the prio 20 deliveries went up again. There was no progress. The total number stayed the same (> 1 million) and the delivery of items still was delayed approx. 22 hours - or didn't work at all.

The complete deletion was the only way to get a working node again.

AlfredSK on 7 May 2020

You are parking your car as well by crashing it onto a wall? 😁

annando on 7 May 2020

When I've been trying to park a car for 48 hours, I will - at some point - take into consideration that the wall is a solution. :-)

AlfredSK on 7 May 2020

😄2

okay, i crasht mine to...

copiis on 7 May 2020

❤1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Database is polluted by a process writing nonsene

hoergen · 48Comments

Local posts get deleted.

BinkaDroid · 64Comments

Photos uploaded in comments get no permissions

clacke · 76Comments

Global directory replacement

Beanow · 61Comments

PHP Notices in PHP error log

AlfredSK · 107Comments