Cms: Deadlocks on queue progress updates

Created on 25 Nov 2020  Â·  7Comments  Â·  Source: craftcms/cms

Description

A couple of clients have recently experienced issues with background queue tasks failing due to deadlocks during progress UPDATEs

2020-11-25 14:28:42 0x7fc52c112700
*** (1) TRANSACTION:
TRANSACTION 14652982, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1128, 4 row lock(s)
MySQL thread id 5733403, OS thread handle 140483295487744, query id 52791929 10-244-9-96.imminent-otter-production.project-imminent-otter 10.244.9.96 imminent_otter_production Searching rows for update
UPDATE `queue` SET `dateReserved`=NULL, `timeUpdated`=NULL, `progress`=0, `progressLabel`=NULL WHERE ((`channel`='queue') AND (`fail`=0)) AND (`timeUpdated` < 1606314522 - `ttr`)
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 520 page no 3 n bits 96 index PRIMARY of table `imminent_otter_production`.`queue` trx id 14652982 lock_mode X locks rec but not gap waiting
Record lock, heap no 27 PHYSICAL RECORD: n_fields 18; compact format; info bits 0
 0: len 4; hex 8000a8e7; asc     ;;
 1: len 6; hex 000000df9635; asc      5;;
 2: len 7; hex 50000001653083; asc P   e0 ;;
 3: len 5; hex 7175657565; asc queue;;
 4: len 30; hex 4f3a33343a2263726166745c71756575655c6a6f62735c55706461746553; asc O:34:"craft\queue\jobs\UpdateS; (total 281 bytes);
 5: len 23; hex 5570646174696e672073656172636820696e6465786573; asc Updating search indexes;;
 6: len 4; hex dfbe6a11; asc   j ;;
 7: len 4; hex 8000012c; asc    ,;;
 8: len 4; hex 80000000; asc     ;;
 9: len 4; hex 00000800; asc     ;;
 10: len 5; hex 99a7f2e729; asc     );;
 11: len 4; hex dfbe6a1a; asc   j ;;
 12: len 2; hex 8064; asc  d;;
 13: SQL NULL;
 14: len 4; hex 80000001; asc     ;;
 15: len 1; hex 80; asc  ;;
 16: SQL NULL;
 17: SQL NULL;

*** (2) TRANSACTION:
TRANSACTION 14652981, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1128, 2 row lock(s), undo log entries 1
MySQL thread id 5733402, OS thread handle 140484824606464, query id 52791924 10.244.11.96 imminent_otter_production Updating
UPDATE `queue` SET `progress`=100, `timeUpdated`=1606314522 WHERE `id`='43239'
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 520 page no 3 n bits 96 index PRIMARY of table `imminent_otter_production`.`queue` trx id 14652981 lock_mode X locks rec but not gap
Record lock, heap no 27 PHYSICAL RECORD: n_fields 18; compact format; info bits 0
 0: len 4; hex 8000a8e7; asc     ;;
 1: len 6; hex 000000df9635; asc      5;;
 2: len 7; hex 50000001653083; asc P   e0 ;;
 3: len 5; hex 7175657565; asc queue;;
 4: len 30; hex 4f3a33343a2263726166745c71756575655c6a6f62735c55706461746553; asc O:34:"craft\queue\jobs\UpdateS; (total 281 bytes);
 5: len 23; hex 5570646174696e672073656172636820696e6465786573; asc Updating search indexes;;
 6: len 4; hex dfbe6a11; asc   j ;;
 7: len 4; hex 8000012c; asc    ,;;
 8: len 4; hex 80000000; asc     ;;
 9: len 4; hex 00000800; asc     ;;
 10: len 5; hex 99a7f2e729; asc     );;
 11: len 4; hex dfbe6a1a; asc   j ;;
 12: len 2; hex 8064; asc  d;;
 13: SQL NULL;
 14: len 4; hex 80000001; asc     ;;
 15: len 1; hex 80; asc  ;;
 16: SQL NULL;
 17: SQL NULL;

 *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 520 page no 4 n bits 528 index queue_channel_fail_timeUpdated_timePushed_idx of table `imminent_otter_production`.`queue` trx id 14652981 lock_mode X locks rec but not gap waiting
Record lock, heap no 461 PHYSICAL RECORD: n_fields 5; compact format; info bits 0
 0: len 5; hex 7175657565; asc queue;;
 1: len 1; hex 80; asc  ;;
 2: len 4; hex dfbe6a19; asc   j ;;
 3: len 4; hex dfbe6a11; asc   j ;;
 4: len 4; hex 8000a8e7; asc     ;;

*** WE ROLL BACK TRANSACTION (2)

It seems as though if any UPDATEs to columns included in the queue_channel_fail_timeUpdated_timePushed_idx index are made during the execution of Queue::_moveExpired() it causes a deadlock, maybe because the index is used in the WHERE portion of:

UPDATE `queue` SET `dateReserved`=NULL, `timeUpdated`=NULL, `progress`=0, `progressLabel`=NULL WHERE ((`channel`='queue') AND (`fail`=0)) AND (`timeUpdated` < 1606314522 - `ttr`)

and is also updated as a result of the same, but it's also updated by any incoming

UPDATE `queue` SET `progress`=100, `timeUpdated`=1606314522 WHERE `id`='43239'

queries. Causing the transactions to deadlock.

Might be worth adding some retry logic for the progress updates (which are the UPDATEs that are selected to undo) or breaking up the index in some way.

Steps to reproduce

  1. Get Queue::_moveExpired() to execute at the same moment as and queue progress updates are incoming
  2. More chance of this happening if more people are logged into the CMS or the queue table has more rows

Additional info

  • Craft version: Reported by clients on 3.4 and 3.5
  • PHP version: 7.4
  • Database driver & version: MariaDB 10.4
  • Plugins & versions:
bug

Most helpful comment

Here’s what I suspect is the order of operations that causes this:

  1. A queue job is in progress
  2. An ajax request comes in to one of the craft\queue\Queue methods that call _moveExpired().
  3. _moveExpired() executes its query, which creates a lock on the queue_channel_fail_timeUpdated_timePushed_idx index.
  4. While that query is executing, the active job calls craft\queue\Queue::setProgress(), which touches the same index, resulting in the deadlock.

I’ve just tweaked _moveExpired() so that it fetches the IDs of the expired jobs first, and then marks them as expired in a separate query, so that the update operation doesn’t use the queue_channel_fail_timeUpdated_timePushed_idx index. If I understand what’s going on correctly, that should fix the deadlocks.

All 7 comments

I've also been struggling with this on a 3.5 site that has a fairly active queue. That site is also running PHP 7.4 and MariaDB 10.4 (on Forge). We currently have 2 queue runners running queue/listen using Forge deamons, and I suspect that this increases the risk of the deadlocks occurring (initially, we actually had 3 deamons, and the issue seemed to occur more often then – although I can't verify that). I've also had it happen locally a few times though, where the queue is simply run over http using runQueueAutomatically.

Here’s what I suspect is the order of operations that causes this:

  1. A queue job is in progress
  2. An ajax request comes in to one of the craft\queue\Queue methods that call _moveExpired().
  3. _moveExpired() executes its query, which creates a lock on the queue_channel_fail_timeUpdated_timePushed_idx index.
  4. While that query is executing, the active job calls craft\queue\Queue::setProgress(), which touches the same index, resulting in the deadlock.

I’ve just tweaked _moveExpired() so that it fetches the IDs of the expired jobs first, and then marks them as expired in a separate query, so that the update operation doesn’t use the queue_channel_fail_timeUpdated_timePushed_idx index. If I understand what’s going on correctly, that should fix the deadlocks.

Craft 3.5.17 is out now with that change.

Sorry to report that I'm still seeing this error after updating to 3.5.17.1:

SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction The SQL being executed was: UPDATE `craft_queue` SET `progress`=99, `timeUpdated`=1610837126, `progressLabel`='100 of 100' WHERE `id`='35516'

Running MySQL 5.5.5 (MariaDB)

Drat, ok. Just pushed a change to the 3.6 branch, so now all UPDATE and DELETE DB queries are executed with a mutex lock. That should definitely solve this.

Awesome, thanks @brandonkelly 🙌

Craft 3.6 has now been officially released ✨

Was this page helpful?
0 / 5 - 0 ratings