Magento2: Cron job not running after crashed once

Created on 29 May 2019 · 24Comments · Source: magento/magento2

Preconditions (*)

Magento EE 2.2.8
Crontab configured as per the documentation

Steps to reproduce (*)

An error makes the cron job indexer_update_all_views fail once (in my case, database unavailable)

Expected result (*)

Current run should be marked as failed in the table cron_schedule
Next run should run correctly and status be updated at the end in cron_schedule

Actual result (*)

Table cron_schedule is filled with pending jobs, no job for indexer_update_all_views is run (no output in var/log/cron.log, no status update in cron_schedule table.

Logs :
var/log/cron.log (last success + the error message)

[2019-05-29 11:25:10] report.INFO: Cron Job indexer_update_all_views is run [] []
[2019-05-29 11:28:03] report.ERROR: Cron Job indexer_update_all_views has an error: SQLSTATE[08S01]: Communication link failure: 1047 WSREP has not yet prepared node for application use, query was: SELECT `mview_state`.* FROM `mview_state` WHERE (`mview_state`.`view_id`='catalog_product_flat'). Statistics: {"sum":0,"count":1,"realmem":0,"emalloc":0,"realmem_start":182714368,"emalloc_start":180313880} [] []

=> And then no more logs about indexer_update_all_views, even if other jobs from the index group run correctly and output success in var/log/cron.log
Database recovered a minute after and query was OK

Fixed in 2.4.x Format is valid P2 PR in progress done S1

Source

QuentinFarizon

👍2

All 24 comments

Hi @QuentinFarizonAfrimarket. Thank you for your report.
To help us process this issue please make sure that you provided the following information:

[ ] Summary of the issue
[ ] Information on your environment
[ ] Steps to reproduce
[ ] Expected and actual results

Please make sure that the issue is reproducible on the vanilla Magento instance following Steps to reproduce. To deploy vanilla Magento instance on our environment, please, add a comment to the issue:

@magento give me 2.3-develop instance - upcoming 2.3.x release

For more details, please, review the Magento Contributor Assistant documentation.

@QuentinFarizonAfrimarket do you confirm that you were able to reproduce the issue on vanilla Magento instance following steps to reproduce?

[ ] yes
[ ] no

Join Magento Community Engineering Slack and ask your questions in #github channel.

m2-assistant[bot] on 29 May 2019

I agree, this definitely needs fixing.

Just FYI: there has recently been a proposal to review the current cron implementation and improve its stability: https://github.com/magento/architecture/pull/171

hostep on 30 May 2019

Hello @hostep thank you !
I agree with the conclusion of magento/architecture#171 : cron management must be more resilient, fireproof, and protected against periodic or permanent failing of one of the job codes.

I think I found a reproducible scenario that caused issue on my system :

Indexing job (group "index") configured as "no separate proces"
Indexing runs out of memory, process is killed by the system (or process crashes violently for another reason)
As process didn't have the occasion to set the job as "error", it stays in running
No new indexing job runs (because one is running)
Running jobs are cleaned after max(successLifetime, erroLifetime) which by default is 3 days (!) for group index

Consequences
=> No error or few errors in cron.log or other logs
=> No reporting on the back-office or cli, apart from indexer status keeping piling up
=> Index job marked as "running" for 3 days after a single error

Workaround :
=> Increase memory limit
=> Set error lifetime to something between 1 and 6 hours (should be sufficient for indexing)

Ideas :
=> Store host+pid in database to regularly check for crashed processes (when you're on the correct host)

QuentinFarizon on 31 May 2019

👍1

@hostep Furthermore, I don't think the database lock (created in ProcessCronQueueObserver::lockGroup) is ever released if process crash without having a chance to unlock it.

So it may never run again, what do you think ?

QuentinFarizon on 31 May 2019

Running jobs are cleaned after max(successLifetime, erroLifetime) which by default is 3 days (!) for group index

This is not entirely true, since it doesn't clean up jobs with status running: https://github.com/magento/magento2/blob/48d8d4307fd4ef8f34b399b692664c46dc929ae0/app/code/Magento/Cron/Observer/ProcessCronQueueObserver.php#L496-L499

As far as I know, running jobs are just never cleaned up, even if they are no longer actually running. And that's the biggest problem here.

=> Set error lifetime to something between 1 and 6 hours (should be sufficient for indexing)

That's a good point, the lifetime was already lowered a lot in https://github.com/magento/magento2/commit/244a2e933d45f3388f28383323dca26fc73712c0, but for the indexing group, it probably makes sense to lower the default failure history even further.
Although, how many errors do you actually see with indexer jobs? Not that many I reckon? But it might still make sense to lower it, since that lifetime is also used to cleanup pending jobs.

Furthermore, I don't think the database lock (created in ProcessCronQueueObserver::lockGroup) is ever released if process crash without having a chance to unlock it.

I'm not exactly sure what happens with a lock acquired in a mysql server when the php code suddenly stops, I would think that the connection with mysql then closes and the lock is released automatically, but I'm not sure about this...

Btw: the locking is not used to manage the statuses of the jobs themselves as far as I know.
The reason why this locking was necessary (if I remember correctly) was because sometimes you could have situations where the same cron group was being executed multiple times in parallel, because the first run wasn't fully done in a single minute (Magento recommends crontab to be set so it runs every minute). And you don't want to have a new process trying to manage the statuses of jobs in the same cron group before the existing process has finished its work.

hostep on 31 May 2019

Thanks @hostep

Indeed you're right, running jobs are never cleaned. I think it would make sens to clear the after $historyFailure or max($historyFailure, $historySuccess)
There are so many reason a process can fail (out of memory, server reboot, ...), having your indexing stuck until you clear the line in the database is a very high price to pay.

I think this requires a quick fix in 2.2 and 2.3 release lines, until your proposition for hardening crons is implemented. What do you think ?

Indeed, from my observations the mysql lock _seems_ to be released, so it should'nt be an additional issue (needs confirmation)

QuentinFarizon on 3 Jun 2019

@magento give me 2.2.8 instance

QuentinFarizon on 3 Jun 2019

Hi @QuentinFarizonAfrimarket. Thank you for your request. I'm working on Magento 2.2.8-develop instance for you

magento-engcom-team on 3 Jun 2019

@magento give me 2.2.8 instance

QuentinFarizon on 3 Jun 2019

Hi @QuentinFarizonAfrimarket. Thank you for your request. I'm working on Magento 2.2.8 instance for you

magento-engcom-team on 3 Jun 2019

Hello @hostep

I have another issue : individual mview updates (like catalog_product_flat) are stuck in 'working' mode in the table mviews_state.
As they are not hold by any process, and are not cleaned (I believe) so they will be stuck forever.
So even with an indexer job running correctly, you can get individual views that are not updated following their changelogs.

I suspect that this modification : https://github.com/magento/magento2/commit/5927a75169aa5eb7a978847d22dbde5dbe3cc5a2 must also be applied to https://github.com/magento/magento2/blob/2.3-develop/lib/internal/Magento/Framework/Mview/View.php#L303

I can push a PR if you agree (for both issues)

QuentinFarizon on 3 Jun 2019

... running jobs are never cleaned. I think it would make sens to clear the after $historyFailure or max($historyFailure, $historySuccess)
There are so many reason a process can fail (out of memory, server reboot, ...), having your indexing stuck until you clear the line in the database is a very high price to pay.

Agreed. There should be some way for Magento to detect failed cron jobs and not assume they are still running. But I'm not sure if the existing lifetime fields are the right solution here. Maybe a new field should be added for a new lifetime called 'max duration of running job' or something like that?

I suspect that this modification : 5927a75 must also be applied to https://github.com/magento/magento2/blob/2.3-develop/lib/internal/Magento/Framework/Mview/View.php#L303

The change to catching a Throwable instead of an Exception might prevent more possible problematic situations, but will not entirely fix all the possible problems here. If php runs out of memory, it will just stop and will just never go into the catch I think. You can possibly catch such situations with register_shutdown_function, but then you need to make sure you reserved a bunch of memory beforehand which you then have to free up so you still have memory enough to fix the state in the database. This sound pretty complex and not a very elegant solution.
And what do you do when the electricity fails just when the state in the database is set to working but before any work actually begun?
Would using database transactions work here for example?

It would be nice to have some feedback from some experts in here. @dmanners: I spoke to you on Imagine 2018 about this particular problem (amongst some other problems), but this still hasn't been properly fixed, can you try to pull in some experts from Magento in here to see how this problem can be solved?
The quick summary: we need a way for Magento to not keep cronjobs in state running or materialized views in the state working forever and ever. There must be a way for Magento to detect this and work around this somehow.

hostep on 3 Jun 2019

Hi @m-a-x-i-m. Thank you for working on this issue.
In order to make sure that issue has enough information and ready for development, please read and check the following instruction: :point_down:

[ ] 1. Verify that issue has all the required information. (Preconditions, Steps to reproduce, Expected result, Actual result).
Details
If the issue has a valid description, the label Issue: Format is valid will be added to the issue automatically. Please, edit issue description if needed, until label Issue: Format is valid appears.

[ ] 2. Verify that issue has a meaningful description and provides enough information to reproduce the issue. If the report is valid, add Issue: Clear Description label to the issue by yourself.
[ ] 3. Add Component: XXXXX label(s) to the ticket, indicating the components it may be related to.
[ ] 4. Verify that the issue is reproducible on 2.3-develop branch
Details
- Add the comment @magento give me 2.3-develop instance to deploy test instance on Magento infrastructure.
- If the issue is reproducible on 2.3-develop branch, please, add the label Reproduced on 2.3.x.
- If the issue is not reproducible, add your comment that issue is not reproducible and close the issue and _stop verification process here_!
[ ] 5. Verify that the issue is reproducible on 2.2-develop branch.
Details
- Add the comment @magento give me 2.2-develop instance to deploy test instance on Magento infrastructure.
- If the issue is reproducible on 2.2-develop branch, please add the label Reproduced on 2.2.x

Join Magento Community Engineering Slack and ask your questions in #github channel.

m2-assistant[bot] on 4 Jun 2019

The change to catching a Throwable instead of an Exception might prevent more possible problematic situations, but will not entirely fix all the possible problems here.

Yes agreed, but it would be easy to push as a first fix. I thought about register_shutdown_function, but indeed it's probably complicated to make database queries in it.
Database transactions should work yes, specifying to lock table cron_schedule but not lock the indexed table.

QuentinFarizon on 4 Jun 2019

Yes agreed, but it would be easy to push as a first fix.

Cool, feel free to create a PR with this suggestion! It will most likely get approved.

hostep on 4 Jun 2019

Even with my patch (https://github.com/magento/magento2/pull/23125) I still have issues where no cron is currently running, it's not stuck in cron_schedule, but some mviews are stuck to 'working'

MariaDB [ivwooaagji7o4]> select * from cron_schedule where job_code  = 'indexer_update_all_views' order by executed_at desc limit 20;
+-------------+--------------------------+---------+----------+---------------------+---------------------+---------------------+---------------------+
| schedule_id | job_code                 | status  | messages | created_at          | scheduled_at        | executed_at         | finished_at         |
+-------------+--------------------------+---------+----------+---------------------+---------------------+---------------------+---------------------+
|    21779692 | indexer_update_all_views | success | NULL     | 2019-06-06 10:36:06 | 2019-06-06 10:39:00 | 2019-06-06 10:40:10 | 2019-06-06 10:40:21 |
|    21779689 | indexer_update_all_views | success | NULL     | 2019-06-06 10:36:06 | 2019-06-06 10:38:00 | 2019-06-06 10:39:06 | 2019-06-06 10:39:19 |
|    21779632 | indexer_update_all_views | success | NULL     | 2019-06-06 10:34:15 | 2019-06-06 10:37:00 | 2019-06-06 10:38:06 | 2019-06-06 10:38:19 |

MariaDB [ivwooaagji7o4]> select * from mview_state;
+----------+-----------------------------------+----------+---------+---------------------+------------+
| state_id | view_id                           | mode     | status  | updated             | version_id |
+----------+-----------------------------------+----------+---------+---------------------+------------+
|        1 | catalogsearch_fulltext            | enabled  | idle    | 2019-06-06 10:41:13 |   73367814 |
|        2 | catalog_category_product          | enabled  | idle    | 2019-06-06 10:41:11 |      69390 |
|        3 | catalog_product_category          | enabled  | working | 2019-06-06 08:52:35 |    8593557 |
|        4 | catalog_product_price             | enabled  | working | 2019-06-05 15:10:16 |    8879166 |
|        5 | catalog_product_attribute         | enabled  | idle    | 2019-06-06 10:41:11 |   30248796 |
|        6 | catalogrule_rule                  | enabled  | idle    | 2019-06-06 10:41:11 |          0 |
|        7 | catalogrule_product               | enabled  | idle    | 2019-06-06 10:41:11 |   44660022 |
|        8 | cataloginventory_stock            | enabled  | idle    | 2019-06-06 10:41:11 |   17497764 |
|        9 | targetrule_product_rule           | enabled  | working | 2019-06-05 15:10:19 |     920777 |
|       10 | targetrule_rule_product           | enabled  | idle    | 2019-06-06 10:41:13 |          0 |
|       11 | salesrule_rule                    | enabled  | idle    | 2019-06-06 10:41:11 |        528 |
|       12 | algolia_products                  | disabled | idle    | 2019-06-06 10:41:11 |   19351077 |
|       13 | algolia_categories                | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       14 | algolia_pages                     | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       15 | algolia_suggestions               | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       16 | algolia_additional_sections       | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       17 | algolia_queue_runner              | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       18 | catalog_category_flat             | enabled  | idle    | 2019-06-06 10:41:11 |      15264 |
|       23 | catalog_product_flat              | enabled  | working | 2019-06-05 15:10:14 |   21107781 |
|       27 | design_config_dummy               | enabled  | idle    | 2019-06-06 10:41:11 |          0 |
|       30 | customer_dummy                    | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       33 | algolia_delete_products           | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       36 | afrimarket_catalog_basket_product | enabled  | idle    | 2019-06-06 10:41:12 |    4258158 |
|       39 | afrimarket_catalog_ranking        | enabled  | idle    | 2019-06-06 10:41:19 |   24287076 |
|       42 | afrimarket_catalog_margin         | enabled  | idle    | 2019-06-06 10:41:19 |     138633 |
+----------+-----------------------------------+----------+---------+---------------------+------------+

I have no ideas how this can happen, I'm open to suggestions

QuentinFarizon on 6 Jun 2019

@QuentinFarizonAfrimarket Hi!
Did you try this fix?
Issue: https://github.com/magento/magento2/pull/23079/files
Fix: https://github.com/magento/magento2/issues/23077
We have the same issue with millions of lines in changelog tables especially catalog_product_flat_cl maybe it is the same for you.
It is stuck in \Magento\Framework\Mview\View::update processing millions of versions in chunck

mattheo-geoffray on 6 Jun 2019

👍1

Hello, yes I have applied the fix https://github.com/magento/magento2/issues/23077 and recreated triggers : https://github.com/magento/magento2/pull/23079#issuecomment-498863413

It certainly helps, but we sill have issues, also with catalog_product_flat (temporary index table seems to disappear during indexation). Not sure how it is related to mview getting stuck since it's supposed to be catched by this PR

QuentinFarizon on 6 Jun 2019

@hostep I've noticed in my own environment that any jobs that are running and never complete are marked as missed. A method that clears missed jobs after a determined lifetime would also solve this issue.

Ctucker9233 on 14 Jun 2019

@Ctucker9233: reading the code I'm not seeing how that can happen

A job is only set to missed when it tries to run a pending job, and if the scheduled time was missed by some offset: https://github.com/magento/magento2/blob/8e16abce80b13d4530cec4a1578850f7d153a39a/app/code/Magento/Cron/Observer/ProcessCronQueueObserver.php#L294-L297

This _runJob method only gets called from the processPendingJobs method. Which only processes jobs with state pending.

So you shouldn't see running jobs getting changed to missed somehow, unless I'm missing something or you are using some custom code which is reponsible for this?

And there is already some code to cleanup missed jobs as far as I can see:
https://github.com/magento/magento2/blob/8e16abce80b13d4530cec4a1578850f7d153a39a/app/code/Magento/Cron/Observer/ProcessCronQueueObserver.php#L497

hostep on 14 Jun 2019

@hostep I misscommunicated. I didn't mean that I am seeing running jobs switch their status to missed. What I meant was that in my environment at least a lot of missing jobs seem to accumulate and that can also cause the cron to crash or have a deadlock. The code that looks like it's meant to clean up missed jobs also seems to be not functioning in my case.

Ctucker9233 on 14 Jun 2019

👍1

Hi All,

Replace incredibly inefficient sql in constant with inner join with temporary memory table with index support, will stop this cron from taking up so much and crashing in the first place.

Use the attached file and override inefficient function:-

<?php
/**
 * Copyright © 2019 Ekky Software Pty Ltd All rights reserved.
 * See COPYING.txt for license details.
 */

declare(strict_types=1);

namespace Ekky\Catalog\Model\Indexer\Category\Product\Action;

use Magento\Catalog\Api\Data\ProductInterface;
use Magento\Framework\DB\Adapter\AdapterInterface;
use Magento\Framework\DB\Select;
use Magento\Framework\EntityManager\MetadataPool;
use Magento\Store\Model\Store;

/**
 * Class Full reindex action
 *
 * @SuppressWarnings(PHPMD.CouplingBetweenObjects)
 */
class Full 
extends \Magento\Catalog\Model\Indexer\Category\Product\Action\Full
{

    /**
     * Refresh entities index
     *
     * @return $this
     */
    public function execute(): \Magento\Catalog\Model\Indexer\Category\Product\Action\Full
    {
        return parent::execute();
    }

    /**
     * Reindex categories using given SQL select and condition.
     *
     * @param Select $basicSelect
     * @param string $whereCondition
     * @param Store $store
     * @return void
     */
    private function reindexCategoriesBySelect(Select $basicSelect, $whereCondition, $store): void
    {
        $this->tableMaintainer->createMainTmpTable((int)$store->getId());

        $entityMetadata = $this->metadataPool->getMetadata(ProductInterface::class);
        $columns = array_keys(
            $this->connection->describeTable($this->tableMaintainer->getMainTmpTable((int)$store->getId()))
        );

        $select = $this->connection->select();
        $select->distinct(true);
        $select->from(['e' => $entityMetadata->getEntityTable()], $entityMetadata->getIdentifierField());
        $temporaryIdIndex = $this->getTable('temp_id_idx_'.$store->getId());
        $this->connection->query("CREATE TEMPORARY TABLE ".$temporaryIdIndex."(".$entityMetadata->getIdentifierField()." INTEGER PRIMARY KEY) ENGINE=MEMORY " . $select->__toString());

        $resultSelect = clone $basicSelect;
        $resultSelect->joinInner(['temp_idx' => $temporaryIdIndex], 'temp_idx.'.$entityMetadata->getIdentifierField().' = '.str_replace("IN (?)","",$whereCondition),[]);
        $this->connection->delete($this->tableMaintainer->getMainTmpTable((int)$store->getId()));
        $this->connection->query(
            $this->connection->insertFromSelect(
                $resultSelect,
                $this->tableMaintainer->getMainTmpTable((int)$store->getId()),
                $columns,
                AdapterInterface::INSERT_ON_DUPLICATE
            )
        );
        $this->connection->dropTable($temporaryIdIndex);
        $this->publishData($store);
    }
}

4DallasC on 22 Nov 2019

The PR in #28007 should fix this. It clears up things stuck in running and mitigates deadlocks so the cleanups can finish and stop the cron_schedule table growing exponentially.

driskell on 24 Sep 2020

👍1

Hi @QuentinFarizon. Thank you for your report.
The issue has been fixed in magento/magento2#28007 by @driskell in 2.4-develop branch
Related commit(s):