Magento2: MessageQueue cron runner repeatedly launches duplicate consumers when ps command is provided by Busybox

Created on 8 Jul 2019  路  17Comments  路  Source: magento/magento2

Preconditions (*)

  1. Magento 2.3.2
  2. Alpine Linux (or any distro using busybox to provide ps command)
  3. ps command provided by busybox (no procps installed)
  4. no php-posix module installed.
  5. Cron jobs setup

Steps to reproduce (*)

  1. ps -a
  2. Wait 5 minutes
  3. ps -a

Expected result (*)

  1. Only one set of consumers should be running.

Actual result (*)

  1. Every minute a new set of consumers is launched.

\Magento\MessageQueueModel\Cron\ConsumersRunner

uses one of two ways to determine whether the consumers are already running or not.

php posix_getpgid($pid)

or

exec ps -p $pid

The check to make sure no consumers are running fails, and new consumers are launched, even though there are already consumers running.

The machine will expire shortly later when all memory is exhausted.

Format is valid needs update

Most helpful comment

@gwharton or @maderlock: out of curiosity: did you already test this with Magento 2.3.3? There were some changes done to how Magento checks if a consumer process is already running. I'm not sure if it would solve your problem, but it might ...

The ps -p code has been removed in that same commit

All 17 comments

Hi @gwharton. Thank you for your report.
To help us process this issue please make sure that you provided the following information:

  • [x] Summary of the issue
  • [x] Information on your environment
  • [x] Steps to reproduce
  • [x] Expected and actual results

Please make sure that the issue is reproducible on the vanilla Magento instance following Steps to reproduce. To deploy vanilla Magento instance on our environment, please, add a comment to the issue:

@magento give me 2.3-develop instance - upcoming 2.3.x release

For more details, please, review the Magento Contributor Assistant documentation.

@gwharton do you confirm that you were able to reproduce the issue on vanilla Magento instance following steps to reproduce?

  • [x] yes
  • [ ] no

Installing php7-posix module resolves the issue on alpine to get the posix functions working.

The documentation should be updated that it is now a requirement that either "ps -p $pid" or php's posix extension is required.

The module should be a little more intelligent at determining what it should do if both methods fail, instead of just relaunching and killing the machine.

@gwharton I was able to resolve a similar issue by adding procps to my alpine dependencies.

RUN apk add --no-cache \
  gzip \
  freetype-dev \
  icu-dev \
  libjpeg-turbo-dev \
  libpng-dev \
  libxslt-dev \
  lsof \
  curl-dev \
  libsodium-dev \
  mysql-client \
  procps \
  zip

I'm getting the same problem of constantly spinning up consumers under Magento 2.3.2 on Amazon Linux.

Oddly, this has procps installed, so there must be some other reason it is not able to find the processes. It is writing to the .pid file, so I do not think this is a permissions problem.

The version of PS needs to support the -p switch or you need the POSIX php module enabled. Like you say it launches and successfully creates the pid file.

I wonder if the return value of PS differentiates between "unknown option -p" and "process not found". The former should result in a magento error log, the latter should result in new consumer launch. At the moment, both result in new consumers being launched.

<?php
exec(escapeshellcmd('ps -ohmygoshthiscantwork'), $output, $code);
echo "Testing command ps -ohmygoshthiscantwork - Return is : ";
$code = (int) $code;
switch ($code) {
    case 0:
        echo "0\n";
        break;
    case 1:
        echo "1\n";
        break;
    default:
        echo "other\n";
        break;
}

exec(escapeshellcmd('ps -p ' . "1"), $output, $code);
echo "Testing command ps -p 1 - Return is : ";
$code = (int) $code;
switch ($code) {
    case 0:
        echo "0\n";
        break;
    case 1:
        echo "1\n";
        break;
    default:
        echo "other\n";
        break;
}

exec(escapeshellcmd('ps -p ' . "9999"), $output, $code);
echo "Testing command ps -p 9999 - Return is : ";
$code = (int) $code;
switch ($code) {
    case 0:
        echo "0\n";
        break;
    case 1:
        echo "1\n";
        break;
    default:
        echo "other\n";
        break;
}

Ubuntu 19

www-data@dev:~/dev1$
Testing command ps -ohmygoshthiscantwork - Return is : 1
Testing command ps -p 1 - Return is : 0
Testing command ps -p 9999 - Return is : 1
www-data@dev:~/dev1$

Alpine with busybox only

www-data@dev:~/dev1$
Testing command ps -ohmygoshthiscantwork - Return is : 1
Testing command ps -p 1 - Return is : 1
Testing command ps -p 9999 - Return is : 1
www-data@dev:~/dev1$

Alpine with procps

www-data@dev:~/dev1$
Testing command ps -ohmygoshthiscantwork - Return is : 1
Testing command ps -p 1 - Return is : 0
Testing command ps -p 9999 - Return is : 1
www-data@dev:~/dev1$

The mechanism cannot detect the difference between ps being called with an unknown argument, or ps -p being called on a process id that doesnt exist. Both return 1. And installations using only busybox it fails altogether.

We are now seeing this problem on both Commerce and Open Source installations of 2.3.2. The consumers keep spawning until the server runs out of memory. For now I'm going to have an attempt to write my own cron job to spawn these as needed. The Magento cron job obviously has an issue picking up when consumers are running or not.

This is on AWS Linux. ps has the -p option available. The fact that two entirely different projects - one on OS and one of Commerce - are having this issue, suggests it is not our code-base and rather the combination of Magento 2.3.2 and AWS.

This is not spinning up per minute, but they are spinning up multiple versions over the course of days. This is making diagnosing this difficult.

@maderlock You either need the php_posix extension installed, or a version of the command line ps command that supports the "-p" command line argument. Either will do.

Hi @engcom-Delta. Thank you for working on this issue.
In order to make sure that issue has enough information and ready for development, please read and check the following instruction: :point_down:

  • [ ] 1. Verify that issue has all the required information. (Preconditions, Steps to reproduce, Expected result, Actual result).
    DetailsIf the issue has a valid description, the label Issue: Format is valid will be added to the issue automatically. Please, edit issue description if needed, until label Issue: Format is valid appears.
  • [ ] 2. Verify that issue has a meaningful description and provides enough information to reproduce the issue. If the report is valid, add Issue: Clear Description label to the issue by yourself.

  • [ ] 3. Add Component: XXXXX label(s) to the ticket, indicating the components it may be related to.

  • [ ] 4. Verify that the issue is reproducible on 2.3-develop branch

    Details- Add the comment @magento give me 2.3-develop instance to deploy test instance on Magento infrastructure.
    - If the issue is reproducible on 2.3-develop branch, please, add the label Reproduced on 2.3.x.
    - If the issue is not reproducible, add your comment that issue is not reproducible and close the issue and _stop verification process here_!

  • [ ] 5. Add label Issue: Confirmed once verification is complete.

  • [ ] 6. Make sure that automatic system confirms that report has been added to the backlog.

Hi @gwharton Unfortunately, I cannot reproduce issue by steps you described on 2.3-develop.
Testing scenario:

  • Disable php posix module
    image
  • Install magento and generate some products
  • Run bin/magento cron:install
  • Start start message queue consumers
    #23625issueConsumerJobs
  • Run ps -a
    #23625issue
  • Wait 5 minutes
  • Run ps -a
    Result:
    :heavy_check_mark: Get same process ids as first time
    #23625issueSecondRun

Could you take a look if my steps are correct?

@maderlock You either need the php_posix extension installed, or a version of the command line ps command that supports the "-p" command line argument. Either will do.

Unfortunately this is not the issue. The server in question has ps with -p option, as well as php_posix.

I suspect this is going to be architecture related, as it's fine on my local environment (OS X) but under AWS linux we have this problem across all our sites whether commerce or open source.

@maderlock I believe there are two issues here.

  1. Consumers get spawned once per minute. Machine rapidly demises due to lack of memory. This happens in the scenareo when BOTH php_posix is not enabled, and ps -p behaves incorrectly (as in the busybox version)
  2. Your issue, where at random times, additional consumers are launched incorrectly even though you have the necessary posix or ps versions installed and the pid is present and correct.

Perhaps a second issue should be raised making it very clear that the issues are similar and related, but have different causes.

@engcom-Delta You need both php_posix disabled, but also a version of the ps command that returns 1 if it doesn't support the -p command line parameter (as busybox's version does). Magento cannot differentiate between ps -p returning 1 because the ps command does not support the -p parameter, and ps -p returning 1 because the version of ps is working properly but consumer is not running. Both will result in a new consumer launch.

Like i say, it only becomes a problem when BOTH posix is missing, and a dodgy version (busybox) of ps is used.

Perhaps there is a way for Magento to detect if the version of ps does not support the -p parameter and disable that method of checking for consumers, and if both the posix check and the ps check fails, then alert the user to the fact that they need either posix or a different version of ps installed, instead of flailing wildly and taking the machine down spawning consumers in rapid succession.

At the very least, it should be mentioned in the technology stack documentation that posix is recommended, and if not make sure the version of ps supports the -p parameter, perhaps mentioning that busybox alone fails to meet these requirements and procps package is required (in the case of alpine) if the posix php entensions are not available.

@gwharton or @maderlock: out of curiosity: did you already test this with Magento 2.3.3? There were some changes done to how Magento checks if a consumer process is already running. I'm not sure if it would solve your problem, but it might ...

The ps -p code has been removed in that same commit

Yes, just had a quick look over the changes and they look great and should resolve this issue so im going to close this now. @maderlock please work on reproducing your issue reliably in 2.3-develop if it still exists since the changes quoted above and raise a new issue to cover your findings with full steps to reproduce.

Interesting. I cannot upgrade to 2.3.3 at this point of the project, but I had come to the conclusion that the PID logic was incorrect - it was referring to the parent PID not that of the spun up consumers. Interesting to see that a different mechanism is being used now.

Was this page helpful?
0 / 5 - 0 ratings