Magento2: Magento 2 suddenly does not work with REDIS anymore after upgrade to (M2.1.6)

Created on 18 Apr 2017  路  25Comments  路  Source: magento/magento2


Preconditions


  1. M2.1.4 > M2.1.6 - NGINX - phpREDIS

Steps to reproduce

  1. Run $php bin/magento setup:static-content:deploy"
  2. Getting:
    [Exception] Notice: unserialize(): Error at offset 0 of 1 bytes in /var/www/domain.com/vendor/magento/module-config/App/Config/Type/System.php on line 214

Expected result

  1. Able to deploy without errors

Actual result

  1. The newly updated System.php file (M2.1.6) "vendor/magento/module-config/App/Config/Type/System.php" was changed and breaking the connection with REDIS. It was however working fine with M2.1.4 and all earlier version of System.php
  2. Run $php bin/magento setup:static-content:deploy"
  3. Now getting:
    [Exception] Notice: unserialize(): Error at offset 0 of 1 bytes in /var/www/domain.com/vendor/magento/module-config/App/Config/Type/System.php on line 214

Fixed in 2.1.x Fixed in 2.2.x Fixed in 2.3.x Clear Description Confirmed Format is valid Ready for Work bug report

Most helpful comment

Apologies. I did rush through to find a solution and didn't spend the time to explain clearly. I'll try my best to detail everything I found.

I was randomly receiving the following error like others:

[Exception] Notice: unserialize(): Error at offset 0 of 1 bytes in /var/www/domain.com/vendor/magento/module-config/App/Config/Type/System.php on line 214

My investigation started by inserting old school I-Dont-Know-What-A-Debugger-Is style error_log lines into this file around the cache loads: LOAD: [cache-entry] and then checking if the result is unserialisable or not before Magento attempts to do it, logging either LOAD SUCCESSFUL or LOAD FAILURE: [the-value-that-we-couldn't-unserialize].

From this I could see that 1 was returned from a load for system_defaultweb and Magento was expecting serialised data. Further from these logs I could see there were many other system_defaultweb loads happening that worked fine.

From this I added further error_log to Cm_Redis_Cache_Backend in the load() and decodeData() functions. It would log the first 3 characters of the decoded data.

The presence of multiple CMCACHE LOAD logs in a row without an immediately following CMCACHE DECODE (which the code dictates must happen) indicated an impossible code path explained only by parallelism. Investigating static-content:deploy we can see that it performs parallelism. The Magento\Deploy\Console\Command\DeployStaticContentCommand::execute method uses a Model to manage deployment of each area, which can be found at Magento\Deploy\Model\DeployManager.

In Magento\Deploy\Model\DeployManager::deploy there is a check to see if parallelism is enabled and available (-j set to > 1 and pctnl_fork function available.) If so, it calls runInParallel. This function uses Magento\Deploy\Model\ProcessQueueManager to queue a list of tasks to be run in parallel, and this in turn creates processes up to the amount specified by -j (default 3) by creating Magento\Deploy\Model\ProcessManager objects.

Magento\Deploy\Model\ProcessManager has a fork method that calls pctnl_fork. After forking, nothing is adjusted so all open descriptors, including MySQL connections, Redis connections, etc etc, are still held, and whilst held the process continues to run the requested processing, using the same connections.

This triggers wild behaviours with a variety of errors such as the unserialise problem, lost connections to redis, corrupt data errors from redis where data could not be uncompressed (if you have enabled compression) and various other random exceptions. This is caused because of a race condition between reads on the Redis connection from the different parallel processes. They essentially battle with each other to receive data, and will at times receive data that is a response to a request a completing different process made. In our case of the exception above, in App/Config/Type/System.php, the process was expecting a response for system_defaultweb but was actually receive a response for system_cache_exists that was made by a different process, which of course returns a single integer 1, and not serialised data as is expected. In fact, the process that wanted system_cache_exists received system_defaultweb response, but this did not cause an exception there was it was simply evaluated for trueness.

Best solution is to pcntl_exec to replace the current image for the child with a new PHP run starting from, say, magento static:deploy-content --child-process <info-it-can-run>, that then creates it's own connections to MySQL / Redis as needed.

Another method would be to ensure that these child processes never use Redis. There's also the possibility of attempting to open another Redis connection too which might get through this. In either case though, it would be extremely important not to close the existing Redis connection as this is most likely going to trigger a QUIT to be sent across the connection. This is why sometimes when one child finishes before others have you see Connection to Redis Lost, since one child will send QUIT when PHP exits, and the other children (and the parent) would be left with an open Redis connection where QUIT was sent, which would be an invalid state.

Overall, workaround is to disable the static-content:deploy parallelism if you are using Redis, by passing -j 1 to the arguments.

Hope this is enough detail.

All 25 comments

Hi @commcad.

Have you followed this steps while upgrading?

Please, pay attention for 5th step - clear var/ directories as well as run FLUSHALL command in redis cli.

I am using Redis version 3.0.6 and Magento 2.1.5. It is working smooth for me. Could you share the server confifuration ?

We just got this issue from a clean install of Magento 2.1.6 using composer installation.
We use CentOS 7, redis 3.2.8, php 7.0.18, nginx 1.12.0, mysql 5.6.36 and Magento 2.1.6.
Doing php bin/magento deploy:mode:set production, even after clearing all cache folders under the var folder, still gives this error.

We've set up countless of M2 sites prior to this version and have never had any issues.

Also worth noting is it is not only this file, sometimes it says the same about this:
Notice: unserialize(): Error at offset 0 of 1 bytes in /home/mediastrategi/public_html/vendor/magento/framework/Interception/PluginList/PluginList.php on line 272

It also only seems to happen after a setup:di:compile, if I run the command manually after it fails, I don't get this issue.

Now, because compile runs before static-content deploy and this sequence makes static-content deploy fail, I can't change from default mode to production mode.

Edit: I fixed the issue by first changing to developer mode then to production mode. Must be something with default mode -> production mode that doesn't work.

When you run FLUSHALL on redis prior to deploying static content, all works fine but it is a workaround, not a permanent solution.

I can confirm that i'm receiving the exact same error during the deployment of static-content when upgrading to 2.1.5, even when following the documentation and flushing Redis.

Now we've upgraded a site from 2.1.4 to 2.1.6, cleaned and refreshed everything as well as flushing the redis cache etc. We were in production mode already but things were kind of broken, so we switched to developer mode then back to production and now it won't work since di:compile causes an issue when static-content:deploy runs right after as explained in an earlier post here.

static-content:deploy always works on a second try after it fails. But that's not how deploy:mode:set works when changing to production, it has to work the first time around.

This specific site is a clean Magento installation with sample-data (demo site). Right now it is unusable.

This bug must have been introduced with 2.1.6 since we haven't seen it before this version.

Exactly the same problem. Introduced after 2.1.6 upgrade.
Getting

[Exception]                                                                                                                                                          
  Notice: unserialize(): Error at offset 0 of 1 bytes in /home/web/m2test/releases/20170505110400/vendor/magento/module-config/App/Config/Type/System.php on line 214

and sometimes [CredisException] read error on connection after that.

Unable to switch to production mode.
Tried to clean all possible caches, rarely it works, but most of the time it messes up with our deployment process.

I can confirm that I get this issue to, and for our shop the best solution is to run the cache:flush command in prior to the setup:static-content:deploy command. Like this:
bin/magento cache:flush && bin/magento setup:static-content:deploy

Having same issue too with RedHat 7, PHP 7, Redis configured. Sometimes I can run cache:flush and it sometimes won't work (possibly due to activity on site) and can take another attempt. I have a replica environment without Redis configured and it works flawlessly. Seems to be introduced since upgrading to 2.1.4 I think.

It's not an upgrade problem either. Every time we attempt to deploy any change and we run setup:di:compile and then static-content:deploy, it currently ALWAYS fails. We then have to manually re-deploy the static content after a few cache flush.

In my case when I debugged the System.php file I found that it was breaking on Error at offset 0 of 1 bytes for a request for "default/web/secure/ssl_offloader", and when attempting to load the cache for "default/web" it was instead receiving "1" a few times, before eventually receiving the full cache entry. So it seems the cache is being poisoned.

Oh wow.

Does static-content deployment now fork?

I'm seeing this. LOAD/LOAD FAILURE is from System.php. The rest is from Cm_Redis_Cache_Backend::load().

LOAD: [system_defaultweb]
CMCACHE LOAD: 123_SYSTEM_DEFAULTWEB
CMCACHE LOAD: 123_SYSTEM_CACHE_EXISTS
CMCACHE DECODE: 1
LOAD FAILURE: [1]
CMCACHE DECODE: a:8

As you can see, there must be two processes running, and it seems the first process sends a hGet for 123_SYSTEM_DEFAULTWEB, then the other process, which is in System.php, sends a hGet for 123_SYSTEM_CACHE_EXISTS. Unfortunately, due to a race condition here because the connection is shared, the System.php actually receives the response for the request for 123_SYSTEM_DEFAULTWEB.

This also explains the many variants and random error messages. Running in a random loop over and over and aborting after a second I received various messages from System.php to Redis connection lost, to invalid data returned from Redis.

Seems this might be the cause... static-content:deploy could be forking and sharing the redis connection.

This also explains the reason cache:flush works. It forces both forks to populate from database. It also explains it not working 100% of the time since if one fork too slow it could start loading from Redis. Also if there's activity on the site it could cause cache to be populated again.

Workaround for now which is deterministically safe:

magento setup:static-content:deploy -j 1

The -j 1 sets parallel processing to just a single process, which effectively undoes the parallelism change that must have been introduced in Magento 2.1.4 or thereabouts.

I think this is a lesson to always call exec after fork, especially in PHP. In any large program it's widely considered dangerous not to. Hopefully straight forward to fix tho! 馃檪

Hi, @driskell. Nice find, but... care to elaborate on that for mortals, too? ELI5 like. What exec? What fork? Where's System.php? How did you get to this?

LOAD: [system_defaultweb]
CMCACHE LOAD: 123_SYSTEM_DEFAULTWEB
CMCACHE LOAD: 123_SYSTEM_CACHE_EXISTS
CMCACHE DECODE: 1
LOAD FAILURE: [1]
CMCACHE DECODE: a:8

Thanks,
The Curious One

Apologies. I did rush through to find a solution and didn't spend the time to explain clearly. I'll try my best to detail everything I found.

I was randomly receiving the following error like others:

[Exception] Notice: unserialize(): Error at offset 0 of 1 bytes in /var/www/domain.com/vendor/magento/module-config/App/Config/Type/System.php on line 214

My investigation started by inserting old school I-Dont-Know-What-A-Debugger-Is style error_log lines into this file around the cache loads: LOAD: [cache-entry] and then checking if the result is unserialisable or not before Magento attempts to do it, logging either LOAD SUCCESSFUL or LOAD FAILURE: [the-value-that-we-couldn't-unserialize].

From this I could see that 1 was returned from a load for system_defaultweb and Magento was expecting serialised data. Further from these logs I could see there were many other system_defaultweb loads happening that worked fine.

From this I added further error_log to Cm_Redis_Cache_Backend in the load() and decodeData() functions. It would log the first 3 characters of the decoded data.

The presence of multiple CMCACHE LOAD logs in a row without an immediately following CMCACHE DECODE (which the code dictates must happen) indicated an impossible code path explained only by parallelism. Investigating static-content:deploy we can see that it performs parallelism. The Magento\Deploy\Console\Command\DeployStaticContentCommand::execute method uses a Model to manage deployment of each area, which can be found at Magento\Deploy\Model\DeployManager.

In Magento\Deploy\Model\DeployManager::deploy there is a check to see if parallelism is enabled and available (-j set to > 1 and pctnl_fork function available.) If so, it calls runInParallel. This function uses Magento\Deploy\Model\ProcessQueueManager to queue a list of tasks to be run in parallel, and this in turn creates processes up to the amount specified by -j (default 3) by creating Magento\Deploy\Model\ProcessManager objects.

Magento\Deploy\Model\ProcessManager has a fork method that calls pctnl_fork. After forking, nothing is adjusted so all open descriptors, including MySQL connections, Redis connections, etc etc, are still held, and whilst held the process continues to run the requested processing, using the same connections.

This triggers wild behaviours with a variety of errors such as the unserialise problem, lost connections to redis, corrupt data errors from redis where data could not be uncompressed (if you have enabled compression) and various other random exceptions. This is caused because of a race condition between reads on the Redis connection from the different parallel processes. They essentially battle with each other to receive data, and will at times receive data that is a response to a request a completing different process made. In our case of the exception above, in App/Config/Type/System.php, the process was expecting a response for system_defaultweb but was actually receive a response for system_cache_exists that was made by a different process, which of course returns a single integer 1, and not serialised data as is expected. In fact, the process that wanted system_cache_exists received system_defaultweb response, but this did not cause an exception there was it was simply evaluated for trueness.

Best solution is to pcntl_exec to replace the current image for the child with a new PHP run starting from, say, magento static:deploy-content --child-process <info-it-can-run>, that then creates it's own connections to MySQL / Redis as needed.

Another method would be to ensure that these child processes never use Redis. There's also the possibility of attempting to open another Redis connection too which might get through this. In either case though, it would be extremely important not to close the existing Redis connection as this is most likely going to trigger a QUIT to be sent across the connection. This is why sometimes when one child finishes before others have you see Connection to Redis Lost, since one child will send QUIT when PHP exits, and the other children (and the parent) would be left with an open Redis connection where QUIT was sent, which would be an invalid state.

Overall, workaround is to disable the static-content:deploy parallelism if you are using Redis, by passing -j 1 to the arguments.

Hope this is enough detail.

Yes, that's enough detail. Lemme get my popcorn, though. And thank you very much for your time explaining.

This problem is still present in the 2.1.7 EE. I opened a ticket on the partner support.

I ran "redis-cli fushall" and after a few tries, was able to run "bin/magento setup:static-content:deploy en_AU en_US". Looks like redis is the culprit

@btwq did you try passing -j 1 as a parameter to the static deploy command. I found that the problem occurs when you use Redis with the default parallelism of -j 4. Not so much a Redis issue but a Magento issue where it inadvertently shares a Redis connection with multiple processes triggering all kinds of wonderful race conditions. Flushall only seems to work occasionally - like due to random luck in the races! -j 1 has worked every time for me across many many deployments.

The Partner support (EE version) is aware of this issue, they said they are working on it.
At the moment, the only stable solution is the one mentionned here.

@commcad Thanks for reporting this issue. This issue is already addressed in internal ticket MAGETWO-69847.

Facing this too

Is this really fixed in 2.1.x? Still seeing it in 2.1.9.

Still facing issue passing from 2.1.5 to 2.2.2 (Magento Commerce), we're using Redis 2.8.24

Still present in 2.2.5

Was this page helpful?
0 / 5 - 0 ratings