Magento2: Magento varnish 6 Too many restarts

Created on 29 Aug 2019  路  57Comments  路  Source: magento/magento2

Preconditions (*)

_Magento 2.4-develop;
Production mode;
Sample Data;
Php 7.3;
Varnish v. 6.2_

We tried https://github.com/magento/magento2/commit/882379061809b806fc5580729a6c9c78d782f84d#diff-2f64f6171deecba61bea147539cf72ec

However it results in too many restarts after a while.

Steps to reproduce (*)

  1. Configure Varnish and Web server (Apache 2 in my case) devdocs;
  2. Configure Magento to use Varnish

Go to Admin->Stores->Configuration->System->Full Page Cache:

  • set Cache Application to Varnish Cache;
  • set TTL for public content to 120 sec;
    Screenshot from 2020-04-28 11-52-06
  1. Export Varnish 6 VCL, configure and restart varnish service;
  2. Run _varnishlog_ in console;
  3. Go to Storefront and try browsing the site for longer than 10min.

Use the VCL on production site with varnish 6.2, after a while certain objects get into a restart loop.

Expected result (*)


We expect no 503 errors caused by restart

Actual result (*)


:heavy_multiplication_x: VCL keeps restarting forever resulting in: - VCL_Error Too many restarts after a few tries
503 Response status and VCL error: Too many restarts
Peek 2020-04-28 11-20

_Varnish logs_
Screenshot from 2020-04-28 11-24-25
screenshotvar
Screenshot from 2020-04-28 11-27-17

PageCache Fixed in 2.4.x Confirmed Format is valid Ready for Work P2 Reproduced on 2.4.x S1 Dev.Experience

Most helpful comment

The issue is reproducible on fresh 2.4-develop.

_Preconditions:_

Magento 2.4-develop;
Production mode;
Sample Data;
Php 7.3;
Varnish v. 6.2

_Manual testing scenario:_

  1. Configure Varnish and Web server (Apache 2 in my case) devdocs;
  2. Configure Magento to use Varnish

Go to Admin->Stores->Configuration->System->Full Page Cache:

  • set Cache Application to Varnish Cache;
  • set TTL for public content to 120 sec;
    Screenshot from 2020-04-28 11-52-06
  1. Export Varnish 6 VCL, configure and restart varnish service;
  2. Run _varnishlog_ in console;
  3. Go to Storefront and try browsing the site for longer than 10min.

Actual Result:

:heavy_multiplication_x: 503 Response status and VCL error: Too many restarts
Peek 2020-04-28 11-20

_Varnish logs_
Screenshot from 2020-04-28 11-24-25
screenshotvar
Screenshot from 2020-04-28 11-27-17

All 57 comments

Hi @HOSTED-POWER. Thank you for your report.
To help us process this issue please make sure that you provided the following information:

  • [ ] Summary of the issue
  • [ ] Information on your environment
  • [ ] Steps to reproduce
  • [ ] Expected and actual results

Please make sure that the issue is reproducible on the vanilla Magento instance following Steps to reproduce. To deploy vanilla Magento instance on our environment, please, add a comment to the issue:

@magento give me 2.3-develop instance - upcoming 2.3.x release

For more details, please, review the Magento Contributor Assistant documentation.

@HOSTED-POWER do you confirm that you were able to reproduce the issue on vanilla Magento instance following steps to reproduce?

  • [ ] yes
  • [ ] no

PS: We read the information here: https://varnish-cache.org/docs/6.2/whats-new/upgrading-6.2.html#whatsnew-upgrading-2019-03

and added in vcl_recv:

if (req.restarts > 0) { set req.hash_always_miss = true; }

Which improved the situation, although were not sure this is resolved properly & 100% supported

Update: We did further tests and it looks properly solved.

Hi @HOSTED-POWER,
Thank you for report!

Could you add steps to reproduce to make sure that we'll be able to reproduce this issue?

That's really good that you already found solution for your issue. Could you create Pull Request with suggested fix?

Hello

To reproduce, install any Magento site (we had 2.3.1 and some 2.x versions) and wait for it to happen:

You can see the log like this:

varnishlog -q 'RespStatus == 503' -g request

Probably after a few minutes already you will see the 503 on certain objects and Varnish which goes into guru meditation error :)

We've seen it on all sites we tried it on, so it will be hard to not notice it.

To be on the safe side, we added this on top in vcl_recv:

sub vcl_recv {

    if (req.restarts > 0) {
        set req.hash_always_miss = true;
    }

    if (req.method == "PURGE") {
...

@Stepa4man: can you maybe also take a look at this to see if this proposed fix is ok?

HostedPower helped us solve this issue yesterday on one of our shops which is hosted with them, where we ran into unexplainable 503 Varnish errors. Their change seems to have fixed it 馃憤

@engcom-Alfa @engcom-Bravo @engcom-Charlie,
Could you verify this issue?

Hi @engcom-Delta. Thank you for working on this issue.
In order to make sure that issue has enough information and ready for development, please read and check the following instruction: :point_down:

  • [ ] 1. Verify that issue has all the required information. (Preconditions, Steps to reproduce, Expected result, Actual result).
    DetailsIf the issue has a valid description, the label Issue: Format is valid will be added to the issue automatically. Please, edit issue description if needed, until label Issue: Format is valid appears.
  • [ ] 2. Verify that issue has a meaningful description and provides enough information to reproduce the issue. If the report is valid, add Issue: Clear Description label to the issue by yourself.

  • [ ] 3. Add Component: XXXXX label(s) to the ticket, indicating the components it may be related to.

  • [ ] 4. Verify that the issue is reproducible on 2.3-develop branch

    Details- Add the comment @magento give me 2.3-develop instance to deploy test instance on Magento infrastructure.
    - If the issue is reproducible on 2.3-develop branch, please, add the label Reproduced on 2.3.x.
    - If the issue is not reproducible, add your comment that issue is not reproducible and close the issue and _stop verification process here_!

  • [ ] 5. Add label Issue: Confirmed once verification is complete.

  • [ ] 6. Make sure that automatic system confirms that report has been added to the backlog.

Hi @HOSTED-POWER thank you for your report. I am not able to reproduce issue by steps you described on 2.3-develop.
#24353issue

image

If you'd like to update the issue, please reopen it.

Hello @engcom-Delta

Did you enable

varnishlog -q 'RespStatus == 503' -g request

And then crawled the whole site? Try that again after 30 min and again after 2 hours, it should really start happning :/

Used varnish 6.2 btw, not sure that would matter

PS: The vcl is for Varnish 6.2: https://github.com/magento/magento2/commit/882379061809b806fc5580729a6c9c78d782f84d#diff-2f64f6171deecba61bea147539cf72ec

So at least I would test with that and not 6.0.5 which is outdated for this test.

Furthermore, if you want to see it even faster, try the caching of static files:

I.e. change this part:

    # Static files caching
    if (req.url ~ "^/(pub/)?(media|static)/") {
        # Static files should not be cached by default
        # return (pass);

        # But if you use a few locales and don't use CDN you can enable caching static files by commenting previous line (#return (pass);) and uncommenting next 3 lines
        unset req.http.Https;
        unset req.http./* {{ ssl_offloaded_header }} */;
        unset req.http.Cookie;
    }

In any case you need to crawl the whole site, not just look to the homepage and assume it didn't occur :)

@HOSTED-POWER thanks for reply. Rechecked on varnish 6.2.2 and issue is not reproducible:
image

With static caching enabled and with crawling a whole site? I think the shop in your screenshot was empty (It's not happening on all objects on all pages).

Also it takes some time, it works sometimes fine for a few hours even, but happens eventually. Sometimes very fast also, but you need to let it run longer time.

We noticed with static file caching enabled, it occurred even faster, so it would be nice to check

Last but not least, Varnish itself states in the documentation how you should replace the "miss": https://varnish-cache.org/docs/6.2/whats-new/upgrading-6.2.html#whatsnew-upgrading-2019-03

return(miss) from vcl_hit{} is now removed. An option for implementing similar functionality is:

return (restart) from vcl_hit{}
in vcl_recv{} for the restart (when req.restarts has increased), set req.hash_always_miss = true;.

@HOSTED-POWER Still cannot reproduce issue:
#24353issue
In default.vcl static file caching is enabled:
image
And header with MISS value is taken from vcl file that was generated from magento admin
https://devdocs.magento.com/guides/v2.3/config-guide/varnish/config-varnish-final.html

Veryyyyyy strange :)

We use nginx --> varnish --> nginx, but I doubt that's the reason.

We saw it on several sites for sure, at least 7 or 8 different ones. (production websites ,so not with the default theme etc).

Sadly this problem also occured on one of my main projects. I can confirm that these 503 errors are happenning from nowhere. In my case, there were other problems (memory issues), so I tought that the problem comes from those. But no, those weren't related. The fix seems to me that solved my issue.

Hi,

I have a question regarding this. We were having this same issue. We added that code from @HOSTED-POWER but we still saw an error but this time it had out of "workspace (bo)" in the log file which led me to this: https://www.claudiokuenzler.com/blog/737/varnish-panic-crash-low-sess-workspace-backend-client-sizing

Now what I think what is happening is once we added that code to set req.hash_always_miss = true; is it allowed the error stack to finally finish with that error when before it was just returning 503 early. OR maybe now that I am setting that enough restarts happened to run out of workspace. Either that or this was a totally unrelated error.

So my question is after applying this fix did anyone else get "workspace (bo)"? Also fyi you can log the 50X errors with this command:

varnishlog -a -A -w /var/log/varnish/varnish50x.log -q "RespStatus >= 500 or BerespStatus >= 500"

FYI we also use nginx SSL--> varnish --> nginx butt he last one nginx is a separate server all on port 80 and 443 with the a record pointed to nginx SSL.

We do not have 503 errors anymore except these in admin: https://prnt.sc/qe81q9 which I am still debugging.

Anyway anyone having the "workspace (bo)" issue with the restart issue fix?

Hello @weismannweb , I'm not sure if I understand it completely (lack of time atm), however after using the updated VCL we had 0 critical errors. So I don't think we hit that error (if I understand correctly that's a critical one too)

PS: I see we have this as a default in our optimized settings: "-p workspace_backend=320k "

Have the same issue, as described. You even don't need to surf the website. The website has ~5 products and 7 cms pages.
I simply opened homepage at once. In one hour there was already inline 503 error (returned by varnish as a content part) instead of the menu. (It was on Friday). After the weekend the site was totally down and we got 503 error from varnish.
The issue reproduced on staging and local environment. (But with the same docker configuration)

Let me re-open this issue, it seems that the error only occurs after a while, but @engcom-Delta only took a few minutes to test the issue, so that's not really representative.

@zhartaunik @weismannweb
Do you have clear steps to reproduce?

BIG DISCLAIMER: I am totally new to varnish cache with Magento 2.3 which we did on this project for the first time as it was a large and heavily trafficked site so what I write below is a total guess. Please bear that in mind.

I think it depends on what change we did fixed it. I think the restarts code fixed it but then i got the workspace error. I am not sure if they are related or separate. If it is because of this in the actual end https://www.claudiokuenzler.com/blog/737/varnish-panic-crash-low-sess-workspace-backend-client-sizing and the code "if (req.restarts > 0) { set req.hash_always_miss = true; }" fixes the restarts which then once the restarts don't cause the 503 error the workspace runs out from too many restarts then I would look to this statement as to how to reproduce:

We found out we had to increase the default (16kb), especially since we're doing quite a bit of HTTP header copying and rewriting around. In fact, if you do that, each varnish thread uses a memory space at most sess_workspace bytes.

If you happen to need more space, maybe because clients are sending long HTTP header values, or because you are (like we do) writing lots of additional varnish-specific headers, then Varnish won't be able to allocate enough memory, and will just write the assert condition on syslog and drop the request.

Mentioned here http://www.streppone.it/cosimo/blog/2010/03/varnish-sess_workspace-and-why-it-is-important/

Which I think indicates you have to have a large number of headers and/or be manipulating them too. Also, my site has a lot of redirects happening too maybe which might add to it.

Also, note we have a store with 5000 products and 350 categories, many extensions, and several layered navigations options on each category page.

Here is one of our varnish logs with 50x errors before we made the final fix.

https://www.dropbox.com/s/4a8mlaj03wjl6up/varnish50x.log-old2?dl=0

Here is a working vcl but it might be useful to see what we are doing with headers:
https://www.dropbox.com/s/ouq5tmhwdgwouew/default.vcl?dl=0

Here is our varnish settings for system d with it now working:
ExecStart=/usr/sbin/varnishd -a :80 -T 127.0.0.1:8080 -f /etc/varnish/default.vcl -S /etc/varnish/secret -s malloc,4096m -p thread_pool_min=200 -p thread_pool_max=4000 -p thread_pool_add_delay=2 -p http_req_size=64000 -p http_resp_hdr_len=90536 -p http_resp_size=120536 -p cli_timeout=25 -p workspace_client=256k -p workspace_backend=256k

That is about all I can add. I have had a cron run varnishlog -a -A -w /var/log/varnish/varnish50x.log -q "RespStatus >= 500 or BerespStatus >= 500" 24x7 and we have yet to get a single 50x error with the restart code fix and the workspace fix.

Experienced the same issue. In my case it was the primary nav loading via ESI was experiencing the "too many restarts" error loop and showing the default Varnish 503 error. It took a good amount of time before it started showing up.

I used the fix from https://github.com/magento/magento2/issues/24353#issuecomment-526098763 and so far that seems to have fixed the issue.

Testing setup idea

The vcl_hit{} restart happens when TTL and grace are expired. Default M2 TTL is 86400, so it seems unlikely that a fresh install is going to experience the issue relatively quickly. Perhaps changing the M2 TTL to something like 2 minutes, grace to 10s, and try browsing the site for longer than 10min. You could also try some cache refreshes/purges.

M2 changes for Varnish 6 VCL

It looks when generating the VCL for Varnish v6, M2 changed the vcl_hit{} "return(miss)" to "return(restart)", because it was removed from 6.2, but the generated VCL does not include the other half of the upgrade for 6.2 with the change in vcl_recv{}: https://varnish-cache.org/docs/6.2/whats-new/upgrading-6.2.html#vcl

At the very least it seems like the vcl_recv{} part needs to be included. I can't easily think of why it wouldn't always go into an infinite restart loop without that. I believe req.hash_always_miss is the desirable setting in this case to maintain backend request collapsing.

@engcom-Delta could you try to reproduce issue using additional info from @robolmos ?

@HOSTED-POWER I have same setup. I am not seeing 503 errors but I am experiencing frequent invalidations. https://github.com/magento/magento2/issues/26341
TTL is set to 1 week.

Experienced the same issue. In my case it was the primary nav loading via ESI was experiencing the "too many restarts" error loop and showing the default Varnish 503 error. It took a good amount of time before it started showing up.

I used the fix from #24353 (comment) and so far that seems to have fixed the issue.

Testing setup idea

The vcl_hit{} restart happens when TTL and grace are expired. Default M2 TTL is 86400, so it seems unlikely that a fresh install is going to experience the issue relatively quickly. Perhaps changing the M2 TTL to something like 2 minutes, grace to 10s, and try browsing the site for longer than 10min. You could also try some cache refreshes/purges.

M2 changes for Varnish 6 VCL

It looks when generating the VCL for Varnish v6, M2 changed the vcl_hit{} "return(miss)" to "return(restart)", because it was removed from 6.2, but the generated VCL does not include the other half of the upgrade for 6.2 with the change in vcl_recv{}: https://varnish-cache.org/docs/6.2/whats-new/upgrading-6.2.html#vcl

At the very least it seems like the vcl_recv{} part needs to be included. I can't easily think of why it wouldn't always go into an infinite restart loop without that. I believe req.hash_always_miss is the desirable setting in this case to maintain backend request collapsing.

donot work, with latest vcl at magento v2.3.4, still error

-- VCL_call HIT
-- VCL_return restart
-- VCL_Error Too many restarts

@suwubee both the VCLs are same. There is no difference.

first the menu will display 503, after few hours, all website will 503

-- ReqHeader grace: none
-- ReqHeader Accept-Encoding: gzip
-- VCL_call RECV
-- ReqUnset grace: none
-- ReqHeader grace: none
-- ReqURL /page_cache/block/esi/blocks/%5B%22catalog.topnav%22%5D/handles/WyJkZWZhdWx0IiwiY21zX2luZGV4X2luZGV4IiwiY21zX3BhZ2VfdmlldyJd/
-- ReqUnset Accept-Encoding: gzip
-- ReqHeader Accept-Encoding: gzip
-- VCL_return hash
-- VCL_call HASH
-- VCL_return lookup
-- Hit 491525 -3913.713074 259200.000000 0.000000
-- VCL_call HIT
-- VCL_return restart
-- VCL_Error Too many restarts
-- Timestamp Process: 1581421533.698961 0.000333 0.000060
-- RespHeader Date: Tue, 11 Feb 2020 11:45:33 GMT
-- RespHeader Server: Varnish
-- RespHeader X-Varnish: 2687046
-- RespProtocol HTTP/1.1
-- RespStatus 503
-- RespReason Service Unavailable
-- VCL_call SYNTH
-- RespHeader Content-Type: text/html; charset=utf-8
-- RespHeader Retry-After: 5
-- VCL_return deliver
-- RespHeader Content-Length: 281
-- Storage malloc Transient
-- Timestamp Resp: 1581421533.699061 0.000433 0.000100
-- ReqAcct 0 0 0 0 28

@ihor-sviziev Are we supposed to remove reference of pub from line 67,102,206 of VCL like we remove from line 13 if the Magento root directory is pub? Kindly advise.

@ihor-sviziev Are we supposed to remove reference of pub from line 67,102,206 of VCL like we remove from line 13 if the Magento root directory is pub? Kindly advise.

@monotheist That's the way I've been doing it since I couldn't find an option to remove the pub path. For single-server setups I've been excluding the health probe as well.

@robolmos Thanks for responding. Can you kindly confirm below is how line 102 and 206 should look like? I have removed (pub/)?

if (req.url ~ "^/(media|static)/") {
if (resp.http.Cache-Control !~ "private" && req.url !~ "^/(media|static)/") {

@ihor-sviziev Are we supposed to remove reference of pub from line 67,102,206 of VCL like we remove from line 13 if the Magento root directory is pub? Kindly advise.

Sorry, could you give a link to a file in special commit that you鈥檙e talking about? File content might be different across different commits/branches/releases

@ihor-sviziev Are we supposed to remove reference of pub from line 67,102,206 of VCL like we remove from line 13 if the Magento root directory is pub? Kindly advise.

Sorry, could you give a link to a file in special commit that you鈥檙e talking about? File content might be different across different commits/branches/releases

Here it is. I think VCL hasn't changed since 2.3.3.
https://github.com/magento/magento2/commit/882379061809b806fc5580729a6c9c78d782f84d#diff-2f64f6171deecba61bea147539cf72ec

@robolmos Thanks for responding. Can you kindly confirm below is how line 102 and 206 should look like? I have removed (pub/)?

if (req.url ~ "^/(media|static)/") {
if (resp.http.Cache-Control !~ "private" && req.url !~ "^/(media|static)/") {

Can someone please confirm above is correct?

@monotheist in theory - it should be correct, but not sure if it's causing listed above issue

Same issue. Since Magento 2 is using VCL 4.0, I changed my Docker to million12/varnish which is Varnish 4 for now.

@engcom-Alfa @engcom-Bravo @engcom-Foxtrot could you review this issue again? It looks to me really critical

Confirming that we are having the same issue here, too, on v2.3.3

With varnish-6 using the unmodified VCL, the megamenu will disappear within a couple of hours, returning a 503 error in varnishlog, and then a few hours after that, the entire site will 503.

Downgrading to varnish-5 proves to function perfectly fine again, as does using varnish-6 and inserting that if (req.restarts > 0) { set req.hash_always_miss = true; } into the vcl_recv section.

Definitely looks like an issue in the default VCL.

Just experienced the same issue with Varnish 6. That condition does seem to fix the issue so far. I would suggest editing the varnish6.vcl to include that when generated. See patch code below.

Index: app/code/Magento/PageCache/etc/varnish6.vcl
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- app/code/Magento/PageCache/etc/varnish6.vcl (revision 7aa94564d85e408baea01abc5315a0441401c375)
+++ app/code/Magento/PageCache/etc/varnish6.vcl (date 1582819203827)
@@ -23,6 +23,10 @@
 }

 sub vcl_recv {
+    if (req.restarts > 0) {
+        set req.hash_always_miss = true;
+    }
+
     if (req.method == "PURGE") {
         if (client.ip !~ purge) {
             return (synth(405, "Method not allowed"));

Actually, this does not seem to be a stable fix. If you try refreshing the same page over and over after a few requests it will fail and varnish will start a new child.

Is there any update to this issue?

@sdzhepa could anyone from QA team review this issue? It is really important to fix as for me

I tested the solution of drew7721 and it solves the issue for me.

Hi @engcom-Alfa. Thank you for working on this issue.
In order to make sure that issue has enough information and ready for development, please read and check the following instruction: :point_down:

  • [ ] 1. Verify that issue has all the required information. (Preconditions, Steps to reproduce, Expected result, Actual result).
    DetailsIf the issue has a valid description, the label Issue: Format is valid will be added to the issue automatically. Please, edit issue description if needed, until label Issue: Format is valid appears.
  • [ ] 2. Verify that issue has a meaningful description and provides enough information to reproduce the issue. If the report is valid, add Issue: Clear Description label to the issue by yourself.

  • [ ] 3. Add Component: XXXXX label(s) to the ticket, indicating the components it may be related to.

  • [ ] 4. Verify that the issue is reproducible on 2.4-develop branch

    Details- Add the comment @magento give me 2.4-develop instance to deploy test instance on Magento infrastructure.
    - If the issue is reproducible on 2.4-develop branch, please, add the label Reproduced on 2.4.x.
    - If the issue is not reproducible, add your comment that issue is not reproducible and close the issue and _stop verification process here_!

  • [ ] 5. Add label Issue: Confirmed once verification is complete.

  • [ ] 6. Make sure that automatic system confirms that report has been added to the backlog.

The issue is reproducible on fresh 2.4-develop.

_Preconditions:_

Magento 2.4-develop;
Production mode;
Sample Data;
Php 7.3;
Varnish v. 6.2

_Manual testing scenario:_

  1. Configure Varnish and Web server (Apache 2 in my case) devdocs;
  2. Configure Magento to use Varnish

Go to Admin->Stores->Configuration->System->Full Page Cache:

  • set Cache Application to Varnish Cache;
  • set TTL for public content to 120 sec;
    Screenshot from 2020-04-28 11-52-06
  1. Export Varnish 6 VCL, configure and restart varnish service;
  2. Run _varnishlog_ in console;
  3. Go to Storefront and try browsing the site for longer than 10min.

Actual Result:

:heavy_multiplication_x: 503 Response status and VCL error: Too many restarts
Peek 2020-04-28 11-20

_Varnish logs_
Screenshot from 2020-04-28 11-24-25
screenshotvar
Screenshot from 2020-04-28 11-27-17

:white_check_mark: Confirmed by @engcom-Alfa
Thank you for verifying the issue. Based on the provided information internal tickets MC-33804 were created

Issue Available: @engcom-Alfa, _You will be automatically unassigned. Contributors/Maintainers can claim this issue to continue. To reclaim and continue work, reassign the ticket to yourself._

Hi @HOSTED-POWER @drew7721,
This issue was finally confirmed! 馃帀
Could you create Pull Request that fixes it? Seems like you already have working solution.

Thank you!

Also checked a possible solution to the problem from @drew7721 in the comment and looks like it solves the problem.

@ihor-sviziev it seems confirmation took quite a while, happy it's finally getting confirmed :)

I've created PR for fixing this issue https://github.com/magento/magento2/pull/28137

Hey everyone, while following does solve the problem, it doesn't solve the problem when you have distributed deployment with multiple FE instances:

Just experienced the same issue with Varnish 6. That condition does seem to fix the issue so far. I would suggest editing the varnish6.vcl to include that when generated. See patch code below.

Index: app/code/Magento/PageCache/etc/varnish6.vcl
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- app/code/Magento/PageCache/etc/varnish6.vcl   (revision 7aa94564d85e408baea01abc5315a0441401c375)
+++ app/code/Magento/PageCache/etc/varnish6.vcl   (date 1582819203827)
@@ -23,6 +23,10 @@
 }

 sub vcl_recv {
+    if (req.restarts > 0) {
+        set req.hash_always_miss = true;
+    }
+
     if (req.method == "PURGE") {
         if (client.ip !~ purge) {
             return (synth(405, "Method not allowed"));

Problem happens when you add / remove backends to varnish.vcl and reload varnish service (emphasis on reload, not restart - to keep everything in cache and just reconfigure) - backend fetch fail happens for short interval (was 10ish seconds for us) - resulting in HTTP 503 for users that didn't hit the cache.

Fix for that is to set N to a number which is max_restarts - 1 just below configuration for max_restarts:

  • if (req.restarts > N) {
  • set req.hash_always_miss = true;
  • }

P.S. you can check what the max_restarts value by using varnishadm command.

Oh, we've also added following snippet to force retry up to max_restarts - 1 times (to give varnish time to see backend is unhealthy using probe in backend config):

```
# cache only successfully responses and 404s
if (beresp.status != 200 && beresp.status != 404) {
set beresp.ttl = 0s;
set beresp.uncacheable = true;
return (deliver);
} elsif (beresp.status > 500) { // THIS IS ADDED
return (retry); // THIS IS ADDED
} elsif (beresp.http.Cache-Control ~ "private") {
set beresp.uncacheable = true;
set beresp.ttl = 86400s;
return (deliver);
}

```

Edit:

Note that this kind of setup uses Varnish Transient storage (short lived cache) and if you don't set memory limit for that storage, it will eat up your RAM and eventually lead to crash of the server (source: https://varnish-cache.org/docs/trunk/users-guide/storage-backends.html, search for "By default Varnish would use an unlimited malloc backend for this.") so make sure to edit your startup script for Varnish and name the Transient storage with limit.

E.g. /usr/sbin/varnishd -a :6081 -f /etc/varnish/default.vcl -s Cache=malloc,2048m -s Transient=malloc,512m:

Screen Shot 2020-08-05 at 1 44 48 PM

P.S. thanks @robolmos for pointing out return(retry) issue, I've updated the example above.

@lotar I'm not a Varnish expert, but you might want to look at other solutions like compiling the VCL, letting the backends register as healthy, then load the VCL. Or, update the backends config to be healthy initially.

Maybe it's OK to retry with a server-side error to help prevent the client getting transient backend errors.. in vcl_backend_response() I believe it's technically return(retry) and uses the max_retries value rather than max_restarts.

Hey @robolmos,

Thanks for the feedback.

@lotar I'm not a Varnish expert, but you might want to look at other solutions like compiling the VCL, letting the backends register as healthy, then load the VCL. Or, update the backends config to be healthy initially.

Nor am I, but compiling / reloading of VCL after new backend is healthy it's not an option for us give the rest of the setup.

Maybe it's OK to retry with a server-side error to help prevent the client getting transient backend errors.. in vcl_backend_response() I believe it's technically return(retry) and uses the max_retries value rather than max_restarts.

Fixed, ty ;)

@lotar should I update my PR https://github.com/magento/magento2/pull/28137 ?

@lotar should I update my PR #28137 ?

@ihor-sviziev honestly I think there's no need since my update was specific for infrastructure setup (auto scaling group issue). While it does solve the problem for us, it won't necessarily be 100% correct solution (or even needed) for different kinds of setup.

Also rethinking the problem, it would be better to do
} elsif (beresp.status == 503) {
instead of
} elsif (beresp.status > 500) {
Since it's here for a specific reason...

What I'd suggest on the other hand is to update official documentation for Varnish 6 setup regarding Transient storage explained here.

Reason being is that default Varnish installation has no memory limit and fix from the PR actually uses this kind of storage.

It is infrastructure part as well (Varnish startup setup) but if not set correctly it will cause Varnish service to eat up RAM in combination with this VCL (from PR) eventually and will lead to site going down.

To conclude, I'd say it's up to you ;)

Hi @HOSTED-POWER. Thank you for your report.
The issue has been fixed in magento/magento2#28137 by @ihor-sviziev in 2.4-develop branch
Related commit(s):

The fix will be available with the upcoming 2.4.1 release.

Was this page helpful?
0 / 5 - 0 ratings