Laravel-websockets: Multiple App Servers & Load Balancer Help

Created on 10 Sep 2020  路  57Comments  路  Source: beyondcode/laravel-websockets

Hello, I'm just after clarification or pointers on how best to approach laravel-websockets with my setup. I run all my servers under Laravel Forge and I use their load balancers which is a Nginx server with a reverse proxy.

My setup is 1 load balancer, 4 app servers (same codebase), a dedicated MySQL and a dedicated Redis server. Previously I ran Laravel Echo Server on the same server as the Redis DB and the 4 app servers were able to communicate, however my project now needs sockets on mobile which lead me here due to the Pusher SDK implementation.

I'm wondering how best to deploy laravel-websockets with my app. Should I run a dedicated server instance with only laravel-websockets and proxy all 4 servers there so there's a single websockets server on a subdomain or should each of the 4 app servers be running websockets:serve ?

I do intend to use Redis once v2 is ready, however I am still wondering which of the 2 solutions would be best?

bug

Most helpful comment

@AugmentBLU @SelimSalihovic So I found out that the double-messages do happen when the messages are being sent out on the node that has channels. Basically, the message didn't get streamed further because it wasn't streamed at all. If there were users locally, they would just send to them and just that, without streaming them further. The latest 2.x-dev should make its job, I'll tag a release this noon after I come back from the gym. Here's the commit that should solve everything up: https://github.com/beyondcode/laravel-websockets/pull/447/commits/7519da4a08f062e9983a4e3e1f698b8e4e8ca83d

All 57 comments

Running laravel:websockets an all 4 servers in the current implementation is broken and should not be used at all. You can use 2.x, make sure to also check the 2.x PR that explains more info on how to migrate it post-rewrite (yes, it was rewritten from scratch)

Which branch do you recommend that I use? Using 2.0.0-beta.18, running sockets on each server works to a degree but when broadcasting, it's intermittently sending to one or the other servers behind the load balancer. Basically not broadcasting to all as expected.

I tried to pull a few of the latest commits but they're causing errors when running.

production.ERROR: Class name must be a valid object or a string {"exception":"[object] (Error(code: 0): Class name must be a valid object or a string at /home/vagrant/projects/mysite/vendor/beyondcode/laravel-websockets/src/Console/Commands/StartServer.php:126)

I did notice your notes about writing from scratch, is this something that's going to happen now?

The 2.0.0-beta.20 version got improved for horizontal scalability, but you need Redis.

I will be using Redis so will check this out. I have it working as is but it's not exactly how I want it so hopefully the update solves the issue I have.

My dev environment has a load balancer, 2 app servers and a dedicated Redis server. I want to broadcast events across the app for many user actions, such as company settings so all users within that company instantly have the most up to date settings.

When I click update, the event is not broadcast across all servers in unison, it's only received by some and not others. If I click update multiple times, the broadcast is received after every second click - which appears to be the load balancer (round robin) selecting which server the connection is going to. I am using Redis and running websockets:serve on both servers and still happens.

My work around was to edit the load balancer Nginx to push all /app connections to a single server and it works fine and Redis is doing it's thing.

    location /app {
        proxy_pass             http://main_server_ip:8444;
        proxy_set_header Host  $host;
        proxy_read_timeout     60;
        proxy_connect_timeout  60;
        proxy_redirect         off;

        # Allow the use of websockets
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;
    }

I will check out the latest beta build to see if this has changed anything.

The idea behind the Redis replication that this package provides is that when a message gets broadcasted to a channel on one instance, it automatically takes a copy of it and streams it to the other instances using Pub/Sub. I have attached a diagram.
Untitled Diagram (3)

That's exactly how I hope it will work but it wasn't doing this as of my testing of 2.0.0-beta.18 (I think) which required the Nginx config on the load balancer.

I will try 2.0.0-beta.20 later today to see if this now works.

I've just tested 2.0.0-beta.20 and it's the same issue behind a load balancer. Only some broadcasts are being received depending on which server the user has landed on.

I tried removing the Nginx config and replaced it on each app server as follows in the hope that it would resolve but it hasn't.

    location /app {
        proxy_pass             http://127.0.0.10:8444;
        proxy_set_header Host  $host;
        proxy_read_timeout     60;
        proxy_connect_timeout  60;
        proxy_redirect         off;

        # Allow the use of websockets
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;
    }

The only way I can get this working is if it's only the load balancer's Nginx config is changed to this below but means all broadcasts are sent from a single server but still received on all others. I've tried numerous different approaches but it just isn't broadcasting in unison and only sends to one or the other server.

    location /app {
        proxy_pass             http://main_server_ip:8444;
        proxy_set_header Host  $host;
        proxy_read_timeout     60;
        proxy_connect_timeout  60;
        proxy_redirect         off;

        # Allow the use of websockets
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;
    }

This approach is still slightly better than having a dedicated sockets box but this approach is still not working as intended, unless I am not approaching this right or have my configs mixed up somewhere.

Have you set the replication mode in websockets to redis via WEBSOCKETS_REPLICATION_MODE=redis?

Yes. Have you tried the setup in your diagram or is it theory that's how it should work?

I also reverted all my server settings back to how it was before and it's back to being intermittent both with user connecting and broadcasts. I went back to my local env so it's just a single server and with 2.0.0-beta.20 both .joining() .leaving() are not triggering properly for other connected users so they do not know who's online or not now. Backed down to 2.0.0-beta.19 and events work again. Perhaps this is having a knock-on effect with how the environment with the load balancer is reacting.

The diagram is in a brief detail on what actually happens. This was made side-by-side with the code. I am currently investigating how this can occur or what additional setup might inflict this problem. The .joining() and .leaving() methods not working problem is caused by the multi-node setup that doesn't replicate properly across your nodes. I can also check by the tests that the receiving via pubsub takes place.

For some kind of reason, your Redis instance might not support Pub/Sub?

For some kind of reason, your Redis instance might not support Pub/Sub?

Isn't 2.0.0-beta.19 running Pub/Sub? That version works as expected with .joining() and .leaving() both locally and on a Forge deployed setup.

I upped it back to 2.0.0-beta.20 on my local env again, .joining() and .leaving() only trigger maybe once or twice for any user logged in and another refreshes their page. After repeatedly refreshing, the events no longer trigger.

2.0.0-beta.19 I can refresh over and over, .joining() and .leaving() both trigger as expected for all users.

I just tried the 2.0.0-beta.21 from fresh and still got the issue with .joining() and .leaving() not triggering after the first refresh. I tried on my local env with both local and redis.

The console debug I can see the connections being made but they are not triggering the Echo events. 2.0.0-beta.19 still seems to be working the best for me.

If it's not working with local, then it seems it's an issue on your side and I cant properly identify what can cause that. Do you have any demo app I can try on? It doesn't have to be disclosed or something.

Make sure the config file that comes with -beta.21 is the same one as the one you got.

Why would it be working as expected for -beta.19 but broken for .20 and .21 on both a local env built on the latest Homestead release, using Laravel 8.3 and on Forge using a brand new deployed Ubuntu 20.04 (LTS) x64 server, with PHP 7.4, latest redis? Has there been any fundamental changes that require specific environment configs or software builds?

Unfortunately this implementation is for a project that I cannot disclose due to NDA which doesn't help. I'll see if I can strip this right back without disclosing the project, failing that going back to the original Echo Server setup.

I got round the load balancer issue by pushing all /app requests to a single server but still maintained round robin. However, the main issue I'm having with -beta 20,21 now is that Laravel Echo .joining() and .leaving() events are not being triggered at all after maybe the first couple firings. This is happening on Local env, single server Forge and multi-server Forge setup, all using Redis. Also fails to trigger using local.

I have tried a very simple Echo call to check and after maybe the second or third refresh, the events stop firing.

window.Echo.join('team.' + this.team.id)
          .here((users) => {
            console.log('here', users);
          })
          .joining((user) => {
            console.log('joining', user);
          })
          .leaving((user) => {
            console.log('leaving', user);
          });

I did notice in beta 20,21 that the Redis database sets for {appID}channels and sockets were disappearing randomly from the DB but doesn't happen with beta 19.

In -beta.20 I have implemented a system that deletes stored info in Redis if the server process gets shutdown unexpectedly (no SIGTERM/SIGINT).

The idea behind it was that if a server has local connections and the server suddenly drops, having Redis as the replication mode might lead to data still stored in Redis (like users data that were in presence channels or increments for the Redis collector which can lead to wrong statistics), so if the connection didn't pong in the last 2 minutes, according to the Pusher docs, it will be marked as inactive, mock the connection details and unsubscribe it from everywhere, including deleting obsolete data AND decrementing the stats values within the collector: https://github.com/beyondcode/laravel-websockets/compare/2.0.0-beta.19...2.0.0-beta.20#diff-1a793d12ff9c279a1c1d734f6a00a54cR411-R427

The fact that this test pass is making me think it's something bad configured: https://github.com/beyondcode/laravel-websockets/blob/2.0.0-beta.21/tests/PresenceChannelTest.php#L64-L99

I think I'm going to set a demo Laravel 8.x repo and configure it and perhaps try it in your env somewhere to see further what's going on.

@rennokki Can we also address the issue with the load balancing? I deleted my previous comment since it was a mistake on my part to think I had it fixed. Why did @AugmentBLU have to set up forwarding for the /app path in nginx? What was your setup in your testing, did you have to do the same? The comment I'm referring to is: https://github.com/beyondcode/laravel-websockets/issues/517#issuecomment-692698236

@SelimSalihovic Having a demo application should be enough to detect what's going on.

@rennokki let me know if you need help setting up the demo repo/app. I'm open to help.

The {appID}channels and sockets sets are being deleted almost instantly after creation from a page load/refresh. Is that expected behaviour?

I had a look at my Redis config and the tcp-keepalive is set to 300 seconds, should I be checking any other settings?

I have detected issues regarding the presence channels that were more-or-less linked to the bad filtering of users. 馃檮 Also, the connections do randomly disappear from the sorted set, and I'm investigating it right now since this is the main cause that makes the events to not be broadcasted at all on Redis replication. I can clearly see that the local replication works fine.

Alright, glad you managed to find something that might point towards the issues. Hopefully fixes can be found.

To sum it up, the issues were caused by the fact that I mistakenly used zrange() instead of zrangewithscore(), causing Redis to give back all connections instead of the ones that were older than 2 minutes without a single pong, so they could be cleaned up. This has to do with the Redis replication, and I really hope that -beta.22 is gonna fix the issues you got. I have tested it locally and I will update the package name and push the demo project so y'all can play with it with the local connection (although the Redis connection will work too, hopefully)

Nice one, I have it working on my local env using redis - I haven't tried using the local provider as I will not be needing this.

Next I'll try on my multi-server Forge setup with the load balancer. I will see if I still need the Nginx config to route /app requests to the same server as well.

I don't suppose you have tried a multi server setup with a load balancer have you?

I did but only using AWS's ALB within Elastic Beanstalk. It helps me to open up the port and route the 6001 SSL to 6001 Non-SSL internally without too much hassle.

I'll give it a shot in the next couple hours. I'm using Laravel Forge and their Nginx based load balancers, reason is I'm using Cloudflare DNS and trying to get sub-subdomains working with SSL is painful. I'm also using DigitalOcean, I did try their own load balancers but they are so bad, many people have issues with them so I stick with Forge's.

I open port 8443 as this is one of the ports Cloudflare supports sockets on, it hits the load balancer and I route any /app connections to http://private_network_ip:8444. I am running the websockets on port 8444 internally, the load balancer has the SSL only, app servers are on a private network so only use http - SSL termination basically at the LB. Everything works this way, however broadcasts sent from user A seem are quite randomly received by other users, even user A if open in another tab as they may of hit another server due to round robin.

Anyway, I'll update once I get round to pushing and testing on this setup.

The setup seems fine, you keep the LB & Internal ports to the same one, LB SSL Termination, using non-SSL for internal backend. The round-robin should not affect the way the message gets sent. As long as the server is up, it gets connected to Redis and listens to the PubSub messages if they got users in a specific channel and app id.

I have just set up the demo project with 2 servers (ec2 instances) with an AWS application LB in front of them. I am still seeing the same thing happen, so only some of the messages get sent through. This is what I noticed:

  • When I open the website one one device and it connects to the socket on 6001 on server 1 and on the other device I open the page and get connected to server 2 on 6001 the list of users online on the frontend only shows 1 user (the currently logged in user). Any messages sent will not appear on the other user's screen.
  • However, when they are connected to the same server's socket connection then I see 2 users on the users list (on both screens) and then the messages sent appear on both user's screens.

Maybe I can share the demo URL with you but I have to wait for a response from my supervisor. Maybe that could help.

@SelimSalihovic That's what I was seeing when I was testing, hence this post. The only way round it that I found was to run websockets:serve --port=8444 on all app servers and then config my LB Nginx to push all /app requests to a single server. Each server still needs websockets to run as it will error with Pusher server couldn't be reached or something similar to that.

@AugmentBLU I understand that, but I got the feeling that the latest beta (22, or any subsequent updates) would also have removed the need for such a setup i.e. for the LB to push all /app requests to the main server. I hope I'm not misunderstanding anything.

@rennokki I've sent an email with our testing URL to your private e-mail address if you choose to test it on our URL to see the issue.

@SelimSalihovic Yeah it would be nice if it worked that way, hopefully it will but at least there is a partial workaround to at least get things working with Redis and multiple servers. Here's hoping!

Now I see that the NGINX configuration sent here is hardcoded with a single server IP that gets all the requests. However, this is not intended to be a good configuration/approach and it creates too much pressure on one server that will resolve all the incoming requests to that /app route.

The proper solution is to let all the servers use the same NGINX configuration that proxies the /app route to the WebSockets server so that the Load Balancer evenly distributes all the incoming connections and requests across servers while the Redis Pub/Sub does its thing. I doubt that this has already been tried and I was talking into the void here.

In this case, you might not need a /app path at all, if you can open one port supported by CF (for example 2096 or 8443), you can easily just use the IP and port and will make sure it gets to a server. I am currently working to put together this approach to showcase the idea.

I have tried using setting my LB as standard, no /app request config. Each app server config has the /app request and is pushed to it's own WS server proxy_pass http://127.0.0.1:8444; but we hit this issue that broadcasts are not being sent to all users on all servers, instead it's intermittent and only received by those on the same server the broadcast is emitted from, same issue as @SelimSalihovic has witnessed.

That's why the LB was configured to route all requests to a single server, but like you said will cause pressure on that single server - however it's the only way around this issue so far.

I've still to test the latest beta version on this setup though, probably won't be till later today now as I need to sort something else.

In your examples, are you using the same Redis instance?

Yes, I have a dedicated server for Redis that each app server connects to. I have Redis running for all session/cache/queue for Laravel and works great so it is setup correctly. It also worked flawlessly with Echo Server when I was using that. I will test in a couple hours and let you know.

I've just tested and still got issues with the broadcasting side.

Both app servers Nginx config includes

    location /app {
        proxy_pass             http://127.0.0.1:8444;
        proxy_set_header Host  $host;
        proxy_read_timeout     60;
        proxy_connect_timeout  60;
        proxy_redirect         off;

        # Allow the use of websockets
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;
    }

Load balancer is standard Forge Nginx load balancer, set to round robin.

When 2 different users are logged in using 2 browsers, things are working as expected. One refreshes the page, the .leaving() and .joining() events trigger as expected on the other's browser, however when a button is clicked that triggers a broadcast event, it's still not consistently being sent to all users, no matter which server they are currently connected to.

Multiple clicks of the button, triggering the broadcast event multiple time eventually gets through but sometimes it's duplicated, sometimes it doesn't appear.

I tested using IP Hash which is sticky session and it's sending the broadcast as expected to all those connected to the same server, however I used TOR browser for another browser connection to connect with a different IP, managed to land on a different server than the other connected users and the broadcast is duplicated so notifications are being received twice every time.

I go back to routing all /app requests at the LB level, change each app server no /app request config and works again but of course it's back to the single server dealing with all broadcasts again. I really don't see any other way to fix this. The general here, joining, leaving events are working but Laravel Broadcasts are not.

@AugmentBLU @SelimSalihovic So I found out that the double-messages do happen when the messages are being sent out on the node that has channels. Basically, the message didn't get streamed further because it wasn't streamed at all. If there were users locally, they would just send to them and just that, without streaming them further. The latest 2.x-dev should make its job, I'll tag a release this noon after I come back from the gym. Here's the commit that should solve everything up: https://github.com/beyondcode/laravel-websockets/pull/447/commits/7519da4a08f062e9983a4e3e1f698b8e4e8ca83d

I just tested 7519da4a08f062e9983a4e3e1f698b8e4e8ca83d on my Forge setup and seems to be working pretty well. I have both app servers running /app requests to their own local hosted websocket server and the broadcasts are being made across all servers, without duplication. I have some further testing to do but seems good so far, great work.

I have been able to replicate the same success but I have some remarks. I also ran it with routing /app requests to their own local hosted websocket server.

  • For this to work I also had to enable session stickiness in the round robin setup (I had previously disabled it). I wonder if I even need to route /app requests to their own server if session stickiness is enabled.
  • I also noticed that the users list is always empty for me, on both browsers. It used to be populated on beta22.

Either way, the broadcasts work without flaws. Thanks for the hard work so far.

@SelimSalihovic Session stickiness is a must on WebSockets when horizontally, making sure to stick to the same node when working back-and-forth, thus using the same active connection instead of closing it and opening it on other LB.

I have noticed that the users list acts funny, so I'm going to investigate this.

Released -beta.23 that contains the docs on changes & fixes the nasty bugs. Looking into the users list issue rn.

Just FYI, when I upgrade to -beta.24 no messages get sent through. On -beta.23 this works as expected.

@SelimSalihovic Presence channels?

I'm not sure I follow but hopefully this answers it. The list of users online is empty but I see the connections being opened in the log files.

You said the messages dont get sent through, meaning it doesnt broadcast them?

I use Laravel Echo with vueJS to create a list of users online within a presence channel. Everything seems to work with -beta.24 but I have only checked locally, single server using Redis. Broadcasts are working as well, will check multi-server setup later.

I did have to clear my Redis DB after update but that's expected.

@rennokki they don't seem to be broadcasted, no. I was testing your demo application with a 2 server setup with a LB in front. I didn't clear the redis db after updating. Would you say that's mandatory?

@AugmentBLU It's not mandatory as the obsolete connections data get flushed automatically if the connections do not pong in 120 seconds (according to the Pusher docs), which obviously they won't pong if you close and re-open the server.

@SelimSalihovic The app has not been updated with the latest version 馃 But I'm going to try it myself within the app. I have pushed to production the latest version and the broadcasting works as it should, even with dozen of concurrent users (like 141 or so)

@rennokki I did manually update the demo app on both servers to use beta 24 and then tested it. Again, as soon as I downgrade to beta 23 on both servers the broadcasts start working.

I just did another test and it appears to be working on beta 24 for both the demo app and our application. I'm not sure what made it to not work 2 days ago but just wanted to let you know it does. The users online list also works for me now.

I've noticed that users are appearing in Redis days after last being connected meaning they show up as online. Shouldn't this be self flushing after X amount of time?

@AugmentBLU @SelimSalihovic @rennokki I'm quite curious how this shakes out because it would be a game-changer to run load-balanced websocket servers and not have to even worry about laravel echo infrastructure. Any more updates from the stakeholders involved, production stories, etc.?

nvm, I see this is a focal point of the refactor. Thanks for all the hard work :smiley:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

connecteev picture connecteev  路  3Comments

rikless picture rikless  路  4Comments

fridzema picture fridzema  路  4Comments

stefandanaita picture stefandanaita  路  4Comments

semsphy picture semsphy  路  3Comments