Rocket.Chat stops working with 1000 active users

Created on 28 Jun 2018  Â·  74Comments  Â·  Source: RocketChat/Rocket.Chat

Description:

For us, Rocket.Chat does not work with more than 1000 active users. Rebooting a server, restarting Apache or restarting Rocket.Chat after an update causes all clients to face serious issues connecting to the chat.

Steps to reproduce:

  1. Setup a chat with 1000 simultaneous active users
  2. Restart all instances at once.

Expected behavior:

Clients can reconnect to the chat.

Actual behavior:

While reconnecting the server sends an enormous amount of the following messages over websocket:

{"msg":"added","collection":"users","id":"$userId","fields":{"name":"$displayName","status":"online","username":"$username","utcOffset":2}}

```json
{"msg":"changed","collection":"users","id":"$userId","fields":{"status":"online"}}

```json
{"msg":"removed","collection":"users","id":"$userId"}

This continues until the server closes the websocket. I assume, this is due to the lack of ping-pong messages in this time. The client instantly requests a new websocket starting the whole thing over and over again.

The only effective way to get the cluster up and working again is to force-logout all users by deleting their loginTokens from mongodb directly.

Server Setup Information:

  • Version of Rocket.Chat Server: 0.65.2
  • Operating System: Debian 8.11
  • Deployment Method: tar with pm2
  • Number of Running Instances: 8 virtual machines with 3 instances each (24 instances)
  • DB Replicaset Oplog: On
  • NodeJS Version: 8.9.4
  • MongoDB Version: 3.4.9

Additional context

The high amount of instances we operate directly results from this issue. When we first ran into it with about 700 users, we assumed we might need to scale the cluster accordingly but we are not willing to add another server to the cluster for every 40 new users. We planned to support around 8000 users. Approximately half of them active.

For now, we do not allow mobile clients yet. We would really love to do so but with the current state of the cluster this wont happen soon.

performance

Most helpful comment

@sampaiodiego Thanks to COVID-19 we had a lot of trouble scaling beyond 1650 active users. Thanks to v3.0.4 we are now at 2250 user max. per day. Thank you for further improving on this issue.

All 74 comments

Do you have Apache2 or Nginx for the frontend ?
Maybe you've reached some limitations (MaxClients?) for the frontend.

What about system usage (RAM, CPU, Network, FS) for the machines of the cluster?

Cheers

We are using Apache as reverse proxy. The servers have 16 GB RAM available and only 1.5 GB used per instance. CPU usage is going up to its limit during the reconnects.

screenshot from 2018-06-28 14-53-04
screenshot from 2018-06-28 14-53-24
screenshot from 2018-06-28 14-53-54

Sounds like you reached the maximum mongodb connections (1024 by default on linux as far as I know).

@AmShaegar13 You could try using nginx instead of Apache, and as suggested check your mongo settings?

hi @kaiiiiiiiii,
I am the admin of @AmShaegar13's Rocket.Chat-Setup, he kindly asked me to post this here:

001-rs:PRIMARY> db.serverStatus().connections
{ "current" : 182, "available" : 51018, "totalCreated" : 3234457 }

root@rocketchatdb:~# lsof -i | grep mongodb | wc -l
186

So this shouldn't be a thing…

Best,
qchn

Did you check Apache2 log for MaxClients reached ?

Yes, @magicbelette, thanks for the hint. We configured MaxClients to 1500 per node and we're far from reaching that.

@qchn Thanks. ;)

@magicbelette Yup. no errors regarding max clients. At most, rare proxy connection timeouts (about once an hour).

@vynmera _Trying_ is nothing I can easily do. This requires another downtime for our users. Additionally, I don't really suspect Apache to be the problem here. Node is causing the CPU load and HTTP is doing fine. I can load all scripts and assets just fine. Just the websocket never finishes receiving those collection update messages.

Sounds like a job for exponential back-off on the client side, after say 2-3 failed web socket reconnects.

hello, You can try with haproxy and use forever-service to run nodes. haproxy -> n nodes -> 1 server mongodb

@dmoldovan123 As already mentioned in my reply to vynmera I can't just _try_ various things. I have to maintain a stable service for 1000 active users. So if you could give me a hint why haproxy with forever-service would be better than apache with pm2 I would be really greatful. This would give me something to justify breaking the service (again) on purpose.

The thing is, I do not see a different proxy or service manager reduce status-changed messages over the websocket.

Don't think that's the best idea ever but you can easily test without PM2, directly with systemd. I don't really know about PM2 but the fact is that you add an another layer and potentially a bottleneck.

Another thing according to my experience, be careful with Apache2 config... My instance was incredibly slow (3 seconds to load each avatar). My Apache2 uses mpm_prefork with a dumb copy/paste (MaxRequestsPerChild 1). Servers were consuming a lot of resources forking new processes with a bad user experience but there was no system load. Took me a couple of days to figure it out :/

I am using pm2 in fork mode so no extra layer should be present. 3 instances of Rocket.Chat are running. Each with its own port.

Cluster mode did not work for some reason.

@dmoldovan123 This is what we already do. As you can see in the issue description, I am running 8 servers with 3 instances each to utilize CPU cores behind a reverse proxy. I don't see how another proxy would impact CPU load of node processes. We are using mongodb with replicaSet and instances can communicate with each other because I set INSTANCE_IP.

I am pretty sure, this issue is related to this one in the user-presence library Rocket.Chat uses as well.

@rodrigok @sampaiodiego Can one of you confirm this?

Thanks for all of your suggestions. We could now prove that the UserPresenceMonitor was responsible for the denial of service we faced.

We disabled it on all but two separate instances and can restart the cluster now without causing tons of status updates.

We did so by patching the source and setting USER_PRESENCE_MONITOR environment:

--- rocket.chat/programs/server/app/app.js  2018-07-04 18:07:36.917547890 +0200
+++ app.js  2018-07-04 18:10:12.273401726 +0200
@@ -7753,7 +7753,10 @@

   InstanceStatus.registerInstance('rocket.chat', instance);
   UserPresence.start();
-  return UserPresenceMonitor.start();
+
+  if (process.env['USER_PRESENCE_MONITOR']) {
+    return UserPresenceMonitor.start();
+  }
 });
 /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

I would still like to have an official fix for this rather than patching the source with every update.

@magicbelette We still use the _slow_ database engine but we do not observer high CPU load or memory usage on database servers yet.

I am also surprised that user presence hasn't caused more issues for more people. I raised the issue a while back.

Another question is whether pm2 is handling sessions properly. Meteor uses sticky sessions and if not handled properly your servers may be doing a lot of extra work constantly logging in new users.

I'd look to add kadira (meteor apm) to check app performance. Node chef offers a solution for 50 dollars per month (as does meteor galaxy hosting but then you have to use their hosting service which is pricey)

I'm not sure but I'm under the feeling that this patch causes users to appear offline for some others. As consequence, users receive email notification even if they are online.

@AmShaegar13 did you notice it or am I totally wrong ?

I think, you are not. Usually, the status appears to be correct but I already had complaints about unnecessary emails. So yes, there is still something wrong with it and I am still hoping this will be fixed. But for now, we can at least use the chat again.

Currently, 1177 users at it's peak.

The same appears to happen the other way round. No email notification although the user is offline in all clients.

This is getting a major annoyance. More and more users complain about broken notifications. This issue is a major drawback for acceptance in our company.

How many instances are you distributing the load across?

We are running 6 servers with 3 instances each without UserPresenceMonitor (see the patch above) behind a load balancer. Additionally, we run 2 servers with 1 instance each with UserPresenceMonitor not balanced, so no users can reach them. Those two servers are dedicated to running the UserPresenceMonitor.

This setup keeps the cluster at least stable but causes the aforementioned problems with notifications.

Just wanted to follow up here. We are working through another case like this. So this is definitely on our radar.

We were having pretty much identical symptoms. CPU pegged at 100%. Packet storm.

We implemented @AmShaegar13’s July 5 patch and (combined with splitting the servers into two Auto-Scaling Groups with different environment variables set) it solved that problem. We then noticed that we were experiencing some of the side-effects mentioned above, including users being marked as Away even when actively using the app. User activity would mark the user as Online for a split-second but then the user would return to Away.

I was concerned that this fix completely broke the User Presence system, but the almost-immediately-Away problem turned out to be something much simpler. Not a runtime failure, but just a configuration bug. As discussed in Issue https://github.com/RocketChat/Rocket.Chat/issues/11309#issuecomment-430816373, releases 0.64.0 and 0.66.0 changed the semantics of the "Idle Time Limit" user config setting, changing the units of the idle timeout from milliseconds to seconds. I don't know if the migrations were broken, or ran twice, or something else, but the end result is that the 300-second idle timeout somehow became 0.3 seconds!

Point being, be aware that there are multiple issues to manage here.

We see a similar issues with CPU related to users status when doing blue green deploys. It seems to be in part related to the activeUsers publication.
https://github.com/RocketChat/Rocket.Chat/blob/0.70.4/server/publications/activeUsers.js

When a sever goes offline all the clients connections for the server are removed from the db by other online servers or by the next server to come online.
https://github.com/Konecty/meteor-user-presence/blob/master/server/server.js#L82

When the clients reconnect to the new servers they create new client connections.

Both of these triggers the users records to get updated, status offline then online. Since the activeUsers publication notifies each client about changes to active users, that could be the number of active users x2 records sent to each client to process. This causes the clients fall behind in processing user statuses. It also seems to have a snowball effect because each client will try to report the users status multiple times as it struggles to sync user statuses. You can see the flood of users status updates using chrome dev tools and monitor the Web Socket frames when restarting the server.

Question for @geekgonecrazy:
If this is on the radar as you say, would it be a bad thing to enter the 4 line workaround of AmShaegar13 mentioned earlier as a pull request, because it could be an option for user groups not using the UserPresence monitor in the mean time?

They did. https://github.com/RocketChat/Rocket.Chat/pull/12353
However, they (correctly) changed the semantics of the environment variable
from opt-in to opt-out.

--Noach

On Mon, Oct 22, 2018 at 9:44 AM tpDBL notifications@github.com wrote:

Question for @geekgonecrazy https://github.com/geekgonecrazy:
If this is on the radar as you say, would it be a bad thing to enter the 4
line workaround of AmShaegar13 mentioned earlier as a pull request, because
it could be an option for user groups not using the UserPresence monitor in
the mean time?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/RocketChat/Rocket.Chat/issues/11288#issuecomment-431751871,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA_neqCyxXw1rHdTdYmJpxj20UtkTdD9ks5unWlOgaJpZM4U7RiY
.

Just chiming in, the symptoms described here are very similar to what I've seen running an instance with ~800 active users. Any restart during the busy period would cause a massive increase in network traffic and unresponsive instances. I was never able to diagnose the issue down to a root cause, so glad to see this progress!

Still experiencing problems when lots of clients need to reconnect at once. Just had another downtime of half an hour because of this. Even with presence monitor limited to two of 20 instances. Around 1200 users online.

Looks like we had some network issues which caused a lot of clients to reconnect. The cluster however did not recover on its own. We had to stop it completely.

@AmShaegar13, we've been pretty stable since fixing some bad configuration settings a few weeks ago, all having to do with Apple Push Notifications:

/admin/Push
Enable: True
Enable Gateway: True
Gateway: https://gateway.rocket.chat

Running v0.69.2. 1800 users. 10 EC2 instances, 2 of which run UPM. 3 RocketChat daemons per instance.

Our log files are still filled with errors:

Error sending push to gateway (2 try) -> { Error: failed [400] {"code":108,"error":"bad request sent to apn","requestId":"[random UUID]","status":400}

and APNs are unreliable (obviously), but we're stable.

@AmShaegar13 Not an expert in Rocket chat, but had some experience of broken servers under load and interested in this problem as we evaluate rocket chat. Have you run https://github.com/netdata/netdata and looked for bottlenecks? Common issues I found normally related to various OS limits being set wrong (file handles and sockets in particular), too long timeouts, disk contention (though unlikely to be an issue on AWS), memory, and process locks. Netdata is particularly good at giving a good overview of which thing is breaking stuff.

We use our own tooling for that displayed on grafana dashboards which gives you a pretty good overview as well. I will see if I can post a screenshot of today's outage later. Also, please refer to the graph I posted back in June.

I will have a look at netdata and see if that will yield any further data. However, I don't know when this will happen again, so wont be posting any new data soon.

By the way, we are using HAProxy after some internal infrastructure upgrade. This, apparently, did not change anything. I think we reached another max users mark our setup can support but I don't know how to fix that.

In my opinion, the problem is the expensive reconnect mechanism that multiplies when lots of clients reconnect at once. The server sends a list of all active users to the connecting client which takes a while and may be blocking. But I'm only guessing here.

In my opinion, the problem is the expensive reconnect mechanism that multiplies when lots of clients reconnect at once.

We worked around that problem by stopping the server and letting it sit for several minutes. After the clients fail to connect, they wait for a progressively longer retry delay. When we finally bring the servers back up, the clients return gradually. Ugly, but it was usually successful.

Adding explicit back-pressure to enter that back-off mode early during startup of the server could at least reduce the severity of the problem, and is a relatively easy fix (time-limited or load-induced addition of a header, and code to act on it with existing back-off in the clients).

We limited the rate of newly created sessions by HAProxy now. This might prevent flooding the servers with reconnects next time.

Here's our monitoring for the recent incident:
clipboard - february 27 2019 10_23 am

Wohohou, Im just in a middle of rocketchat deployment in our organisation and this does not look good.
For start I plan about 200-700 active users and later if everything will work good, there is waiting potentional 2000-3000 users. Nothing fancY like mobile notification, but now its looks like rocket chat have problems handling even few hundred users. This make me nervous...
And also why do you AmShaegar13 running so much servers with few instances instead of running one-two with 10+ rocket chat instances?

We only use VMs with 4 cores and doubled the amount of servers when we faced these issues for the first time. This obviously did not help. We use 3 instances per server as recommended in Running Multiple Instances Per Host To Improve Performance.

I have everything in one 8core-vm,16Gb Ram, including mongodb. Point is for me to have all components together, to get rid off any necessary rocket chat component communication over network. Currently 4 instances running with empty channels and 1 connected admin. So far no problems with load :)

Looks like our issue has been fixed by #14488, thanks a lot!

Looks like our issue has been fixed by #14488, thanks a lot!

@AmShaegar13 Sir,could you tell me how many users does your one single instance support now?can one single instance support more than 5k people (and what the server's physical config is ?
)

Our current setup is 8 servers with one instance each. Downscaling from 3 instances each helped us support more users, apperently. These NxN connections between each and every Rocket.Chat instance seem to limit scalability.

Currently, we have 2100 active users. Howevery, Rocket.Chat is far from stable. About once a day, CPU load of single instances raises to 100 % slowing the instance down. This increases the response time. It raises to more than 45s in extreme cases.

If we are fast enough to remove that instance from the load balancer, it eventually recovers. Otherwise, other instances will follow until the whole cluster is unusable and needs to be completely stopped and restarted.

From my point of view, Rocket.Chat will, in its current state, not be able to handle 5000 active users.

@AmShaegar13 do you try the newer version (2.4.11),(and what each server's physical config is ?
)
image

We are currently running 2.4.8. Can't tell you much about the physical hardware as the servers are virtual machines in our internal VM cluster. 4 cores, 16 GB RAM. That's all I have at the moment. Also, we stopped using pm2 and use systemd services now. However, this should not have any impact.

I am doubting the reason is monodb server

No. mongodb is pretty stable and can even handle high load. Also, our mongodbs are on extra hosts.

you can try nginx http2. could tell me :does pm2 work?

No, I cannot try anything. I have 2000 users in home office because of COVID-19 relying on a stable service.

does pm2 cluster work or docker swarm? I tink we can try http2 or http3,

Gateway

what's Gateway's function @nmagedman @AmShaegar13

The gateway is used for push notifications from your Rocket.Chat instance to Android/iOS.

pm2 in cluster mode does not work. Every Rocket.Chat instance needs it's own dedicated port. Docker should be fine as long as you pass the instance ip as INSTANCE_IP environment variable to Rocket.Chat.

@AmShaegar13 ,thank you sir.does it work fine with Multiple instances ? OR are there some bugs?

Multiple instance work, if you correctly set the INSTANCE_IP environment variable.

@AmShaegar13 tanks ,somebody said their Multiple instances resolved performance problem,but there 's a bug about notifycation,does this exist now?

As I said. The issue has been fixed.

@AmShaegar13 Tanks very much.Jesus Bless you

sir,when I use remote mongo_url(with password),this is an error:

MongoNetworkError: failed to connect to server [127.0.0.1:27017] on first connect [MongoNetworkError: connect ECONNREFUSED 127.0.0.1:27017]

why 127????

version: "3.3"
services:
rocketchat-00:
image: rocket.chat:2.4.11
restart: always
environment:
PORT: "3000"
INSTANCE_IP: "172.17.117.62"
ROOT_URL: 'http://172.17.117.62:7777'
MONGO_URL: 'mongodb://admin:[email protected]:27017/rocketchat?replicaSet=rocketchatReplicaSet&authSource=admin'
MONGO_OPLOG_URL: 'mongodb://admin:[email protected]:27017/local?replicaSet=rocketchatReplicaSet&authSource=admin'
ports:
- "3000:7777"

thanks @AmShaegar13 for all your support.. nice to see your numbers and you've accomplished.. if you get a change, try to upgrading to 3.0.x as we removed a lot of meteor stuff so we can be more scalable.

@564064202 if you have a different issue, please open a new issue then

@sampaiodiego Thanks to COVID-19 we had a lot of trouble scaling beyond 1650 active users. Thanks to v3.0.4 we are now at 2250 user max. per day. Thank you for further improving on this issue.

7K users on 55 nodes but still in Rocket.Chat 2.x.

My problem upgrading to Rocket.Chat 3.x is the number of nodes. It seems that MongoDB doesn't support the broadcasting of a large number of new instances at the same time when restarting node.js.

Can I restart only a few nodes at the same time ? It implies Rocket.Chat instances in version 3.x and other running 2.x.

Cheers

Can I restart only a few nodes at the same time ? It implies Rocket.Chat instances in version 3.x and other running 2.x.

yes you can @magicbelette .. it is not always recommended as at some point version schemas might be incompatible, but you usually can do that. we actually use rolling upgrades strategy on k8s that does exact the same. 😉

In our case after migration to v3.0.4 (today) from v2.2.1 (yesterday) every node processes uses 100% of CPU :/

We keep the config DISABLE_PRESENCE_MONITOR=true on 53/55 instances.

Capture d’écran_2020-03-26_08-33-03

@AmShaegar13 Sir, could you give me a hand
https://github.com/RocketChat/Rocket.Chat/issues/17020

docker-compose with multiple instances at same time will meet a bug : ' MongoError: ns not found', 'Errors like this can cause oplog processing errors.'

@sampaiodiego Sir, could you give me a hand

17020

docker-compose with multiple instances at same time will meet a bug : ' MongoError: ns not found', 'Errors like this can cause oplog processing errors.'

We are currently running 2.4.8. Can't tell you much about the physical hardware as the servers are virtual machines in our internal VM cluster. 4 cores, 16 GB RAM. That's all I have at the moment. Also, we stopped using pm2 and use systemd services now. However, this should not have any impact.
@AmShaegar13
Dear Sir, I am so sorry to bother you now, But I don't know who can tell the truth about how to support more than 5000 pepople onlie . You have more than two thousand people online, I want to know if I want 5,000 online users, how many( Or How much server configuration is required) servers do I need? Could you tell me?Or give me some suggestion? Please help me, for God ’s sake。 Thank you every much.

@564064202 for 5k online users 8 instances should be enough, but to help you correctly we would need to understand all the aspects of your installation, usage, etc. If you have any kind of support contract we can do it quicker, without it you may need to wait the answers here when we have time or the help from the community.

Some basic advice:

  • For scale we recommend some docker installation (k8s, openshift, or, at lease, docker-compose).
  • SSD for the database and use replicas.
  • Using the last version you can disable some features at the troubleshooting section on admin area, test them and check how they affect the performance.

@rodrigok @AmShaegar13
We found that the 3. * version is often unstable, so we dare not use them. I don't know what the hardware configuration you need for the 8 running instances? We also plan to use SSD to run MongoDB. All running platforms are on the cloud.

Our current setup is 8 servers with one instance each. Downscaling from 3 instances each helped us support more users, apperently. These NxN connections between each and every Rocket.Chat instance seem to limit scalability.

Hi, @AmShaegar13
We having a very similar issues, and #14488 in my opinion didn't helped a lot.
Now we are on 3.1.1 and experiencing sporadic high resources load where present statuses and notifications about that come into play.
Still do not see the way to reboot one single instance - it will flood all other instances and we will need to restart all instances with warm ups like @nmagedman described here
And I'd like to ask you @AmShaegar13 and @nmagedman about this:

They did. #12353 However, they (correctly) changed the semantics of the environment variable from opt-in to opt-out.

Is that true, that in #12353 whole logic is reversed? In compare to that patch
I mean if I set DISABLE_PRESENCE_MONITOR=yes in docker-compose file - what will be? Is that instance will start with UserPresenceMonitor.start(); or without it?

Hi @ankar84,
as the name of the environment variable DISABLE_PRESENCE_MONITOR indicates, if you set it to true or yes the presence monitor is disabled otherwise enabled. So it works like opt-out. The presence monitor is always on except for when you set DISABLE_PRESENCE_MONITOR=yes.

However, we are not using it anymore. We are on v3.0.4 at the moment which works without any problems.

as the name of the environment variable DISABLE_PRESENCE_MONITOR indicates, if you set it to true or yes the presence monitor is disabled otherwise enabled. So it works like opt-out. The presence monitor is always on except for when you set DISABLE_PRESENCE_MONITOR=yes.

I get it! You patch was if user_presence_monitor - then start, but Diego implemented it opposite way - if DISABLE_PRESENCE_MONITOR not true or yes - then start it.

However, we are not using it anymore. We are on v3.0.4 at the moment which works without any problems.

Now we are on 3.1.1 and sometimes we experiencing performance issues, so I configured only 1 of our 20 instances to user_presence_monitor started now. Do not see any problems in presence status system works now. And thanks for your answer!

Hi, we're trying to support 4k active users with RocketChat, but we are unable to go above 1k for now.

We are using RocketChat v3.6.3
10 instances (2CPU & 2GB RAM each) on AWS Fargate
3 nodes MongoDB v4.2 cluster (8vCPU & 32GB RAM & 16000 max connections) on Atlas, we use retryWrites=true&w=majority&poolSize=75 in the connection string.

We are using selenium with headless chrome on the cloud to perform the load test.
All users are connected to the same public channel, and wait a random amount of time before sending a text message.
We tried with :
+10 min : const time = Math.floor(Math.random() * 10 * 60) * 1000;
and + 30 min : const time = Math.floor(Math.random() * 30 * 60) * 1000;

Yesterday we tried with 2370 users, and the chat was unusable, I could not send messages (they stay grey and no REST request sendMessage is sent), if I reload the page I can access the channel but the messages loader stays forever.

The problem is that our monitoring does not show any big CPU load, the app instances are at ~50% CPU max and the DB is at ~40% CPU, so we're at lost here.

We first discovered that having the setting Unread_Count set to all_messages is a big no for large channels, it was generating a lot of oplog updates on the subscription collection and was slowing the app.

We also have a lot of this in our instance logs :
Mongodb Exception in setInterval callback: SwitchedToQuery TIMEOUT QUERY OPERATION

We would appreciate any additional hints from the experts in this thread.
@rodrigok @AmShaegar13 @magicbelette @ankar84

@ramrami to start supporting more users you will need to disable notifications and presence.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

marceloschmidt picture marceloschmidt  Â·  3Comments

antn89 picture antn89  Â·  3Comments

Buzzele picture Buzzele  Â·  3Comments

mddvul22 picture mddvul22  Â·  3Comments

karlprieb picture karlprieb  Â·  3Comments