Rocket.chat: Node slamming CPU since update to 3.7

Created on 30 Sep 2020  ·  45Comments  ·  Source: RocketChat/Rocket.Chat

Since updating to 3.7 from previous version the CPU utilisation has jumped from less than 5% to a constant very high average.

Description:

Node running away with Cpu and Mongo DB docker also high

Screenshot 2020-09-30 at 14 36 09

Screenshot 2020-09-30 at 14 38 04

Screenshot 2020-09-30 at 14 38 15

Steps to reproduce:

  1. restart / start rocket

Expected behavior:

normal behaviour / ~3% CPU

Actual behavior:

high CPU

Server Setup Information:

  • Version of Rocket.Chat Server:
  • Operating System:
  • Deployment Method: latest docker
  • Number of Running Instances: 1
  • DB Replicaset Oplog:
  • NodeJS Version:
  • MongoDB Version:

Client Setup Information

N/A

Additional context

Relevant logs:


Most helpful comment

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

All 45 comments

@HiveMindNet Which version of RC were you upgrading from? Also which version of MongoDB before and after ?

@HiveMindNet Which version of RC were you upgrading from? Also which version of MongoDB _before_ and _after_ ?

Thanks for asking 👍

I cam from 3.6.3 (I'm always up to date) and I didn't upgrade the Mongo docker. (which is and was db version v4.0.20) - could that be the issue?

Both run in individual dockers if that matters.

Thanks for any guidance :)

I updated from 3.6.3 too, a manual installation. After upgrade to 3.7.0, I met the same problem. My node is 12.14.0, npm is 6.13.4 and my mongo is 4.0.2.
Thanks for any suggestion.

Has this happened with any previous upgrade? Did number of users connected change after the upgrade? Does there seem to be any correlating factor with behaviour and the cpu spikes?

Has this happened with any previous upgrade? Did number of users connected change after the upgrade? Does there seem to be any correlating factor with behaviour and the cpu spikes?

I think the answers of all your questions are no, I haven't meet that before, and only 4 users on my server. After upgrading, the usage of cpu would be down to about 1% - 3# in 2 minutes with previous upgrade, but this time it keeps almost 100% all time.
Thanks.

Has this happened with any previous upgrade? Did number of users connected change after the upgrade? Does there seem to be any correlating factor with behaviour and the cpu spikes?

One more question, can I downgrade to 3.6.3 by the steps of update in doc.rocket.chat? Is there any problem with it?
Thanks

I have just upgraded from 3.6.2 and I can confirm this issue. I will wait until morning, if it will not be resolved by itself, i will do a restore.

node -v: v12.14.0

2020-09-30_19-59-39
2020-09-30_20-00-47
2020-09-30_20-02-27

I have the same problem after upgrading from 3.6.3. Before the update the load was 3-4%
node: 12.14.0
mongoDB: 4.0.20
OS: Ubuntu 20.04.1 LTS

image

Maybe this is a red herring, but I notice I'm getting Jitsi error log entries which I wasn't before too. Sorry if this isn't relevant, but in case it was I thought I'd share. Is everyone else having these issues seeing anything like this / using Jitsi?

Screenshot 2020-09-30 at 18 18 59

Thanks everyone, all data points help.

Anyone not running this on time-shared cloud vps? (on your own dedicated machines?)

Thanks everyone, all data points help.

Anyone not running this on time-shared cloud vps? (on your own dedicated machines?)

We're on our own AWS EC2 (m5.xlarge), just running RC within docker for ease. Nothing we can't look at or adjust if needed on the main infrastructure. Let me know if you want me to look at anything. We're running sematext for monitoring everything across containers and on the main EC2 too, including logs and events.

Maybe this is a red herring, but I notice I'm getting Jitsi error log entries which I wasn't before too. Sorry if this isn't relevant, but in case it was I thought I'd share. Is everyone else having these issues seeing anything like this / using Jitsi?

Screenshot 2020-09-30 at 18 18 59

I don't receive any error, there isn't error or warning or any other thing in my log, just like before, so I think it's not a Jitsi problem.
Thanks.

@ankar84 Have you updated to 3.7 on your cluster? And are you seeing similar "increased DB load" behavior?

@HiveMindNet is jitsi something you can turn off integration for a bit and see if changes anything? Just to rule that part out. We have the jitsi integration enabled on open with out this issue.. so doesn't seem likely. On open we're also using our docker image. So that aspect is also the same

But anything we use to narrow in on the cause would be awesome.

@ankar84 Have you updated to 3.7 on your cluster? And are you seeing similar "increased DB load" behavior?

Hi, @Sing-Li
Yes. I have upgraded one of my test environments to 3.7.0 from 3.6.3 and all looks fluent and nice.
I have only few users in that test deployment (but @xzyyyy have trouble with 4 users here)

I have 3 servers with CentOS7 and 4 RC docker containers each - 12 instances
Each server have one mongoDB container version 4.0.20
Storage engine wiredTiger because of using of change streams https://github.com/RocketChat/Rocket.Chat/pull/18892
Here is a htop screens from all 3 servers
image
image
image

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

@HiveMindNet is jitsi something you can turn off integration for a bit and see if changes anything? Just to rule that part out. We have the jitsi integration enabled on open with out this issue.. so doesn't seem likely. On open we're also using our docker image. So that aspect is also the same

But anything we use to narrow in on the cause would be awesome.

I have just tried turning off Jitsi integration and stopping Jitsi servers and it made no difference. Worth a try though - so that's ruled out.

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

Jackpot! You can see the drop when I restart the Docker below on the graph - the CPU drops back to normal.

Is it safe for me to leave this USE_NATIVE_OPLOG=true on a production environment while you find a fix??

Screenshot 2020-10-01 at 08 03 23

Screenshot 2020-10-01 at 08 03 35

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

I can confirm USE_NATIVE_OPLOG=true fixed high cpu issue

Same thing on my instance, add "USE_NATIVE_OPLOG=true" on systemd service file fixed the issue.

@mranderson56 Where should I write this?

@wolfcreative In the service file (/lib/systemd/system/rocketchat.service), in the Environment parameter.

Into your systemd file (on Debian) : /etc/systemd/system/multi-user.target.wants/rocketchat.service
You can add this in "Environment".
Then : systemctl daemon-reload
And restart Rocketchat

Well, the CPU load dropped to 0.7-1.5%
Thank you all for your help!

Can any of the main devs please let me know, is it safe for me to leave this USE_NATIVE_OPLOG=true on a production environment while you find a fix?? @geekgonecrazy

@HiveMindNet it's safe to keep USE_NATIVE_OPLOG in production env, it will use the code used on all the previous versions.

@HiveMindNet it's safe to keep USE_NATIVE_OPLOG in production env, it will use the code used on all the previous versions.

Thank you :)

@HiveMindNet how many users do you have online on your instance? And could you give me a screenshot or copy of the startup logs where rocket.chat prints the mongodb version, engine, etc?

@rodrigok Someone in the forums posted some info: https://forums.rocket.chat/t/rocketchat-3-7-0-high-cpu-usage/8715

I solved it by entering this on my docker-compose file

volumes:
- ./uploads:/app/uploads
environment:
- USE_NATIVE_OPLOG=true
- PORT=3000

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

Same Issue here: USE_NATIVE_OPLOG=true should be definitely the default! I just stumbled upon the changed load by chance!

EDIT: To be clear. Setting this flag, did fix the issue for us. Load is back to <5%

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

Sorry, please disregard this, I found a missing index in our mongodb that caused this issue.

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

Sorry, please disregard this, I found a missing index in our mongodb that caused this issue.

Come now, don't leave everybody hanging. What exactly did you do to fix this? :thinking:

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

Sorry, please disregard this, I found a missing index in our mongodb that caused this issue.

Come now, don't leave everybody hanging. What exactly did you do to fix this? thinking

Sorry I think it is something unrelated to this issue but I don't want to leave you hanging ;)
In my case there was an index missing in the collection rocketchat.rocketchat_uploads.chunks. The missing index (compound index for fields files_id and n) causes all requests for blobs to do a full table scan. I'm not sure how this index got lost, but we did a database recovery right before the update to 3.7.0 so something might have gone wrong there.

@HiveMindNet how many users do you have online on your instance? And could you give me a screenshot or copy of the startup logs where rocket.chat prints the mongodb version, engine, etc?

Sorry @rodrigok we're still early in adoption so although over 1000 users, only around 20-30 concurrently active at any one time.

Same Issue here: USE_NATIVE_OPLOG=true should be definitely the default! I just stumbled upon the changed load by chance!

EDIT: To be clear. Setting this flag, did fix the issue for us. Load is back to <5%

We have the same problem with our cluster. The workaround with USE_NATIVE_OPLOG=true also worked for us

Will this fix get merged into the 3.7.1 release?

I'm in opposite to majority of users here want to test change streams in test environment.
RC server version is 3.7.0
Storage engine is wiredTiger
MongoDB version is 4.0.19
No any oplog information on instance start.

➔ System ➔ startup
➔ +---------------------------------------------+
➔ |                SERVER RUNNING               |
➔ +---------------------------------------------+
➔ |                                             |
➔ |  Rocket.Chat Version: 3.7.0                 |
➔ |       NodeJS Version: 12.18.4 - x64         |
➔ |      MongoDB Version: 4.0.19                |
➔ |       MongoDB Engine: wiredTiger            |
➔ |             Platform: linux                 |
➔ |         Process Port: 3000                  |
➔ |             Site URL: https://rc.test.ru  |
➔ |     ReplicaSet OpLog: Enabled               |
➔ |          Commit Hash: 2c82cd9cee            |
➔ |        Commit Branch: HEAD                  |
➔ |                                             |
➔ +---------------------------------------------+

How to force enable change streams in deployment?

Thank you !
The USE_NATIVE_OPLOG=true fixed the problem !

Folks, we release the version 3.7.1 with a fix for this situation, can you all please try this new version without the USE_NATIVE_OPLOG=true variable?

Thanks

We have no issues with 3.7.1, MongoDB 3.6 running WiredTiger and no USE_NATIVE_OPLOG variable present. No experience with 3.7.0, so I cannot confirm whether the issue existed for us on that.

@rodrigok We applied the update and removed USE_NATIVE_OPLOG=true and at the moment it looks good. However we do not have many users online on weekends so we'll have to wait for monday for a definitive answer.

@rodrigok We applied the update and removed USE_NATIVE_OPLOG=true and at the moment it looks good.

wiredTiger

Does this only fix "mmapv1 database engine or when no admin access was granted to the database user,"?
we use 3.3.3 version,and it will have a high cpu load sometimes,does this work for wiredTiger ?thank you.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  ·  3Comments

amayer5125 picture amayer5125  ·  3Comments

tanc picture tanc  ·  3Comments

danpospisil picture danpospisil  ·  3Comments

Buzzele picture Buzzele  ·  3Comments