Rocket.chat: Node slamming CPU since update to 3.7

Created on 30 Sep 2020 · 45Comments · Source: RocketChat/Rocket.Chat

Since updating to 3.7 from previous version the CPU utilisation has jumped from less than 5% to a constant very high average.

Description:

Node running away with Cpu and Mongo DB docker also high

Screenshot 2020-09-30 at 14 36 09

Screenshot 2020-09-30 at 14 38 04

Screenshot 2020-09-30 at 14 38 15

Steps to reproduce:

restart / start rocket

Expected behavior:

normal behaviour / ~3% CPU

Actual behavior:

high CPU

Server Setup Information:

Version of Rocket.Chat Server:
Operating System:
Deployment Method: latest docker
Number of Running Instances: 1
DB Replicaset Oplog:
NodeJS Version:
MongoDB Version:

Client Setup Information

N/A

Additional context

Relevant logs:

Source

HiveMindNet

👍3 👀2

Most helpful comment

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

geekgonecrazy on 1 Oct 2020

👍5 ❤2

All 45 comments

@HiveMindNet Which version of RC were you upgrading from? Also which version of MongoDB before and after ?

Sing-Li on 30 Sep 2020

@HiveMindNet Which version of RC were you upgrading from? Also which version of MongoDB _before_ and _after_ ?

Thanks for asking 👍

I cam from 3.6.3 (I'm always up to date) and I didn't upgrade the Mongo docker. (which is and was db version v4.0.20) - could that be the issue?

Both run in individual dockers if that matters.

Thanks for any guidance :)

HiveMindNet on 30 Sep 2020

I updated from 3.6.3 too, a manual installation. After upgrade to 3.7.0, I met the same problem. My node is 12.14.0, npm is 6.13.4 and my mongo is 4.0.2.
Thanks for any suggestion.

xzyyyy on 30 Sep 2020

Has this happened with any previous upgrade? Did number of users connected change after the upgrade? Does there seem to be any correlating factor with behaviour and the cpu spikes?

geekgonecrazy on 30 Sep 2020

Has this happened with any previous upgrade? Did number of users connected change after the upgrade? Does there seem to be any correlating factor with behaviour and the cpu spikes?

I think the answers of all your questions are no, I haven't meet that before, and only 4 users on my server. After upgrading, the usage of cpu would be down to about 1% - 3# in 2 minutes with previous upgrade, but this time it keeps almost 100% all time.
Thanks.

xzyyyy on 30 Sep 2020

Has this happened with any previous upgrade? Did number of users connected change after the upgrade? Does there seem to be any correlating factor with behaviour and the cpu spikes?

One more question, can I downgrade to 3.6.3 by the steps of update in doc.rocket.chat? Is there any problem with it?
Thanks

xzyyyy on 30 Sep 2020

I have just upgraded from 3.6.2 and I can confirm this issue. I will wait until morning, if it will not be resolved by itself, i will do a restore.

node -v: v12.14.0

2020-09-30_19-59-39
2020-09-30_20-00-47
2020-09-30_20-02-27

kgdemir on 30 Sep 2020

I have the same problem after upgrading from 3.6.3. Before the update the load was 3-4%
node: 12.14.0
mongoDB: 4.0.20
OS: Ubuntu 20.04.1 LTS

wolfcreative on 30 Sep 2020

Maybe this is a red herring, but I notice I'm getting Jitsi error log entries which I wasn't before too. Sorry if this isn't relevant, but in case it was I thought I'd share. Is everyone else having these issues seeing anything like this / using Jitsi?

Screenshot 2020-09-30 at 18 18 59

HiveMindNet on 30 Sep 2020

Thanks everyone, all data points help.

Anyone not running this on time-shared cloud vps? (on your own dedicated machines?)

Sing-Li on 30 Sep 2020

Thanks everyone, all data points help.

Anyone not running this on time-shared cloud vps? (on your own dedicated machines?)

We're on our own AWS EC2 (m5.xlarge), just running RC within docker for ease. Nothing we can't look at or adjust if needed on the main infrastructure. Let me know if you want me to look at anything. We're running sematext for monitoring everything across containers and on the main EC2 too, including logs and events.

HiveMindNet on 30 Sep 2020

Maybe this is a red herring, but I notice I'm getting Jitsi error log entries which I wasn't before too. Sorry if this isn't relevant, but in case it was I thought I'd share. Is everyone else having these issues seeing anything like this / using Jitsi?

I don't receive any error, there isn't error or warning or any other thing in my log, just like before, so I think it's not a Jitsi problem.
Thanks.

xzyyyy on 30 Sep 2020

👍1

@ankar84 Have you updated to 3.7 on your cluster? And are you seeing similar "increased DB load" behavior?

Sing-Li on 30 Sep 2020

@HiveMindNet is jitsi something you can turn off integration for a bit and see if changes anything? Just to rule that part out. We have the jitsi integration enabled on open with out this issue.. so doesn't seem likely. On open we're also using our docker image. So that aspect is also the same

But anything we use to narrow in on the cause would be awesome.

geekgonecrazy on 1 Oct 2020

@ankar84 Have you updated to 3.7 on your cluster? And are you seeing similar "increased DB load" behavior?

Hi, @Sing-Li
Yes. I have upgraded one of my test environments to 3.7.0 from 3.6.3 and all looks fluent and nice.
I have only few users in that test deployment (but @xzyyyy have trouble with 4 users here)

I have 3 servers with CentOS7 and 4 RC docker containers each - 12 instances
Each server have one mongoDB container version 4.0.20
Storage engine wiredTiger because of using of change streams https://github.com/RocketChat/Rocket.Chat/pull/18892
Here is a htop screens from all 3 servers

ankar84 on 1 Oct 2020

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

geekgonecrazy on 1 Oct 2020

👍5 ❤2

@HiveMindNet is jitsi something you can turn off integration for a bit and see if changes anything? Just to rule that part out. We have the jitsi integration enabled on open with out this issue.. so doesn't seem likely. On open we're also using our docker image. So that aspect is also the same

But anything we use to narrow in on the cause would be awesome.

I have just tried turning off Jitsi integration and stopping Jitsi servers and it made no difference. Worth a try though - so that's ruled out.

HiveMindNet on 1 Oct 2020

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

Jackpot! You can see the drop when I restart the Docker below on the graph - the CPU drops back to normal.

Is it safe for me to leave this USE_NATIVE_OPLOG=true on a production environment while you find a fix??

Screenshot 2020-10-01 at 08 03 23

Screenshot 2020-10-01 at 08 03 35

HiveMindNet on 1 Oct 2020

👍3

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

I can confirm USE_NATIVE_OPLOG=true fixed high cpu issue

kgdemir on 1 Oct 2020

❤1

Same thing on my instance, add "USE_NATIVE_OPLOG=true" on systemd service file fixed the issue.

mranderson56 on 1 Oct 2020

❤1

@mranderson56 Where should I write this?

wolfcreative on 1 Oct 2020

@wolfcreative In the service file (/lib/systemd/system/rocketchat.service), in the Environment parameter.

Stoobitweet on 1 Oct 2020

Into your systemd file (on Debian) : /etc/systemd/system/multi-user.target.wants/rocketchat.service
You can add this in "Environment".
Then : systemctl daemon-reload
And restart Rocketchat

mranderson56 on 1 Oct 2020

👍2

Well, the CPU load dropped to 0.7-1.5%
Thank you all for your help!

wolfcreative on 1 Oct 2020

❤1 👍1

Can any of the main devs please let me know, is it safe for me to leave this USE_NATIVE_OPLOG=true on a production environment while you find a fix?? @geekgonecrazy

HiveMindNet on 1 Oct 2020

@HiveMindNet it's safe to keep USE_NATIVE_OPLOG in production env, it will use the code used on all the previous versions.

rodrigok on 1 Oct 2020

@HiveMindNet it's safe to keep USE_NATIVE_OPLOG in production env, it will use the code used on all the previous versions.

Thank you :)

HiveMindNet on 1 Oct 2020

@HiveMindNet how many users do you have online on your instance? And could you give me a screenshot or copy of the startup logs where rocket.chat prints the mongodb version, engine, etc?

rodrigok on 1 Oct 2020

@rodrigok Someone in the forums posted some info: https://forums.rocket.chat/t/rocketchat-3-7-0-high-cpu-usage/8715

fdellwing on 1 Oct 2020

I solved it by entering this on my docker-compose file

volumes:
- ./uploads:/app/uploads
environment:
- USE_NATIVE_OPLOG=true
- PORT=3000

netbix on 3 Oct 2020

👍1

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

wolbernd on 5 Oct 2020

Same Issue here: USE_NATIVE_OPLOG=true should be definitely the default! I just stumbled upon the changed load by chance!

EDIT: To be clear. Setting this flag, did fix the issue for us. Load is back to <5%

ataraxus on 5 Oct 2020

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

Sorry, please disregard this, I found a missing index in our mongodb that caused this issue.

wolbernd on 5 Oct 2020

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

Sorry, please disregard this, I found a missing index in our mongodb that caused this issue.

Come now, don't leave everybody hanging. What exactly did you do to fix this? :thinking:

benedikt-wegmann on 5 Oct 2020

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

Sorry, please disregard this, I found a missing index in our mongodb that caused this issue.

Come now, don't leave everybody hanging. What exactly did you do to fix this? thinking

Sorry I think it is something unrelated to this issue but I don't want to leave you hanging ;)
In my case there was an index missing in the collection rocketchat.rocketchat_uploads.chunks. The missing index (compound index for fields files_id and n) causes all requests for blobs to do a full table scan. I'm not sure how this index got lost, but we did a database recovery right before the update to 3.7.0 so something might have gone wrong there.

wolbernd on 5 Oct 2020

👍1

@HiveMindNet how many users do you have online on your instance? And could you give me a screenshot or copy of the startup logs where rocket.chat prints the mongodb version, engine, etc?

Sorry @rodrigok we're still early in adoption so although over 1000 users, only around 20-30 concurrently active at any one time.

HiveMindNet on 5 Oct 2020

Same Issue here: USE_NATIVE_OPLOG=true should be definitely the default! I just stumbled upon the changed load by chance!

EDIT: To be clear. Setting this flag, did fix the issue for us. Load is back to <5%

We have the same problem with our cluster. The workaround with USE_NATIVE_OPLOG=true also worked for us

dpunkturban on 5 Oct 2020

Will this fix get merged into the 3.7.1 release?

binarydad on 6 Oct 2020

I'm in opposite to majority of users here want to test change streams in test environment.
RC server version is 3.7.0
Storage engine is wiredTiger
MongoDB version is 4.0.19
No any oplog information on instance start.

➔ System ➔ startup
➔ +---------------------------------------------+
➔ |                SERVER RUNNING               |
➔ +---------------------------------------------+
➔ |                                             |
➔ |  Rocket.Chat Version: 3.7.0                 |
➔ |       NodeJS Version: 12.18.4 - x64         |
➔ |      MongoDB Version: 4.0.19                |
➔ |       MongoDB Engine: wiredTiger            |
➔ |             Platform: linux                 |
➔ |         Process Port: 3000                  |
➔ |             Site URL: https://rc.test.ru  |
➔ |     ReplicaSet OpLog: Enabled               |
➔ |          Commit Hash: 2c82cd9cee            |
➔ |        Commit Branch: HEAD                  |
➔ |                                             |
➔ +---------------------------------------------+

How to force enable change streams in deployment?

ankar84 on 6 Oct 2020

Thank you !
The USE_NATIVE_OPLOG=true fixed the problem !

gbouthenot on 8 Oct 2020

Folks, we release the version 3.7.1 with a fix for this situation, can you all please try this new version without the USE_NATIVE_OPLOG=true variable?

Thanks

rodrigok on 9 Oct 2020

👍2

We have no issues with 3.7.1, MongoDB 3.6 running WiredTiger and no USE_NATIVE_OPLOG variable present. No experience with 3.7.0, so I cannot confirm whether the issue existed for us on that.

Gummikavalier on 10 Oct 2020

@rodrigok We applied the update and removed USE_NATIVE_OPLOG=true and at the moment it looks good. However we do not have many users online on weekends so we'll have to wait for monday for a definitive answer.

wolbernd on 10 Oct 2020

@rodrigok We applied the update and removed USE_NATIVE_OPLOG=true and at the moment it looks good.

ataraxus on 10 Oct 2020

wiredTiger

Does this only fix "mmapv1 database engine or when no admin access was granted to the database user,"?
we use 3.3.3 version,and it will have a high cpu load sometimes,does this work for wiredTiger ?thank you.