Rocket.chat: Possible memory leak when clearing oembed cache

Created on 4 Apr 2017  ยท  55Comments  ยท  Source: RocketChat/Rocket.Chat

Rocket.Chat Version: 0.53.0
Running Instances: 4
DB Replicaset OpLog: Yes 7 days
Node Version: v4.5.0

Over the course of a many months I have had several problems with RocketChat running out of ram and swap on several of my VM's each VM has 2.5gb of ram and 2GB of swap. This is a gradual increase in ram, but looking at its memory usage it keep rising till the point the server dies.

embed bug

Most helpful comment

@richardwlu yes it will rebuild. As far as disabling it.. I don't think we have a way of turning it off right now.

This is related a bit to: #6894

@rodrigok we've worked through and determined this huge spike in memory usage seems to come from oembed cron job cleaning up. Thoughts?

All 55 comments

@oliver-bowman how quickly does this seem to happen? And do you notice any pattern? Like more users connected and it starts using more memory? Any pattern you can identify would greatly help us track down this potential problem.

This happens around once a month, every month. Users connected stays around the same + - a few users. I will continue to find more information when I can.

@oliver-bowman please do! Anything that could help us reproduce or at least identify where the leak might be would be very useful.

@oliver-bowman I think that 2GB swap with 2,5GB RAM is too much. Best practises is use 50% of RAM for swap.

@Rohlik its not best practice, but in this case other then anytime swap is used causing the read / writes to be slower it shouldn't be effecting his issue.

I think I understood correctly that each of these rocket.chat instances is running on its own VM?

This was not the original intention for resources of VM's, I had to increase them based on this particular issue. I needed to try to decrease down time. Yes each instance is on its own VM.

We ran into the same issue, our instance is crashing each day. We have ~ 50 online users around the clock. That should be fine.

Rocket.Chat Version: 0.55.1
Running Instances: 1
DB Replicaset OpLog: Disabled
Node Version: v4.5.0

The latest crash report:

Apr 24 11:31:47 chat node[30215]: FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Apr 24 11:31:47 chat node[30215]: <--- Last few GCs --->
Apr 24 11:31:47 chat node[30215]: 86452953 ms: Mark-sweep 1317.2 (1457.7) -> 1317.4 (1457.7) MB, 404.6 / 0 ms [allocation failure] [scavenge might not succeed].
Apr 24 11:31:47 chat node[30215]: 86453388 ms: Mark-sweep 1317.4 (1457.7) -> 1317.4 (1457.7) MB, 435.1 / 0 ms [allocation failure] [scavenge might not succeed].
Apr 24 11:31:47 chat node[30215]: 86453815 ms: Mark-sweep 1317.4 (1457.7) -> 1315.4 (1457.7) MB, 427.2 / 0 ms [last resort gc].
Apr 24 11:31:47 chat node[30215]: 86454241 ms: Mark-sweep 1315.4 (1457.7) -> 1316.8 (1457.7) MB, 425.9 / 0 ms [last resort gc].
Apr 24 11:31:47 chat node[30215]: <--- JS stacktrace --->
Apr 24 11:31:47 chat node[30215]: ==== JS stack trace =========================================
Apr 24 11:31:47 chat node[30215]: Security context: 0x20cc65ab4629 <JS Object>
Apr 24 11:31:47 chat node[30215]:     1: slowToString [buffer.js:405] [pc=0x321c91a06695] (this=0xf4ac73e9b41 <an Uint8Array with map 0x18d5f1ac8c09>,encoding=0x20cc65abd761 <String[4]: utf8>,start=2322,end=236296)
Apr 24 11:31:47 chat node[30215]:     2: toString [buffer.js:~432] [pc=0x321c91ca8fe5] (this=0xf4ac73e9b41 <an Uint8Array with map 0x18d5f1ac8c09>)
Apr 24 11:31:47 chat node[30215]:     3: arguments adaptor frame: 3->0
Apr 24 11:31:47 chat node[30215]:     4: /* anonymous */(aka /* anonymous */) [/opt/Rocket.Chat-0.5...
Apr 24 11:31:48 chat systemd[1]: rocketchat.service: Main process exited, code=killed, status=6/ABRT
Apr 24 11:31:48 chat systemd[1]: rocketchat.service: Unit entered failed state.
Apr 24 11:31:48 chat systemd[1]: rocketchat.service: Failed with result 'signal'.
Apr 24 11:35:31 chat systemd[1]: Stopped Rocket Chat server.

Our custom systemd unit does a great job on restarting the crashed instance but this always results in a unexpected downtime.

I've seen that node has a memory limit by default at 1.76 GB on a 64 bit machine. Is this maybe to less for our instance? Or is it really a memory leak? I'm really not sure.

The memory usage of the system itself seems okay, the host has 4 GB of ram.

bildschirmfoto 2017-04-24 um 11 53 32

bildschirmfoto 2017-04-24 um 11 50 11

@BirknerAlex ah! You are spot on! Does appear nodejs limits its self to 1.76GB of memory. Adding the flag --max_old_space_size=4096 should allow it to use all available memory.

So switching the command to something like: node --max_old_space_size=4096 main.js of course swapping the number to how ever many Megabytes of memory you wish to allow.

@oliver-bowman this might also solve the problem for you.

This likely needs added to our documentation. Let me know if this does the trick for you guys

@geekgonecrazy So do you think it's legit that our instances are consuming more memory then 1.76GB? If yes, sure I will adjust our memory limit ๐Ÿ‘ If it's really a memory leak we will just increase the time until the instance is crashing again.

//EDIT: Okay I will test this parameter change and let you know guys if this fixes the issue for me.

@BirknerAlex its definitely possible. How many users / connected users do you have?

Between 30-50 users at the same time each day. I've changed the setting on our production instance and I've restarted it. I will monitor the process for the next days and let you know if I expect further crashes.

Yes please do and let us know. We've been monitoring memory usage on our demo server and haven't noticed a memory leak. It seams the garbage collector is doing a pretty good job of sweeping up and keeping memory usage from getting out of control. But more eyes on memory usage the better.

Okay! Now it's going to be weird. We had a new out of memory crash last evening.

There was a huge memory usage peak before it's crash. So it seems giving the Rocket.Chat process a higher memory limit will only change that more memory will be consumed before crashing. Is there something like scheduled task in Rocket.Chat? Log or history cleanup, ...? Our database contains 459662 messages (mongodb size ~ 10GB), I don't know if this is much.

Before upgrading to 0.55.1 we were running some 0.4X version. So we upgraded and jumped over many versions. That issue didn't exists in 0.4X, so maybe it's some new feature?

bildschirmfoto 2017-04-26 um 11 52 27

@BirknerAlex do you have a more exact version number that you jumped from?

Good news is I think we're seeing this also. Database size shouldn't really have an impact.

Okay I saw the last version was much older, we were running 0.30 before. Then upgraded to 0.55 and then to 0.55.1.

But great if you are on a good way to find this memory (leak?). ๐Ÿ‘

Agreed it is good news that progress is being made

@geekgonecrazy how can I adjust parameter max-old-space-size, if I am running rocketchat in snap container?

@teamfighter once I get #6844 done this will be a possibility in the snap

I don't know if this helps but it seems that it still happens almost each 24 hours.

image 1

@BirknerAlex that looks like a fantastic clue. So for sure related to something happening at least once a day. From the look of it... pretty much at least from the 4th - 8th at the exact same time of the day each time.

if you can get a more precise time from your graph :crossed_fingers: then you could query mongo with something like:

db.getCollection('rocketchat_cron_history').find({"startedAt": {$gte: new ISODate("2017-05-08 14:00:00"), $lt: new ISODate("2017-05-08 15:00:00")}})

Swapping of course your time ranges in here. This would help determine what exactly was running during this time and help us locate the issue.

Oh I think we hit the jackpot.

Database timestamps are in UTC-0. Our CMK Graph timestamps are in UTC+2 because we're located in Germany. So the crash started yesterday on ~ 9:55 UTC-0:

bildschirmfoto 2017-05-10 um 10 43 54

And look what we get there. I've queried all cronjobs in the last 2 days. And there 2 which fit's perfectly in time and have no finishedAt timestamp -> "Cleanup OEmbed cache".

root@chat:~# mongo
MongoDB shell version: 2.6.10
connecting to: test
> use rocketchat
switched to db rocketchat
> db.getCollection('rocketchat_cron_history').find({"startedAt": {$gte: new ISODate("2017-05-09 11:40:00"), $lt: new ISODate("2017-05-09 12:20:00")}})
{ "_id" : "erNzWNZKzqjrz5rmL", "intendedAt" : ISODate("2017-05-09T12:07:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-09T12:07:00.398Z"), "finishedAt" : ISODate("2017-05-09T12:07:00.933Z"), "result" : null }
> db.getCollection('rocketchat_cron_history').find({"startedAt": {$gte: new ISODate("2017-05-07 00:00:00"), $lt: new ISODate("2017-05-09 12:20:00")}})
{ "_id" : "pPnHK7Mrkgvx8yBqt", "intendedAt" : ISODate("2017-05-08T09:40:00Z"), "name" : "Cleanup OEmbed cache", "startedAt" : ISODate("2017-05-08T09:40:00.166Z") }
{ "_id" : "j7RY6TTzx4C6uqLfv", "intendedAt" : ISODate("2017-05-08T09:40:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T09:40:00.168Z"), "finishedAt" : ISODate("2017-05-08T09:40:00.977Z"), "result" : null }
{ "_id" : "jZWGKkc33TYKh4Xxb", "intendedAt" : ISODate("2017-05-08T10:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T10:55:00.545Z"), "finishedAt" : ISODate("2017-05-08T10:55:01.131Z"), "result" : null }
{ "_id" : "zSmdChsTCTXgcFPyQ", "intendedAt" : ISODate("2017-05-08T11:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T11:55:00.143Z"), "finishedAt" : ISODate("2017-05-08T11:55:00.637Z"), "result" : null }
{ "_id" : "4GrbBNdjPnkAjBtCN", "intendedAt" : ISODate("2017-05-08T12:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T12:55:00.660Z"), "finishedAt" : ISODate("2017-05-08T12:55:01.159Z"), "result" : null }
{ "_id" : "9XCugYPbT2Qz4vskL", "intendedAt" : ISODate("2017-05-08T13:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T13:55:00.171Z"), "finishedAt" : ISODate("2017-05-08T13:55:00.695Z"), "result" : null }
{ "_id" : "NScFoij46x9TpP4ZH", "intendedAt" : ISODate("2017-05-08T14:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T14:55:00.722Z"), "finishedAt" : ISODate("2017-05-08T14:55:01.253Z"), "result" : null }
{ "_id" : "cmx3Csw6EBouEcx9P", "intendedAt" : ISODate("2017-05-08T15:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T15:55:00.267Z"), "finishedAt" : ISODate("2017-05-08T15:55:01.069Z"), "result" : null }
{ "_id" : "kkWqsr7w2MvKBsZDf", "intendedAt" : ISODate("2017-05-08T16:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T16:55:00.081Z"), "finishedAt" : ISODate("2017-05-08T16:55:00.633Z"), "result" : null }
{ "_id" : "zdAxvxF4PkvvsEbaJ", "intendedAt" : ISODate("2017-05-08T17:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T17:55:00.654Z"), "finishedAt" : ISODate("2017-05-08T17:55:01.144Z"), "result" : null }
{ "_id" : "qSAmR6D7XeT5DstxE", "intendedAt" : ISODate("2017-05-08T18:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T18:55:00.160Z"), "finishedAt" : ISODate("2017-05-08T18:55:00.657Z"), "result" : null }
{ "_id" : "Wh4ZjGyENXpBY6siv", "intendedAt" : ISODate("2017-05-08T19:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T19:55:00.682Z"), "finishedAt" : ISODate("2017-05-08T19:55:01.183Z"), "result" : null }
{ "_id" : "CMT6z6ZgqfNkJccLM", "intendedAt" : ISODate("2017-05-08T20:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T20:55:00.198Z"), "finishedAt" : ISODate("2017-05-08T20:55:00.764Z"), "result" : null }
{ "_id" : "swjpJiFFN55CiKbnb", "intendedAt" : ISODate("2017-05-08T21:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T21:55:00.788Z"), "finishedAt" : ISODate("2017-05-08T21:55:01.301Z"), "result" : null }
{ "_id" : "ch6NQyCifxuYcvymn", "intendedAt" : ISODate("2017-05-08T22:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T22:55:00.315Z"), "finishedAt" : ISODate("2017-05-08T22:55:00.957Z"), "result" : null }
{ "_id" : "KGfnpFGrBJaaDJeYp", "intendedAt" : ISODate("2017-05-08T23:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-08T23:55:01.077Z"), "finishedAt" : ISODate("2017-05-08T23:55:01.582Z"), "result" : null }
{ "_id" : "7JMPcNL8qCf5PPFF9", "intendedAt" : ISODate("2017-05-09T00:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-09T00:55:00.593Z"), "finishedAt" : ISODate("2017-05-09T00:55:01.211Z"), "result" : null }
{ "_id" : "HhribBKbi7z3GD76c", "intendedAt" : ISODate("2017-05-09T01:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-09T01:55:00.226Z"), "finishedAt" : ISODate("2017-05-09T01:55:00.723Z"), "result" : null }
{ "_id" : "auQ8kJhdwGajy9h2M", "intendedAt" : ISODate("2017-05-09T02:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-09T02:55:00.744Z"), "finishedAt" : ISODate("2017-05-09T02:55:01.288Z"), "result" : null }
{ "_id" : "cniP2gcFrjbcN3hxS", "intendedAt" : ISODate("2017-05-09T03:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-09T03:55:00.306Z"), "finishedAt" : ISODate("2017-05-09T03:55:00.812Z"), "result" : null }
{ "_id" : "HK4CcH7gBdFPKCcGY", "intendedAt" : ISODate("2017-05-09T04:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-09T04:55:00.836Z"), "finishedAt" : ISODate("2017-05-09T04:55:01.355Z"), "result" : null }
{ "_id" : "EmgZ3EW7ZcFkLf6rf", "intendedAt" : ISODate("2017-05-09T05:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-09T05:55:00.364Z"), "finishedAt" : ISODate("2017-05-09T05:55:00.898Z"), "result" : null }
{ "_id" : "FCSGsrx3a4AyNfmAA", "intendedAt" : ISODate("2017-05-09T06:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-09T06:55:00.919Z"), "finishedAt" : ISODate("2017-05-09T06:55:01.417Z"), "result" : null }
{ "_id" : "uzXXKYrmEimmFJyce", "intendedAt" : ISODate("2017-05-09T07:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-09T07:55:00.426Z"), "finishedAt" : ISODate("2017-05-09T07:55:00.982Z"), "result" : null }
{ "_id" : "X4b4vs34w9LFvE7zx", "intendedAt" : ISODate("2017-05-09T08:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-09T08:55:01.004Z"), "finishedAt" : ISODate("2017-05-09T08:55:01.552Z"), "result" : null }
{ "_id" : "KBTktofkjmJhvW6gp", "intendedAt" : ISODate("2017-05-09T09:55:00Z"), "name" : "Cleanup OEmbed cache", "startedAt" : ISODate("2017-05-09T09:55:00.003Z") }
{ "_id" : "SG5TS2nTqbDaRJtWf", "intendedAt" : ISODate("2017-05-09T09:55:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-09T09:55:00.653Z"), "finishedAt" : ISODate("2017-05-09T09:55:01.483Z"), "result" : null }
{ "_id" : "rpbN2o6ByXZrZbiqo", "intendedAt" : ISODate("2017-05-09T11:07:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-09T11:07:00.892Z"), "finishedAt" : ISODate("2017-05-09T11:07:01.388Z"), "result" : null }
{ "_id" : "erNzWNZKzqjrz5rmL", "intendedAt" : ISODate("2017-05-09T12:07:00Z"), "name" : "Generate and save statistics", "startedAt" : ISODate("2017-05-09T12:07:00.398Z"), "finishedAt" : ISODate("2017-05-09T12:07:00.933Z"), "result" : null }

//EDIT: After hitting the "Clear cache now" Button in the Admin Panel it takes just 2-3 minutes to crash the instance. You can see the memory usage growing. (We have still set the 4GB nodejs memory limit). "Embed cache expiration days" is set to 30 days.

I have no idea if this is really the reason for it but we we're running a version (0.30) before without this feature. So that means we never cleared something up before. Maybe clearing up all except the last 30 days is too much?

db.rocketchat_oembed_cache.count()
24000

I just caught up on the messages on this issue. I ran into something similar cause I was noticing my instances were having to restart when listing the pm2 instances

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ App name      โ”‚ id โ”‚ mode โ”‚ pid   โ”‚ status โ”‚ restart โ”‚ uptime โ”‚ cpu โ”‚ mem        โ”‚ watching โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ rocket.chat.0 โ”‚ 0  โ”‚ fork โ”‚ 7781  โ”‚ online โ”‚ 5       โ”‚ 16h    โ”‚ 0%  โ”‚ 371.8 MB   โ”‚ disabled โ”‚
โ”‚ rocket.chat.1 โ”‚ 1  โ”‚ fork โ”‚ 32543 โ”‚ online โ”‚ 0       โ”‚ 5D     โ”‚ 0%  โ”‚ 405.1 MB   โ”‚ disabled โ”‚
โ”‚ rocket.chat.2 โ”‚ 2  โ”‚ fork โ”‚ 7376  โ”‚ online โ”‚ 4       โ”‚ 16h    โ”‚ 0%  โ”‚ 382.5 MB   โ”‚ disabled โ”‚
โ”‚ rocket.chat.3 โ”‚ 3  โ”‚ fork โ”‚ 6970  โ”‚ online โ”‚ 1       โ”‚ 16h    โ”‚ 0%  โ”‚ 373.4 MB   โ”‚ disabled โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
2017-05-04 08:44:24 814: OEmbed duplicated record http://www.refreshyourcache.com/en/home/
2017-05-04 08:44:24 828: OEmbed duplicated record http://www.refreshyourcache.com/en/home/
2017-05-04 08:44:24 883: OEmbed duplicated record http://www.refreshyourcache.com/en/home/
2017-05-04 08:44:24 895: OEmbed duplicated record http://www.refreshyourcache.com/en/home/
2017-05-05 17:46:32 681: FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory

Our server has 8 GB of memory. What would an optimal value be for --max_old_space_size=?

And is there a solution to this? The difference seems that I am getting that "OEmbed duplicated record".

@BirknerAlex Perfect! That does look like we've got a cause. Do you know if there are that many links being posted between cleanup times that would cause it to need to clean up that much each time?

@richardwlu what version are you running? I want to say the duplicate issue we fixed a while back.. but possible they didn't get cleaned up. I believe the easiest resolution to this issue is to just query the oembed collection directly and prune those records out manually.

@geekgonecrazy I am running on 0.55.0.

Can you provide an example of what you mean by pruning the records manually? What is OEmbed and would you know why there is a duplicated record?

@geekgonecrazy Nope, I don't know what the people are doing with direct messages or in private channels :( But in the public channels I see around ~ย 50 links posted each day.

But maybe the reason is the cache is to big because it's never been cleaned up before?

@richardwlu so the oembed cache is stored in the collection called rocketchat_oembed_cache the key is the url. So you could run a query like: db.getCollection('rocketchat_oembed_cache').find({_id: "https://google.com"}) to find those duplicates.

I don't think the duplicates are honestly what's causing yours to crash. I think its like @BirknerAlex said.. its crashing before it finishes so its never actually getting cleared out.

So if you want to take care of this manually.

db.getCollection('rocketchat_oembed_cache').drop()

@geekgonecrazy If I drop the collection, it's safe to say RC will remake it then? Would I need to restart?

Is there a way to prevent the RC cronjob to clear the cache from running so we can at least avoid this crash?

@richardwlu yes it will rebuild. As far as disabling it.. I don't think we have a way of turning it off right now.

This is related a bit to: #6894

@rodrigok we've worked through and determined this huge spike in memory usage seems to come from oembed cron job cleaning up. Thoughts?

Just for FYI:

RC: v0.55.0
node: v4.5
mongodb: v3.2.9

db.serverStatus().storageEngine
{
    "name" : "wiredTiger",
    "supportsCommittedReads" : true,
    "persistent" : true
}
db.rocketchat_oembed_cache.count()
58257

Noticed the same thing as above as there is no finishedAt timestamp.

001-rs:PRIMARY> db.rocketchat_cron_history.find({"name":"Cleanup OEmbed cache"},{})
{ "_id" : "dARtv2idzgTegQzTp", "intendedAt" : ISODate("2017-05-08T23:46:00Z"), "name" : "Cleanup OEmbed cache", "startedAt" : ISODate("2017-05-08T23:46:00Z") }
{ "_id" : "ZZC3iqfzQX72MK52g", "intendedAt" : ISODate("2017-05-08T23:48:00Z"), "name" : "Cleanup OEmbed cache", "startedAt" : ISODate("2017-05-08T23:48:00.002Z") }
{ "_id" : "fyQhmDxi9jTXaxsdk", "intendedAt" : ISODate("2017-05-09T23:46:00Z"), "name" : "Cleanup OEmbed cache", "startedAt" : ISODate("2017-05-09T23:46:00.002Z") }
{ "_id" : "ck8FAvMLNNx6bfx8Z", "intendedAt" : ISODate("2017-05-09T23:47:00Z"), "name" : "Cleanup OEmbed cache", "startedAt" : ISODate("2017-05-09T23:47:00.001Z") }
{ "_id" : "6udxBivFkzF7zyib5", "intendedAt" : ISODate("2017-05-09T23:48:00Z"), "name" : "Cleanup OEmbed cache", "startedAt" : ISODate("2017-05-09T23:48:00Z") }

@richardwlu yup looks like same in your case. Just never finished the operation.

@rodrigok
Doesn't look like the query should be super expensive on memory...
https://github.com/RocketChat/Rocket.Chat/blob/develop/server/methods/OEmbedCacheCleanup.js#L12
which then calls ->
https://github.com/RocketChat/Rocket.Chat/blob/develop/packages/rocketchat-oembed/server/models/OEmbedCache.js#L28

Shouldn't be memory intensive.

I have this same issue every evening, during which our 1GB server is so strained that nginx gives up and we experience some downtime. It seems to be getting more severe each night.

Is it safe to assume that, until the issue is resolved, we just need to truncate/drop the rocketchat_oembed_cache table periodically? I.e., by running db.getCollection('rocketchat_oembed_cache').drop()?

@alexweissman I think its for sure safe to drop that collection.

@alexweissman Do you got any side affects after dropping that collection?

@BirknerAlex I have dropped the collection while the app was running with no side effects. The collection will automatically be made again on the next hyperlink with no issues.

@BirknerAlex I have not noticed any problems either since dropping the collection. Everyone, feel free to stop by my chat server (https://chat.userfrosting.com) if you want to talk about this more!

For sure a band-aid. But glad to hear that its at least relieving your issues for now.

Yup, thanks to @geekgonecrazy and the other maintainers for creating such a great product! If there is any other information I can provide to help solve this problem more permanently, please let me know.

@geekgonecrazy

its also working for me now. With deleting the oembed_cache. ๐Ÿ‘

I dropped it also manually - got a channel there are over 30k Messages with just links. (github bot reporting about new things.)

The workaround is working here too, finally without daily crashes. :)

Uptime: 4 days ๐ŸŽ‰ ๐Ÿพ

This appears to possibly be an underlying issue with Meteor: https://github.com/meteor/meteor/issues/8405

@hans-strudle interesting find. Above it looks like one was only 58k records and this happened.

Maybe we need to chunk up the delete?

@rodrigok @sampaiodiego what do you guys think?

interesting.. but looks like mongodb doesn't have a way to limit the delete command.. :thinking:

Closing as the issue inside of Meteor has been closed due to something being fixed in 1.6 of Meteor: https://github.com/meteor/meteor/issues/8405#issuecomment-340795341

Feel free to comment and we'll open this back up if people are still having issues on the latest versions (as of the date of this comment).

This is still an issue in 0.62.2, just change the Admin.Message.Embed Cache Expiration Days and watch memory spike until crash. Running in docker container if that matters.

How many days did you change to? How many oembed messages? For sure need more details

I think it was 5 days? I'm not sure it matters how many days I changed it to, just changing the value causes the cache clearing functionality to run which causes the memory leak which causes a crash.

As for number of oembed messages, no idea. Where does it say? We have 120,000 messages total.

I think you'd have to hop into database to get the oembed count.

db.rocketchat_oembed_cache.find().length()

I assume CPU peaks along side the memory usage?

How hard is mongo getting pegged at this time?

Is it eating through all system memory before crashing? Or how much memory does Rocket.Chat its self use when it crashes? Does mongo memory usage increase too?

Also if you feel up to it.. would be great to see if there are long running mongo queries.
tail -f /var/log/mongodb/mongod.log prior to triggering oembed should give a clue. If new entries pop out those should be related.

Sorry for all the questions. Just really need to try and get as much of a picture of what's going on, to try and narrow in on what we need to optimize.

My knowledge of JS and Mongo is virtually zero.

Memory goes up and up until RC crashes and/or something kills it. If I recall CPU is also maxed out. I'm not sure what mongo was doing at the time.

right so if both Rocket.Chat and mongo are on the same host there can be two culprits. One we do too heavy of a query on mongo. Mongo eats all memory while it does its thing. Or... we do something Rocket.Chat side that leaks memory. So just would be good to know more what the questions above tell us :slightly_smiling_face:

Definitely rocket.chat leaking memory as it was running in its own container. Mongo was in a different one.

Containers alone do not mean anything in this situation. Containers are just process isolation. Unless the container is limiting memory.. for this question the effect would be the same as if you installed it manually.

If you run something like top when it happens you should be able to see the processes and their memory usage.

Rancher has a pretty graph of memory usage going through the roof.

This has hit us, and only now we found that this is the reason.
Our cluster has been choking and getting auto-rebooted for months every morning, right after the "Cleanup OEmbed cached" task started.
Using commands pasted here, we had 84K caches. Also, all the RC instances ran the job, which makes no sense to me at all, as it's supposed to clean a DB table.
FYI, after the cleanup, no lockup, but going by the cron history, the task took 16-25 seconds for 100 caches for every RC instance!!!
As our forced reboot starts killing at 30 seconds of high activity, we're changing the expiration value from 30 days to 2 days and will change it to 1 day if that is too much.
Also, as this value is hidden (has no UI), here's the command I used:
db.rocketchat_settings.updateOne({'_id': 'API_EmbedCacheExpirationDays'}, {$set: {'packageValue': 2, value: 2}})

Made a mistake reading the startedAt/finishedAt values - the difference was in milliseconds. Still, 30 days may be just too much time to keep this data.

Hello,
I dropped that cache collection and now embed preview is not working for me at all. Could you help me out with this somehow ? I started debug logging but I see nothing suspicious.

Thanks a lot !

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jadeqwang picture jadeqwang  ยท  81Comments

engelgabriel picture engelgabriel  ยท  158Comments

HammyHavoc picture HammyHavoc  ยท  52Comments

gwillen picture gwillen  ยท  77Comments

Theaxiom picture Theaxiom  ยท  72Comments