Rocket.chat: Issues starting after 3.5.1 upgrade

Created on 6 Aug 2020 · 15Comments · Source: RocketChat/Rocket.Chat

Description:

We successfully upgraded to 3.5.1 but when we later tried to restart Rocketchat it hung at SAUMonitor ➔ info [aggregate] - Start Cron. and progressed no further.

Since this was in Kubernetes the liveness probe eventually killed it, each time getting stuck at the same point until eventually after 7 times it managed to progress and start correctly. It's possible that it would've succeeded with a longer wait for liveness check but there's no way to verify now as it seems to be fine now.

See additional context for some background to why we restarted.

Steps to reproduce:

Not able to reproduce now, but it was upon start of a new kubernetes pod.

Expected behavior:

Rocketchat starts up in a timely fashion.

Actual behavior:

Rocketchat container starts and gets to a certain point in the logs (see logs) and hangs before eventually being killed by kubernetes due to liveness/readiness probes.

Server Setup Information:

Version of Rocket.Chat Server: 3.5.1
Operating System: Official docker image 3.5.1
Deployment Method: docker in kubernetes
Number of Running Instances: 1
DB Replicaset Oplog:
NodeJS Version:
MongoDB Version:

Client Setup Information

Desktop App or Browser Version:
Operating System:

Additional context

One point to note is the reason we restarted was to apply the enabling of push notifications, I'm not sure if there is any possible interaction here. We received the notice to enable them and actioned accordingly and we then encountered this.

These are the liveness/readiness checks in kubernetes which have served us well for a few years:

        livenessProbe:
          httpGet:
            path: /home
            port: 3000
          initialDelaySeconds: 90
          timeoutSeconds: 10
        readinessProbe:
          httpGet:
            path: /home
            port: 3000
          initialDelaySeconds: 30
          timeoutSeconds: 10

Relevant logs:

kubectl -n xxxx logs rocketchat-7b4dc69ddd-4k77b
rocketchat:lib ➔ oauth_updated Accounts_OAuth_Facebook
...
...
rocketchat:lib ➔ oauth_updated Accounts_OAuth_Apple
Federation ➔ Setup.info Federation is disabled
Setting default file store to FileSystem
CAS ➔ info Disabling CAS login service
Search Logger ➔ info create search provider defaultProvider
Search Logger ➔ info create search provider chatpalProvider
{"line":"120","file":"migrations.js","message":"Migrations: Not migrating, already at version 202","time":{"$date":1596709208408},"level":"info"}
ufs: temp directory created at "/tmp/ufs"
SAUMonitor ➔ info [aggregate] - Start Cron.

Source

cjpluss2k18

🚀2

Most helpful comment

A proper fix will land on release 3.5.2. a Docker image will be available in a few minutes. A snap release will be available in a few hours.

sampaiodiego on 13 Aug 2020

👍4

All 15 comments

This is also happening on docker develop-80ba8cf. I do not have the extra debugging options @cjpluss2k18 has, but I can see that it is getting hung at:

ufs: temp directory created at "/tmp/ufs"

What can we do to work around this problem?

bkraul on 13 Aug 2020

Yep - looks like snap package updated to 3.5.1 and after restart rocketchat-server is getting stuck at that same point and it isn't opening the listening TCP socket so my reverse proxy gateway is responding with a 502 bad gateway error.

Edit: Reverting snap to 3.4.2 doesn't seem to help either. Server starts and then exits immediately without any indication of error.

Further Edit: stracing the node process it is definitely talking to mongo at least - looks vaguely like it is looping over rocket_notification_queue - just nothing else going on.

alyandon on 13 Aug 2020

👍2

I have exactly the same issue after upgrading to 3.5.1. First restart of docker containers after upgrading worked, but it gets stuck after the second restart at ufs: temp directory created at "/tmp/ufs".

Same as @cjpluss2k18 , I restarted the second time to enable push notifications, but my rocketchat instance does not start after 7 minutes (... or half an hour for that matter) :/

EDIT: Now it progresses to this point and gets stuck here

rocketchat_1          | LocalStore: store created at
rocketchat_1          | LocalStore: store created at
rocketchat_1          | LocalStore: store created at
rocketchat_1          | Setting default file store to GridFS
rocketchat_1          | {"line":"120","file":"migrations.js","message":"Migrations: Not migrating, already at version 202","time":{"$date":1597326036694},"level":"info"}
rocketchat_1          | ufs: temp directory created at "/tmp/ufs"
rocketchat_1          | LocalStore: store created at
rocketchat_1          | LocalStore: store created at
rocketchat_1          | LocalStore: store created at
rocketchat_1          | Setting default file store to GridFS
rocketchat_1          | {"line":"120","file":"migrations.js","message":"Migrations: Not migrating, already at version 202","time":{"$date":1597326196454},"level":"info"}

cjhille on 13 Aug 2020

@sampaiodiego could we please get this triaged? Seems like a breaking issue :)

bkraul on 13 Aug 2020

how many times does this have to happen

o5k on 13 Aug 2020

same here

Mixelito on 13 Aug 2020

Well this is quite nasty.. since the system worked for a while after upgrading. Now I'm forced to revert to an older backup from 2 days ago... which means dataloss for everyone.. \o/

cjhille on 13 Aug 2020

Same problem here since our snap installation automatically upgraded to 3.5.1 during the night. I've tried to revert to a previous snap but the same issue persists. Still stuck at the SAUMonitor line.

acmeplus on 13 Aug 2020

For those curious - I popped into the community rocketchat server snap channel and based on the conversation going on there it appears that this problem is being caused by the rocketchat server attempting to reach out to cloud.rocket.chat which currently appears to be non-responsive in some fashion?

Adding a firewall rule to block outbound traffic - iptables -I OUTPUT -d 51.81.11.138/24 -j REJECT appears to allow my instance to start normally but presumably things like push notifications won't work.

Obviously, I don't recommend using that as anything other than a work-around.

alyandon on 13 Aug 2020

❤2

Adding a firewall rule to block outbound traffic - iptables -I OUTPUT -d 51.81.11.138/24 -j REJECT appears to allow my instance to start normally but presumably things like push notifications won't work.

This work-around does work!
Will continue to follow this thread for an actual solution.

phrankerco on 13 Aug 2020

A proper fix will land on release 3.5.2. a Docker image will be available in a few minutes. A snap release will be available in a few hours.

sampaiodiego on 13 Aug 2020

👍4

I wondered why my server started right up on my test instance which is not joined to the rocket.chat cloud or connected to its services. Makes perfect sense. @sampaiodiego , I noticed the PR you merged only defers the startup checks, but it doesn't seem to address the issue of trying to access cloud.rocket.chat when it is unavailable. Is this something needing to be handled as well?

bkraul on 13 Aug 2020

My RocketChat server (running on virtual machine CentOS Linux release 7.8.2003 (Core) with latest updates just crash today. I run all OS updates (MongoDB, NodeJS and RocketChat to the latest version), but without success.
I run
iptables -I OUTPUT -d 51.81.11.138/24 -j REJECT
systemctl restart rocketchat
and server now is online.

iovcho on 13 Aug 2020

@bkraul it's not a case when the service isn't available, but when the service doesn't reply keeping the connection waiting for the stoping the startup, we didn't realize those requests were making the startup to wait so the first solution was to move them to async.

This PR changes the HTTP calls to always have a forced timeout of 20s and prevent those connections to stay forever there.

We are fixing our cloud services to provide a timeout as well to prevent this to happen again on old versions of Rocket.Chat or any other place using the cloud APIs.

Unfortunately, it was the first time we had this kind of problem, it seems to be related to our services and database connection making the requests to wait forever. We are implementing all the necessary changes to prevent this to happen again.

Our cloud services are back to the normal response time since a few minutes ago, so if you restart your servers they should ba to work again.

Thanks