Rocket.chat: Multiple server deployment constantly needs to be reloaded in order to receive messages

Created on 22 Aug 2016 · 29Comments · Source: RocketChat/Rocket.Chat

Your Rocket.Chat version: 0.37.1

After upgrading from 0.35.0 to 0.37.1 I need to constantly refresh the client in order to receive the latest messages. The socket seems to be okay and pushes/receives ping/pong requests, but not the messages.

screen shot 2016-08-22 at 5 58 32 pm

Any ideas what's wrong? Is it possible to somehow downgrade to 0.35.0, because just changing the version back doesn't help?

waiting response bug

Source

vytautasgimbutas

👍3

Most helpful comment

@rodrigok Indeed, I was exploring the private network option as being a potential cause of the problem until I worked out how to reproduce the issue and looked around in the code.

Thanks for looking at this, and the quick PR 👍

jamie-owen on 29 Nov 2016

👍2

All 29 comments

Occasionally after a refresh I will receive messages for some time, but then it will stop and again a refresh will be required.

Tried setting log level to the highest, checked what happens when a message is sent and received and when it's sent but not received on the other side. They are both identical and looks like this:

Meteor ➔ method stream-notify-room -> userId: B2fHx9bd369Xe4usE , arguments:  { '0': 'B2fHx9bd369Xe4usEJbSwAEWLmW4ZQ8gg7/typing',
  '1': 'Saulius.Nevys',
  '2': true }
Meteor ➔ method stream-notify-room -> userId: B2fHx9bd369Xe4usE , arguments:  { '0': 'B2fHx9bd369Xe4usEJbSwAEWLmW4ZQ8gg7/typing',
  '1': 'Saulius.Nevys',
  '2': false }
Meteor ➔ method sendMessage -> userId: B2fHx9bd369Xe4usE , arguments:  { '0':
   { _id: '8MmBFYhvz6tTK8xXn',
     rid: 'B2fHx9bd369Xe4usEJbSwAEWLmW4ZQ8gg7',
     msg: 'a\n' } }
Meteor ➔ method canAccessRoom -> userId: B2fHx9bd369Xe4usE , arguments:  { '0': 'B2fHx9bd369Xe4usEJbSwAEWLmW4ZQ8gg7',
  '1': 'B2fHx9bd369Xe4usE' }

vytautasgimbutas on 22 Aug 2016

We're seeing this as well, we're sticking with our current version until this is resolved.

lochiiconnectivity on 14 Sep 2016

@lochiiconnectivity
I was told that the issue was with renaming channel "#general". Once we put it back to #general everything started to work correctly. I'm not sure, maybe this was a coincidence.

vytautasgimbutas on 14 Sep 2016

We haven't renamed it at all, I'm afraid.

lochiiconnectivity on 14 Sep 2016

Here we have the same issue with version 0.39 when using rocket with a load balancer, with only one instance everything works fine.
We are curently trying to fix it by configuring our load balancer to support correctly websockets.

meltenc on 14 Sep 2016

we are also in a load balanced environment , what do you mean by fixing your load balancer? what do you think is wrong with it? is it currently working? did something change for you in 0.39.0 where you feel you now have to do this?

lochiiconnectivity on 14 Sep 2016

We thought it has something to do with our load balancer configuration (sticky session or webscoket issue), but it appears since version 0.38 or 0.37, so it should be something with the these releases and the fact we are load balancing rocket because everything works fine without it...

meltenc on 14 Sep 2016

We just went into production with nginx as the load balancer fronting 12 instances of RC version 0.42.0. on 3 different servers. My users started to contact me on this specific issue. I can re-start nginx and its fine for a period of time. Since the last comment was 28 days ago did someone find a way to resolve this issue? Currently I changed my config on the nginx side to point to one server with 4 RC instances. It seems that there are issues here to. I've now changed it to one RC instance. On nginx. I had least_conn; and ip_hash. I remove the least_conn; and still have the hash_ip;. Does anyone have some standards in this area or advice?

drickerusa on 12 Oct 2016

Hey, we're running a similar setup as you @drickerusa and ran in to this issue when upgrading from 0.37.1 to 0.39.0, as a result we rolled back and we've been stuck on 0.37.1 with no upgrade path for the last couple of months. Keen for an update from the team on this, even a suggestion of where to look to troubleshoot/fix this problem is a start.

jamie-owen on 13 Oct 2016

@Jamie-Owen I pared down to one server running 5 instances and used this in my rocketchat.conf file created by the forever-service. env INSTANCE_IP=. This variable is also referenced in "Multiple instances of nodejs fails to propagate messages #4019". I put it on all of my machines, but I'm not sure it really matters because it still manifested itself on multiple machines with multiple instances. I need to do some testing in my staging environment, but I first need to setup another machine to do this.

drickerusa on 19 Oct 2016

I'm also runnig in multiple machines setting the INSTANCE_IP and also had some reports of messages not propagating, but now it occurs silently without report StreamBroadcast error. Version 0.43.

ronaldosaheki on 2 Nov 2016

I was seeing this same issue when we were on heroku. I believe it had something to do with the websocket connections idling out. See https://github.com/RocketChat/Rocket.Chat/issues/4589

I've since moved to Digital Ocean and installed with snap. Everything is working again and feels a lot tighter.

kevincolten on 2 Nov 2016

Just to clarify, this issue is related to deployments running multiple instances of RC behind a load balancer. We have done some work on this to facilitate the streaming across the multiple servers. We will give you more info shortly.

engelgabriel on 3 Nov 2016

👍2

Thanks for the update @engelgabriel I'm currently using one server with 5 instances and this is performing well. Once the fix is in I'll add one of the other servers with more RC instances.

drickerusa on 3 Nov 2016

Thanks for the update @engelgabriel. Also thanks @drickerusa, I think we'll move to a similar set up as per your suggestion until 0.46.0 so we can try out some of the recent features.

jamie-owen on 3 Nov 2016

https://github.com/RocketChat/Rocket.Chat.Docs/issues/87

Check that out. Maybe it can help.

TheReal1604 on 3 Nov 2016

Any update on this issue? I am running into the same problem when using nginx to load balance multiple RC docker containers.

tasmar on 16 Nov 2016

We're seeing this impact notifications as well in our instance. Some users are not getting notifications while others do.

cujarrett on 17 Nov 2016

@tasmar @matt-jarrett someone of you checked my guide posted above...?

TheReal1604 on 17 Nov 2016

@TheReal1604 I have not had time to try it. I will work on it next week and post my results when I have ran some tests.

tasmar on 18 Nov 2016

Is this still an issue in our latest release? Keep in mind you should configure INSTANCE_IP as described here https://github.com/RocketChat/Rocket.Chat.Docs/issues/87 and machines should have access to each other.

rodrigok on 22 Nov 2016

After investigation on this issue on the new version (0.46.0), We saw that messages sent from a client on a host are not available on clients connected to the same host but on a different instance of Rocket Chat.

We did a little 'ping-pong' test using a ddp-client library connecting to every instances. The test by itself does not detect any lost messages (everything looks to work fine on the ddp side). Unfortunately when you look at the test progression on a specific instance, you can't see any messages sent by the test client connected to your host.

Because the issue appeared on the 0.39 release we looked at what changed ( https://github.com/RocketChat/Rocket.Chat/compare/0.37.1...0.39.0#diff-46440261ce942a3b5a369e56c2b0a0daR39 ) and this part looked wrong to us :

https://github.com/RocketChat/Rocket.Chat/blob/0.39.0/server/stream/streamBroadcast.coffee#L40

We are running RocketChat on docker and localhost is not the docker host, meaning the container cannot connect to other containers running on the same host because your assumption that instance.extraInformation.host == process.env.INSTANCE_IP should equate to 'localhost' breaks this

juanwolf on 25 Nov 2016

@juanwolf Each docker instance has their own IP right? So that code will never be executed.

It exists juts to replace IP by localhost if the local and remote instances have the same IP.

rodrigok on 29 Nov 2016

@juanwolf can you help us test it? Maybe give us access to your testing environment so we can take a look and try to understand it better? We really want to solve this issues.

engelgabriel on 29 Nov 2016

@engelgabriel @rodrigok just fyi, after following the steps in my guide all is working fine in my 3 instances deployment. (3 separate instances with 1 docker container each for mongo and RC)

TheReal1604 on 29 Nov 2016

@rodrigok @engelgabriel We are running multiple Docker hosts, in turn, each host runs multiple Rocket.Chat containers, we pass the Docker Host IP in as the INSTANCE_IP environment variable and we set PORT too.

As we are running multiple containers on one host, the code to set instance = "localhost:#{port}" does run and this is the reason containers on the same Docker host can't talk to each other, but are able to talk fine with containers on another host. This is also the reason why @TheReal1604 doesn't experience these symptoms with a container per host (we are scaling both vertically and horizontally).

I removed the logic that I suspected was causing the problem in streamBroadcast.coffeethen built my own image and communication seems fine between all instances, although I don't want to be so naive to assume this will work for all deployment methods.

If you guys want to know more about this I'm happy to join you in your dev channel on the demo server.

jamie-owen on 29 Nov 2016

👍1

@Jamie-Owen Oh, ok, you are using the Host IP to your containers, you are not using a private network.

I was talking to @engelgabriel and we realized this option, I will add a check to see if process is running in a docker container to prevent this conversion to localhost.

rodrigok on 29 Nov 2016

@rodrigok Indeed, I was exploring the private network option as being a potential cause of the problem until I worked out how to reproduce the issue and looked around in the code.

Thanks for looking at this, and the quick PR 👍

jamie-owen on 29 Nov 2016

👍2

@Jamie-Owen sounds interesting.. maybe docker overlay networks could be interesting for your environment.

https://deis.com/blog/2016/connecting-docker-containers-2/

TheReal1604 on 29 Nov 2016

Was this page helpful?

0 / 5 - 0 ratings