Moleculer: optimize HEARTBEAT performance

Created on 30 Oct 2018  ·  21Comments  ·  Source: moleculerjs/moleculer

Is your feature request related to a problem? Please describe.
When have lots of nodes, PING/PONG/HEARTBEAT take too much CPU usage. As i understand it is N*(N-1) cost(maybe i'm wrong about the count).
We upgrade using moleculer on our production machines today, about 200+ node process(we have lots of nodes for Websocket clients connection). Find that gnatsd use 70%~90%CPU from Linux top command. I guess its about 8000 r/s just for PING/PONG/HEARTBEAT. After we change pingInterval from 5s to 500s, the gnatsd CPU usage reduce to 1%~5%

Describe the solution you'd like
Some solution resolve N*(N-1) PING/PONG problem.

Solution Ⅰ: Can disable PING/PONG/HEARTBEAT, use request timeout to detect node disconnecting. If one node broker.call timeout then choose another service node to finish request, but mark that timeout one as MayFailure, if timeout 2~3 times then mark as Failure. Wait the Failure restart and broadcast info then mark as Active.

Solution Ⅱ: Change N*(N-1) to N, some of transportor already has buildin SUB/PUB mechanism, like NATS. Just SUB the special broadcast event moleculer_keep_alive, each node just PUB this event on timer interval. It will somehow reduce the traffic.

Transporter Need reproduce Bug

Most helpful comment

Hello, I was wondering if node discovery could be done via websockets, full duplex, .. does that makes sense in any way? I don't know exactly all the aspects but heartbeat looks like long polling .. ?

Yes, its possible. The Gossip protocol work on TCP so... its possible to run over websockets.

Actually.... I did this here on my work. I'm just waiting my boss to authorize sharing the code (because i did this using my company's time). I did this last year, and forget xD. I will ask again and share the code.

All 21 comments

Nice write-up 👍

I suspect that this issue probably exists for HEARTBEAT packets as well (to some extent).

The PING/PONG is not sent automatically by the broker. It is a "user" command. Only send if you call the broker.ping function. _(And LatencyStrategy uses it to detect latencies between nodes)

The HEARTBEAT uses a simple topic what you described in solution 2. All nodes subscribe to only the MOL.HEARTBEAT topic and publish the own heartbeat to it, as well. The NATS will send the message to all subscribers.

Could you get a statistics from NATS by topics?

Seems the MOL.HEARTBEAT have strange behavior. I started 30 test nodes and empty APP layer call. It work well, the nats-top show like this:

NATS server version 1.3.0 (uptime: 1m0s)                                                                                                                
Server:
  Load: CPU:  0.0%  Memory: 13.7M  Slow Consumers: 0
  In:   Msgs: 647  Bytes: 586.9K  Msgs/Sec: 0.0  Bytes/Sec: 0       
  Out:  Msgs: 11.0K  Bytes: 2.3M  Msgs/Sec: 0.0  Bytes/Sec: 0         

Connections Polled: 30
  HOST                  CID    SUBS    PENDING     MSGS_TO     MSGS_FROM   BYTES_TO    BYTES_FROM  LANG     VERSION  UPTIME   LAST ACTIVITY
  192.168.1.69:48412    14     20      0           378         22          80.7K       20.4K       node     1.0.1    59s      2018-10-31 13:38:42.041809301 +0
  192.168.1.69:48623    26     20      0           366         13          80.1K       2.6K        node     1.0.1    58s      2018-10-31 13:38:42.041809301 +0

But If i keep the 30 nodes running, and ONLY restart gnatsd few times. each time gnatsd restart, the traffic and CPU increase:

NATS server version 1.3.0 (uptime: 1m0s)                                                                                                                
Server:
  Load: CPU:  0.0%  Memory: 13.7M  Slow Consumers: 0
  In:   Msgs: 1.2K  Bytes: 1.1M  Msgs/Sec: 0.0  Bytes/Sec: 0        
  Out:  Msgs: 20.9K  Bytes: 2.2M  Msgs/Sec: 0.0  Bytes/Sec: 0         

Connections Polled: 30
  HOST                  CID    SUBS    PENDING     MSGS_TO     MSGS_FROM   BYTES_TO    BYTES_FROM  LANG     VERSION  UPTIME   LAST ACTIVITY
  192.168.1.69:34444    26     20      0           704         23          95.4K       1.3K        node     1.0.1    58s      2018-10-31 13:39:52.050453283 +0
  192.168.1.69:34454    30     20      0           705         25          93.5K       3.3K        node     1.0.1    58s      2018-10-31 13:39:52.050453283 +0


NATS server version 1.3.0 (uptime: 1m0s)                                                                                                                
Server:
  Load: CPU:  1.0%  Memory: 13.7M  Slow Consumers: 0
  In:   Msgs: 1.6K  Bytes: 1.1M  Msgs/Sec: 36.0  Bytes/Sec: 2.0K    
  Out:  Msgs: 31.8K  Bytes: 2.8M  Msgs/Sec: 1078.7  Bytes/Sec: 60.8K  

Connections Polled: 30
  HOST                  CID    SUBS    PENDING     MSGS_TO     MSGS_FROM   BYTES_TO    BYTES_FROM  LANG     VERSION  UPTIME   LAST ACTIVITY
  192.168.1.69:49802    8      20      0           1.1K        60          117.8K      49.5K       node     1.0.1    1m0s     2018-10-31 13:41:12.052607595 +0
  192.168.1.69:49823    28     20      0           1.1K        41          116.8K      11.9K       node     1.0.1    1m0s     2018-10-31 13:41:12.052607595 +0


NATS server version 1.3.0 (uptime: 1m0s)                                                                                                                
Server:
  Load: CPU:  1.0%  Memory: 13.7M  Slow Consumers: 0
  In:   Msgs: 1.9K  Bytes: 1.1M  Msgs/Sec: 6.0  Bytes/Sec: 347      
  Out:  Msgs: 41.5K  Bytes: 3.3M  Msgs/Sec: 179.8  Bytes/Sec: 10.2K   

Connections Polled: 30
  HOST                  CID    SUBS    PENDING     MSGS_TO     MSGS_FROM   BYTES_TO    BYTES_FROM  LANG     VERSION  UPTIME   LAST ACTIVITY
  192.168.1.69:36978    29     20      0           1.4K        47          135.2K      2.7K        node     1.0.1    59s      2018-10-31 13:42:27.057375856 +0
  192.168.1.69:36976    27     20      0           1.4K        53          133.6K      14.5K       node     1.0.1    59s      2018-10-31 13:42:27.057375856 +0

NATS server version 1.3.0 (uptime: 1m0s)                                                                                                                
Server:
  Load: CPU:  2.0%  Memory: 13.7M  Slow Consumers: 0
  In:   Msgs: 2.3K  Bytes: 1.2M  Msgs/Sec: 29.9  Bytes/Sec: 1.7K    
  Out:  Msgs: 52.8K  Bytes: 4.1M  Msgs/Sec: 898.3  Bytes/Sec: 51.5K   

Connections Polled: 30
  HOST                  CID    SUBS    PENDING     MSGS_TO     MSGS_FROM   BYTES_TO    BYTES_FROM  LANG     VERSION  UPTIME   LAST ACTIVITY
  192.168.1.69:52421    17     20      0           1.8K        81          158.0K      44.9K       node     1.0.1    59s      2018-10-31 13:43:44.262573118 +0
  192.168.1.69:52422    19     20      0           1.8K        80          158.0K      42.9K       node     1.0.1    59s      2018-10-31 13:43:44.262573118 +0

I write some code just connect NATS directly without moleculer. The traffic is empty:

NATS server version 1.3.0 (uptime: 1m9s)                                                                                                                
Server:
  Load: CPU:  0.0%  Memory: 11.7M  Slow Consumers: 0
  In:   Msgs: 0  Bytes: 0  Msgs/Sec: 0.0  Bytes/Sec: 0
  Out:  Msgs: 0  Bytes: 0  Msgs/Sec: 0.0  Bytes/Sec: 0

Connections Polled: 30
  HOST                  CID    SUBS    PENDING     MSGS_TO     MSGS_FROM   BYTES_TO    BYTES_FROM  LANG     VERSION  UPTIME   LAST ACTIVITY
  192.168.1.69:52746    4      0       0           0           0           0           0           node     1.0.1    1m8s     2018-10-31 14:08:22.576751492 +0
  192.168.1.69:52754    12     0       0           0           0           0           0           node     1.0.1    1m8s     2018-10-31 14:08:22.578244986 +0
  192.168.1.69:52761    18     0       0           0           0           0           0           node     1.0.1    1m8s     2018-10-31 14:08:22.579326183 +0
  192.168.1.69:52768    24     0       0           0           0           0           0           node     1.0.1    1m8s     2018-10-31 14:08:22.580617047 +0
  192.168.1.69:52745    3      0       0           0           0           0           0           node     1.0.1    1m8s     2018-10-31 14:08:22.576606612 +0

The above statics somehow can explain my yesterday issue. I did restart gnatsd like 2~3 times. But i have 200+ nodes, then the traffic and CPU increase quickly. Then i change the heartbeatInterval from 5s to 500s, and restart all 200+ nodes, the traffic and CPU go peace.

I will try to reproduce and investigate...

I could reproduce. I started 100 nodes with

$ NODE_CLUSTER_WORKERS=100 TRANSPORTER=NATS node examples/loadtest/clients.js

It generates 1M msgs in 4 minutes.
But all subscriptions are correct. If heartbeat time is 5 sec, it means they generate 100 msgs in 5 sec => 20 msgs / sec. So 20 msgs /sec on the incoming side of NATS and NATS generates 20 * 100 msgs /sec outgoing messages. So it really generates traffic.

But currently, I have no idea how we can reduce this messages. Only with heartbeat timer increasing.

But that can't explain why gnatsd restart lead traffic increase. traffic should keep the same when gnatsd restart.

For reduce the messages, can we use solution Ⅰ ?

Solution 1 is same what circuit breaker function does. So if you increase the ping and enable circuit breaker you get the same result.

@icebob have you try to keep 100 nodes but only restart gnatsd few times ? Did the traffic increase?

I've tried, but I didn't experience the same issue. The CPU usage between 1%-10% after some restarting too.

nats-top can show IN/OUT for msg, each time gnatsd restart, it increased. But it should not.

~Can you try running an older version of gnatsd? Like versions before they have the improved DAG?~

Yep the explanation below makes more sense

I think the problem is that when you start nodes, it starts the heartbeat timer gradually. But if the transporters disconnect and reconnect heartbeat timers on all nodes starts at the same moments (when the connection is established). So all nodes will send heartbeats at the same time. And it can generate big traffic. 100 HB broadcast message generates 100k messages on NATS.

It makes sense, as on disconnection the timers are cleared and on connection timers are re-established.

Hello, I was wondering if node discovery could be done via websockets, full duplex, .. does that makes sense in any way? I don't know exactly all the aspects but heartbeat looks like long polling .. ?

Hello, I was wondering if node discovery could be done via websockets, full duplex, .. does that makes sense in any way? I don't know exactly all the aspects but heartbeat looks like long polling .. ?

Yes, its possible. The Gossip protocol work on TCP so... its possible to run over websockets.

Actually.... I did this here on my work. I'm just waiting my boss to authorize sharing the code (because i did this using my company's time). I did this last year, and forget xD. I will ask again and share the code.

Hello, we are considering to use moleculer in production but this issue is still the center of the discussion. To my understanding molculer uses heartbeats to track disconnected nodes and as @roytan883 mentions this scales N * N-1 which is a mayor problem.

We're using moleculer together with kubernetes and i'm wondering if there's a way to turn off moleculer's heartbeat and use the discovery and load balancing that kubernetes provides instead.

@Wallacy any update ?

I think the problem is that when you start nodes, it starts the heartbeat timer gradually. But if the transporters disconnect and reconnect heartbeat timers on all nodes starts at the same moments (when the connection is established). So all nodes will send heartbeats at the same time. And it can generate big traffic. 100 HB broadcast message generates 100k messages on NATS.

add random factor to send change state heartbeat

We found a problem in the heartbeat logic of Moleculer which has been fixed in v0.14.6 version. In brief, when the transporter client disconnected (unexpectedly), it doesn't trigger the $transporter.disconnected event, so the heartbeat timers weren't destroyed but when the connection was established again, the $transporter.connected event was triggered & timers were created again.
In addition, we found that the nats client (probably some others as well) is queuing the packets while the connection is broken and after it connects, sends all collected messages all at once, which can flood the NATS server and Moleculer nodes.

Both issues are fixed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

darky picture darky  ·  5Comments

nurdism picture nurdism  ·  3Comments

maitrucquynhq111 picture maitrucquynhq111  ·  3Comments

SushKenyNeosoft picture SushKenyNeosoft  ·  3Comments

Kamil93 picture Kamil93  ·  3Comments