Is your feature request related to a problem? Please describe.
When have lots of nodes, PING/PONG/HEARTBEAT take too much CPU usage. As i understand it is N*(N-1) cost(maybe i'm wrong about the count).
We upgrade using moleculer on our production machines today, about 200+ node process(we have lots of nodes for Websocket clients connection). Find that gnatsd use 70%~90%CPU from Linux top command. I guess its about 8000 r/s just for PING/PONG/HEARTBEAT. After we change pingInterval from 5s to 500s, the gnatsd CPU usage reduce to 1%~5%
Describe the solution you'd like
Some solution resolve N*(N-1) PING/PONG problem.
Solution Ⅰ: Can disable PING/PONG/HEARTBEAT, use request timeout to detect node disconnecting. If one node broker.call timeout then choose another service node to finish request, but mark that timeout one as MayFailure, if timeout 2~3 times then mark as Failure. Wait the Failure restart and broadcast info then mark as Active.
Solution Ⅱ: Change N*(N-1) to N, some of transportor already has buildin SUB/PUB mechanism, like NATS. Just SUB the special broadcast event moleculer_keep_alive, each node just PUB this event on timer interval. It will somehow reduce the traffic.
Nice write-up 👍
I suspect that this issue probably exists for HEARTBEAT packets as well (to some extent).
The PING/PONG is not sent automatically by the broker. It is a "user" command. Only send if you call the broker.ping function. _(And LatencyStrategy uses it to detect latencies between nodes)
The HEARTBEAT uses a simple topic what you described in solution 2. All nodes subscribe to only the MOL.HEARTBEAT topic and publish the own heartbeat to it, as well. The NATS will send the message to all subscribers.
Could you get a statistics from NATS by topics?
Seems the MOL.HEARTBEAT have strange behavior. I started 30 test nodes and empty APP layer call. It work well, the nats-top show like this:
NATS server version 1.3.0 (uptime: 1m0s)
Server:
Load: CPU: 0.0% Memory: 13.7M Slow Consumers: 0
In: Msgs: 647 Bytes: 586.9K Msgs/Sec: 0.0 Bytes/Sec: 0
Out: Msgs: 11.0K Bytes: 2.3M Msgs/Sec: 0.0 Bytes/Sec: 0
Connections Polled: 30
HOST CID SUBS PENDING MSGS_TO MSGS_FROM BYTES_TO BYTES_FROM LANG VERSION UPTIME LAST ACTIVITY
192.168.1.69:48412 14 20 0 378 22 80.7K 20.4K node 1.0.1 59s 2018-10-31 13:38:42.041809301 +0
192.168.1.69:48623 26 20 0 366 13 80.1K 2.6K node 1.0.1 58s 2018-10-31 13:38:42.041809301 +0
But If i keep the 30 nodes running, and ONLY restart gnatsd few times. each time gnatsd restart, the traffic and CPU increase:
NATS server version 1.3.0 (uptime: 1m0s)
Server:
Load: CPU: 0.0% Memory: 13.7M Slow Consumers: 0
In: Msgs: 1.2K Bytes: 1.1M Msgs/Sec: 0.0 Bytes/Sec: 0
Out: Msgs: 20.9K Bytes: 2.2M Msgs/Sec: 0.0 Bytes/Sec: 0
Connections Polled: 30
HOST CID SUBS PENDING MSGS_TO MSGS_FROM BYTES_TO BYTES_FROM LANG VERSION UPTIME LAST ACTIVITY
192.168.1.69:34444 26 20 0 704 23 95.4K 1.3K node 1.0.1 58s 2018-10-31 13:39:52.050453283 +0
192.168.1.69:34454 30 20 0 705 25 93.5K 3.3K node 1.0.1 58s 2018-10-31 13:39:52.050453283 +0
NATS server version 1.3.0 (uptime: 1m0s)
Server:
Load: CPU: 1.0% Memory: 13.7M Slow Consumers: 0
In: Msgs: 1.6K Bytes: 1.1M Msgs/Sec: 36.0 Bytes/Sec: 2.0K
Out: Msgs: 31.8K Bytes: 2.8M Msgs/Sec: 1078.7 Bytes/Sec: 60.8K
Connections Polled: 30
HOST CID SUBS PENDING MSGS_TO MSGS_FROM BYTES_TO BYTES_FROM LANG VERSION UPTIME LAST ACTIVITY
192.168.1.69:49802 8 20 0 1.1K 60 117.8K 49.5K node 1.0.1 1m0s 2018-10-31 13:41:12.052607595 +0
192.168.1.69:49823 28 20 0 1.1K 41 116.8K 11.9K node 1.0.1 1m0s 2018-10-31 13:41:12.052607595 +0
NATS server version 1.3.0 (uptime: 1m0s)
Server:
Load: CPU: 1.0% Memory: 13.7M Slow Consumers: 0
In: Msgs: 1.9K Bytes: 1.1M Msgs/Sec: 6.0 Bytes/Sec: 347
Out: Msgs: 41.5K Bytes: 3.3M Msgs/Sec: 179.8 Bytes/Sec: 10.2K
Connections Polled: 30
HOST CID SUBS PENDING MSGS_TO MSGS_FROM BYTES_TO BYTES_FROM LANG VERSION UPTIME LAST ACTIVITY
192.168.1.69:36978 29 20 0 1.4K 47 135.2K 2.7K node 1.0.1 59s 2018-10-31 13:42:27.057375856 +0
192.168.1.69:36976 27 20 0 1.4K 53 133.6K 14.5K node 1.0.1 59s 2018-10-31 13:42:27.057375856 +0
NATS server version 1.3.0 (uptime: 1m0s)
Server:
Load: CPU: 2.0% Memory: 13.7M Slow Consumers: 0
In: Msgs: 2.3K Bytes: 1.2M Msgs/Sec: 29.9 Bytes/Sec: 1.7K
Out: Msgs: 52.8K Bytes: 4.1M Msgs/Sec: 898.3 Bytes/Sec: 51.5K
Connections Polled: 30
HOST CID SUBS PENDING MSGS_TO MSGS_FROM BYTES_TO BYTES_FROM LANG VERSION UPTIME LAST ACTIVITY
192.168.1.69:52421 17 20 0 1.8K 81 158.0K 44.9K node 1.0.1 59s 2018-10-31 13:43:44.262573118 +0
192.168.1.69:52422 19 20 0 1.8K 80 158.0K 42.9K node 1.0.1 59s 2018-10-31 13:43:44.262573118 +0
I write some code just connect NATS directly without moleculer. The traffic is empty:
NATS server version 1.3.0 (uptime: 1m9s)
Server:
Load: CPU: 0.0% Memory: 11.7M Slow Consumers: 0
In: Msgs: 0 Bytes: 0 Msgs/Sec: 0.0 Bytes/Sec: 0
Out: Msgs: 0 Bytes: 0 Msgs/Sec: 0.0 Bytes/Sec: 0
Connections Polled: 30
HOST CID SUBS PENDING MSGS_TO MSGS_FROM BYTES_TO BYTES_FROM LANG VERSION UPTIME LAST ACTIVITY
192.168.1.69:52746 4 0 0 0 0 0 0 node 1.0.1 1m8s 2018-10-31 14:08:22.576751492 +0
192.168.1.69:52754 12 0 0 0 0 0 0 node 1.0.1 1m8s 2018-10-31 14:08:22.578244986 +0
192.168.1.69:52761 18 0 0 0 0 0 0 node 1.0.1 1m8s 2018-10-31 14:08:22.579326183 +0
192.168.1.69:52768 24 0 0 0 0 0 0 node 1.0.1 1m8s 2018-10-31 14:08:22.580617047 +0
192.168.1.69:52745 3 0 0 0 0 0 0 node 1.0.1 1m8s 2018-10-31 14:08:22.576606612 +0
The above statics somehow can explain my yesterday issue. I did restart gnatsd like 2~3 times. But i have 200+ nodes, then the traffic and CPU increase quickly. Then i change the heartbeatInterval from 5s to 500s, and restart all 200+ nodes, the traffic and CPU go peace.
I will try to reproduce and investigate...
I could reproduce. I started 100 nodes with
$ NODE_CLUSTER_WORKERS=100 TRANSPORTER=NATS node examples/loadtest/clients.js
It generates 1M msgs in 4 minutes.
But all subscriptions are correct. If heartbeat time is 5 sec, it means they generate 100 msgs in 5 sec => 20 msgs / sec. So 20 msgs /sec on the incoming side of NATS and NATS generates 20 * 100 msgs /sec outgoing messages. So it really generates traffic.
But currently, I have no idea how we can reduce this messages. Only with heartbeat timer increasing.
But that can't explain why gnatsd restart lead traffic increase. traffic should keep the same when gnatsd restart.
For reduce the messages, can we use solution Ⅰ ?
Solution 1 is same what circuit breaker function does. So if you increase the ping and enable circuit breaker you get the same result.
@icebob have you try to keep 100 nodes but only restart gnatsd few times ? Did the traffic increase?
I've tried, but I didn't experience the same issue. The CPU usage between 1%-10% after some restarting too.
nats-top can show IN/OUT for msg, each time gnatsd restart, it increased. But it should not.
~Can you try running an older version of gnatsd? Like versions before they have the improved DAG?~
Yep the explanation below makes more sense
I think the problem is that when you start nodes, it starts the heartbeat timer gradually. But if the transporters disconnect and reconnect heartbeat timers on all nodes starts at the same moments (when the connection is established). So all nodes will send heartbeats at the same time. And it can generate big traffic. 100 HB broadcast message generates 100k messages on NATS.
It makes sense, as on disconnection the timers are cleared and on connection timers are re-established.
Hello, I was wondering if node discovery could be done via websockets, full duplex, .. does that makes sense in any way? I don't know exactly all the aspects but heartbeat looks like long polling .. ?
Hello, I was wondering if node discovery could be done via websockets, full duplex, .. does that makes sense in any way? I don't know exactly all the aspects but heartbeat looks like long polling .. ?
Yes, its possible. The Gossip protocol work on TCP so... its possible to run over websockets.
Actually.... I did this here on my work. I'm just waiting my boss to authorize sharing the code (because i did this using my company's time). I did this last year, and forget xD. I will ask again and share the code.
Hello, we are considering to use moleculer in production but this issue is still the center of the discussion. To my understanding molculer uses heartbeats to track disconnected nodes and as @roytan883 mentions this scales N * N-1 which is a mayor problem.
We're using moleculer together with kubernetes and i'm wondering if there's a way to turn off moleculer's heartbeat and use the discovery and load balancing that kubernetes provides instead.
@Wallacy any update ?
I think the problem is that when you start nodes, it starts the heartbeat timer gradually. But if the transporters disconnect and reconnect heartbeat timers on all nodes starts at the same moments (when the connection is established). So all nodes will send heartbeats at the same time. And it can generate big traffic. 100 HB broadcast message generates 100k messages on NATS.
add random factor to send change state heartbeat
We found a problem in the heartbeat logic of Moleculer which has been fixed in v0.14.6 version. In brief, when the transporter client disconnected (unexpectedly), it doesn't trigger the $transporter.disconnected event, so the heartbeat timers weren't destroyed but when the connection was established again, the $transporter.connected event was triggered & timers were created again.
In addition, we found that the nats client (probably some others as well) is queuing the packets while the connection is broken and after it connects, sends all collected messages all at once, which can flood the NATS server and Moleculer nodes.
Both issues are fixed.
Most helpful comment
Yes, its possible. The Gossip protocol work on TCP so... its possible to run over websockets.
Actually.... I did this here on my work. I'm just waiting my boss to authorize sharing the code (because i did this using my company's time). I did this last year, and forget xD. I will ask again and share the code.