Moleculer: metrics make request 10x times slower

Created on 3 Jul 2018  Â·  39Comments  Â·  Source: moleculerjs/moleculer

The issue scenario is :
apiGateway.rest -> auth.authorize->demo.welcome->demo.hello
every step is very simple, should only consider network cost.

If i enable metrics: true in each services process, use about 120ms total. Buf if i disable metrics, it only use about 10ms.

metrics internal use event broadcast ? Then is should not impact normal performance so badly.

Core Need reproduce

Most helpful comment

After long time digging and testing, i found the reason. I don't know its NATS or Nodejs bug. The Nodejs said noDelay defaults to true, but its not.

socket.setNoDelay([noDelay])
Disables the Nagle algorithm. By default TCP connections use the Nagle algorithm, they buffer data before sending it off. Setting true for noDelay will immediately fire off data each time socket.write() is called. noDelay defaults to true.

Just modify this file at: node_modules/nats/lib/nats.js

// Create the stream
    this.stream = net.createConnection(this.url.port, this.url.hostname);
    this.stream.setNoDelay(true)  // need call it, somehow default 'true' is not work

After manually add setNoDelay(true), it works fine on ubuntu:

call demo1.hello1 at:  1530870151475
call demo2.hello2 at:  1530870151477
call demo3.hello3 at:  1530870151478
call demo1.hello1 use time =  8
call demo1.hello1WithEvent startTime =  1530870151480
call demo1.hello1WithEvent at:  1530870151481
listen e1 at:  1530870151482
call demo2.hello2WithEvent at:  1530870151483
listen e2 at:  1530870151484
call demo3.hello3WithEvent at:  1530870151484
listen e3 at:  1530870151485
call demo1.hello1WithEvent endTime =  1530870151486
call demo1.hello1WithEvent use time =  6

test environment:

moleculer: 0.12.8
nodejs: 8.11.1
node-nats: 1.0.0
gnatsd: 1.2.0
system: ubuntu 16.04 server

All 39 comments

Each step duration looks good, only a few ms. But why the final apiGateway.rest count duration a long time:

INFO -> metrics traceStart:  { id: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 1,
  startTime: 1530602761235,
  remoteCall: false,
  action: { name: 'apiGateway.rest' },
  nodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceStart:  { id: '6ba3c7bc-a194-43c2-8f34-34380de6d60c',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 2,
  startTime: 1530602761271,
  remoteCall: true,
  meta: {},
  action: { name: 'auth.authorize' },
  parent: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  nodeID: 'rts-auth-dev-88-0',
  callerNodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceFinish:  { id: '6ba3c7bc-a194-43c2-8f34-34380de6d60c',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 2,
  startTime: 1530602761271,
  endTime: 1530602761273.025,
  duration: 2.024912,
  remoteCall: true,
  fromCache: false,
  meta: {},
  action: { name: 'auth.authorize' },
  parent: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  nodeID: 'rts-auth-dev-88-0',
  callerNodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceStart:  { id: '6c079ee3-0f1a-45fc-8c9d-fc2f5bd23ce4',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 2,
  startTime: 1530602761314,
  remoteCall: true,
  meta: {},
  action: { name: 'demo.welcome' },
  parent: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  nodeID: 'rts-demo-dev-88-0',
  callerNodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceStart:  { id: '47682a4f-6c45-4c98-9130-9372dfecb748',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 3,
  startTime: 1530602761314,
  remoteCall: false,
  meta: {},
  action: { name: 'demo.hello' },
  parent: '6c079ee3-0f1a-45fc-8c9d-fc2f5bd23ce4',
  nodeID: 'rts-demo-dev-88-0' }
INFO -> metrics traceFinish:  { id: '47682a4f-6c45-4c98-9130-9372dfecb748',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 3,
  startTime: 1530602761314,
  endTime: 1530602761315.0862,
  duration: 1.086225,
  remoteCall: false,
  fromCache: false,
  meta: {},
  action: { name: 'demo.hello' },
  parent: '6c079ee3-0f1a-45fc-8c9d-fc2f5bd23ce4',
  nodeID: 'rts-demo-dev-88-0' }
INFO -> metrics traceFinish:  { id: '6c079ee3-0f1a-45fc-8c9d-fc2f5bd23ce4',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 2,
  startTime: 1530602761314,
  endTime: 1530602761315.9263,
  duration: 1.92624,
  remoteCall: true,
  fromCache: false,
  meta: {},
  action: { name: 'demo.welcome' },
  parent: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  nodeID: 'rts-demo-dev-88-0',
  callerNodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceFinish:  { id: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 1,
  startTime: 1530602761235,
  endTime: 1530602761354.5188,
  duration: 119.518744,
  remoteCall: false,
  fromCache: false,
  action: { name: 'apiGateway.rest' },
  nodeID: 'rts-apiGateway-dev-88-0' }

Which versions?

Btw, please use issue templates when you open a new issue.

I've just measured this demo with jMeter:
With metrics: false: ~10k req/sec
With metrics: true: ~8k req/sec

I think i found the problem. if i start the process which listen those "metrics" event(metrics.trace.span.start/finish), then whole system slow to "120ms" each request(the event handler is very simple, i even try empty function). if i stop the listen process, the whole system back to normal "10ms" each request.

And the listener is on the same node, or in a remote node?

a remote node, every service in this issue are running on a separate node process

Ok, I will investigate it.

by the way, transporter is "nats"

What I measured:
API gateway with metrics without listener: 8k req/sec
API gateway with metrics with a remote listener: 3k req/sec

_Btw, it was just a simple "test.hello" action._

it was just a simple "test.hello" action. a remote listener: 3k req/sec

yes, that's the problem. my scenario has apiGateway.rest -> auth.authorize->demo.welcome->demo.hello, more remote actions, much slower.

Maybe you should test its latency?

If you can make a repro repo.

you just make request like this:
client -> apiGateway(test.hello) -> test.hello -> ctx.call test1.hello1 -> ctx.call test2.hello2
place test/test1/test2 in separate node, and another node listen metrics

yes, I can create a repo. But do you need that ? you already measured that remote node listen metrics lead performance badly slower.

It's reasonable that metric function needs extra performance which increases the response time. Unfortunately, in your case, it's a huge difference because every action call generates 2 extra events what need to be transferred to another node which causes overhead. The bad news that I can't fix it currently because it is not a bug it's a design issue.

In the version 0.14, I will rewrite the whole metrics & tracing logic to be more efficient and pluggable.
Meanwhile, as a workaround, you can disable this built-in metric function and write a middleware which tracing action calls and send them to a central server (bypassing the moleculer message bus)

Maybe its not about the metric, i dig the source code, the metrics just call broker.emit. Then i make a test for event emit. Not listen (metrics.trace.span.start/finish).

In listener node:

      events: {
        // "metrics.trace.span.start": this.traceStart,
        // "metrics.trace.span.finish": this.traceFinish,
        "event.test.t1": this.t1,
        "event.test.t2": this.t2,
      },

async t1(info) {
    this.logger.info(`${this.name} t1: `, util.inspect(info));
  }
  async t2(info) {
    this.logger.info(`${this.name} t2: `, util.inspect(info));
  }

in auth:

async authorize(ctx) {
    this.broker.emit("event.test.t1", {
      a: "abc",
      b: 123,
      c: true
    })
    console.log("call authorize ctx.params = ", ctx.params)
    if (ctx.params.action.indexOf('$node') >= 0) { //disable call $node.*
      return false
    }
    return true
  }

in demo:

  async welcome(ctx) {
    this.broker.emit("event.test.t2", {
      a: "abc",
      b: 123,
      c: true
    })
    // await ctx.call('demo.hello')
    return this._sayWelcome(ctx.params.name);
  }

the result is:
1, if not send event.test.t1 and event.test.t2, use 8.396ms
2, if send event.test.t1, use 51.806ms
2, if send event.test.t1 and event.test.t2, use 81.931ms

simple event take 30~40ms, doesn't make sence, event should not much slower than action

I guess the problem is not about how to generate metrics.
It is about how to send event without impact action performance.

Hmm, it's weird. Please create a dev example file with 3 brokers & services & NATS like here and I will try to investigate it.

OK, i will create a dev example file

OK, finished
the result is:

call demo1.hello1 at:  1530627615760
call demo2.hello2 at:  1530627615761
call demo3.hello3 at:  1530627615762
call demo1.hello1 use time =  6
call demo1.hello1WithEvent at:  1530627615765
listen t1 at:  1530627615765
call demo2.hello2WithEvent at:  1530627615804
listen t2 at:  1530627615805
call demo3.hello3WithEvent at:  1530627615844
listen t3 at:  1530627615844
call demo1.hello1WithEvent use time =  120

test code is:

const { ServiceBroker } = require("moleculer");

let natsTransporter = {
  type: "NATS",
  options: {
    url: "nats://127.0.0.1:4222",
  }
}

// Create broker Client
const brokerClient = new ServiceBroker({
  namespace: "streaming",
  nodeID: "client",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

// Create broker #1
const broker1 = new ServiceBroker({
  namespace: "streaming",
  nodeID: "demo1",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

// Create broker #2
const broker2 = new ServiceBroker({
  namespace: "streaming",
  nodeID: "demo2",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

// Create broker #3
const broker3 = new ServiceBroker({
  namespace: "streaming",
  nodeID: "demo3",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

// Create brokerListen
const brokerListen = new ServiceBroker({
  namespace: "streaming",
  nodeID: "listen",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

broker1.createService({
  name: "demo1",
  actions: {
    hello1(ctx) {
      console.log("call demo1.hello1 at: ", Date.now())
      return ctx.call('demo2.hello2')
    },
    hello1WithEvent(ctx) {
      broker1.emit('t1', { a: 'abc', b: 123, c: true })
      console.log("call demo1.hello1WithEvent at: ", Date.now())
      return ctx.call('demo2.hello2WithEvent')
    }
  }
});

broker2.createService({
  name: "demo2",
  actions: {
    hello2(ctx) {
      console.log("call demo2.hello2 at: ", Date.now())
      return ctx.call('demo3.hello3')
    },
    hello2WithEvent(ctx) {
      broker2.emit('t2', { a: 'abc', b: 123, c: true })
      console.log("call demo2.hello2WithEvent at: ", Date.now())
      return ctx.call('demo3.hello3WithEvent')
    }
  }
});

broker3.createService({
  name: "demo3",
  actions: {
    hello3(ctx) {
      console.log("call demo3.hello3 at: ", Date.now())
      return "hello from demo3"
    },
    hello3WithEvent(ctx) {
      broker3.emit('t3', { a: 'abc', b: 123, c: true })
      console.log("call demo3.hello3WithEvent at: ", Date.now())
      return "hello from hello3WithEvent"
    }
  }
});

brokerListen.createService({
  name: "listen",
  events: {
    t1: (ctx) => {
      console.log("listen t1 at: ", Date.now())
    },
    t2: (ctx) => {
      console.log("listen t2 at: ", Date.now())
    },
    t3: (ctx) => {
      console.log("listen t3 at: ", Date.now())
    },
  }
});

brokerClient.Promise.all([brokerClient.start(), broker1.start(), broker2.start(), broker3.start(), brokerListen.start()])
  .delay(2000)
  .then(async () => {

    let startTime = Date.now()
    await brokerClient.call('demo1.hello1')
    let endTime = Date.now()
    console.log("call demo1.hello1 use time = ", endTime - startTime)

    startTime = Date.now()
    await brokerClient.call('demo1.hello1WithEvent')
    endTime = Date.now()
    console.log("call demo1.hello1WithEvent use time = ", endTime - startTime)

    // startTime = Date.now()
    // await brokerClient.call('demo1.hello1WithEvent')
    // endTime = Date.now()
    // console.log("call demo1.hello1WithEvent again use time = ", endTime - startTime)

  });

Thanks, I will investigate it tonight.

It's weirder. It's the result on my PC:

call demo1.hello1 at:  1530633313582
call demo2.hello2 at:  1530633313585
call demo3.hello3 at:  1530633313587
call demo1.hello1 use time =  17
call demo1.hello1WithEvent at:  1530633313596
listen t1 at:  1530633313599
call demo2.hello2WithEvent at:  1530633313601
listen t2 at:  1530633313603
call demo3.hello3WithEvent at:  1530633313604
listen t3 at:  1530633313606
call demo1.hello1WithEvent use time =  15

Node version, OS, architecture...etc?

I could reproduce it on Ubuntu

call demo1.hello1 at:  1530644013498
call demo2.hello2 at:  1530644013501
call demo3.hello3 at:  1530644013503
call demo1.hello1 use time =  17
call demo1.hello1WithEvent at:  1530644013513
listen t1 at:  1530644013514
call demo2.hello2WithEvent at:  1530644013560
listen t2 at:  1530644013561
call demo3.hello3WithEvent at:  1530644013604
listen t3 at:  1530644013604
call demo1.hello1WithEvent use time =  139

It's very mystic!
With NATS: ~8 rps _(~125ms)_
With Redis: ~500 rps _(~2ms)_
With TCP: ~1700 rps _(~0.5ms)_
linux-transporter-emits

And it's only on Linux. On Windows it's ~1500-2000 rps with all transporters.
Do you have any idea? Maybe NATS client problem because it can send messages parallel?

Dev test file: https://github.com/moleculerjs/moleculer/blob/master/dev/emit-latency.js

have you try different nats client library version?

I will, but currently I can't.

I tried with 0.8, 0.7...same result.

Same thing here.

  • On Windows the perf is Ok.
  • On Linux tried with different versions of nats (1, 0.8, 0.7) and moleculer (0.12.8, 0.12.6, 0.12.1) and the result is the similar to this
call demo1.hello1 at:  1530693198252
call demo2.hello2 at:  1530693198252
call demo3.hello3 at:  1530693198253
call demo1.hello1 use time =  8
call demo1.hello1WithEvent at:  1530693198255
listen t1 at:  1530693198255
call demo2.hello2WithEvent at:  1530693198303
listen t2 at:  1530693198303
call demo3.hello3WithEvent at:  1530693198348
listen t3 at:  1530693198349
call demo1.hello1WithEvent use time =  138

Info about my system:

Kubuntu 18.04 64bit
Kernel 4.15.0-24
CPU: Intel i7 6700
RAM 16GB

NodeJS: 8.11.3
Nats Server: 1.1.0

Thanks @AndreMaz

After long time digging and testing, i found the reason. I don't know its NATS or Nodejs bug. The Nodejs said noDelay defaults to true, but its not.

socket.setNoDelay([noDelay])
Disables the Nagle algorithm. By default TCP connections use the Nagle algorithm, they buffer data before sending it off. Setting true for noDelay will immediately fire off data each time socket.write() is called. noDelay defaults to true.

Just modify this file at: node_modules/nats/lib/nats.js

// Create the stream
    this.stream = net.createConnection(this.url.port, this.url.hostname);
    this.stream.setNoDelay(true)  // need call it, somehow default 'true' is not work

After manually add setNoDelay(true), it works fine on ubuntu:

call demo1.hello1 at:  1530870151475
call demo2.hello2 at:  1530870151477
call demo3.hello3 at:  1530870151478
call demo1.hello1 use time =  8
call demo1.hello1WithEvent startTime =  1530870151480
call demo1.hello1WithEvent at:  1530870151481
listen e1 at:  1530870151482
call demo2.hello2WithEvent at:  1530870151483
listen e2 at:  1530870151484
call demo3.hello3WithEvent at:  1530870151484
listen e3 at:  1530870151485
call demo1.hello1WithEvent endTime =  1530870151486
call demo1.hello1WithEvent use time =  6

test environment:

moleculer: 0.12.8
nodejs: 8.11.1
node-nats: 1.0.0
gnatsd: 1.2.0
system: ubuntu 16.04 server

Wow, nice catch! It can explain why has the difference between OSs. So on Windows, the noDelay can be true, and on Linux it's not.

And it's the reason why was not problem with Redis transporter:
https://github.com/luin/ioredis/blob/622975d9d454eca6a9c495d35f5638d68e2662f2/lib/redis.js#L311

Btw, I can add a workaround, that I set the setNoDelay always in the transporter code when the client connected.

@icebob Maybe you should also listen connect and reconnect events from NATS. Because if reconnect, the down layer stream(socket) object will be a new one.

    client.on("connect", () => {
        this.client = client;
        if (client.stream)
            client.stream.setNoDelay(true);

    client.on("reconnect", () => {
        this.client = client;
        if (client.stream)
            client.stream.setNoDelay(true);

Yes, good idea!

npm already has a fix for this, you may want to just update and remove the patching.

@aricart Thank you for fixing this

@roytan883 yw!!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

icebob picture icebob  Â·  3Comments

thatisuday picture thatisuday  Â·  3Comments

molobala picture molobala  Â·  3Comments

slinkardbrandon picture slinkardbrandon  Â·  4Comments

DeividasJackus picture DeividasJackus  Â·  4Comments