Egg: app_worker重启

Created on 29 Aug 2018  ·  14Comments  ·  Source: eggjs/egg

  • Node Version:10.9.0
  • Egg Version:2.10.0
  • Plugin Name:
  • Plugin Version:
  • Platform: docker FROM node:10.9.0-alpine
  • Mini Showcase Repository:

背景

1) 我们是部署在docker里面, 没有用daemon启动

2) 每次app_worker重启前,都是消费了不到1000条kafka数据,然后进行了耗时操作(10-20分钟左右)

问题

1) 请问 在egg.js里面碰到app_worker重启的错误,你们怎么分析原因的呢?

2) 除了查看日志,还有没有其它方法来分析?

8/29/2018 3:51:28 AM2018-08-29 03:51:28,312 INFO 89 end processTopLineRoute
8/29/2018 3:51:32 AM[2018-08-29 03:51:32.127] [cfork:master:32] worker:54 disconnect (exitedAfterDisconnect: false, state: disconnected, isDead: false, worker.disableRefork: false)
8/29/2018 3:51:32 AM[2018-08-29 03:51:32.137] [cfork:master:32] new worker:19312 fork (state: none)
8/29/2018 3:51:32 AM[2018-08-29 03:51:32.138] [cfork:master:32] worker:54 exit (code: null, exitedAfterDisconnect: false, state: dead, isDead: true, isExpected: true, worker.disableRefork: false)
8/29/2018 3:51:32 AM2018-08-29 03:51:32,137 INFO 32 [master] app_worker#3:54 disconnect, suicide: false, state: disconnected, current workers: ["1","2","3","4","5","6","7","8","9"]
8/29/2018 3:51:32 AM2018-08-29 03:51:32,137 INFO 32 [master] app_worker#9:19312 start, state: none, current workers: ["1","2","3","4","5","6","7","8","9"]
8/29/2018 3:51:32 AM2018-08-29 03:51:32,145 ERROR 32 nodejs.AppWorkerDiedError: [master] app_worker#3:54 died (code: null, signal: SIGKILL, suicide: false, state: dead), current workers: ["1","2","4","5","6","7","8","9"]
8/29/2018 3:51:32 AM    at Master.onAppExit (/usr/src/app/node_modules/egg-cluster/lib/master.js:388:21)
8/29/2018 3:51:32 AM    at Master.emit (events.js:182:13)
8/29/2018 3:51:32 AM    at Messenger.sendToMaster (/usr/src/app/node_modules/egg-cluster/lib/utils/messenger.js:122:17)
8/29/2018 3:51:32 AM    at Messenger.send (/usr/src/app/node_modules/egg-cluster/lib/utils/messenger.js:87:12)
8/29/2018 3:51:32 AM    at EventEmitter.cluster.on (/usr/src/app/node_modules/egg-cluster/lib/master.js:265:22)
8/29/2018 3:51:32 AM    at EventEmitter.emit (events.js:187:15)
8/29/2018 3:51:32 AM    at ChildProcess.worker.process.once (internal/cluster/master.js:194:13)
8/29/2018 3:51:32 AM    at Object.onceWrapper (events.js:273:13)
8/29/2018 3:51:32 AM    at ChildProcess.emit (events.js:182:13)
8/29/2018 3:51:32 AM    at Process.ChildProcess._handle.onexit (internal/child_process.js:239:12)
8/29/2018 3:51:32 AMname: "AppWorkerDiedError"
8/29/2018 3:51:32 AMpid: 32
8/29/2018 3:51:32 AMhostname: e700df096dd3
8/29/2018 3:51:32 AM
8/29/2018 3:51:33 AM2018-08-29 03:51:33,022 INFO 19312 Plugin development is disabled by env unmatched, require env(local) but got env is test
8/29/2018 3:51:33 AM2018-08-29 03:51:33,053 INFO 19312 [egg:core] App root: /usr/src/app
8/29/2018 3:51:33 AM2018-08-29 03:51:33,053 INFO 19312 [egg:core] All *.log files save on "/usr/src/app/logs/tv-node"

Most helpful comment

woker进程重启原因找到了,是docker 杀的, 运维设置的docker容器最大内存是1G,然后egg默认启动了8个worker进程,所以当1个进程100多快200M时,特别容易引起docker 容器发 kill -9命令,终于真相大白了

All 14 comments

Translation of this issue:


app_worker restart

  • Node Version: 10.9.0
  • Egg Version: 2.10.0
  • Plugin Name:
  • Plugin Version:
  • Platform: docker FROM node: 10.9.0-alpine
  • Mini Showcase Repository:

background

1) We are deployed in docker, not started with daemon

2) Every time app_worker restarts, it consumes less than 1000 kafka data, and then takes time-consuming operations (about 10-20 minutes)

question

1) Excuse me, I have encountered an error in app_worker restart in egg.js. How do you analyze the reason?

2) In addition to viewing the logs, is there any other way to analyze?

8/29/2018 3:51:28 AM2018-08-29 03:51:28,312 INFO 89 end processTopLineRoute
8/29/2018 3:51:32 AM[2018-08-29 03:51:32.127] [cfork:master:32] worker:54 disconnect (exitedAfterDisconnect: false, state: disconnected, isDead: false, worker.disableRefork : false)
8/29/2018 3:51:32 AM[2018-08-29 03:51:32.137] [cfork:master:32] new worker:19312 fork (state: none)
8/29/2018 3:51:32 AM[2018-08-29 03:51:32.138] [cfork:master:32] worker:54 exit (code: null, exitedAfterDisconnect: false, state: dead, isDead: true , isExpected: true, worker.disableRefork: false)
8/29/2018 3:51:32 AM2018-08-29 03:51:32,137 INFO 32 [master] app_worker#3:54 disconnect, suicide: false, state: disconnected, current workers: ["1","2 ","3","4","5","6","7","8","9"]
8/29/2018 3:51:32 AM2018-08-29 03:51:32,137 INFO 32 [master] app_worker#9:19312 start, state: none, current workers: ["1","2","3 ","4","5","6","7","8","9"]
8/29/2018 3:51:32 AM2018-08-29 03:51:32,145 ERROR 32 nodejs.AppWorkerDiedError: [master] app_worker#3:54 died (code: null, signal: SIGKILL, suicide: false, state: Dead), current workers: ["1","2","4","5","6","7","8","9"]
8/29/2018 3:51:32 AM at Master.onAppExit (/usr/src/app/node_modules/egg-cluster/lib/master.js:388:21)
8/29/2018 3:51:32 AM at Master.emit (events.js:182:13)
8/29/2018 3:51:32 AM at Messenger.sendToMaster (/usr/src/app/node_modules/egg-cluster/lib/utils/messenger.js:122:17)
8/29/2018 3:51:32 AM at Messenger.send (/usr/src/app/node_modules/egg-cluster/lib/utils/messenger.js:87:12)
8/29/2018 3:51:32 AM at EventEmitter.cluster.on (/usr/src/app/node_modules/egg-cluster/lib/master.js:265:22)
8/29/2018 3:51:32 AM at EventEmitter.emit (events.js:187:15)
8/29/2018 3:51:32 AM at ChildProcess.worker.process.once (internal/cluster/master.js:194:13)
8/29/2018 3:51:32 AM at Object.onceWrapper (events.js:273:13)
8/29/2018 3:51:32 AM at ChildProcess.emit (events.js:182:13)
8/29/2018 3:51:32 AM at Process.ChildProcess._handle.onexit (internal/child_process.js:239:12)
8/29/2018 3:51:32 AMname: "AppWorkerDiedError"
8/29/2018 3:51:32 AMpid: 32
8/29/2018 3:51:32 AMhostname: e700df096dd3
8/29/2018 3:51:32 AM
8/29/2018 3:51:33 AM2018-08-29 03:51:33,022 INFO 19312 Plugin development is disabled by env unmatched, require env(local) but got env is test
8/29/2018 3:51:33 AM2018-08-29 03:51:33,053 INFO 19312 [egg:core] App root: /usr/src/app
8/29/2018 3:51:33 AM2018-08-29 03:51:33,053 INFO 19312 [egg:core] All *.log files save on "/usr/src/app/logs/tv-node"

nodejs.AppWorkerDiedError: [master] app_worker#3:54 died

check common-error.log

看 stderr 是否有异常,或者就是进程 oom 了。

最让人头疼的就是没有异常,其它log文件都没有相关消息,唯一的异常日志就在上面
请问怎么查看进程是否oom?

接入 alinode

其实我在本地用run dev启动就没有问题,猜测是在docker里用 npm start启动会有很多个worker进程, 如果有几个进程都消费kafka, 执行耗时操作(10-20分钟左右), 那就可能oom
现在的想法是只让一个worker进程来消费kafka, 大侠看下我的实现可以吗,或者有更好的方法吗?
agent.js

  agent.messenger.on('egg-ready', () => {
    agent.messenger.sendRandom('start_kafka_action', true);
  });

  agent.messenger.on('reselect_kafka_action', () => {
    agent.messenger.sendRandom('start_kafka_action', true);
  });

app.js

  app.messenger.on('start_kafka_action', (data) => {
    if (data === true) {
      app.logger.info(`start kafka in pid: ${process.pid}`);
      loadRdkafka(app);

      process.on('SIGTERM', () => {
        app.messenger.sendToAgent('reselect_kafka_action', true);
        app.logger.info(`close kafka in pid: ${process.pid}`);
      });
    }
  });

不要逃避问题,用 alinode 分析到底是哪来导致内存泄露了。多个 worker 只是加速了过程而已。

hey, we need more information to follow up this issue, recommend to use egg-init --type=simple to provide a mini project and upload to your github.

for now it will close due to lack of activity, feel free to reopen this issue if provided more information.

更新一下,接入alinode, 压力测试后,worker进程重启,并没有生成coredump文件,看alinode,内存和cpu都很正常,平缓。然后就问alinode的大侠(中间经过无数坑,包括docker里相同镜象,不同宿主机,cat /proc/sys/kernel/core_pattern,结果不一样),最后虽然没有解决问题,不过得出结论:
1)进程重启不一定会生成coredump, crash和OOM会生成, 所以看alinode监控,不是crash和OOM导致的worker重启
2)看日志isExpected: true,这个就好像是进程自己发了个disconnect事件导致的重启
根据上面的结论,自己全局搜索了下disconnect关键字(会不会C++模块导致的?比如rdkafka, sharp),不过自己不熟悉c++代码,所以也没找到什么有用的信息
如果后面有新进展,再更新到这个issue里

@leegang2017 在 app.js 把 process.kill 给 money patch 打下日志和获取下 error stack 看看?看看到底是哪个模块手动调用 kill 了。

woker进程重启原因找到了,是docker 杀的, 运维设置的docker容器最大内存是1G,然后egg默认启动了8个worker进程,所以当1个进程100多快200M时,特别容易引起docker 容器发 kill -9命令,终于真相大白了

woker进程重启原因找到了,是docker 杀的, 运维设置的docker容器最大内存是1G,然后egg默认启动了8个worker进程,所以当1个进程100多快200M时,特别容易引起docker 容器发 kill -9命令,终于真相大白了

我们之前是采用agent接受消息,然后分发到worker进程,但是agent压力比较大,性能很受影响。

我这边也是这样子的情况,分配的内存太低;就频繁重启

如果是node进程被容器给kill掉了(不是上线发送的终止信号引起的),此时egg-scripts也没有提供对应的保活拉起逻辑,应用这不就一直挂在哪儿了么?这种情况怎么解?

如果是node进程被容器给kill掉了(不是上线发送的终止信号引起的),此时egg-scripts也没有提供对应的保活拉起逻辑,应用这不就一直挂在哪儿了么?这种情况怎么解?

你说的这种情况应该不存在。

  1. 在容器里面,egg-scripts 应该用非 daemon 模式,此时如果 master 被 kill,egg-scripts 也会对应退出,此时容器侧就需要处理了。
  2. 如果 kill 的是 agent,worker 啥的,master 会搞定。
Was this page helpful?
0 / 5 - 0 ratings