8/29/2018 3:51:28 AM2018-08-29 03:51:28,312 INFO 89 end processTopLineRoute
8/29/2018 3:51:32 AM[2018-08-29 03:51:32.127] [cfork:master:32] worker:54 disconnect (exitedAfterDisconnect: false, state: disconnected, isDead: false, worker.disableRefork: false)
8/29/2018 3:51:32 AM[2018-08-29 03:51:32.137] [cfork:master:32] new worker:19312 fork (state: none)
8/29/2018 3:51:32 AM[2018-08-29 03:51:32.138] [cfork:master:32] worker:54 exit (code: null, exitedAfterDisconnect: false, state: dead, isDead: true, isExpected: true, worker.disableRefork: false)
8/29/2018 3:51:32 AM2018-08-29 03:51:32,137 INFO 32 [master] app_worker#3:54 disconnect, suicide: false, state: disconnected, current workers: ["1","2","3","4","5","6","7","8","9"]
8/29/2018 3:51:32 AM2018-08-29 03:51:32,137 INFO 32 [master] app_worker#9:19312 start, state: none, current workers: ["1","2","3","4","5","6","7","8","9"]
8/29/2018 3:51:32 AM2018-08-29 03:51:32,145 ERROR 32 nodejs.AppWorkerDiedError: [master] app_worker#3:54 died (code: null, signal: SIGKILL, suicide: false, state: dead), current workers: ["1","2","4","5","6","7","8","9"]
8/29/2018 3:51:32 AM at Master.onAppExit (/usr/src/app/node_modules/egg-cluster/lib/master.js:388:21)
8/29/2018 3:51:32 AM at Master.emit (events.js:182:13)
8/29/2018 3:51:32 AM at Messenger.sendToMaster (/usr/src/app/node_modules/egg-cluster/lib/utils/messenger.js:122:17)
8/29/2018 3:51:32 AM at Messenger.send (/usr/src/app/node_modules/egg-cluster/lib/utils/messenger.js:87:12)
8/29/2018 3:51:32 AM at EventEmitter.cluster.on (/usr/src/app/node_modules/egg-cluster/lib/master.js:265:22)
8/29/2018 3:51:32 AM at EventEmitter.emit (events.js:187:15)
8/29/2018 3:51:32 AM at ChildProcess.worker.process.once (internal/cluster/master.js:194:13)
8/29/2018 3:51:32 AM at Object.onceWrapper (events.js:273:13)
8/29/2018 3:51:32 AM at ChildProcess.emit (events.js:182:13)
8/29/2018 3:51:32 AM at Process.ChildProcess._handle.onexit (internal/child_process.js:239:12)
8/29/2018 3:51:32 AMname: "AppWorkerDiedError"
8/29/2018 3:51:32 AMpid: 32
8/29/2018 3:51:32 AMhostname: e700df096dd3
8/29/2018 3:51:32 AM
8/29/2018 3:51:33 AM2018-08-29 03:51:33,022 INFO 19312 Plugin development is disabled by env unmatched, require env(local) but got env is test
8/29/2018 3:51:33 AM2018-08-29 03:51:33,053 INFO 19312 [egg:core] App root: /usr/src/app
8/29/2018 3:51:33 AM2018-08-29 03:51:33,053 INFO 19312 [egg:core] All *.log files save on "/usr/src/app/logs/tv-node"
Translation of this issue:
8/29/2018 3:51:28 AM2018-08-29 03:51:28,312 INFO 89 end processTopLineRoute
8/29/2018 3:51:32 AM[2018-08-29 03:51:32.127] [cfork:master:32] worker:54 disconnect (exitedAfterDisconnect: false, state: disconnected, isDead: false, worker.disableRefork : false)
8/29/2018 3:51:32 AM[2018-08-29 03:51:32.137] [cfork:master:32] new worker:19312 fork (state: none)
8/29/2018 3:51:32 AM[2018-08-29 03:51:32.138] [cfork:master:32] worker:54 exit (code: null, exitedAfterDisconnect: false, state: dead, isDead: true , isExpected: true, worker.disableRefork: false)
8/29/2018 3:51:32 AM2018-08-29 03:51:32,137 INFO 32 [master] app_worker#3:54 disconnect, suicide: false, state: disconnected, current workers: ["1","2 ","3","4","5","6","7","8","9"]
8/29/2018 3:51:32 AM2018-08-29 03:51:32,137 INFO 32 [master] app_worker#9:19312 start, state: none, current workers: ["1","2","3 ","4","5","6","7","8","9"]
8/29/2018 3:51:32 AM2018-08-29 03:51:32,145 ERROR 32 nodejs.AppWorkerDiedError: [master] app_worker#3:54 died (code: null, signal: SIGKILL, suicide: false, state: Dead), current workers: ["1","2","4","5","6","7","8","9"]
8/29/2018 3:51:32 AM at Master.onAppExit (/usr/src/app/node_modules/egg-cluster/lib/master.js:388:21)
8/29/2018 3:51:32 AM at Master.emit (events.js:182:13)
8/29/2018 3:51:32 AM at Messenger.sendToMaster (/usr/src/app/node_modules/egg-cluster/lib/utils/messenger.js:122:17)
8/29/2018 3:51:32 AM at Messenger.send (/usr/src/app/node_modules/egg-cluster/lib/utils/messenger.js:87:12)
8/29/2018 3:51:32 AM at EventEmitter.cluster.on (/usr/src/app/node_modules/egg-cluster/lib/master.js:265:22)
8/29/2018 3:51:32 AM at EventEmitter.emit (events.js:187:15)
8/29/2018 3:51:32 AM at ChildProcess.worker.process.once (internal/cluster/master.js:194:13)
8/29/2018 3:51:32 AM at Object.onceWrapper (events.js:273:13)
8/29/2018 3:51:32 AM at ChildProcess.emit (events.js:182:13)
8/29/2018 3:51:32 AM at Process.ChildProcess._handle.onexit (internal/child_process.js:239:12)
8/29/2018 3:51:32 AMname: "AppWorkerDiedError"
8/29/2018 3:51:32 AMpid: 32
8/29/2018 3:51:32 AMhostname: e700df096dd3
8/29/2018 3:51:32 AM
8/29/2018 3:51:33 AM2018-08-29 03:51:33,022 INFO 19312 Plugin development is disabled by env unmatched, require env(local) but got env is test
8/29/2018 3:51:33 AM2018-08-29 03:51:33,053 INFO 19312 [egg:core] App root: /usr/src/app
8/29/2018 3:51:33 AM2018-08-29 03:51:33,053 INFO 19312 [egg:core] All *.log files save on "/usr/src/app/logs/tv-node"
nodejs.AppWorkerDiedError: [master] app_worker#3:54 died
check common-error.log
看 stderr 是否有异常,或者就是进程 oom 了。
最让人头疼的就是没有异常,其它log文件都没有相关消息,唯一的异常日志就在上面
请问怎么查看进程是否oom?
接入 alinode
其实我在本地用run dev启动就没有问题,猜测是在docker里用 npm start启动会有很多个worker进程, 如果有几个进程都消费kafka, 执行耗时操作(10-20分钟左右), 那就可能oom
现在的想法是只让一个worker进程来消费kafka, 大侠看下我的实现可以吗,或者有更好的方法吗?
agent.js
agent.messenger.on('egg-ready', () => {
agent.messenger.sendRandom('start_kafka_action', true);
});
agent.messenger.on('reselect_kafka_action', () => {
agent.messenger.sendRandom('start_kafka_action', true);
});
app.js
app.messenger.on('start_kafka_action', (data) => {
if (data === true) {
app.logger.info(`start kafka in pid: ${process.pid}`);
loadRdkafka(app);
process.on('SIGTERM', () => {
app.messenger.sendToAgent('reselect_kafka_action', true);
app.logger.info(`close kafka in pid: ${process.pid}`);
});
}
});
不要逃避问题,用 alinode 分析到底是哪来导致内存泄露了。多个 worker 只是加速了过程而已。
hey, we need more information to follow up this issue, recommend to use egg-init --type=simple to provide a mini project and upload to your github.
for now it will close due to lack of activity, feel free to reopen this issue if provided more information.
更新一下,接入alinode, 压力测试后,worker进程重启,并没有生成coredump文件,看alinode,内存和cpu都很正常,平缓。然后就问alinode的大侠(中间经过无数坑,包括docker里相同镜象,不同宿主机,cat /proc/sys/kernel/core_pattern,结果不一样),最后虽然没有解决问题,不过得出结论:
1)进程重启不一定会生成coredump, crash和OOM会生成, 所以看alinode监控,不是crash和OOM导致的worker重启
2)看日志isExpected: true,这个就好像是进程自己发了个disconnect事件导致的重启
根据上面的结论,自己全局搜索了下disconnect关键字(会不会C++模块导致的?比如rdkafka, sharp),不过自己不熟悉c++代码,所以也没找到什么有用的信息
如果后面有新进展,再更新到这个issue里
@leegang2017 在 app.js 把 process.kill 给 money patch 打下日志和获取下 error stack 看看?看看到底是哪个模块手动调用 kill 了。
woker进程重启原因找到了,是docker 杀的, 运维设置的docker容器最大内存是1G,然后egg默认启动了8个worker进程,所以当1个进程100多快200M时,特别容易引起docker 容器发 kill -9命令,终于真相大白了
woker进程重启原因找到了,是docker 杀的, 运维设置的docker容器最大内存是1G,然后egg默认启动了8个worker进程,所以当1个进程100多快200M时,特别容易引起docker 容器发 kill -9命令,终于真相大白了
我们之前是采用agent接受消息,然后分发到worker进程,但是agent压力比较大,性能很受影响。
我这边也是这样子的情况,分配的内存太低;就频繁重启
如果是node进程被容器给kill掉了(不是上线发送的终止信号引起的),此时egg-scripts也没有提供对应的保活拉起逻辑,应用这不就一直挂在哪儿了么?这种情况怎么解?
如果是node进程被容器给kill掉了(不是上线发送的终止信号引起的),此时egg-scripts也没有提供对应的保活拉起逻辑,应用这不就一直挂在哪儿了么?这种情况怎么解?
你说的这种情况应该不存在。
Most helpful comment
woker进程重启原因找到了,是docker 杀的, 运维设置的docker容器最大内存是1G,然后egg默认启动了8个worker进程,所以当1个进程100多快200M时,特别容易引起docker 容器发 kill -9命令,终于真相大白了