v0.8.4
OS: CentOS7.2
Kernel: 3.10.0.862.9.1.el7.x86_64
Hardware: Dell R730xd 2*E5-2650V4(2.2G,12C)
kernel crash and os reboot
PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000010"


Sounds like a kernel bug to me - 3.10 kernel are very very old - move to 3.16 / 3.18 or even 4.4 / 4.9 / 4.14 LTS
@jippi
Off topic:
Stock CentOS wont move the kernel ahead.
The user will have to explicitly install kernel-lt or kernel-ml
@dudd
When does the crash occur? Steps to reproduce the issue?
3.10 is not a vanilla 3.10. You'd get bug fixes back ported by Red Hat, but you don't get most of the new features. That's how Red Hat manages the kernel. That said, I've been having many problems with the 862 series (from refusal to boot on some machines to random crashes), I'm sticking with 3.10.0-693.21.1.el7.x86_64.
The kdump I've seen mostly ended up affecting nomad executor because that's there are many instances of these, but I've seen other crash points, so I don't think this is a Nomad issue. However, when the kernel crashes, the Nomad agent can end up in a corrupted state and crashes itself.
Nomad won't be able to trigger a kernel panic - either its a Kernel bug, or a Go - I'm willing to put $ on kernel bug
@jippi I have no doubt on that either. Here's a backtrace from one kdump that involved nomad:
[723807.210656] Modules linked in: rbd libceph dns_resolver xt_nat veth 8021q garp mrp fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio bridge stp llc ext4 mbcache jbd2 sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support joydev mei_me pcspkr mei sg ioatdma lpc_ich shpchp i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic ast drm_kms_helper syscopyarea sysfillrect sysimgblt
[723807.289647] fb_sys_fops ttm ahci ixgbe drm libahci igb crct10dif_pclmul libata crct10dif_common crc32c_intel mdio ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
[723807.308020] CPU: 20 PID: 1098295 Comm: nomad Kdump: loaded Not tainted 3.10.0-862.11.6.el7.x86_64 #1
[723807.318365] Hardware name: Supermicro SYS-6028TP-HTR/X10DRT-P, BIOS 2.0b 03/30/2017
[723807.327378] task: ffff94cf3610bf40 ti: ffff94c54f6f8000 task.ti: ffff94c54f6f8000
[723807.336120] RIP: 0010:[<ffffffff930d6fce>] [<ffffffff930d6fce>] effective_load.isra.41+0x4e/0x90
[723807.346218] RSP: 0018:ffff94c54f6fbcb8 EFLAGS: 00010002
[723807.352888] RAX: 0000000000000400 RBX: ffff94d129191800 RCX: 0000000000000400
[723807.361321] RDX: 0000000000000014 RSI: ffff94c1a8ffa800 RDI: 0000000000000001
[723807.369705] RBP: ffff94c54f6fbcb8 R08: 0000000000000400 R09: ffff94c587b95800
[723807.378069] R10: 0000000000000002 R11: 0000000000000400 R12: 0000000000000014
[723807.386382] R13: 0000000000018b40 R14: 000000000000002a R15: 0000000000018b40
[723807.394744] FS: 00007fad327fc700(0000) GS:ffff94e07ef80000(0000) knlGS:0000000000000000
[723807.403955] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[723807.410861] CR2: 0000000000000079 CR3: 000000032ca30000 CR4: 00000000003607e0
[723807.419245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[723807.427456] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[723807.435783] Call Trace:
[723807.439360] [<ffffffff930d7882>] select_task_rq_fair+0x482/0x700
[723807.446566] [<ffffffff930d1ca7>] try_to_wake_up+0xd7/0x350
[723807.453292] [<ffffffff9319709b>] ? unlock_page+0x2b/0x30
[723807.459783] [<ffffffff930d1f9b>] wake_up_q+0x5b/0x80
[723807.465939] [<ffffffff931066eb>] futex_wake+0x16b/0x180
[723807.472365] [<ffffffff9310921a>] do_futex+0x12a/0x5a0
[723807.478577] [<ffffffff9335fad8>] ? lockref_put_or_lock+0x48/0x80
[723807.485710] [<ffffffff93241a84>] ? mntput+0x24/0x40
[723807.491672] [<ffffffff932214d6>] ? __fput+0x186/0x260
[723807.497791] [<ffffffff93109710>] SyS_futex+0x80/0x180
[723807.503898] [<ffffffff9372579b>] system_call_fastpath+0x22/0x27
[723807.510910] Code: eb 29 0f 1f 00 48 85 c9 7e 4b 49 0f af 41 50 31 d2 48 f7 f1 48 83 f8 02 49 0f 42 c2 48 2b 07 48 8b 7f 68 48 85 ff 74 35 45 31 c0 <48> 8b 77 78 4c 8b 8e c0 00 00 00 48 8b 16 49 8b 89 80 02 00 00
[723807.532880] RIP [<ffffffff930d6fce>] effective_load.isra.41+0x4e/0x90
[723807.540390] RSP <ffff94c54f6fbcb8>
[723807.544896] CR2: 0000000000000079
And this kernel crashed on a different host, ended up crashing consul. The backtrace is completely different:
[638656.024183] CPU: 11 PID: 0 Comm: swapper/11 Kdump: loaded Not tainted 3.10.0-862.11.6.el7.x86_64 #1
[638656.033369] Hardware name: Supermicro SYS-6028TP-HTR/X10DRT-P, BIOS 2.0b 03/30/2017
[638656.041164] task: ffff8d52e95b3f40 ti: ffff8d52e95c4000 task.ti: ffff8d52e95c4000
[638656.048780] RIP: 0010:[<ffffffffb88d8777>] [<ffffffffb88d8777>] update_blocked_averages+0x87/0x700
[638656.057998] RSP: 0018:ffff8d61bf8c3de0 EFLAGS: 00010006
[638656.063431] RAX: 000000000000000b RBX: ffff8d61bf8d8b40 RCX: ffffffffb98c8040
[638656.070696] RDX: 3426b13e5b934a59 RSI: 00000000000001d5 RDI: 000000000000022c
[638656.077959] RBP: ffff8d61bf8c3e48 R08: ffff8d61bf8d8bc0 R09: 0000000000000000
[638656.085221] R10: ffffffffffffffff R11: 000000000000b59c R12: ffff8d5e69f3c800
[638656.092483] R13: 0000000000000057 R14: ffff8d61bf8d8b40 R15: ffff8d61bf8d93b0
[638656.099750] FS: 0000000000000000(0000) GS:ffff8d61bf8c0000(0000) knlGS:0000000000000000
[638656.107973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[638656.113840] CR2: 00007f849c572480 CR3: 0000001566b86000 CR4: 00000000003607e0
[638656.121106] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[638656.128368] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[638656.135630] Call Trace:
[638656.138187] <IRQ>
[638656.140233] [<ffffffffb88dffcd>] rebalance_domains+0x4d/0x2b0
[638656.146389] [<ffffffffb88e0352>] run_rebalance_domains+0x122/0x1e0
[638656.152789] [<ffffffffb889dba5>] __do_softirq+0xf5/0x280
[638656.158322] [<ffffffffb8f28cec>] call_softirq+0x1c/0x30
[638656.163765] [<ffffffffb882e625>] do_softirq+0x65/0xa0
[638656.169032] [<ffffffffb889df25>] irq_exit+0x105/0x110
[638656.174303] [<ffffffffb8f2a088>] smp_apic_timer_interrupt+0x48/0x60
[638656.180793] [<ffffffffb8f267b2>] apic_timer_interrupt+0x162/0x170
[638656.187101] <EOI>
[638656.189150] [<ffffffffb8d6e5d7>] ? cpuidle_enter_state+0x57/0xd0
[638656.195567] [<ffffffffb8d6e72e>] cpuidle_idle_call+0xde/0x230
[638656.201531] [<ffffffffb88366ce>] arch_cpu_idle+0xe/0xb0
[638656.206969] [<ffffffffb88f5dba>] cpu_startup_entry+0x14a/0x1e0
[638656.213022] [<ffffffffb8857187>] start_secondary+0x1f7/0x270
[638656.218895] [<ffffffffb88000d5>] start_cpu+0x5/0x14
[638656.223983] Code: 48 89 45 c8 48 8b 45 c8 48 39 c7 4c 8d a0 50 ff ff ff 0f 84 ab 01 00 00 0f 1f 40 00 49 8b 94 24 c0 00 00 00 49 63 86 30 09 00 00 <48> 8b 4a 40 48 8b 52 48 48 8b 1c c1 4c 8b 2c c2 0f 1f 44 00 00
[638656.244524] RIP [<ffffffffb88d8777>] update_blocked_averages+0x87/0x700
[638656.252934] RSP <ffff8d61bf8c3de0>
The primary thing in common are the kernel version.
@shantanugadgil
Sorry to reply you late. The crash is random, and I can't steps to reproduce the issue, because it's not under my control.
I'll keep watching and progress will be in sync.
Nomad version
v0.8.7
Operating system and Environment details
OS: CentOS7.2
Kernel: 3.10.0.862.9.1.el7.x86_64
Hardware: Dell R730xd 2*E5-2650V4(2.2G,12C)
Issue
kernel crash and os reboot
I catch the same issue. There are Thousands of nomad alloc process. Gradually, all the memory of the machine is occupied. Then it reboots. See:
[root@node-1 127.0.0.1-2019-03-06-18:30:20]# crash /usr/lib/debug/lib/modules/3.10.0-862.14.4.el7.x86_64/vmlinux vmcore
crash 7.2.3-8.el7
Copyright (C) 2002-2017 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
WARNING: kernel relocated [716MB]: patching 82671 gdb minimal_symbol values
KERNEL: /usr/lib/debug/lib/modules/3.10.0-862.14.4.el7.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 48
DATE: Wed Mar 6 18:30:11 2019
UPTIME: 22:24:18
LOAD AVERAGE: 2.02, 2.40, 2.63
TASKS: 9091
NODENAME: node-3
RELEASE: 3.10.0-862.14.4.el7.x86_64
VERSION: #1 SMP Wed Sep 26 15:12:11 UTC 2018
MACHINE: x86_64 (2294 Mhz)
MEMORY: 127.6 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000130"
PID: 1473170
COMMAND: "nomad"
TASK: ffff98cde8630fd0 [THREAD_INFO: ffff98b5ad780000]
CPU: 38
STATE: TASK_RUNNING (PANIC)
crash> ps nomad
PID PPID CPU TASK ST %MEM VSZ RSS COMM
12792 1 6 ffff98cdda218fd0 IN 0.1 5663472 203556 nomad
13062 1 0 ffff98bdc8b80fd0 IN 0.1 5663472 203556 nomad
13064 1 18 ffff98bdc8b80000 IN 0.1 5663472 203556 nomad
13065 1 15 ffff98bdc8b84f10 IN 0.1 5663472 203556 nomad
13066 1 37 ffff98bdd2bb1fa0 IN 0.1 5663472 203556 nomad
13166 1 10 ffff98bdc5aceeb0 IN 0.1 5663472 203556 nomad
13167 1 0 ffff98be12ff0000 IN 0.1 5663472 203556 nomad
13185 1 20 ffff98be7d052f70 IN 0.1 5663472 203556 nomad
13203 1 17 ffff98bdd3328000 IN 0.1 5663472 203556 nomad
13204 1 25 ffff98bdc5acdee0 IN 0.1 5663472 203556 nomad
13205 1 7 ffff98bdc42e8fd0 IN 0.1 5663472 203556 nomad
13206 1 2 ffff98ce55ffcf10 IN 0.1 5663472 203556 nomad
13207 1 11 ffff98ce003bbf40 IN 0.1 5663472 203556 nomad
13208 1 5 ffff98ce5628cf10 IN 0.1 5663472 203556 nomad
13209 1 6 ffff98bdc5ac9fa0 IN 0.1 5663472 203556 nomad
13210 1 47 ffff98ce2e64eeb0 IN 0.1 5663472 203556 nomad
13211 1 30 ffff98bdd332bf40 IN 0.1 5663472 203556 nomad
13270 1 7 ffff98ce55ffbf40 IN 0.1 5663472 203556 nomad
13271 1 5 ffff98ce2e64af70 IN 0.1 5663472 203556 nomad
13272 1 5 ffff98bdfceebf40 IN 0.1 5663472 203556 nomad
13273 1 21 ffff98be76384f10 IN 0.1 5663472 203556 nomad
13274 1 33 ffff98bdc1678000 IN 0.1 5663472 203556 nomad
13276 1 6 ffff98ce33f88000 IN 0.1 5663472 203556 nomad
13279 1 32 ffff98ce003b9fa0 IN 0.1 5663472 203556 nomad
13295 1 27 ffff98cde1e7bf40 IN 0.1 5663472 203556 nomad
13296 1 29 ffff98be59b1cf10 IN 0.1 5663472 203556 nomad
13297 1 28 ffff98bdc63beeb0 IN 0.1 5663472 203556 nomad
13298 1 5 ffff98bdc7bf5ee0 IN 0.1 5663472 203556 nomad
13308 1 11 ffff98ce33f88fd0 IN 0.1 5663472 203556 nomad
13310 1 9 ffff98ce25bd5ee0 IN 0.1 5663472 203556 nomad
13316 1 35 ffff98bdd2bb4f10 IN 0.1 5663472 203556 nomad
13317 1 35 ffff98be7d050fd0 IN 0.1 5663472 203556 nomad
13318 1 1 ffff98be7d056eb0 IN 0.1 5663472 203556 nomad
13326 1 7 ffff98bdc5fb2f70 IN 0.1 5663472 203556 nomad
13327 1 1 ffff98cde767eeb0 IN 0.1 5663472 203556 nomad
13558 1 7 ffff98ce5cf8cf10 IN 0.1 5663472 203556 nomad
13560 1 30 ffff98bddab46eb0 IN 0.1 5663472 203556 nomad
13575 1 19 ffff98cdda21bf40 IN 0.1 5663472 203556 nomad
13577 1 5 ffff98bdc0b42f70 IN 0.1 5663472 203556 nomad
13594 1 32 ffff98ce25bd0000 IN 0.1 5663472 203556 nomad
13595 1 1 ffff98bdc0b43f40 IN 0.1 5663472 203556 nomad
13620 1 19 ffff98cdda21cf10 IN 0.1 5663472 203556 nomad
13622 1 27 ffff98ce33f8dee0 IN 0.1 5663472 203556 nomad
13682 1 4 ffff98cdf6fb5ee0 IN 0.1 5663472 203556 nomad
13762 1 2 ffff98ce5628af70 IN 0.1 5663472 203556 nomad
13763 1 18 ffff98ce3fa04f10 IN 0.1 5663472 203556 nomad
13764 1 5 ffff98be7f2abf40 IN 0.1 5663472 203556 nomad
13765 1 4 ffff98ce55ff8fd0 IN 0.1 5663472 203556 nomad
13766 1 11 ffff98ce1d336eb0 IN 0.1 5663472 203556 nomad
13767 1 4 ffff98ce55ffeeb0 IN 0.1 5663472 203556 nomad
13768 1 46 ffff98bdc0b45ee0 IN 0.1 5663472 203556 nomad
13769 1 0 ffff98bdc6228fd0 IN 0.1 5663472 203556 nomad
13770 1 11 ffff98ce5cf8eeb0 IN 0.1 5663472 203556 nomad
13771 1 11 ffff98be7e75cf10 IN 0.1 5663472 203556 nomad
13772 1 4 ffff98bdc1311fa0 IN 0.1 5663472 203556 nomad
13773 1 18 ffff98ce55ffaf70 IN 0.1 5663472 203556 nomad
13774 1 27 ffff98ce4c2d2f70 IN 0.1 5663472 203556 nomad
13776 1 35 ffff98ce1d330fd0 IN 0.1 5663472 203556 nomad
13777 1 9 ffff98bdc87aaf70 IN 0.1 5663472 203556 nomad
15415 1 8 ffff98ce7954af70 IN 0.1 5663472 203556 nomad
15480 1 24 ffff98bdc167dee0 IN 0.1 5663472 203556 nomad
15492 1 9 ffff98bdbe704f10 IN 0.1 5663472 203556 nomad
15520 1 43 ffff98bdbf6e3f40 IN 0.1 5663472 203556 nomad
15521 1 35 ffff98bdd3753f40 IN 0.1 5663472 203556 nomad
15523 1 6 ffff98cdd7e48000 IN 0.1 5663472 203556 nomad
15524 1 3 ffff98cdef3b8000 IN 0.1 5663472 203556 nomad
17327 13771 18 ffff98ce51e23f40 IN 0.0 2565468 37996 nomad
17328 13771 15 ffff98bdd1762f70 IN 0.0 2565468 37996 nomad
17329 13771 19 ffff98bdd1761fa0 IN 0.0 2565468 37996 nomad
17330 13771 5 ffff98be792feeb0 IN 0.0 2565468 37996 nomad
17331 13771 0 ffff98cdc3ad6eb0 IN 0.0 2565468 37996 nomad
17332 13771 40 ffff98be63f33f40 IN 0.0 2565468 37996 nomad
17333 13771 43 ffff98bdc0e0af70 IN 0.0 2565468 37996 nomad
17335 13771 29 ffff98cdd7f41fa0 IN 0.0 2565468 37996 nomad
and so on..
BTW, nomad 0.5.2 with Kernel 3.10.0.862.9.1.el7.x86_64 run perfectly.
And nomad 0.8.7 with Kernel 3.10.0-693.37.4.el7.x86_64 run perfectly, too.
Hey there
Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.
Thanks!
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem :+1:
Most helpful comment
Nomad won't be able to trigger a kernel panic - either its a Kernel bug, or a Go - I'm willing to put $ on kernel bug