Ejabberd: Segmentation fault on Ejabberd 16.8

Created on 13 Dec 2016  路  19Comments  路  Source: processone/ejabberd

Hi,

I'm having constant crashes (every 3-6 hours on every node) in ejabberd. I'm currently using ejabberd 16.8 but the issue happened with older versions as well. Architecture-wise, I have four ejabberd nodes in a cluster. I'm also using redis as the session manager.

Fortunately, the crashes leave a core dump which I have analysed without luck.

Here is the output, step by step (I've highlighted each command, followed by its output):

gdb xavier/rel/xavier/erts-8.1.1/bin/beam.smp -core crashes/core.5_scheduler.14 -d otp_src_19.1/erts/emulator

Program terminated with signal SIGSEGV, Segmentation fault.
#0  do_minor (p=0x7f88582a04e0, live_hf_end=<optimized out>, mature=<optimized out>, mature_size=672, new_sz=610, objv=<optimized out>, nobj=1) at beam/erl_gc.c:1401
1401            val = *ptr;

(gdb) print ptr

$1 = <optimized out>

(gdb) print gval

$3 = 140221396262226

(gdb) source otp_src_19.1/erts/etc/unix/etp-commands.in

%---------------------------------------------------------------------------
% Use etp-help for a command overview and general help.
%
% To use the Erlang support module, the environment variable ROOTDIR
% must be set to the toplevel installation directory of Erlang/OTP,
% so the etp-commands file becomes:
%     $ROOTDIR/erts/etc/unix/etp-commands
% Also, erl and erlc must be in the path.
%---------------------------------------------------------------------------
etp-set-max-depth 20
etp-set-max-string-length 100
--------------- System Information ---------------
OTP release: 19
ERTS version: 8.1.1
Compile date: Tue Nov 15 00:12:21 2016
Arch: x86_64-unknown-linux-gnu
Endianness: Little
Word size: 64-bit
HiPE support: yes
SMP support: yes
Thread support: yes
Kernel poll: Supported and used
Debug compiled: no
Lock checking: no
Lock counting: no
Node name: [email protected]
Number of schedulers: 16
Number of async-threads: 100
--------------------------------------------------

(gdb) etp-process-info p

  Pid: <0.21391.423>
  State: on-heap-msgq | garbage-collecting | running | active | prq-prio-normal | usr-prio-normal | act-prio-normal
  Current function: stringprep:resourceprep/1
  CP: #Cp<ejabberd_hooks:safe_apply/3+0xc0>
  I: #Cp<gen:do_call/4+0x198>
  Heap size: 610
  Old-heap size: 1598
  Mbuf size: 0
  Msgq len: 1 (inner=1, outer=0)
  Parent: <0.1696.0>
  Pointer: (Process *) 0x7f88582a04e0

(gdb) etp-stacktrace p

% Stacktrace (77): #Cp<ejabberd_hooks:safe_apply/3+0xc0>.
#Cp<ejabberd_hooks:run_fold1/4+0x1430>.
#Cp<ejabberd_router:do_route/3+0x4f0>.
#Cp<0x64e0ebf8>.
#Cp<ejabberd_router_multicast:'-do_route_normal/3-lc$^0/1-0-'/3+0x90>.
#Cp<0x59edb4f8>.
#Cp<ejabberd_c2s:terminate/3+0x10c0>.
#Cp<p1_fsm:terminate/8+0x130>.
#Cp<proc_lib:init_p_do_apply/3+0x58>.
#Cp<terminate process normally>.

(gdb) etp-stackdump p

% Stackdump (77): #Cp<ejabberd_hooks:safe_apply/3+0xc0>.
#Cp<ejabberd_hooks:run_fold1/4+0x1430>.
#Catch<2533>.
#Cp<ejabberd_router:do_route/3+0x4f0>.
[].
[].
[].
[].
[].
[].
[].
filter_packet.
{{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>},{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0>},{xmlel,#HeapBinary<0x8,0x73657270>,[{#HeapBinary<0x4,0x65707974>,#HeapBinary<0xb,0x76616e75,0xcc656c62>}],[]}}.
[].
#Cp<0x64e0ebf8>.
[].
[].
[].
[].
[].
[].
[].
[].
#Cp<ejabberd_router_multicast:'-do_route_normal/3-lc$^0/1-0-'/3+0x90>.
[].
[].
[].
[].
[].
{xmlel,#HeapBinary<0x8,0x73657270>,[{#HeapBinary<0x4,0x65707974>,#HeapBinary<0xb,0x76616e75,0xcc656c62>}],[]}.
{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0>}.
{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>}.
#Catch<2410>.
#Cp<0x59edb4f8>.
{xmlel,#HeapBinary<0x8,0x73657270>,[{#HeapBinary<0x4,0x65707974>,#HeapBinary<0xb,0x76616e75,0xcc656c62>}],[]}.
{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>}.
[].
#Cp<ejabberd_c2s:terminate/3+0x10c0>.
[].
[].
[].
[].
[].
{xmlel,#HeapBinary<0x8,0x73657270>,[{#HeapBinary<0x4,0x65707974>,#HeapBinary<0xb,0x76616e75,0xcc656c62>}],[]}.
[{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0>}].
#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>.
{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>}.
#Catch<2431>.
#Cp<p1_fsm:terminate/8+0x130>.
[].
[].
[].
[].
[].
[].
[].
[].
[].
inactive.
{{1481,631321,949637},<0.21391.423>}.
#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>.
#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>.
#HeapBinary<0x8,0x33333231>.
0.
{state,{socket_state,gen_tcp,#Port<0.1144832>,<0.21390.423>},ejabberd_socket,#Ref<0.0.1048580.225209>,false,#HeapBinary<0x13,0x39393237,0x39383731,0x5c313931>,undefined,c2s,c2s_shaper,false,false,false,false,[verify_none,compression_none],true,{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>},#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,{{1481,631321,949637},<0.21391.423>},...}.
ejabberd_socket.
{socket_state,gen_tcp,#Port<0.1144832>,<0.21390.423>}.
#Cp<proc_lib:init_p_do_apply/3+0x58>.
#Catch<1665>.
[].
{state,{socket_state,gen_tcp,#Port<0.1144832>,<0.21390.423>},ejabberd_socket,#Ref<0.0.1048580.225209>,false,#HeapBinary<0x13,0x39393237,0x39383731,0x5c313931>,undefined,c2s,c2s_shaper,false,false,false,false,[verify_none,compression_none],true,{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>},#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d324---Type <return> to continue, or q <return> to quit---
5,0x30323346>,{{1481,631321,949637},<0.21391.423>},...}.
session_established.
ejabberd_c2s.
{'$gen_event',{xmlstreamerror,#HeapBinary<0x15,0x204c4d58,0x6920617a,0x6962206f>}}.
<0.21391.423>.
normal.
#Cp<terminate process normally>.
#Catch<162>.

(gdb) etpf-stackdump p

% Stackdump (77): #Cp<ejabberd_hooks:safe_apply/3+0xc0>.
#Cp<ejabberd_hooks:run_fold1/4+0x1430>.
#Catch<2533>.
#Cp<ejabberd_router:do_route/3+0x4f0>.
[].
[].
[].
[].
[].
[].
[].
filter_packet.
<etpf-boxed 0xd7cb6222>.
[].
#Cp<0x64e0ebf8>.
[].
[].
[].
[].
[].
[].
[].
[].
#Cp<ejabberd_router_multicast:'-do_route_normal/3-lc$^0/1-0-'/3+0x90>.
[].
[].
[].
[].
[].
<etpf-boxed 0x5b6c8aa2>.
<etpf-boxed 0xd7cb6242>.
<etpf-boxed 0xd7188fea>.
#Catch<2410>.
#Cp<0x59edb4f8>.
<etpf-boxed 0x5b6c8aa2>.
<etpf-boxed 0xd7188fea>.
[].
#Cp<ejabberd_c2s:terminate/3+0x10c0>.
[].
[].
[].
[].
[].
<etpf-boxed 0x5b6c8aa2>.
<etpf-cons 0xd7cb6281>.
<etpf-boxed 0xd71891fa>.
<etpf-boxed 0xd7188fea>.
#Catch<2431>.
#Cp<p1_fsm:terminate/8+0x130>.
[].
[].
[].
[].
[].
[].
[].
[].
[].
inactive.
<etpf-boxed 0xd71892d2>.
<etpf-boxed 0xd7189222>.
<etpf-boxed 0xd71891fa>.
<etpf-boxed 0xd71891e2>.
0.
<etpf-boxed 0xd718968a>.
ejabberd_socket.
<etpf-boxed 0xd718925a>.
#Cp<proc_lib:init_p_do_apply/3+0x58>.
#Catch<1665>.
[].
<etpf-boxed 0xd718968a>.
session_established.
ejabberd_c2s.
<etpf-boxed 0xd71899ea>.
<0.21391.423>.
normal.
#Cp<terminate process normally>.
#Catch<162>.

TL;DR I assume the problem is happening in ejabberd_router_multicast (which may have to do with the intra-node communication) but I don't know how to proceed. The ejabberd_c2s.
{'$gen_event',{xmlstreamerror,#HeapBinary<0x15,0x204c4d58,0x6920617a,0x6962206f>}}.
line seems important.

Thanks in advance!

Bug

Most helpful comment

Have you guys figured out the root cause? We've temporarily resolved this issue by rollback fast_xml to p1_xml. Hopeful this info could be a little useful for others.

All 19 comments

Do you remember since what version you started to get crashes?

What version of expat is ejabberd compiled against?

We are using expat 2.1.0. 2.1.0-6+deb8u3 to be exact

dpkg -s expat

Package: expat
Status: install ok installed
Priority: optional
Section: text
Installed-Size: 42
Maintainer: Laszlo Boszormenyi (GCS) <[email protected]>
Architecture: amd64
Version: 2.1.0-6+deb8u3
Depends: libc6 (>= 2.14), libexpat1 (>= 2.0.1)
Description: XML parsing C library - example application
 This package contains xmlwf, an example application of expat, the C
 library for parsing XML.  The arguments to xmlwf are one or more
 files which are each to be checked for XML well-formedness.
Homepage: http://expat.sourceforge.net

dpkg -s libexpat1

Package: libexpat1
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 347
Maintainer: Laszlo Boszormenyi (GCS) <[email protected]>
Architecture: amd64
Multi-Arch: same
Source: expat
Version: 2.1.0-6+deb8u3
Depends: libc6 (>= 2.14)
Pre-Depends: multiarch-support
Conflicts: wink (<= 1.5.1060-4)
Description: XML parsing C library - runtime library
 This package contains the runtime, shared library of expat, the C
 library for parsing XML. Expat is a stream-oriented parser in
 which an application registers handlers for things the parser
 might find in the XML document (like start tags).
Homepage: http://expat.sourceforge.net

I have just found something. If I run

dpkg -s lib64expat1

dpkg-query: package 'lib64expat1' is not installed and no information is available
Use dpkg --info (= dpkg-deb --info) to examine archive files,
and dpkg --contents (= dpkg-deb --contents) to list their contents.

Can the problem be related to the fact we are not installing the amd64 version of expat? (We are using a 64 bit system)

That libexpat1 package is already 64 bit version

My bad, it says Architecture: amd64 right in the middle :(

@santiagopoli Do you remember since what version you started to get crashes?

No, but I got this same crash using Ejabberd 16.6. It didn't happen to us with Ejabberd 15 (but we were having a lot of other crashes -due to our code- in that time, so maybe it happened as well)

BTW, we're using Ejabberd as an Elixir dependency.

Could this be related to https://bugs.erlang.org/browse/ERL-304 ?

Yes, this is probably the same bug.

santiagopoli, did you resolve this crash? Recently we met the very similar crash issue as well.

Guys, we're reviewing our C code, please be patient (we got plenty of the cores like this one).

No, we haven't solved it yet, but we think it only happens within a cluster. We've tried to reproduce this bug with a single, larger node and we didn't have this error (but can be just pure luck). Having 4 interconnected nodes produces this crash 2+ times a day. Notice we often have 100k users connected at the same time.

Have you guys figured out the root cause? We've temporarily resolved this issue by rollback fast_xml to p1_xml. Hopeful this info could be a little useful for others.

@shanjianping yes, that's important info, thanks

I'm experiencing the same symptoms with this issue. If you don't mind me asking:

The root case is now covered by the test here: https://github.com/processone/fast_xml/commit/6cfd311c6fa6ed94de7b9bdb255751c26e66094b
Without the patch you would get a segfault at line 406. Simply put, if you send more data after a server generates xml-too-big error, it would segfault, because a server attempts to reuse freed structures.
This also happens in non-cluster environment, it's pretty much reproduceable: revert the patch and run the test.

Following trick is fixed my problem on my fresh ejabbard install on ubuntu 16.04.03

https://askubuntu.com/questions/865578/is-there-a-way-to-override-a-hat-child-profile-in-an-apparmor-local-override-fil

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lgg picture lgg  路  4Comments

haegar picture haegar  路  4Comments

rahul-l picture rahul-l  路  3Comments

Vshnv picture Vshnv  路  4Comments

lucastimotiofirmino picture lucastimotiofirmino  路  3Comments