Ejabberd: Possibly a bug in mod_stream_mgmt

Created on 3 May 2019  路  8Comments  路  Source: processone/ejabberd

Hi guys,
I recently discovered a very weird behaviour in our Ejabberd nodes. Some of the c2s processes are hanging on the bounce_message_queue . On running erlang:get_process_info(PID) I found the following problematic info

[{current_function,{erlang,bif_return_trap,1}},                                                                                                                                                         
 {initial_call,{proc_lib,init_p,5}},                                                                                                                                                                    
 {status,running},                                                                                                                                                                                      
 {message_queue_len,82969},                                                                                                                                                                             
 {messages,[{#Ref<0.0.899153930.63895>,ok},                                                                                                                                                             
            {'$gen_cast',{close,ping_timeout}},                                                                                                                                                         
            {system,{<0.4663.172>,#Ref<0.0.111935508.211751>},get_state},                                                                                                                               
            {route,{message ...}}   

The stack trace shows

ejabberd_c2s:route/2 ->  src/ejabberd_c2s.erl:143
lists:foreach/2 ->  lists.erl:1337
ejabberd_sm:route/1 ->  src/ejabberd_sm.erl:138
ejabberd_local:route/1 ->  src/ejabberd_local.erl:116
ejabberd_router:do_route/1 ->  src/ejabberd_router.erl:348
ejabberd_router:route/1 ->  src/ejabberd_router.erl:87
ejabberd_c2s:bounce_message_queue/0 ->  src/ejabberd_c2s.erl:853

In this scenario, the bounce message function in mod_stream_mgmt specifically
https://github.com/processone/ejabberd/blob/6b0f7f2a246c1bbbd69ef4bb0f23a7d494e0d8c4/src/mod_stream_mgmt.erl#L739
is leading to a while true kind of situation if the message box is very large, since the ejabberd_c2s terminated function is called after mod_stream_mgmt function if mod_stream_mgmt is enabled.

I propose changing it to something like

-spec bounce_message_queue() -> ok.
bounce_message_queue() ->
    receive {route, Pkt} ->
        case xmpp:get_meta(Pkt, mgmt_is_bounced, false) of 
            false ->
                xmpp:put_meta(NewPkt, mgmt_is_bounced, true)
                ejabberd_router:route(Pkt),
                bounce_message_queue();
            _ -> 
                ok
        end
    after 0 ->
        ok
    end.
Needs more info

Most helpful comment

All 8 comments

@zinid I can send a PR for this if you would like.

It's not yet clear to me how mod_stream_mgmt:bounce_message_queue/0 would trigger a "while true kind of situation". The function is only called on resumption, and in that case, execution of the c2s_terminated hook is stopped, so ejabberd_c2s:bounce_message_queue/0 (which shows up in your stack trace) is _not_ called afterwards. Maybe the new c2s process that was created during resumption dies very quickly, and therefore re-bounces the bounced messages?

@weiss What info can I provide which would be helpful in debugging this ? Do you need more stack trace ?

Basically anything that leads you to the assumption there's actually a bug in ejabberd. From what you've shown, I don't see anything going wrong. If you can reproduce the issue, enabling [debug] output (ejabberdctl set-loglevel 5) before reproducing might give some hints as to what's going on. But I'm also fine with a description of how some code path does the wrong thing.

So following is my hypothesis (I can't provide logs because the issue can only be produced in production), ejabberd_c2s c2s_terminated hook will be called after mod_stream_mgmt c2s_terminated hook. In case of a large amount of messages in the message box the receive block will keep on looping in itself since the process is still alive and receiving messages (since the time to send all the message is huge, the process starts receiving the older messages as well). this line can only be called once the process exits the receive loop but since the process has plenty of messages in its mailbox it never does.

this line can only be called once the process exits the receive loop but since the process has plenty of messages in its mailbox it never does.

This assumption is wrong, I think. mod_stream_mgmt:bounce_message_queue/0 is only called on resumption of a c2s session from a new connection. The c2s process that handles the new connection updates ejabberd_sm's routing table before stopping the old c2s process. Only after that, the old c2s process runs mod_stream_mgmt:bounce_message_queue/0. Therefore, the bounced messages are guaranteed to be routed to the _new_ c2s process at that point.

However, the stack trace you've shown is actually a different case where no resumption is involved, the c2s session is terminated for some reason and ejabberd_c2s:bounce_message_queue/0 is called. Just as during resumption, the c2s process is already removed from the routing table at that point. So I don't see messages looping in this case either.

I'd suggest closing this issue for the moment. So far I fail to see a bug, and I don't see how the suggested fix would actually change ejabberd's behavior. If someone comes up with new details, we can reopen it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pacija picture pacija  路  4Comments

jjdp picture jjdp  路  3Comments

Vshnv picture Vshnv  路  4Comments

cromain picture cromain  路  3Comments

lucastimotiofirmino picture lucastimotiofirmino  路  3Comments