Haproxy: Crash in h2_send in HAProxy 2.2.4

Created on 22 Oct 2020  路  6Comments  路  Source: haproxy/haproxy

Detailed description of the problem

We've seen occasional crashes of HAProxy when processing HTTP/2 requests from clients. The steps needed to reproduce the crash are unknown, but we have a core file from the most recent crash on a production system.

Expected behavior

Return cleanly from h2_send

Steps to reproduce the behavior

Unknown

Do you have any idea what may have caused this?

It's unclear. The line where the apparent segfault occurs, src/mux_h2.c:3454, looks like this:

                while (((h2c->flags & (H2_CF_MUX_MFULL|H2_CF_MUX_MALLOC)) == 0) && !done)
                        done = h2_process_mux(h2c);

In the core file we collected, h2c is:

(gdb) p h2c
$8 = (struct h2c *) 0x7fc5ec583f60
(gdb) p *h2c
$9 = {
  conn = 0x7fc6043f18b0, 
  st0 = H2_CS_ERROR, 
  errcode = H2_ERR_NO_ERROR, 
  flags = 73730, 
  streams_limit = 100, 
  max_id = 6445, 
  rcvd_c = 1386, 
  rcvd_s = 66, 
  ddht = 0x7fc5901e7ee0, 
  dbuf = {
    size = 0, 
    area = 0x0, 
    data = 0, 
    head = 0
  }, 
  dsi = 6439, 
  dfl = 0, 
  dft = 0 '\000', 
  dff = 1 '\001', 
  dpl = 0 '\000', 
  last_sid = 6445, 
  mbuf = {{
      size = 32, 
      area = 0x2 <error: Cannot access memory at address 0x2>, 
      data = 9, 
      head = 9
    }, {
      size = 0, 
      area = 0x0, 
      data = 0, 
      head = 0
    } <repeats 31 times>}, 
  msi = -1, 
  mfl = -1831598677, 
  mft = 45 '-', 
  mff = 76 'L', 
  miw = 6291456, 
  mws = 15633930, 
  mfs = 16384, 
  timeout = 10000, 
  shut_timeout = 10000, 
  nb_streams = 1, 
  nb_cs = 1, 
  nb_reserved = 0, 
  stream_cnt = 3223, 
  proxy = 0x5604b4ebb910, 
  task = 0x0, 
  streams_by_id = {
    b = {0x7fc5c06b3f90, 0x0}
  }, 
  send_list = {
    n = 0x7fc5c06b3ff8, 
    p = 0x7fc5c06b3ff8
  }, 
  fctl_list = {
    n = 0x7fc5ec584418, 
    p = 0x7fc5ec584418
  }, 
  blocked_list = {
    n = 0x7fc5ec584428, 
    p = 0x7fc5ec584428
  }, 
  buf_wait = {
    target = 0xfd8130000182a301, 
    wakeup_cb = 0x1131d5503060c30, 
    list = {
      next = 0x7fc5ec584448, 
      prev = 0x7fc5ec584448
    }
  }, 
  wait_event = {
    tasklet = 0x7fc5ec884540, 
    events = 2
  }
}

Do you have an idea how to solve the issue?

What is your configuration?

Clients speak HTTP/1.1 and HTTP/2 to HAProxy. HAProxy speaks only HTTP/1.1 to the backend servers.

Output of haproxy -vv and uname -a

HA-Proxy version 2.2.4-de45672 2020/09/30 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.2.4.html
Running on: Linux 4.18.0-25-generic #26~18.04.1-Ubuntu SMP Thu Jun 27 07:28:31 UTC 2019 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-stringop-overflow -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_OPENSSL=1 USE_LUA=1 USE_TFO=1

Feature list : +EPOLL -KQUEUE +NETFILTER -PCRE -PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=64).
Built with OpenSSL version : OpenSSL 1.1.1g  21 Apr 2020
Running on OpenSSL version : OpenSSL 1.1.1  11 Sep 2018
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.5
Built with network namespace support.
Built without compression support (neither USE_ZLIB nor USE_SLZ are set).
Compression algorithms supported : identity("identity")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built without PCRE or PCRE2 support (using libc's regex instead)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 7.5.0
Built with the Prometheus exporter as a service

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
            fcgi : mode=HTTP       side=BE        mux=FCGI
       <default> : mode=HTTP       side=FE|BE     mux=H1
              h2 : mode=HTTP       side=FE|BE     mux=H2
       <default> : mode=TCP        side=FE|BE     mux=PASS

Available services :
    prometheus-exporter

Available filters :
    [SPOE] spoe
    [COMP] compression
    [TRACE] trace
    [CACHE] cache
    [FCGI] fcgi-app

Linux [HOSTNAME_DELETED] 4.18.0-25-generic #26~18.04.1-Ubuntu SMP Thu Jun 27 07:28:31 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

If HAProxy crashed: Last outputs and backtraces

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fc6875fa8b1 in __GI_abort () at abort.c:79
#2  0x00005604b45a6ef8 in ha_panic () at src/debug.c:271
#3  0x00005604b460e778 in wdt_handler (sig=14, si=<optimized out>, arg=<optimized out>)
    at src/wdt.c:119
#4  <signal handler called>
#5  h2_send (h2c=h2c@entry=0x7fc5ec583f60) at src/mux_h2.c:3454
#6  0x00005604b44eac95 in h2_process (h2c=h2c@entry=0x7fc5ec583f60) at src/mux_h2.c:3677
#7  0x00005604b44edb88 in h2_io_cb (t=<optimized out>, ctx=0x7fc5ec583f60, status=<optimized out>)
    at src/mux_h2.c:3558
#8  0x00005604b45f520c in run_tasks_from_lists (budgets=budgets@entry=0x7fc6742bf35c)
    at src/task.c:448
#9  0x00005604b45f5a9d in process_runnable_tasks () at src/task.c:672
#10 0x00005604b45b0857 in run_poll_loop () at src/haproxy.c:2915
#11 0x00005604b45b0c09 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3080
#12 0x00007fc6884d76db in start_thread (arg=0x7fc6742e2700) at pthread_create.c:463
#13 0x00007fc6876dba3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Additional information (if helpful)

fixed bug

All 6 comments

Its interesting. It seems to be similar to #875. The dump of your h2c is really useful. I think I understand how a loop in h2_send() is possible.

The h2c is in state H2_CS_ERROR with the error code H2_ERR_NO_ERROR. It only happens when the task timeout is expired. In this case, the task callback function is called. It is confirmed because h2c->task is NULL. If the mbuf ring is full (H2_CF_MUX_MFULL flag is set), no GOAWAY frame is sent and H2_CF_GOAWAY_FAILED is set. At this stage, I guess there is at least one stream attached to the connection and we return without releasing the h2c.

When the h2c is woken up to send remaining data, h2_process() is called. There are 2 h2_send() in this function. If the first one is unable to send any data, H2_CF_MUX_MFULL is not removed. But before calling the second one, at the end of h2_process(), because H2_CF_GOAWAY_FAILED flag is set, the mbuf ring is released, without removing H2_CF_MUX_MFULL. And now, when the second h2_send() is called, we fall in a loop because the mbuf ring is empty while the H2_CF_MUX_MFULL is set. Thus done and sent variables remain to 0.

@wtarreau, I don't know if this scenario is clear, it is not easy to describe. But, in theory, it is possible to have a loop in h2_send(). Now, I don't really know how to fix it. I guess h2_release_mbuf() should at least remove H2_CF_MUX_MFULL flags. I don't know if the second h2_send() is useful, except if a GOAWAY frame is emitted.

Just a note. In h2_timeout_task(), I guess we should also remove H2_CF_MUX_MFULL and H2_CF_DEM_MROOM if something is sent.

I pushed a fix for this issue. I will backport it soon. Thanks !

backported to 2.2 now. If it is possible, it could be good to validate the fix.

Thanks @capflam , we will try and get this out this week for testing.

As a follow up: we waited for the 2.2.5 release, which included this Issue Fix. We have been running it since the release and have not yet experienced the failure. Thanks for quick fix and release.

Cool. thanks for your feedback.

Was this page helpful?
0 / 5 - 0 ratings