We've seen occasional crashes of HAProxy when processing HTTP/2 requests from clients. The steps needed to reproduce the crash are unknown, but we have a core file from the most recent crash on a production system.
Return cleanly from h2_send
Unknown
It's unclear. The line where the apparent segfault occurs, src/mux_h2.c:3454, looks like this:
while (((h2c->flags & (H2_CF_MUX_MFULL|H2_CF_MUX_MALLOC)) == 0) && !done)
done = h2_process_mux(h2c);
In the core file we collected, h2c is:
(gdb) p h2c
$8 = (struct h2c *) 0x7fc5ec583f60
(gdb) p *h2c
$9 = {
conn = 0x7fc6043f18b0,
st0 = H2_CS_ERROR,
errcode = H2_ERR_NO_ERROR,
flags = 73730,
streams_limit = 100,
max_id = 6445,
rcvd_c = 1386,
rcvd_s = 66,
ddht = 0x7fc5901e7ee0,
dbuf = {
size = 0,
area = 0x0,
data = 0,
head = 0
},
dsi = 6439,
dfl = 0,
dft = 0 '\000',
dff = 1 '\001',
dpl = 0 '\000',
last_sid = 6445,
mbuf = {{
size = 32,
area = 0x2 <error: Cannot access memory at address 0x2>,
data = 9,
head = 9
}, {
size = 0,
area = 0x0,
data = 0,
head = 0
} <repeats 31 times>},
msi = -1,
mfl = -1831598677,
mft = 45 '-',
mff = 76 'L',
miw = 6291456,
mws = 15633930,
mfs = 16384,
timeout = 10000,
shut_timeout = 10000,
nb_streams = 1,
nb_cs = 1,
nb_reserved = 0,
stream_cnt = 3223,
proxy = 0x5604b4ebb910,
task = 0x0,
streams_by_id = {
b = {0x7fc5c06b3f90, 0x0}
},
send_list = {
n = 0x7fc5c06b3ff8,
p = 0x7fc5c06b3ff8
},
fctl_list = {
n = 0x7fc5ec584418,
p = 0x7fc5ec584418
},
blocked_list = {
n = 0x7fc5ec584428,
p = 0x7fc5ec584428
},
buf_wait = {
target = 0xfd8130000182a301,
wakeup_cb = 0x1131d5503060c30,
list = {
next = 0x7fc5ec584448,
prev = 0x7fc5ec584448
}
},
wait_event = {
tasklet = 0x7fc5ec884540,
events = 2
}
}
Clients speak HTTP/1.1 and HTTP/2 to HAProxy. HAProxy speaks only HTTP/1.1 to the backend servers.
haproxy -vv and uname -aHA-Proxy version 2.2.4-de45672 2020/09/30 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.2.4.html
Running on: Linux 4.18.0-25-generic #26~18.04.1-Ubuntu SMP Thu Jun 27 07:28:31 UTC 2019 x86_64
Build options :
TARGET = linux-glibc
CPU = generic
CC = gcc
CFLAGS = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-stringop-overflow -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
OPTIONS = USE_OPENSSL=1 USE_LUA=1 USE_TFO=1
Feature list : +EPOLL -KQUEUE +NETFILTER -PCRE -PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS
Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with multi-threading support (MAX_THREADS=64, default=64).
Built with OpenSSL version : OpenSSL 1.1.1g 21 Apr 2020
Running on OpenSSL version : OpenSSL 1.1.1 11 Sep 2018
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.5
Built with network namespace support.
Built without compression support (neither USE_ZLIB nor USE_SLZ are set).
Compression algorithms supported : identity("identity")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built without PCRE or PCRE2 support (using libc's regex instead)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 7.5.0
Built with the Prometheus exporter as a service
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
fcgi : mode=HTTP side=BE mux=FCGI
<default> : mode=HTTP side=FE|BE mux=H1
h2 : mode=HTTP side=FE|BE mux=H2
<default> : mode=TCP side=FE|BE mux=PASS
Available services :
prometheus-exporter
Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace
[CACHE] cache
[FCGI] fcgi-app
Linux [HOSTNAME_DELETED] 4.18.0-25-generic #26~18.04.1-Ubuntu SMP Thu Jun 27 07:28:31 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007fc6875fa8b1 in __GI_abort () at abort.c:79
#2 0x00005604b45a6ef8 in ha_panic () at src/debug.c:271
#3 0x00005604b460e778 in wdt_handler (sig=14, si=<optimized out>, arg=<optimized out>)
at src/wdt.c:119
#4 <signal handler called>
#5 h2_send (h2c=h2c@entry=0x7fc5ec583f60) at src/mux_h2.c:3454
#6 0x00005604b44eac95 in h2_process (h2c=h2c@entry=0x7fc5ec583f60) at src/mux_h2.c:3677
#7 0x00005604b44edb88 in h2_io_cb (t=<optimized out>, ctx=0x7fc5ec583f60, status=<optimized out>)
at src/mux_h2.c:3558
#8 0x00005604b45f520c in run_tasks_from_lists (budgets=budgets@entry=0x7fc6742bf35c)
at src/task.c:448
#9 0x00005604b45f5a9d in process_runnable_tasks () at src/task.c:672
#10 0x00005604b45b0857 in run_poll_loop () at src/haproxy.c:2915
#11 0x00005604b45b0c09 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3080
#12 0x00007fc6884d76db in start_thread (arg=0x7fc6742e2700) at pthread_create.c:463
#13 0x00007fc6876dba3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Its interesting. It seems to be similar to #875. The dump of your h2c is really useful. I think I understand how a loop in h2_send() is possible.
The h2c is in state H2_CS_ERROR with the error code H2_ERR_NO_ERROR. It only happens when the task timeout is expired. In this case, the task callback function is called. It is confirmed because h2c->task is NULL. If the mbuf ring is full (H2_CF_MUX_MFULL flag is set), no GOAWAY frame is sent and H2_CF_GOAWAY_FAILED is set. At this stage, I guess there is at least one stream attached to the connection and we return without releasing the h2c.
When the h2c is woken up to send remaining data, h2_process() is called. There are 2 h2_send() in this function. If the first one is unable to send any data, H2_CF_MUX_MFULL is not removed. But before calling the second one, at the end of h2_process(), because H2_CF_GOAWAY_FAILED flag is set, the mbuf ring is released, without removing H2_CF_MUX_MFULL. And now, when the second h2_send() is called, we fall in a loop because the mbuf ring is empty while the H2_CF_MUX_MFULL is set. Thus done and sent variables remain to 0.
@wtarreau, I don't know if this scenario is clear, it is not easy to describe. But, in theory, it is possible to have a loop in h2_send(). Now, I don't really know how to fix it. I guess h2_release_mbuf() should at least remove H2_CF_MUX_MFULL flags. I don't know if the second h2_send() is useful, except if a GOAWAY frame is emitted.
Just a note. In h2_timeout_task(), I guess we should also remove H2_CF_MUX_MFULL and H2_CF_DEM_MROOM if something is sent.
I pushed a fix for this issue. I will backport it soon. Thanks !
backported to 2.2 now. If it is possible, it could be good to validate the fix.
Thanks @capflam , we will try and get this out this week for testing.
As a follow up: we waited for the 2.2.5 release, which included this Issue Fix. We have been running it since the release and have not yet experienced the failure. Thanks for quick fix and release.
Cool. thanks for your feedback.