haproxy 2.2: h2 + large file uploads fail

Created on 2 Aug 2020 · 5Comments · Source: haproxy/haproxy

We deployed 2.2.1 last week but had to revert to 2.0.16 because large file uploads broke. I've created a complete Docker Compose test app here. This includes a simple Node app and stripped down haproxy config. I can recreate this issue 100% of the time using this setup.

Our production setup is Apache / PHP.

Output of `haproxy -vv` and `uname -a`

This happens with any version of 2.2 including the official Docker images (using haproxy:2.2-alpine in my test app).

What's the configuration?

A stripped down config can be found here.

Steps to reproduce the behavior

Put haproxy 2.2 in front of a webserver configured to accept file uploads. Use h2.
Upload a larger file. I've been testing with files 100MB+.
Wait until it nearly completes uploading. The upload will fail. Always seems to happen right at the end.
Here is the log sequence from my test app complete with comments:

# From the app. It looks like the file upload completes successfully as far as the application is concerned. The md5 hashes are always the same.
web_1    | upload finished, post data: {
web_1    |   testUpload: {
web_1    |     name: '951172_Show-1-Bad-Man-Incorporate.mp3',
web_1    |     data: <Buffer >,
web_1    |     size: 127412766,
web_1    |     encoding: '7bit',
web_1    |     tempFilePath: '/tmp/tmp-6-1596328303983',
web_1    |     truncated: false,
web_1    |     mimetype: 'audio/mpeg',
web_1    |     md5: '37d6046ac89f95899c4fd60e13d057fa',
web_1    |     mv: [Function: mv]
web_1    |   }
web_1    | }

# Yet the request fails with termination flags SD:
proxy_1  | 192.168.111.48:63220 [02/Aug/2020:00:31:43.952] fe-main~ be-upload-test/web 29/0/0/-1/56534 -1 0 - - SD-- 1/1/0/0/0 0/0 {dev01:8100||https://dev01:8100/} "POST https://dev01:8100/upload HTTP/2.0"

# The app thinks the upload succeeded.
web_1    | ::ffff:192.168.48.3 - - [02/Aug/2020:00:32:40 +0000] "POST /upload HTTP/1.1" 302 46 "https://dev01:8100/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"

The browser doesn't get notified that the upload succeeded. It will keep trying the upload several times until it errors out completely.

Actual behavior

The file upload fails.

Expected behavior

The file upload should succeed.

Do you have any idea what may have caused this?

Upgrading to 2.2.1. This appears to only be present in 2.2.x (I've tested 2.2.1 and 2.2.2). I can't recreate this issue in 2.1.x or 2.0.x.

Do you have an idea how to solve the issue?

~It looks like adding tune.h2.initial-window-size 1048576 to the global section is a workaround. I've tested it with 800MB+ files and I can't get the issue to occur when using this setting (we really needed this anyway!). I haven't tested this in production yet but may do so soon.~

I tested 2.2.2 configured with tune.h2.initial-window-size 1048576 on a fairly loaded production server. The issue is still present even with this tuned setting. Looks like the only solution is to downgrade.

medium fixed bug

Source

brenc

All 5 comments

I'm unable to reproduce the issue with a similar config and apache as server (but not php). I tested with a file of 1g. But the upload itself seems to be ok in your test. But there is a problem with the server response. The status code is reported as -1 with SD as termination flags. If it is 100% reproducible on a test environment, it could be a good idea to enable H1 traces. To do so, you need to configure a stats socket in your global section (in TCP because you are in a container).

To enable H1 traces and wait for events:

socat tcp:A.B.C.D:{PORT}t readline
trace h1 sink buf0; trace h1 level developer; trace h1 verbosity complete; trace h1 start now; show events buf0 -w

With the traces, it will be easier to understand what happens.

capflam on 4 Aug 2020

That's very odd considering I can reproduce the issue nearly 100% of the time. Are you doing this on a local network connection? That might be related (specifically latency). If so, you could try throttling your network connection (e.g. in Chrome devtools, network tab).

I'm testing this remotely with two different systems that are at about a 35ms ping time away. I'm able to reproduce this with the full Apache/PHP site and with my minimal Node app. The staging site is over public IP and my dev is over VPN.

Anyway, attaching the trace. I started the trace then uploaded my usual 100MBish test file. I let it upload as usual until the end when it gets reset and the browser starts the upload over. I stopped the upload and the trace then. The most relevant bits are probably at the end of the file.

h1_trace.log.gz

brenc on 4 Aug 2020

Thanks. Reading your trace, I found a way to reproduce the bug. The upload must be relatively slow to exceed the server timeout. In this case, it is indeed pretty easy to reproduce the issue. I will push a fix. The problem is that the H1 connection timeout (in the H1 multiplexer) is not refreshed and may expire during the transfer.

capflam on 5 Aug 2020

👍1

The fix was pushed and backported as far as 2.0.

capflam on 5 Aug 2020

👍1

oh interesting, it looks like we are also affected by this issue, which could explain some users complains. Thanks for the report and complete commit message!