We deployed 2.2.1 last week but had to revert to 2.0.16 because large file uploads broke. I've created a complete Docker Compose test app here. This includes a simple Node app and stripped down haproxy config. I can recreate this issue 100% of the time using this setup.
Our production setup is Apache / PHP.
haproxy -vv and uname -aThis happens with any version of 2.2 including the official Docker images (using haproxy:2.2-alpine in my test app).
A stripped down config can be found here.
# From the app. It looks like the file upload completes successfully as far as the application is concerned. The md5 hashes are always the same.
web_1 | upload finished, post data: {
web_1 | testUpload: {
web_1 | name: '951172_Show-1-Bad-Man-Incorporate.mp3',
web_1 | data: <Buffer >,
web_1 | size: 127412766,
web_1 | encoding: '7bit',
web_1 | tempFilePath: '/tmp/tmp-6-1596328303983',
web_1 | truncated: false,
web_1 | mimetype: 'audio/mpeg',
web_1 | md5: '37d6046ac89f95899c4fd60e13d057fa',
web_1 | mv: [Function: mv]
web_1 | }
web_1 | }
# Yet the request fails with termination flags SD:
proxy_1 | 192.168.111.48:63220 [02/Aug/2020:00:31:43.952] fe-main~ be-upload-test/web 29/0/0/-1/56534 -1 0 - - SD-- 1/1/0/0/0 0/0 {dev01:8100||https://dev01:8100/} "POST https://dev01:8100/upload HTTP/2.0"
# The app thinks the upload succeeded.
web_1 | ::ffff:192.168.48.3 - - [02/Aug/2020:00:32:40 +0000] "POST /upload HTTP/1.1" 302 46 "https://dev01:8100/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
The browser doesn't get notified that the upload succeeded. It will keep trying the upload several times until it errors out completely.
The file upload fails.
The file upload should succeed.
Upgrading to 2.2.1. This appears to only be present in 2.2.x (I've tested 2.2.1 and 2.2.2). I can't recreate this issue in 2.1.x or 2.0.x.
~It looks like adding tune.h2.initial-window-size 1048576 to the global section is a workaround. I've tested it with 800MB+ files and I can't get the issue to occur when using this setting (we really needed this anyway!). I haven't tested this in production yet but may do so soon.~
I tested 2.2.2 configured with tune.h2.initial-window-size 1048576 on a fairly loaded production server. The issue is still present even with this tuned setting. Looks like the only solution is to downgrade.
I'm unable to reproduce the issue with a similar config and apache as server (but not php). I tested with a file of 1g. But the upload itself seems to be ok in your test. But there is a problem with the server response. The status code is reported as -1 with SD as termination flags. If it is 100% reproducible on a test environment, it could be a good idea to enable H1 traces. To do so, you need to configure a stats socket in your global section (in TCP because you are in a container).
To enable H1 traces and wait for events:
socat tcp:A.B.C.D:{PORT}t readline
trace h1 sink buf0; trace h1 level developer; trace h1 verbosity complete; trace h1 start now; show events buf0 -w
With the traces, it will be easier to understand what happens.
That's very odd considering I can reproduce the issue nearly 100% of the time. Are you doing this on a local network connection? That might be related (specifically latency). If so, you could try throttling your network connection (e.g. in Chrome devtools, network tab).
I'm testing this remotely with two different systems that are at about a 35ms ping time away. I'm able to reproduce this with the full Apache/PHP site and with my minimal Node app. The staging site is over public IP and my dev is over VPN.
Anyway, attaching the trace. I started the trace then uploaded my usual 100MBish test file. I let it upload as usual until the end when it gets reset and the browser starts the upload over. I stopped the upload and the trace then. The most relevant bits are probably at the end of the file.
Thanks. Reading your trace, I found a way to reproduce the bug. The upload must be relatively slow to exceed the server timeout. In this case, it is indeed pretty easy to reproduce the issue. I will push a fix. The problem is that the H1 connection timeout (in the H1 multiplexer) is not refreshed and may expire during the transfer.
The fix was pushed and backported as far as 2.0.
oh interesting, it looks like we are also affected by this issue, which could explain some users complains. Thanks for the report and complete commit message!