The bit of code below is slower with threads than without, when built using orc.
The issue is also present when NOT using a {.threadvar.} although less pronounced than with {.threadvar.}.
resp htmlResponse("<h2>Welcome to Prologue</h2>"):
Non-threadvar single-thread: 28970 req/s
Non-threadvar multi-thread: 16319 req/s (1.7x slower)
threadvar single-thread: 22371 req/s
threadvar multi-thread: 6406 req/s (3.49x slower)
import prologue
import nwt
let settings = newSettings(
address="0.0.0.0",
debug=false,
port=Port(8000),
staticDirs=["/static"],
)
var app = newApp(settings=settings)
## Set up threadvar for `templates` for per thread initialization.
var templates {.threadvar.} : Nwt
proc initTemplates() =
if templates.isNil:
templates = newNwt("templates/*.html")
proc index(ctx: Context) {.gcsafe, async.} =
initTemplates()
resp templates.renderTemplate("index.html")
app.addRoute("/", index)
app.run()
Testing with wrk -t2 -c100 -d10s http://localhost:8000
nim c -r -d:danger -d:release --opt:speed --gc:orc mainRunning 10s test @ http://localhost:8000
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.93ms 3.56ms 85.98ms 82.41%
Req/Sec 11.27k 0.93k 13.46k 76.50%
224424 requests in 10.03s, 104.45MB read
Requests/sec: 22371.13
Transfer/sec: 10.41MB
nim c -r -d:danger -d:release --opt:speed --gc:orc --threads:on mainRunning 10s test @ http://localhost:8000
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 17.24ms 12.43ms 110.90ms 87.82%
Req/Sec 3.23k 413.09 4.39k 81.50%
64325 requests in 10.04s, 29.94MB read
Requests/sec: 6406.63
Transfer/sec: 2.98MB
??? Probably around twice single-threaded throughput, as evidenced by `markAndSweep` throughput. ???
:shrug:
@disruptek made me do it. :smile:
This was started in another thread => https://github.com/planety/prologue/issues/89
$ nim --version
Nim Compiler Version 1.5.1 [Linux: amd64]
Compiled at 2020-10-19
Copyright (c) 2006-2020 by Andreas Rumpf
git hash: 833035bacd91c2cf16ea006b64bf996e87eac75d
active boot switches: -d:release
You need to use -d:useMalloc for multi-threaded performance until we patched our default allocator.
Also, I cannot reproduce this as I don't know what
import prologue
import nwt
means. Btw ORC focuses on memory consumption and latency...
Thank you for your reply.
-d:useMalloc crashes program after several thousand requests with a double free
To get prologue (Web framework) and nwt (Nim Web Templates)
$ nimble install prologue
$ nimble install nwt
Supporting files for example:
<!DOCTYPE html>
<html>
<head>
<title>
{% block title %}
{% endblock %}
</title>
</head>
<body>
{% block body %}
{% endblock %}
</body>
</html>
{% extends "base.html" %}
{% block title %}
Index
{% endblock %}
{% block body %}
Some contents.
{% endblock %}
After working with it more, the example can be even further minimized.
import prologue
let settings = newSettings(
address="0.0.0.0",
debug=false,
port=Port(8000),
staticDirs=["/static"],
)
var app = newApp(settings=settings)
proc index(ctx: Context) {.async.} =
resp htmlResponse "Hello world!"
app.addRoute("/", index)
app.run()
nim c -r -d:release -d:danger --opt:speed --gc:orc tprologue
$ wrk -t2 -c100 -d10s http://localhost:8000
Running 10s test @ http://localhost:8000
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.04ms 3.28ms 19.28ms 80.53%
Req/Sec 14.88k 1.05k 16.89k 72.50%
296088 requests in 10.03s, 41.51MB read
Requests/sec: 29509.59
Transfer/sec: 4.14MB
nim c -r -d:release -d:danger --opt:speed --gc:orc --threads:on tprologue
$ wrk -t2 -c100 -d10s http://localhost:8000
Running 10s test @ http://localhost:8000
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 9.46ms 12.16ms 76.16ms 87.07%
Req/Sec 8.64k 1.02k 10.64k 77.50%
171965 requests in 10.02s, 24.11MB read
Requests/sec: 17161.03
Transfer/sec: 2.41MB
Reconfirmed: Adding -d:useMalloc yields the following error after a little over 100 requests.
double free or corruption (fasttop)
SIGABRT: Abnormal termination.
~But, adding --panics:on makes it keep on running. Not sure if this is a good switch to use in this instance.~
EDIT: Sometimes --panics:on runs smoothly. Other times it crashes just as -d:useMalloc without panics on.
nim c -r -d:release -d:danger --opt:speed --gc:orc --threads:on -d:useMalloc --panics:on tprologue
wrk -t2 -c100 -d10s http://localhost:8000
Running 10s test @ http://localhost:8000
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.14ms 7.95ms 81.73ms 85.91%
Req/Sec 22.83k 1.61k 27.46k 75.50%
454507 requests in 10.03s, 63.72MB read
Requests/sec: 45329.73
Transfer/sec: 6.35MB
A reduced example program doesn't contain import prologue with unkown complexities. I know this can be hard to obtain but it would help us the most.
You can probably get the same from httpbeast, prologue uses a forked version of it AFAIK. The httpbeast repo has a benchmark file you can compile.
OK. Fair enough. :smile:
@dom96 thanks for the tip.
Using benchmark.nim as written at https://github.com/dom96/httpbeast/blob/master/tests/benchmark.nim
TLDR... the httpbeast doesn't yield the issue, although... I'd have thought that 3 more cores would increase throughput by more than they did. Probably due to overhead of running wrk on the same machine, though.
$ nim --version
Nim Compiler Version 1.5.1 [Linux: amd64]
Compiled at 2020-10-22
Copyright (c) 2006-2020 by Andreas Rumpf
git hash: e09c259bf3dc8beb2c4d5abf9334c787d18f99f5
active boot switches: -d:release
$ gcc --version
gcc (GCC) 10.2.0
Single Threaded => nim c -r -d:release -d:danger --opt:speed --gc:orc -o:singleThreaded test
wrk -t2 -c100 -d10s http://localhost:8080/plaintext
Running 10s test @ http://localhost:8080/plaintext
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.39ms 1.19ms 37.37ms 96.07%
Req/Sec 37.11k 4.82k 45.57k 75.00%
739376 requests in 10.03s, 94.49MB read
Requests/sec: 73721.09
Transfer/sec: 9.42MB
Multi Threaded => nim c -r -d:release -d:danger --opt:speed --gc:orc --threads:on -o:multiThreaded test
wrk -t2 -c100 -d10s http://localhost:8080/plaintext
Running 10s test @ http://localhost:8080/plaintext
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.36ms 2.04ms 30.27ms 88.46%
Req/Sec 54.65k 5.36k 66.44k 74.00%
1088172 requests in 10.03s, 139.06MB read
Requests/sec: 108449.87
Transfer/sec: 13.86MB
I even checked --cc:clang and it gave similar, results. (single = 78k / multi = 107k)
$ clang --version
clang version 10.0.1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ ld.lld --version
LLD 10.0.1 (compatible with GNU linkers)
Interesting, then I guess there must be something in prologue that's making this happen. Maybe you can try httpx (the httpbeast fork it uses) and/or Jester to see if they also exhibit this.
Same test code as https://github.com/nim-lang/Nim/issues/15661#issuecomment-714683388
Simply changed import httpbeast for import httpx, but here's the interesting thing. Adding {.async.} to the proc causes a slow down.
orc, no {.async.}wrk -t2 -c100 -d10s http://localhost:8080/plaintext
Running 10s test @ http://localhost:8080/plaintext
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.58ms 2.39ms 28.57ms 86.03%
Req/Sec 64.21k 4.36k 69.39k 93.50%
1277948 requests in 10.03s, 163.31MB read
Requests/sec: 127367.41
Transfer/sec: 16.28MB
orc, with {.async.} added to proc line (15% slower)$ wrk -t2 -c100 -d10s http://localhost:8080/plaintext
Running 10s test @ http://localhost:8080/plaintext
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.23ms 1.77ms 23.71ms 89.49%
Req/Sec 54.75k 4.27k 61.33k 83.50%
1091057 requests in 10.03s, 139.43MB read
Requests/sec: 108726.59
Transfer/sec: 13.89MB
orc, no {.async.}$ wrk -t2 -c100 -d10s http://localhost:8080/plaintext
Running 10s test @ http://localhost:8080/plaintext
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.49ms 2.30ms 26.47ms 87.35%
Req/Sec 62.59k 4.22k 68.49k 86.50%
1245450 requests in 10.03s, 159.16MB read
Requests/sec: 124198.29
Transfer/sec: 15.87MB
orc, with {.async.} (15% slower)$ wrk -t2 -c100 -d10s http://localhost:8080/plaintext
Running 10s test @ http://localhost:8080/plaintext
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.17ms 1.59ms 19.21ms 89.69%
Req/Sec 53.21k 4.37k 61.63k 77.50%
1059267 requests in 10.03s, 135.37MB read
Requests/sec: 105629.46
Transfer/sec: 13.50MB
the disparity for markAndSweep isn't as large, but still takes a slight dip with {.async.} added. Just splitting hairs at these results, really.
httpbeast without {.async.} => 125k
httpbeast with {.async.} => 122k (2.4% slower)
httpx without {.async.} => 126k
httpx with {.async.} => 121 (4% slower)
@dom96, what do you think about my latest?
Huh? Why aren't you testing with --gc:orc, that's what this issue is about. The {.async.} slowdown isn't surprising, you're just allocating a future for each request when it's there which unsurprisingly leads to a slowdown.
I am testing with --gc:orc. I was only posting the info about markAndSweep for information. Comparison...
I figured there would be a slight slow down when adding the overhead of async pragma BUT not a 15% slowdown as seen when adding async for orc..
okay, I think I get it, you weren't clear that your benchmarks were running with --gc:orc. So if I'm reading correctly, httpbeast without {.async.} with --gc:orc gets 127k req/s and httpbeast without {.async.} with --gc:markAndSweep gets 125k req/s, right?
My bad. Every benchmark is done with orc unless stated otherwise. I'll edit it to be more clear.
httpbeast, orc, no async = 127k
httpbeast, orc, async = 108k
Sorry, but I care about the crashes, not about these fictional benchmark results that don't measure memory consumption.