Nim: --gc:orc ; Multi-threaded async slower than single-threaded.

Created on 20 Oct 2020  路  15Comments  路  Source: nim-lang/Nim

The bit of code below is slower with threads than without, when built using orc.

The issue is also present when NOT using a {.threadvar.} although less pronounced than with {.threadvar.}.
resp htmlResponse("<h2>Welcome to Prologue</h2>"):
Non-threadvar single-thread: 28970 req/s
Non-threadvar multi-thread: 16319 req/s (1.7x slower)

threadvar single-thread: 22371 req/s
threadvar multi-thread: 6406 req/s (3.49x slower)

Example

import prologue
import nwt

let settings = newSettings(
  address="0.0.0.0",
  debug=false,
  port=Port(8000),
  staticDirs=["/static"],
)

var app = newApp(settings=settings)

## Set up threadvar for `templates` for per thread initialization.
var templates {.threadvar.} : Nwt

proc initTemplates() =
  if templates.isNil:
    templates = newNwt("templates/*.html")

proc index(ctx: Context) {.gcsafe, async.} =
  initTemplates()
  resp templates.renderTemplate("index.html")

app.addRoute("/", index)

app.run()

Current Output

Testing with wrk -t2 -c100 -d10s http://localhost:8000

Using --gc:orc (multi-threaded has lower throughput)

nim c -r -d:danger -d:release --opt:speed --gc:orc main

Running 10s test @ http://localhost:8000
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.93ms    3.56ms  85.98ms   82.41%
    Req/Sec    11.27k     0.93k   13.46k    76.50%
  224424 requests in 10.03s, 104.45MB read
Requests/sec:  22371.13
Transfer/sec:     10.41MB

nim c -r -d:danger -d:release --opt:speed --gc:orc --threads:on main

Running 10s test @ http://localhost:8000
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    17.24ms   12.43ms 110.90ms   87.82%
    Req/Sec     3.23k   413.09     4.39k    81.50%
  64325 requests in 10.04s, 29.94MB read
Requests/sec:   6406.63
Transfer/sec:      2.98MB

Expected Output

??? Probably around twice single-threaded throughput, as evidenced by `markAndSweep` throughput. ???

Possible Solution

:shrug:

Additional Information

@disruptek made me do it. :smile:
This was started in another thread => https://github.com/planety/prologue/issues/89

$ nim --version
Nim Compiler Version 1.5.1 [Linux: amd64]
Compiled at 2020-10-19
Copyright (c) 2006-2020 by Andreas Rumpf

git hash: 833035bacd91c2cf16ea006b64bf996e87eac75d
active boot switches: -d:release
ARORC

All 15 comments

You need to use -d:useMalloc for multi-threaded performance until we patched our default allocator.

Also, I cannot reproduce this as I don't know what

import prologue
import nwt

means. Btw ORC focuses on memory consumption and latency...

Thank you for your reply.

-d:useMalloc crashes program after several thousand requests with a double free

To get prologue (Web framework) and nwt (Nim Web Templates)

$ nimble install prologue
$ nimble install nwt

Supporting files for example:

/templates/base.html

<!DOCTYPE html>
<html>
  <head>
    <title>
      {% block title %}
      {% endblock %}
    </title>
  </head>
  <body>
    {% block body %}
    {% endblock %}
  </body>
</html>

/templates/index.html

{% extends "base.html" %}

{% block title %}
Index
{% endblock %}

{% block body %}
Some contents.
{% endblock %}

After working with it more, the example can be even further minimized.

import prologue

let settings = newSettings(
  address="0.0.0.0",
  debug=false,
  port=Port(8000),
  staticDirs=["/static"],
)

var app = newApp(settings=settings)

proc index(ctx: Context) {.async.} =
  resp htmlResponse "Hello world!"

app.addRoute("/", index)

app.run()

nim c -r -d:release -d:danger --opt:speed --gc:orc tprologue

$ wrk -t2 -c100 -d10s http://localhost:8000
Running 10s test @ http://localhost:8000
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.04ms    3.28ms  19.28ms   80.53%
    Req/Sec    14.88k     1.05k   16.89k    72.50%
  296088 requests in 10.03s, 41.51MB read
Requests/sec:  29509.59
Transfer/sec:      4.14MB

nim c -r -d:release -d:danger --opt:speed --gc:orc --threads:on tprologue

$ wrk -t2 -c100 -d10s http://localhost:8000                                
Running 10s test @ http://localhost:8000
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.46ms   12.16ms  76.16ms   87.07%
    Req/Sec     8.64k     1.02k   10.64k    77.50%
  171965 requests in 10.02s, 24.11MB read
Requests/sec:  17161.03
Transfer/sec:      2.41MB

Reconfirmed: Adding -d:useMalloc yields the following error after a little over 100 requests.

double free or corruption (fasttop)
SIGABRT: Abnormal termination.

~But, adding --panics:on makes it keep on running. Not sure if this is a good switch to use in this instance.~
EDIT: Sometimes --panics:on runs smoothly. Other times it crashes just as -d:useMalloc without panics on.

nim c -r -d:release -d:danger --opt:speed --gc:orc --threads:on -d:useMalloc --panics:on tprologue

wrk -t2 -c100 -d10s http://localhost:8000                                                         
Running 10s test @ http://localhost:8000
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.14ms    7.95ms  81.73ms   85.91%
    Req/Sec    22.83k     1.61k   27.46k    75.50%
  454507 requests in 10.03s, 63.72MB read
Requests/sec:  45329.73
Transfer/sec:      6.35MB

A reduced example program doesn't contain import prologue with unkown complexities. I know this can be hard to obtain but it would help us the most.

You can probably get the same from httpbeast, prologue uses a forked version of it AFAIK. The httpbeast repo has a benchmark file you can compile.

OK. Fair enough. :smile:

@dom96 thanks for the tip.

Using benchmark.nim as written at https://github.com/dom96/httpbeast/blob/master/tests/benchmark.nim

TLDR... the httpbeast doesn't yield the issue, although... I'd have thought that 3 more cores would increase throughput by more than they did. Probably due to overhead of running wrk on the same machine, though.

$ nim --version
Nim Compiler Version 1.5.1 [Linux: amd64]
Compiled at 2020-10-22
Copyright (c) 2006-2020 by Andreas Rumpf

git hash: e09c259bf3dc8beb2c4d5abf9334c787d18f99f5
active boot switches: -d:release
$ gcc --version
gcc (GCC) 10.2.0

Single Threaded => nim c -r -d:release -d:danger --opt:speed --gc:orc -o:singleThreaded test

wrk -t2 -c100 -d10s http://localhost:8080/plaintext
Running 10s test @ http://localhost:8080/plaintext
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.39ms    1.19ms  37.37ms   96.07%
    Req/Sec    37.11k     4.82k   45.57k    75.00%
  739376 requests in 10.03s, 94.49MB read
Requests/sec:  73721.09
Transfer/sec:      9.42MB

Multi Threaded => nim c -r -d:release -d:danger --opt:speed --gc:orc --threads:on -o:multiThreaded test

wrk -t2 -c100 -d10s http://localhost:8080/plaintext
Running 10s test @ http://localhost:8080/plaintext
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.36ms    2.04ms  30.27ms   88.46%
    Req/Sec    54.65k     5.36k   66.44k    74.00%
  1088172 requests in 10.03s, 139.06MB read
Requests/sec: 108449.87
Transfer/sec:     13.86MB

I even checked --cc:clang and it gave similar, results. (single = 78k / multi = 107k)

$ clang --version
clang version 10.0.1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ ld.lld --version
LLD 10.0.1 (compatible with GNU linkers)

Interesting, then I guess there must be something in prologue that's making this happen. Maybe you can try httpx (the httpbeast fork it uses) and/or Jester to see if they also exhibit this.

Same test code as https://github.com/nim-lang/Nim/issues/15661#issuecomment-714683388

Simply changed import httpbeast for import httpx, but here's the interesting thing. Adding {.async.} to the proc causes a slow down.

[email protected], orc, no {.async.}

wrk -t2 -c100 -d10s http://localhost:8080/plaintext                                     
Running 10s test @ http://localhost:8080/plaintext
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.58ms    2.39ms  28.57ms   86.03%
    Req/Sec    64.21k     4.36k   69.39k    93.50%
  1277948 requests in 10.03s, 163.31MB read
Requests/sec: 127367.41
Transfer/sec:     16.28MB

[email protected], orc, with {.async.} added to proc line (15% slower)

$ wrk -t2 -c100 -d10s http://localhost:8080/plaintext
Running 10s test @ http://localhost:8080/plaintext
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.23ms    1.77ms  23.71ms   89.49%
    Req/Sec    54.75k     4.27k   61.33k    83.50%
  1091057 requests in 10.03s, 139.43MB read
Requests/sec: 108726.59
Transfer/sec:     13.89MB

[email protected], orc, no {.async.}

$ wrk -t2 -c100 -d10s http://localhost:8080/plaintext
Running 10s test @ http://localhost:8080/plaintext
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.49ms    2.30ms  26.47ms   87.35%
    Req/Sec    62.59k     4.22k   68.49k    86.50%
  1245450 requests in 10.03s, 159.16MB read
Requests/sec: 124198.29
Transfer/sec:     15.87MB

[email protected], orc, with {.async.} (15% slower)

$ wrk -t2 -c100 -d10s http://localhost:8080/plaintext
Running 10s test @ http://localhost:8080/plaintext
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.17ms    1.59ms  19.21ms   89.69%
    Req/Sec    53.21k     4.37k   61.63k    77.50%
  1059267 requests in 10.03s, 135.37MB read
Requests/sec: 105629.46
Transfer/sec:     13.50MB

the disparity for markAndSweep isn't as large, but still takes a slight dip with {.async.} added. Just splitting hairs at these results, really.
httpbeast without {.async.} => 125k
httpbeast with {.async.} => 122k (2.4% slower)

httpx without {.async.} => 126k
httpx with {.async.} => 121 (4% slower)

@dom96, what do you think about my latest?

Huh? Why aren't you testing with --gc:orc, that's what this issue is about. The {.async.} slowdown isn't surprising, you're just allocating a future for each request when it's there which unsurprisingly leads to a slowdown.

I am testing with --gc:orc. I was only posting the info about markAndSweep for information. Comparison...

I figured there would be a slight slow down when adding the overhead of async pragma BUT not a 15% slowdown as seen when adding async for orc..

okay, I think I get it, you weren't clear that your benchmarks were running with --gc:orc. So if I'm reading correctly, httpbeast without {.async.} with --gc:orc gets 127k req/s and httpbeast without {.async.} with --gc:markAndSweep gets 125k req/s, right?

My bad. Every benchmark is done with orc unless stated otherwise. I'll edit it to be more clear.

httpbeast, orc, no async = 127k
httpbeast, orc, async = 108k

Sorry, but I care about the crashes, not about these fictional benchmark results that don't measure memory consumption.

Was this page helpful?
0 / 5 - 0 ratings