Rocket: Clarify performance claims in README

Created on 18 Jan 2017 · 6Comments · Source: SergioBenitez/Rocket

Hello! I've been working with @seanmonstar to help polish off the Tokio branch of hyper (now on master). One of the worries he mentioned was the performance, which we of course in Tokio (and @seanmonstar in Hyper) take quite seriously! The README prominently claims to beat the pants off async Hyper which I've attempted to reproduce locally to debug any performance problems that might arise.

Locally here's what I've set up.

For Hyper I'm on hyperium/hyper@eb64fec24c545bec16a43364ebfce4c631f58933 (Cargo.lock), the tip of https://github.com/hyperium/hyper/pull/1013 (that PR has nothing to do with perf though, just what I had). The program is a simple hello world, modified slightly from examples/hello.rs in Hyper itself.
For Rocket I'm on b164da1a01b323035bc88e02bfd464ff4b6b11e6 (Cargo.lock) and am using a slightly tweaked version of the hello world example.

I ran the Hyper example as:

$ rustc -V
rustc 1.14.0 (e8a012324 2016-12-16)
$ cargo run --release --example hello

and I ran the Rocket example as:

$ rustc +nightly -V
rustc 1.16.0-nightly (4ce7accaa 2017-01-17)
$ cargo +nightly run --release

I tested both servers with the wrk command as below. Note that -t and -c are quite low but this should be relatively representative.

wrk -t 10 -d 10s -c 20 --latency http://127.0.0.1:3000/

The numbers that I get for Hyper are:

Running 10s test @ http://127.0.0.1:3000/                      
  10 threads and 20 connections                                
  Thread Stats   Avg      Stdev     Max   +/- Stdev            
    Latency   228.86us  124.13us   1.27ms   57.90%             
    Req/Sec     8.76k     0.93k    9.53k    80.00%             
  Latency Distribution                                         
     50%  230.00us                                             
     75%  339.00us                                             
     90%  400.00us                                             
     99%  439.00us                                             
  880581 requests in 10.10s, 121.77MB read                     
Requests/sec:  87186.94                                        
Transfer/sec:     12.06MB

The numbers I get for Rocket are:

Running 10s test @ http://127.0.0.1:3001/
  10 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    14.52us    3.68us   1.30ms   95.08%
    Req/Sec    66.79k     6.79k   70.33k    93.07%
  Latency Distribution
     50%   14.00us
     75%   14.00us
     90%   15.00us
     99%   23.00us
  671464 requests in 10.10s, 93.49MB read
Requests/sec:  66502.15
Transfer/sec:      9.26MB

These numbers are basically what I expected with both being roughly on par but Hyper being faster (what with being async and not being a framework and such). The latency numbers here are unusual with Hyper claiming around 439us and Rocket claiming 23us. My assumption is that this is a bug in wrk. With only one worker although wrk issues 20 connections to Rocket only one of them makes any progress, and all other connections are blocked (e.g. nothing happens on them). The wrk program probably isn't prepared to handle this and prints out odd results as a result.

If I pass -t 1 -c 1 to wrk (only one thread with one connection) then Rocket still comes in at 23us and Hyper comes in at 31us, assuaging any fear that Rocket has ~20x lower latency than Hyper.

So overall my conclusion is that I am unable to reproduce the claims in the README about performance. Maybe Hyper has changed in the meantime? (I'm not sure). Can you provide more details about the exact setup locally, commands run, etc? I'm not too worried about hardware, just the relative claims of libraries on the same hardware.

A broader conclusion IMO is that it's somewhat unproductive to claim "Rocket currently performs significantly better than the latest version of asynchronous Hyper". We're all working towards the same goal of a fast ergonomic web server in Rust, and prominently claiming that a library is slow (especially when you're built on it) seems like an inferior alternative to opening a bug report and working through the underlying issues (if any). I know we in Tokio and @seanmonstar are always eager to squelch any performance bugs!

question request

Source

alexcrichton

❤7

Most helpful comment

Hi there! Just sliding through to clear up any small confusion that might have occurred.

The Hyper you benchmarked against, 0.11.0-a.0, on the other hand, makes this possibility entirely unclear. None of the example use more than one thread, and none of the core code alludes to how to do this easily.

The 2 examples in the repo were simplified, but is entirely possible to run a Tokio event loop (and thus a hyper server) on any number of threads.

you've artificially restricted Rocket to a single thread in your benchmarks. I'm baffled that you would make this change and claim that the example was only slightly tweaked

As you noticed also, the "Hello World" example in the hyper repo is only running in 1 thread. I imagine the modification to the example was to just make them both use the same amount of cores. It's not really artificially handicapping Rocket on purpose, just comparing both examples running in 1 thread, and letting wrk run on the other cpus.

As shown in the README, I used -t 1 -c 18

I think this may have been part of the desire to try a bunch of parameters. 1 thread and 18 connections is quite small.

The issues in Hyper are cross-cutting; unless a deep analysis is performed, only a blanket "improve performance" issue would suffice.

I completely agree! I hate the blocking-IO branch at this point.

I've opened two pull requests against Hyper, and both have been largely ignored for weeks or months with little or no feedback from the maintainer.

The first one, I did take a while to get around to merging. I didn't understand the goal of it, as Rocket didn't publicly exist (so I couldn't see what performance improvement you were trying to make), and I've been focusing all my free time on working on non-blocking IO and improvements.

The second was regarding upgrading OpenSSL, which was actually not the first issue or PR hyper had received to do it, which had many discussions in those. I was busy balancing things, and trying to figure out how to remove hyper from the openssl version conflict problem entirely, so I didn't repeat myself in that specific PR. It seems like in the end, that that part all worked out.

If you have other thoughts, let's discuss them in an issue! Or multiple! There exists a few more general issues for performance, such as using a memory pool, writev, less copies, disabling nagle, and others.

I don't believe there is any ill intent here on any "side". We all want to build web things using Rust. We all want others to be able to as well. We want to be fast, we want to be good. We also want to be the best programming community out there. So, let's do that!

seanmonstar on 18 Jan 2017

👍6

All 6 comments

Hey, thanks for opening up this issue to discuss this!

There are numerous issues with the way you've benchmarked the two libraries and with the assumptions you've made. I address each of them below.

First, you've benchmarked against an entirely different Hyper than I have. As stated in the README, I benchmarked against Hyper v0.10.0-a.0 (1/12/2016). The version in the Cargo.toml file at the commit (https://github.com/hyperium/hyper/commit/eb64fec24c545bec16a43364ebfce4c631f58933) you've indicated you've benchmarked against is 0.11.0-a.0. As I'm sure you're aware from your contributions, these two versions are entirely different. The former, 0.10.0-a.0 is based on rotor, while the latter, 0.11.0-a.0, is based on tokio and futures. The rotor based version was tracked via the master branch until recently, evidently, when the tokio branch, which no longer exists, was made master.

You might wonder why I didn't benchmark against the tokio branch instead. First, it's only fair that I benchmark both libraries against master. Second, and most importantly, the tokio branch was considerably slower than the master branch in my benchmarks. And there's a good reason for this: the tokio branch uses a single thread to handle incoming connections.
This brings me to my second point: Rocket uses n threads to handle connections by default, where n is the number of logical cores. The version of Hyper I benchmarked against, 0.10.0-a.0 is able to do the same, and I benchmarked it as such. The Hyper you benchmarked against, 0.11.0-a.0, on the other hand, makes this possibility entirely unclear. None of the example use more than one thread, and none of the core code alludes to how to do this easily. It wouldn't be fair to benchmark one library that uses n threads against another that uses only one.
And this is my third point: you've artificially restricted Rocket to a single thread in your benchmarks; you've decreased the number of handling threads n-fold. As you know, the number of threads listening and responding to incoming connections is _incredibly important_ to performance.
This brings me to my fourth and fifth points: you use different versions of rustc in your benchmarks, and the response sent by the Hyper program in your benchmark is _shorter_. The first is obvious, and so I won't discuss it further. But the latter is subtle. The issue is that the Server header in Hyper's response is Server: hyper, while the Server header in Rocket's response is Server: rocket. rocket is one character longer than hyper. This is largely insignificant as one character is only ~0.5% of the total response, but it's important to measure the exact same thing. In my benchmarks, I change Hyper's Server value to hyper* to account for the difference.
Sixth, you used different wrk parameters. As shown in the README, I used -t 1 -c 18. Though you ran the benchmark with a variety of parameters, you did not use identical parameters to those in the README.
As a side note, you say "if I pass -t 1 -c 1 to wrk (only one thread with one connection) then Rocket still comes in at 23us and Hyper comes in at 31us," and that that this shows Hyper is not too off-par with Rocket. This is still 35% worse, unfortunately, even though this is the optimal case for Hyper-tokio, or any poll-based server. This is optimal because under these conditions, Hyper has no connections to handle, and no connections to multiplex: when Hyper polls, it is only waiting on the singular "accept connection" event.

I hope that I've been clear in my points above. Please let me know if making these changes gives you results that are on par with mine!

On your broader conclusions: Rocket _does_ perform significantly better than every version of Hyper I've benchmarked against, including the tokio branch. I don't understand why you would label sharing these benchmarks as unproductive: what is unproductive about sharing benchmark results? And, what is _especially_ pertinent or unproductive about benchmarking Rocket against a library Rocket itself depends on? Rocket eschews Hyper as soon as the connection arrives; Hyper's role in Rocket is to listen for and accept connections, parse the incoming HTTP message, transfer decode, and do the inverse on the responding side. Rocket uses Hyper in this way for many reasons, including an API that requires a significant number of allocations when none are necessary.

You state that a better alternative to sharing my benchmark results is to open a bug report and work through the underlying issues. First, it's not clear to me what such a bug report would look like. The issues in Hyper are cross-cutting; unless a deep analysis is performed, only a blanket "improve performance" issue would suffice. Second, I've opened two pull requests against Hyper, and both were largely ignored for weeks or months with little or no feedback from the maintainer. These experiences do not bode as particularly inviting towards "working through the underlying issues".

I'm _incredibly_ excited about the present and impending work on tokio and futures; I believe Rocket will benefit from this work greatly. I also look forward to any performance improvements that Hyper might benefit from as well as any knowledge Rocket can glean from Hyper's experiences with asynchronous I/O in Rust. I have previously conversed at length with @seanmonstar and others about the state of Rust and the web, and I'm looking forward to continuing these conversations and broadening their scope as the Rust web and async I/O ecosystem matures.

Edit: Made my points _bullety_ and more concise.

SergioBenitez on 18 Jan 2017

👍5

Hi there! Just sliding through to clear up any small confusion that might have occurred.

The Hyper you benchmarked against, 0.11.0-a.0, on the other hand, makes this possibility entirely unclear. None of the example use more than one thread, and none of the core code alludes to how to do this easily.

The 2 examples in the repo were simplified, but is entirely possible to run a Tokio event loop (and thus a hyper server) on any number of threads.

you've artificially restricted Rocket to a single thread in your benchmarks. I'm baffled that you would make this change and claim that the example was only slightly tweaked

As shown in the README, I used -t 1 -c 18

I think this may have been part of the desire to try a bunch of parameters. 1 thread and 18 connections is quite small.

The issues in Hyper are cross-cutting; unless a deep analysis is performed, only a blanket "improve performance" issue would suffice.

I completely agree! I hate the blocking-IO branch at this point.

I've opened two pull requests against Hyper, and both have been largely ignored for weeks or months with little or no feedback from the maintainer.

seanmonstar on 18 Jan 2017

👍6

@seanmonstar The point is that @alexcrichton's main concern was that he could not reproduce my benchmarks, but the reason he couldn't do so is because he benchmarked an entirely different thing in an entirely different way. It doesn't matter whether Hyper-tokio can or can't use multiple threads, it matters that both libraries are run in their optimal configuration, or at the very least, their default configuration. It _is_ _relatively_ artificially handicapping Rocket because he is comparing against a Rocket that was _not_ "handicapped" in this way.

While 1 thread and 18 connections may seem like small numbers, and they are, you can't do too much better when running the benchmarks on a single machine. I chose these numbers because my machine has 12 logical cores, all of which Rocket/Hyper-rotor will try to use, and so it doesn't make sense to have the benchmarking tool try to use some of those threads. I chose 18 connections because that's what gave me the best numbers across the board.

I don't think there's any ill intent here either. I'm all for being as _fast_ as possible and for working together to get there. I absolutely adore the Rust community. This is one of the main reasons as to why I've chosen to stick around since first learning about Rust 0.4.

P.S: I'd love to update the benchmarks with numbers from Hyper-tokio. Can you post code to run the Hello, world! example with n threads?

SergioBenitez on 18 Jan 2017

@SergioBenitez er I'm sorry if you're taking this the wrong way! I opened this issue to clarify what the difference in numbers I was seeing were, not to attack Rocket in any way. My main question was:

Can you provide more details about the exact setup locally, commands run, etc?

Which I think still hasn't been answered? Could you detail information such:

What revisions were the repositories at?
What lock files were in play for versions of dependencies?
What rustc versions rustc versions were in use?
Precisely what code was run?
What command was used to benchmark?

This would all be quite helpful in reproducing numbers locally! Consistent environments just mean that we can have good comparisons of numbers and help debug issues more quickly. I may have questions as to why something was benchmarked in a particular fashion or why perhaps one configuration was left out. I can't really have a question, though, if I don't have enough information to formulate it.

To answer some of your points, though:

First, you've benchmarked against an entirely different Hyper than I have

Indeed! Again sorry if you saw this as an attack or aggressive, that's not at all what I intended! I originally wanted to submit an update to the README through a PR but I found myself unable to do so because I couldn't reproduce the results, nor was I confident that I even reproduced the environment.

This brings me to my second point: Rocket uses n threads to handle connections by default, where n is the number of logical cores.

Indeed! I personally considered it unfair to benchmark a multithreaded implementation with a non-multithreaded implementation. Seems opinions on this differ though :)

And this is my third point: you've artificially restricted Rocket to a single thread in your benchmarks.

Yep! That's all got to do with the previous point though, I felt I was leveling the playing field, but sounds like we differ in opinion!

you use different versions of rustc in your benchmarks

Apologies for the inconsistency, I just wanted to be thorough in describing how I benchmarked. Changing the rustc version makes no difference in the benchmarks, however.

and the response sent by the Hyper program in your benchmark is shorter

Er, well ok. Changing them to be the same has no difference though. I feel like we can all agree though that if the length of your "Server" header drastically affects your benchmark then you probably need a new benchmark?

you used different wrk parameters

Ok! I didn't know what you were using though? (I'm not super familiar with wrk, so being exhaustive would be nice)

On your broader conclusions: Rocket does perform significantly better than every version of Hyper I've benchmarked against, including the tokio branch. I don't understand why you would label sharing these benchmarks as unproductive: what is unproductive about sharing benchmark results?

I personally felt that the benchmark numbers were hostile towards the rest of the Rust ecosystem, e.g. Iron and Hyper. Hyper specifically was called out as being "significantly slower" than Rocket. To me this is unproductive because it appears like a default stance of attempting to get ahead of other libraries as opposed to collaboration and/or bug filing.

I was also under the impression that it's common knowledge that (a) performant I/O programs use async I/O, especially servers and (b) microbenchmarks like hello world are not good methods to gauge a framework. Seeing numbers claiming that synchronous I/O beats out the asynchronous ecosystem in performance is misleading and using them to claim superiority over alternatives seems slightly disingenuous.

To help put this in perspective, let's try this again:

Hyper at a custom revision. I've done this for now to add the bare-bones support for multithreading. Hyper of course will likely have a more official API for this, but it gets the point across.
Rocket at b164da1a01b323035bc88e02bfd464ff4b6b11e6.
Hyper program and lock file
Rocket program and lock file
rustc 1.16.0-nightly (4ce7accaa 2017-01-17)
Ubuntu 8-core machine

| wrk command | Hyper | Rocket |
|------------------|--------|-------|
| wrk -t 10 -d 10s -c 20 | 225759.00 | 215172.26 |
| wrk -t 1 -d 10s -c 18 | 205628.07 | 212075.66 |

As I mentioned above I'm leaving off latency numbers. I don't think they're fair here as Rocket's only actually serving 8 (the number of cores I have) of the connections issued. The other extra connections are idle and make no progress, which I think skews wrk's output of latency.

From this I wouldn't conclude that anything is "significantly" better than the other. Rather this microbenchmark seems shows that Rocket as a framework has very little overhead (similar to other libraries like Hyper).

First, it's not clear to me what such a bug report would look like. The issues in Hyper are cross-cutting; unless a deep analysis is performed, only a blanket "improve performance" issue would suffice.

I am not personally intimately familiar with Hyper's internals, but the numbers and benchmarking I've done do not seem to agree with this claim that there are "cross-cutting" issues. I would expect bug reports to at least start out with "this benchmark seems slow" and then it can be refined further.

Second, I've opened two pull requests against Hyper, and both have been largely ignored for weeks or months with little or no feedback from the maintainer.

I'm sorry these haven't been moving along, but I'm sure @seanmonstar will get around to them in time once other changes in Hyper settle down.

alexcrichton on 18 Jan 2017

👍1

@SergioBenitez er I'm sorry if you're taking this the wrong way! I opened this issue to clarify what the difference in numbers I was seeing were, not to attack Rocket in any way.

Indeed, it seems that I've interpreted your issue in the wrong way. My apologies.

What revisions were the repositories at?

Rocket was at https://github.com/SergioBenitez/Rocket/commit/ddda8fe79b5056afb7c13f176efb176a104d0ed6 and Hyper was at https://github.com/hyperium/hyper/commit/bdc19d52bf5ec2e63b785de31bfe0ad3ba4d2550.

What lock files were in play for versions of dependencies?

The latest at that time. This shouldn't result in variance.

What rustc versions rustc versions were in use?

nightly-2017-01-08

Precisely what code was run?

The Rocket hello_world example was run unmodified with ROCKET_ENV=prod cargo run --release. The Hyper code is here.

What command was used to benchmark?

wrk -t 1 -c 18 http://localhost

Indeed! I personally considered it unfair to benchmark a multithreaded implementation with a non-multithreaded implementation. Seems opinions on this differ though :)

I didn't benchmark multithreaded Rocket against non-multithreaded Hyper. In fact, as I stated, I purposefully didn't include benchmarks against Hyper-tokio _because_ there was no multithreaded version. As you can see from your own benchmarks, the difference between multithread and non-multithreaded is drastic.

Ok! I didn't know what you were using though? (I'm not super familiar with wrk, so being exhaustive would be nice)

wrk outputs the parameters in the first few lines:

Running 10s test @ http://localhost:3000
  1 threads and 18 connections

I hoped this would be sufficient to indicate the parameters, but I can see how someone unfamiliar with wrk might miss that.

I personally felt that the benchmark numbers were hostile towards the rest of the Rust ecosystem, e.g. Iron and Hyper. Hyper specifically was called out as being "significantly slower" than Rocket. To me this is unproductive because it appears like a default stance of attempting to get ahead of other libraries as opposed to collaboration and/or bug filing.

This certainly wasn't my intention! :( The point of the benchmarks is simply to show that Rocket has no performance overhead over using the bare HTTP server. In fact, it might even have an advantage! This is the only reason the benchmarks exist in the README at this point.

I was also under the impression that it's common knowledge that (a) performant I/O programs use async I/O, especially servers [...] Seeing numbers claiming that synchronous I/O beats out the asynchronous ecosystem in performance is misleading and using them to claim superiority over alternatives seems slightly disingenuous.

Unfortunately this is an extremely common misconception. Asynchronous I/O doesn't necessarily make your program performant, nor does a program using synchronous I/O necessarily suffer from bad performance. There are reasons to use either. When writing an HTTP server on today's operating systems, using asynchronous I/O is definitely the right way to go. That being said, if you put Rocket behind a reverse proxy like NGINX, you won't see any of the drawbacks of synchronous I/O, including the latency issues you mention. For this and other reasons, this is the only recommended way to deploy Rocket at the moment.

Again, my goal isn't to "claim superiority" in any way.

(b) microbenchmarks like hello world are not good methods to gauge a framework.

Agreed. It's simply a baseline.

To help put this in perspective, let's try this again: [...]

I've run your multithreaded code, but it performs _much_ worse (32k req/s) than the code on the master branch with a single thread (55k req/s). I'd bet this is an artifact of the differences between SO_REUSEPORT on OS X and Ubuntu. Also, I would try with -c 10 on your machine.

SergioBenitez on 18 Jan 2017

👍2

That'd do it, thanks for the info! That's all the info I needed, so I'm going to close this.

alexcrichton on 19 Jan 2017

Was this page helpful?

0 / 5 - 0 ratings