Flask: Python Flask Gevent stack - Simple “Hello World” app shows as inefficient when benchmarked

Created on 30 May 2014 · 14Comments · Source: pallets/flask

I have the following simple "Hello World" app:

from gevent import monkey
monkey.patch_all()
from flask import Flask
from gevent import wsgi

app = Flask(__name__)

@app.route('/')
def index():
  return 'Hello World'

server = wsgi.WSGIServer(('127.0.0.1', 5000), app)
server.serve_forever()

As you can see it's pretty straightforward.

The problem is that despite such simpliness it's pretty slow/inefficient as the following benchmark (made with Apache Benchmark) shows:

ab -k -n 1000 -c 100 http://127.0.0.1:5000/

Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        
Server Hostname:        127.0.0.1
Server Port:            5000

Document Path:          /
Document Length:        11 bytes

Concurrency Level:      100
Time taken for tests:   1.515 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Keep-Alive requests:    0
Total transferred:      146000 bytes
HTML transferred:       11000 bytes
Requests per second:    660.22 [#/sec] (mean)
Time per request:       151.465 [ms] (mean)
Time per request:       1.515 [ms] (mean, across all concurrent requests)
Transfer rate:          94.13 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.6      0       3
Processing:     1  145  33.5    149     191
Waiting:        1  144  33.5    148     191
Total:          4  145  33.0    149     191

Percentage of the requests served within a certain time (ms)
  50%    149
  66%    157
  75%    165
  80%    173
  90%    183
  95%    185
  98%    187
  99%    188
 100%    191 (longest request)

Eventually increasing the number of connections and/or concurrency doesn't bring better results, in fact it becomes worse.

What I'm most concerned about is the fact that I can't go over 700 Requests per second and a Transfer rate of 98 Kbytes/sec.

Also, the individual Time per request seems to be too much.

I got curious about what Python and Gevent are doing in the background, or better, what the OS is doing, so I used a strace to determine eventual system-side issues and here's the result:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 56.46    0.000284           0      1386           close
 24.25    0.000122           0      1016           write
 10.74    0.000054           0      1000           send
  4.17    0.000021           0      3652      3271 open
  2.19    0.000011           0       641           read
  2.19    0.000011           0      6006           fcntl64
  0.00    0.000000           0         1           waitpid
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         3           time
  0.00    0.000000           0        12        12 access
  0.00    0.000000           0        32           brk
  0.00    0.000000           0         5         1 ioctl
  0.00    0.000000           0      5006           gettimeofday
  0.00    0.000000           0         4         2 readlink
  0.00    0.000000           0       191           munmap
  0.00    0.000000           0         1         1 statfs
  0.00    0.000000           0         1         1 sigreturn
  0.00    0.000000           0         2           clone
  0.00    0.000000           0         2           uname
  0.00    0.000000           0        21           mprotect
  0.00    0.000000           0        69        65 _llseek
  0.00    0.000000           0        71           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         3           getcwd
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0       243           mmap2
  0.00    0.000000           0      1838       748 stat64
  0.00    0.000000           0        74           lstat64
  0.00    0.000000           0       630           fstat64
  0.00    0.000000           0         1           getuid32
  0.00    0.000000           0         1           getgid32
  0.00    0.000000           0         1           geteuid32
  0.00    0.000000           0         1           getegid32
  0.00    0.000000           0         4           getdents64
  0.00    0.000000           0         3         1 futex
  0.00    0.000000           0         1           set_thread_area
  0.00    0.000000           0         2           epoll_ctl
  0.00    0.000000           0        12         1 epoll_wait
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0        26           clock_gettime
  0.00    0.000000           0         2           openat
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         1           eventfd2
  0.00    0.000000           0         1           epoll_create1
  0.00    0.000000           0         1           pipe2
  0.00    0.000000           0         1           socket
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0      1000           accept
  0.00    0.000000           0         1           getsockname
  0.00    0.000000           0      2000      1000 recv
  0.00    0.000000           0         1           setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00    0.000503                 24977      5103 total

As you can see there are 5103 errors, the worst offender being the open syscall which I suspect has to do with files not being found (ENOENT). To my surprise epoll didn't look like a _troubler_, as I heard of many horror stories about it.

I wish to post the full strace which goes into the detail of every single call, but it is way too large.

A final note; I also set the following system parameters (which are the maximum allowed amount) hoping it would change the situation but it didn't:

echo “32768   61000″ > /proc/sys/net/ipv4/ip_local_port_range
sysctl -w fs.file-max=128000
sysctl -w net.ipv4.tcp_keepalive_time=300
sysctl -w net.core.somaxconn=61000
sysctl -w net.ipv4.tcp_max_syn_backlog=2500
sysctl -w net.core.netdev_max_backlog=2500
ulimit -n 1024

My question is, given that the sample I'm using can't be changed so much to fix these issues, where should I look to correct them?

For a comparison I made the following "Hello World" script with Wheezy.web & Gevent and I got ~2000 Requests per second:

from gevent import monkey
monkey.patch_all()
from gevent import pywsgi
from wheezy.http import HTTPResponse
from wheezy.http import WSGIApplication
from wheezy.routing import url
from wheezy.web.handlers import BaseHandler
from wheezy.web.middleware import bootstrap_defaults
from wheezy.web.middleware import path_routing_middleware_factory

def helloworld(request):
    response = HTTPResponse()
    response.write('hello world')
    return response


routes = [
    url('hello', helloworld, name='helloworld')
]


options = {}
main = WSGIApplication(
    middleware=[
        bootstrap_defaults(url_mapping=routes),
        path_routing_middleware_factory
    ],
    options=options
)


server = pywsgi.WSGIServer(('127.0.0.1', 5000), main, backlog=128000)
server.serve_forever()

And the benchmark results:

ab -k -n 1000 -c 1000 http://127.0.0.1:5000/hello

Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        
Server Hostname:        127.0.0.1
Server Port:            5000

Document Path:          /front
Document Length:        11 bytes

Concurrency Level:      1000
Time taken for tests:   0.484 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Keep-Alive requests:    0
Total transferred:      170000 bytes
HTML transferred:       11000 bytes
Requests per second:    2067.15 [#/sec] (mean)
Time per request:       483.758 [ms] (mean)
Time per request:       0.484 [ms] (mean, across all concurrent requests)
Transfer rate:          343.18 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    8  10.9      0      28
Processing:     2   78  39.7     56     263
Waiting:        2   78  39.7     56     263
Total:         18   86  42.6     66     263

Percentage of the requests served within a certain time (ms)
  50%     66
  66%     83
  75%    129
  80%    131
  90%    152
  95%    160
  98%    178
  99%    182
 100%    263 (longest request)

I find Wheezy.web's speed great, but I'd still like to use Flask as it's far simpler and less time consuming to work with.

Source

ghost

❤2 👍1

Most helpful comment

First of all 10.000 requests is a fairly small number, you want to increase that to about 100.000 or even 1.000.000.

Nevertheless if I replicate your benchmark exactly on my machine (Mid 2011 MacBook Air 1.8 GHz i7) I get more than twice the performance.

Switching to PyPy for faster interpretation, using gunicorn with eventlet (no gevent with PyPy, yet at least), using 6 worker processes which seem to produce optimal results and adjusting the number of requests to 1.000.000 I get a throughput of 780 Kb/s and 4600 req/s.

Further looking at the benchmark method used I can't help but feel that 100 concurrent requests are also fairly low. In fact there are people reconfiguring kernels and developing async systems to achieve more than 10k concurrent requests. Simply setting the file descriptor limit to ulimit -n 10000 allowed me to increase the number of concurrent requests to 350 - by far not as high as I hoped but with more effort one could probably make more requests work - which allowed for a small but decent increase to about 5200 req/s and 900 Kb/s.

This is far faster than what you have achieved for both Flask and Wheezy, even accounting for my apparently faster hardware.

The problem here is not that Flask is slow you simply haven't configured your web server correctly. You could probably improve performance further still by using varnish for example. My machine is not exactly server material and given that hardware costs much less than developer time, getting a nice server would be an easy way to increase performance significantly as well.

DasIch on 30 May 2014

👍3 ❤2

All 14 comments

It would be interesting to see the strace of the Wheezy.web one.

danielchatfield on 30 May 2014

And now measure Django please, and tell them that it is too slow. I am sure
they'll tell you that Django is slower than Flask or Wheezy because it simply
does more.

On Fri, May 30, 2014 at 04:30:38AM -0700, yakamooz wrote:

I have the following simple "Hello World" app:

from gevent import monkey
monkey.patch_all()
from flask import Flask
from gevent import wsgi

app = Flask(__name__)

@app.route('/')
def index():
  return 'Hello World'

server = wsgi.WSGIServer(('127.0.0.1', 5000), app)
server.serve_forever()

As you can see it's pretty straightforward.

The problem is that despite such simpliness it's pretty slow/inefficient as the following benchmark (made with Apache Benchmark) shows:

ab -k -n 1000 -c 100 http://127.0.0.1:5000/

Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        
Server Hostname:        127.0.0.1
Server Port:            5000

Document Path:          /
Document Length:        11 bytes

Concurrency Level:      100
Time taken for tests:   1.515 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Keep-Alive requests:    0
Total transferred:      146000 bytes
HTML transferred:       11000 bytes
Requests per second:    660.22 [#/sec] (mean)
Time per request:       151.465 [ms] (mean)
Time per request:       1.515 [ms] (mean, across all concurrent requests)
Transfer rate:          94.13 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.6      0       3
Processing:     1  145  33.5    149     191
Waiting:        1  144  33.5    148     191
Total:          4  145  33.0    149     191

Percentage of the requests served within a certain time (ms)
  50%    149
  66%    157
  75%    165
  80%    173
  90%    183
  95%    185
  98%    187
  99%    188
 100%    191 (longest request)

Eventually increasing the number of connections and/or concurrency doesn't bring better results, in fact it becomes worse.

What I'm most concerned about is the fact that I can't go over 700 Requests per second and a Transfer rate of 98 Kbytes/sec.

Also, the individual Time per request seems to be too much.

I got curious about what Python and Gevent are doing in the background, or better, what the OS is doing, so I used a strace to determine eventual system-side issues and here's the result:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 56.46    0.000284           0      1386           close
 24.25    0.000122           0      1016           write
 10.74    0.000054           0      1000           send
  4.17    0.000021           0      3652      3271 open
  2.19    0.000011           0       641           read
  2.19    0.000011           0      6006           fcntl64
  0.00    0.000000           0         1           waitpid
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         3           time
  0.00    0.000000           0        12        12 access
  0.00    0.000000           0        32           brk
  0.00    0.000000           0         5         1 ioctl
  0.00    0.000000           0      5006           gettimeofday
  0.00    0.000000           0         4         2 readlink
  0.00    0.000000           0       191           munmap
  0.00    0.000000           0         1         1 statfs
  0.00    0.000000           0         1         1 sigreturn
  0.00    0.000000           0         2           clone
  0.00    0.000000           0         2           uname
  0.00    0.000000           0        21           mprotect
  0.00    0.000000           0        69        65 _llseek
  0.00    0.000000           0        71           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         3           getcwd
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0       243           mmap2
  0.00    0.000000           0      1838       748 stat64
  0.00    0.000000           0        74           lstat64
  0.00    0.000000           0       630           fstat64
  0.00    0.000000           0         1           getuid32
  0.00    0.000000           0         1           getgid32
  0.00    0.000000           0         1           geteuid32
  0.00    0.000000           0         1           getegid32
  0.00    0.000000           0         4           getdents64
  0.00    0.000000           0         3         1 futex
  0.00    0.000000           0         1           set_thread_area
  0.00    0.000000           0         2           epoll_ctl
  0.00    0.000000           0        12         1 epoll_wait
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0        26           clock_gettime
  0.00    0.000000           0         2           openat
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         1           eventfd2
  0.00    0.000000           0         1           epoll_create1
  0.00    0.000000           0         1           pipe2
  0.00    0.000000           0         1           socket
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0      1000           accept
  0.00    0.000000           0         1           getsockname
  0.00    0.000000           0      2000      1000 recv
  0.00    0.000000           0         1           setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00    0.000503                 24977      5103 total

I wish to post the full strace which goes into the detail of every single call, but it is way too large.

A final note; I also set the following system parameters (which are the maximum allowed amount) hoping it would change the situation but it didn't:

echo “32768 61000″ > /proc/sys/net/ipv4/ip_local_port_range
sysctl -w fs.file-max=128000
sysctl -w net.ipv4.tcp_keepalive_time=300
sysctl -w net.core.somaxconn=61000
sysctl -w net.ipv4.tcp_max_syn_backlog=2500
sysctl -w net.core.netdev_max_backlog=2500
ulimit -n 1024

My question is, given that the sample I'm using can't be changed so much to fix these issues, where should I look to correct them?

For a comparison I made the following "Hello World" script with Wheezy.web & Gevent and I got ~2000 Requests per second:

from gevent import monkey
monkey.patch_all()
from gevent import pywsgi
from wheezy.http import HTTPResponse
from wheezy.http import WSGIApplication
from wheezy.routing import url
from wheezy.web.handlers import BaseHandler
from wheezy.web.middleware import bootstrap_defaults
from wheezy.web.middleware import path_routing_middleware_factory

def helloworld(request):
    response = HTTPResponse()
    response.write('hello world')
    return response


routes = [
    url('hello', helloworld, name='helloworld')
]


options = {}
main = WSGIApplication(
    middleware=[
        bootstrap_defaults(url_mapping=routes),
        path_routing_middleware_factory
    ],
    options=options
)


server = pywsgi.WSGIServer(('127.0.0.1', 5000), main, backlog=128000)
server.serve_forever()

And the benchmark results:

ab -k -n 1000 -c 1000 http://127.0.0.1:5000/hello

Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        
Server Hostname:        127.0.0.1
Server Port:            5000

Document Path:          /front
Document Length:        11 bytes

Concurrency Level:      1000
Time taken for tests:   0.484 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Keep-Alive requests:    0
Total transferred:      170000 bytes
HTML transferred:       11000 bytes
Requests per second:    2067.15 [#/sec] (mean)
Time per request:       483.758 [ms] (mean)
Time per request:       0.484 [ms] (mean, across all concurrent requests)
Transfer rate:          343.18 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    8  10.9      0      28
Processing:     2   78  39.7     56     263
Waiting:        2   78  39.7     56     263
Total:         18   86  42.6     66     263

Percentage of the requests served within a certain time (ms)
  50%     66
  66%     83
  75%    129
  80%    131
  90%    152
  95%    160
  98%    178
  99%    182
 100%    263 (longest request)

I find Wheezy.web's speed great, but I'd still like to use Flask as it's far simpler and less time consuming to work with.

Reply to this email directly or view it on GitHub:
https://github.com/mitsuhiko/flask/issues/1073

untitaker on 30 May 2014

I agree with @untitaker that striving for anything close to wheezy.web performance is not realistic -wheezy was designed explicitly for speed and high concurrency and thus lacks the flexibility of flask and doesn't do nearly as much stuff for you.

In fact if concurrent performance is of that much importance then go would probably be a better choice over python.

danielchatfield on 30 May 2014

@danielchatfield Here is the Wheezy.web strace (weird that it took more time in the background):

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 35.70    0.000876           0      2919       771 stat64
 22.58    0.000554           0      2000           send
  7.29    0.000179           0      4564      2280 recv
  6.48    0.000159           0      3721      3277 open
  6.07    0.000149           0     12858           fcntl64
  5.70    0.000140           0      2207        65 accept
  5.66    0.000139           0      2590           close
  5.18    0.000127           0     10153           gettimeofday
  2.08    0.000051           0       729           fstat64
  1.39    0.000034           0       676           read
  1.30    0.000032          32         1           waitpid
  0.57    0.000014           0       292           mmap2
  0.00    0.000000           0      2002           write
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         4           time
  0.00    0.000000           0        13        13 access
  0.00    0.000000           0        39           brk
  0.00    0.000000           0         5         1 ioctl
  0.00    0.000000           0         4         2 readlink
  0.00    0.000000           0       225           munmap
  0.00    0.000000           0         1         1 statfs
  0.00    0.000000           0         2           clone
  0.00    0.000000           0         2           uname
  0.00    0.000000           0        26           mprotect
  0.00    0.000000           0        69        65 _llseek
  0.00    0.000000           0        70           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         1           getcwd
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         5           lstat64
  0.00    0.000000           0         1           getuid32
  0.00    0.000000           0         1           getgid32
  0.00    0.000000           0         1           geteuid32
  0.00    0.000000           0         1           getegid32
  0.00    0.000000           0         4           getdents64
  0.00    0.000000           0         4         1 futex
  0.00    0.000000           0         1           set_thread_area
  0.00    0.000000           0       282           epoll_ctl
  0.00    0.000000           0        89           epoll_wait
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0       181           clock_gettime
  0.00    0.000000           0         2           openat
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         1           eventfd2
  0.00    0.000000           0         1           epoll_create1
  0.00    0.000000           0         1           pipe2
  0.00    0.000000           0         1           socket
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0         1           getsockname
  0.00    0.000000           0         1           setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00    0.002454                 45758      6476 total

The fact is that I'm trying to squeeze the most out of Flask because I like its simplicity and speed of development in primis.

This wasn't an attempt to bash Flask and/or Python, really. I hope that given this benchmark and its strace someone could help me to find the "culprit" in Flask lower concurrency and fix it.

I wouldn't go with Go (sorry for the confusion I introduced) because Python is cleaner and easier to work it.

ghost on 30 May 2014

❤3

Flask:

fcntl(15, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000371>
fcntl(15, F_SETFL, O_RDWR|O_NONBLOCK)   = 0 <0.001111>
gettimeofday({1401456796, 717391}, NULL) = 0 <0.000255>
recvfrom(15, "GET / HTTP/1.0\r\nConnection: Keep"..., 8192, 0, NULL, NULL) = 106 <0.000385>
gettimeofday({1401456796, 719872}, NULL) = 0 <0.000293>
gettimeofday({1401456796, 721357}, NULL) = 0 <0.000367>
sendto(15, "HTTP/1.1 200 OK\r\nContent-Type: t"..., 146, 0, NULL, 0) = 146 <0.000408>
gettimeofday({1401456796, 722988}, NULL) = 0 <0.000089>
gettimeofday({1401456796, 723339}, NULL) = 0 <0.000088>
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=331, ...}) = 0 <0.000437>
write(2, "127.0.0.1 - - [2014-05-30 22:33:"..., 70127.0.0.1 - - [2014-05-30 22:33:16] "GET / HTTP/1.0" 200 146 0.003116
) = 70 <0.000489>
recvfrom(15, 0x2703cc4, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000395>
close(15)                               = 0 <0.000517>

Wheezy.web:

fcntl(15, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000335>
fcntl(15, F_SETFL, O_RDWR|O_NONBLOCK)   = 0 <0.000378>
gettimeofday({1401456852, 680479}, NULL) = 0 <0.001002>
recvfrom(15, "GET /hello HTTP/1.0\r\nConnection:"..., 8192, 0, NULL, NULL) = 111 <0.000553>
gettimeofday({1401456852, 684721}, NULL) = 0 <0.000306>
gettimeofday({1401456852, 685890}, NULL) = 0 <0.000578>
sendto(15, "HTTP/1.1 200 OK\r\nContent-Type: t"..., 170, 0, NULL, 0) = 170 <0.000739>
gettimeofday({1401456852, 688582}, NULL) = 0 <0.001020>
gettimeofday({1401456852, 690220}, NULL) = 0 <0.000405>
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=331, ...}) = 0 <0.000339>
write(2, "127.0.0.1 - - [2014-05-30 22:34:"..., 75127.0.0.1 - - [2014-05-30 22:34:12] "GET /hello HTTP/1.0" 200 170 0.003861
) = 75 <0.000424>
recvfrom(15, 0x23b9f54, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000475>
close(15)                               = 0 <0.000638>

methane on 30 May 2014

I don't think it can be fixed, because improving Flask's performance to be comparable with Wheezy ultimatively would mean to remove functionality.

untitaker on 30 May 2014

Of course that doesn't mean that any improvements could be made, but i am sure there are no low-hanging fruits.

untitaker on 30 May 2014

😄2

Another data

wheezy.web:

./wrk -d 10 -c 100 http://127.0.0.1:5000/hello
Running 10s test @ http://127.0.0.1:5000/hello
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.67s     1.37s    4.20s    87.65%
    Req/Sec     1.02k     0.94k    2.56k    35.43%
  19061 requests in 10.00s, 2.74MB read
  Socket errors: connect 0, read 0, write 0, timeout 64
Requests/sec:   1906.22
Transfer/sec:    281.09KB

# time of server
real    0m15.023s
user    0m5.431s
sys     0m3.403s

Flask:

$ ./wrk -d 10 -c 100 http://127.0.0.1:5000/
Running 10s test @ http://127.0.0.1:5000/
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.53s     2.33s    7.43s    88.61%
    Req/Sec   718.82    691.44     1.67k    29.73%
  13451 requests in 10.00s, 1.63MB read
  Socket errors: connect 0, read 0, write 0, timeout 160
Requests/sec:   1344.95
Transfer/sec:    166.81KB

real    0m14.723s
user    0m6.828s
sys     0m2.467s

methane on 30 May 2014

I agree with @untitaker

methane on 30 May 2014

@methane Looking at the strace make it look like there's no much difference between the two. Now that the community confirmed its performance, what do you suggest I do to handle 1500-2000 Requests per second without modifying Flask? Switch from CPython to Pypy? Spread Python processes over many servers and CPU's?

ghost on 30 May 2014

First of all 10.000 requests is a fairly small number, you want to increase that to about 100.000 or even 1.000.000.

Nevertheless if I replicate your benchmark exactly on my machine (Mid 2011 MacBook Air 1.8 GHz i7) I get more than twice the performance.

This is far faster than what you have achieved for both Flask and Wheezy, even accounting for my apparently faster hardware.

DasIch on 30 May 2014

👍3 ❤2

@DasIch It looks like your machine has more throughput than mine. May I ask you how you run Gunicorn and PyPy?

I wanted to try PyPy too and got faster results like yours.

For the test I used Monocle + Tornado (and PyPy of course) and 1000 concurrent connections x1000 times.

I got ~6000 req/s with it. I got way worse results with Wheezy.web this time.

I know that Gevent isn't (still) supposed to work with PyPy, but I wanted to give it a try and make it work anyway. You guess, I got it working without too much effort. Though I'm very dubious that it works 100% at all, but good, this is a start point nonetheless.

So, I got the Gevent + Flask snippet to work with PyPy and it wasn't bad (~4-5000 req/s when fully "warmed"). It was still less performing than Monocle + Tornado. But if you have to trade the simplicity of Flask for the performance of Monocle + Tornado you can live with the performance of Flask + Gevent anyway as there's no much difference and you got to produce faster.

I want to share with you how I got Gevent and PyPy working, so we may fix remaining issues.

First make sure that you have all the required libraries in your system:

$ apt-get install libssl-dev libev-dev libffi-dev ncurses-dev

Install the cffi module:

$ pypy -m pip install cffi

Install a version of Gevent which has been modified to run on PyPy:

$ git clone https://github.com/schmir/gevent.git
$ cd gevent
$ git checkout pypy-hacks
$ pypy setup.py install

I also patched the gevent.core cffi module to fix the "erroneous" byte declaration that stopped the installation process. You may want to apply it:

$ git clone https://github.com/yakamooz/pypycore.git
$ cd pypycore
$ CFLAGS=-O2 pypy -m pip install -e .

There is a socket.py that I patched in the "pypycore" folder you cloned from Github. Replace the one in /usr/lib/pypy/lib-python/2.7 with it (make a backup for safety).

Before doing anything with PyPy and Gevent make sure Gevent uses the right gevent.core in the following way:

$ export GEVENT_LOOP=pypycore.loop

Now you can use Gevent and PyPy together!

I'd be glad if you posted your performance with it and see if you get more throughput than the ~4-5000 req/s I had.

ghost on 31 May 2014

UPDATE:
It looks like Syncless is x5-6 faster than Gevent, x1-2 than Tornado+Monocle on PyPy and Gunicorn+Eventlet on PyPy as I got ~7000 req/s with it https://code.google.com/p/syncless/ and Flask on pure Python.

I'm going to patch it to work with PyPy and see how much I get.

ghost on 1 Jun 2014

@osmantekin I know this might be outdated, but I can't managed to get ~7000 req/s, what was your setup?

Currently I get ~2300 req/s with Gunicorn + Gevent + CPython and ~1600 req/s Gunicorn + Gevent + PyPy on my 8 core PC.
Thanks.