I have the following simple "Hello World" app:
from gevent import monkey
monkey.patch_all()
from flask import Flask
from gevent import wsgi
app = Flask(__name__)
@app.route('/')
def index():
return 'Hello World'
server = wsgi.WSGIServer(('127.0.0.1', 5000), app)
server.serve_forever()
As you can see it's pretty straightforward.
The problem is that despite such simpliness it's pretty slow/inefficient as the following benchmark (made with Apache Benchmark) shows:
ab -k -n 1000 -c 100 http://127.0.0.1:5000/
Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: 127.0.0.1
Server Port: 5000
Document Path: /
Document Length: 11 bytes
Concurrency Level: 100
Time taken for tests: 1.515 seconds
Complete requests: 1000
Failed requests: 0
Write errors: 0
Keep-Alive requests: 0
Total transferred: 146000 bytes
HTML transferred: 11000 bytes
Requests per second: 660.22 [#/sec] (mean)
Time per request: 151.465 [ms] (mean)
Time per request: 1.515 [ms] (mean, across all concurrent requests)
Transfer rate: 94.13 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.6 0 3
Processing: 1 145 33.5 149 191
Waiting: 1 144 33.5 148 191
Total: 4 145 33.0 149 191
Percentage of the requests served within a certain time (ms)
50% 149
66% 157
75% 165
80% 173
90% 183
95% 185
98% 187
99% 188
100% 191 (longest request)
Eventually increasing the number of connections and/or concurrency doesn't bring better results, in fact it becomes worse.
What I'm most concerned about is the fact that I can't go over 700 Requests per second and a Transfer rate of 98 Kbytes/sec.
Also, the individual Time per request seems to be too much.
I got curious about what Python and Gevent are doing in the background, or better, what the OS is doing, so I used a strace to determine eventual system-side issues and here's the result:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
56.46 0.000284 0 1386 close
24.25 0.000122 0 1016 write
10.74 0.000054 0 1000 send
4.17 0.000021 0 3652 3271 open
2.19 0.000011 0 641 read
2.19 0.000011 0 6006 fcntl64
0.00 0.000000 0 1 waitpid
0.00 0.000000 0 1 execve
0.00 0.000000 0 3 time
0.00 0.000000 0 12 12 access
0.00 0.000000 0 32 brk
0.00 0.000000 0 5 1 ioctl
0.00 0.000000 0 5006 gettimeofday
0.00 0.000000 0 4 2 readlink
0.00 0.000000 0 191 munmap
0.00 0.000000 0 1 1 statfs
0.00 0.000000 0 1 1 sigreturn
0.00 0.000000 0 2 clone
0.00 0.000000 0 2 uname
0.00 0.000000 0 21 mprotect
0.00 0.000000 0 69 65 _llseek
0.00 0.000000 0 71 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 3 getcwd
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 243 mmap2
0.00 0.000000 0 1838 748 stat64
0.00 0.000000 0 74 lstat64
0.00 0.000000 0 630 fstat64
0.00 0.000000 0 1 getuid32
0.00 0.000000 0 1 getgid32
0.00 0.000000 0 1 geteuid32
0.00 0.000000 0 1 getegid32
0.00 0.000000 0 4 getdents64
0.00 0.000000 0 3 1 futex
0.00 0.000000 0 1 set_thread_area
0.00 0.000000 0 2 epoll_ctl
0.00 0.000000 0 12 1 epoll_wait
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 26 clock_gettime
0.00 0.000000 0 2 openat
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 1 eventfd2
0.00 0.000000 0 1 epoll_create1
0.00 0.000000 0 1 pipe2
0.00 0.000000 0 1 socket
0.00 0.000000 0 1 bind
0.00 0.000000 0 1 listen
0.00 0.000000 0 1000 accept
0.00 0.000000 0 1 getsockname
0.00 0.000000 0 2000 1000 recv
0.00 0.000000 0 1 setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00 0.000503 24977 5103 total
As you can see there are 5103 errors, the worst offender being the open syscall which I suspect has to do with files not being found (ENOENT). To my surprise epoll didn't look like a _troubler_, as I heard of many horror stories about it.
I wish to post the full strace which goes into the detail of every single call, but it is way too large.
A final note; I also set the following system parameters (which are the maximum allowed amount) hoping it would change the situation but it didn't:
echo “32768 61000″ > /proc/sys/net/ipv4/ip_local_port_range
sysctl -w fs.file-max=128000
sysctl -w net.ipv4.tcp_keepalive_time=300
sysctl -w net.core.somaxconn=61000
sysctl -w net.ipv4.tcp_max_syn_backlog=2500
sysctl -w net.core.netdev_max_backlog=2500
ulimit -n 1024
My question is, given that the sample I'm using can't be changed so much to fix these issues, where should I look to correct them?
For a comparison I made the following "Hello World" script with Wheezy.web & Gevent and I got ~2000 Requests per second:
from gevent import monkey
monkey.patch_all()
from gevent import pywsgi
from wheezy.http import HTTPResponse
from wheezy.http import WSGIApplication
from wheezy.routing import url
from wheezy.web.handlers import BaseHandler
from wheezy.web.middleware import bootstrap_defaults
from wheezy.web.middleware import path_routing_middleware_factory
def helloworld(request):
response = HTTPResponse()
response.write('hello world')
return response
routes = [
url('hello', helloworld, name='helloworld')
]
options = {}
main = WSGIApplication(
middleware=[
bootstrap_defaults(url_mapping=routes),
path_routing_middleware_factory
],
options=options
)
server = pywsgi.WSGIServer(('127.0.0.1', 5000), main, backlog=128000)
server.serve_forever()
And the benchmark results:
ab -k -n 1000 -c 1000 http://127.0.0.1:5000/hello
Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: 127.0.0.1
Server Port: 5000
Document Path: /front
Document Length: 11 bytes
Concurrency Level: 1000
Time taken for tests: 0.484 seconds
Complete requests: 1000
Failed requests: 0
Write errors: 0
Keep-Alive requests: 0
Total transferred: 170000 bytes
HTML transferred: 11000 bytes
Requests per second: 2067.15 [#/sec] (mean)
Time per request: 483.758 [ms] (mean)
Time per request: 0.484 [ms] (mean, across all concurrent requests)
Transfer rate: 343.18 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 8 10.9 0 28
Processing: 2 78 39.7 56 263
Waiting: 2 78 39.7 56 263
Total: 18 86 42.6 66 263
Percentage of the requests served within a certain time (ms)
50% 66
66% 83
75% 129
80% 131
90% 152
95% 160
98% 178
99% 182
100% 263 (longest request)
I find Wheezy.web's speed great, but I'd still like to use Flask as it's far simpler and less time consuming to work with.
It would be interesting to see the strace of the Wheezy.web one.
And now measure Django please, and tell them that it is too slow. I am sure
they'll tell you that Django is slower than Flask or Wheezy because it simply
does more.
On Fri, May 30, 2014 at 04:30:38AM -0700, yakamooz wrote:
I have the following simple "Hello World" app:
from gevent import monkey monkey.patch_all() from flask import Flask from gevent import wsgi app = Flask(__name__) @app.route('/') def index(): return 'Hello World' server = wsgi.WSGIServer(('127.0.0.1', 5000), app) server.serve_forever()As you can see it's pretty straightforward.
The problem is that despite such simpliness it's pretty slow/inefficient as the following benchmark (made with Apache Benchmark) shows:
ab -k -n 1000 -c 100 http://127.0.0.1:5000/ Benchmarking 127.0.0.1 (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests Server Software: Server Hostname: 127.0.0.1 Server Port: 5000 Document Path: / Document Length: 11 bytes Concurrency Level: 100 Time taken for tests: 1.515 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Keep-Alive requests: 0 Total transferred: 146000 bytes HTML transferred: 11000 bytes Requests per second: 660.22 [#/sec] (mean) Time per request: 151.465 [ms] (mean) Time per request: 1.515 [ms] (mean, across all concurrent requests) Transfer rate: 94.13 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.6 0 3 Processing: 1 145 33.5 149 191 Waiting: 1 144 33.5 148 191 Total: 4 145 33.0 149 191 Percentage of the requests served within a certain time (ms) 50% 149 66% 157 75% 165 80% 173 90% 183 95% 185 98% 187 99% 188 100% 191 (longest request)Eventually increasing the number of connections and/or concurrency doesn't bring better results, in fact it becomes worse.
What I'm most concerned about is the fact that I can't go over 700 Requests per second and a Transfer rate of 98 Kbytes/sec.
Also, the individual Time per request seems to be too much.
I got curious about what Python and Gevent are doing in the background, or better, what the OS is doing, so I used a strace to determine eventual system-side issues and here's the result:
% time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 56.46 0.000284 0 1386 close 24.25 0.000122 0 1016 write 10.74 0.000054 0 1000 send 4.17 0.000021 0 3652 3271 open 2.19 0.000011 0 641 read 2.19 0.000011 0 6006 fcntl64 0.00 0.000000 0 1 waitpid 0.00 0.000000 0 1 execve 0.00 0.000000 0 3 time 0.00 0.000000 0 12 12 access 0.00 0.000000 0 32 brk 0.00 0.000000 0 5 1 ioctl 0.00 0.000000 0 5006 gettimeofday 0.00 0.000000 0 4 2 readlink 0.00 0.000000 0 191 munmap 0.00 0.000000 0 1 1 statfs 0.00 0.000000 0 1 1 sigreturn 0.00 0.000000 0 2 clone 0.00 0.000000 0 2 uname 0.00 0.000000 0 21 mprotect 0.00 0.000000 0 69 65 _llseek 0.00 0.000000 0 71 rt_sigaction 0.00 0.000000 0 1 rt_sigprocmask 0.00 0.000000 0 3 getcwd 0.00 0.000000 0 1 getrlimit 0.00 0.000000 0 243 mmap2 0.00 0.000000 0 1838 748 stat64 0.00 0.000000 0 74 lstat64 0.00 0.000000 0 630 fstat64 0.00 0.000000 0 1 getuid32 0.00 0.000000 0 1 getgid32 0.00 0.000000 0 1 geteuid32 0.00 0.000000 0 1 getegid32 0.00 0.000000 0 4 getdents64 0.00 0.000000 0 3 1 futex 0.00 0.000000 0 1 set_thread_area 0.00 0.000000 0 2 epoll_ctl 0.00 0.000000 0 12 1 epoll_wait 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 26 clock_gettime 0.00 0.000000 0 2 openat 0.00 0.000000 0 1 set_robust_list 0.00 0.000000 0 1 eventfd2 0.00 0.000000 0 1 epoll_create1 0.00 0.000000 0 1 pipe2 0.00 0.000000 0 1 socket 0.00 0.000000 0 1 bind 0.00 0.000000 0 1 listen 0.00 0.000000 0 1000 accept 0.00 0.000000 0 1 getsockname 0.00 0.000000 0 2000 1000 recv 0.00 0.000000 0 1 setsockopt ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000503 24977 5103 totalAs you can see there are 5103 errors, the worst offender being the open syscall which I suspect has to do with files not being found (ENOENT). To my surprise epoll didn't look like a _troubler_, as I heard of many horror stories about it.
I wish to post the full strace which goes into the detail of every single call, but it is way too large.
A final note; I also set the following system parameters (which are the maximum allowed amount) hoping it would change the situation but it didn't:
echo “32768 61000″ > /proc/sys/net/ipv4/ip_local_port_range sysctl -w fs.file-max=128000 sysctl -w net.ipv4.tcp_keepalive_time=300 sysctl -w net.core.somaxconn=61000 sysctl -w net.ipv4.tcp_max_syn_backlog=2500 sysctl -w net.core.netdev_max_backlog=2500 ulimit -n 1024My question is, given that the sample I'm using can't be changed so much to fix these issues, where should I look to correct them?
For a comparison I made the following "Hello World" script with Wheezy.web & Gevent and I got ~2000 Requests per second:
from gevent import monkey monkey.patch_all() from gevent import pywsgi from wheezy.http import HTTPResponse from wheezy.http import WSGIApplication from wheezy.routing import url from wheezy.web.handlers import BaseHandler from wheezy.web.middleware import bootstrap_defaults from wheezy.web.middleware import path_routing_middleware_factory def helloworld(request): response = HTTPResponse() response.write('hello world') return response routes = [ url('hello', helloworld, name='helloworld') ] options = {} main = WSGIApplication( middleware=[ bootstrap_defaults(url_mapping=routes), path_routing_middleware_factory ], options=options ) server = pywsgi.WSGIServer(('127.0.0.1', 5000), main, backlog=128000) server.serve_forever()And the benchmark results:
ab -k -n 1000 -c 1000 http://127.0.0.1:5000/hello Benchmarking 127.0.0.1 (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests Server Software: Server Hostname: 127.0.0.1 Server Port: 5000 Document Path: /front Document Length: 11 bytes Concurrency Level: 1000 Time taken for tests: 0.484 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Keep-Alive requests: 0 Total transferred: 170000 bytes HTML transferred: 11000 bytes Requests per second: 2067.15 [#/sec] (mean) Time per request: 483.758 [ms] (mean) Time per request: 0.484 [ms] (mean, across all concurrent requests) Transfer rate: 343.18 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 8 10.9 0 28 Processing: 2 78 39.7 56 263 Waiting: 2 78 39.7 56 263 Total: 18 86 42.6 66 263 Percentage of the requests served within a certain time (ms) 50% 66 66% 83 75% 129 80% 131 90% 152 95% 160 98% 178 99% 182 100% 263 (longest request)I find Wheezy.web's speed great, but I'd still like to use Flask as it's far simpler and less time consuming to work with.
Reply to this email directly or view it on GitHub:
https://github.com/mitsuhiko/flask/issues/1073
I agree with @untitaker that striving for anything close to wheezy.web performance is not realistic -wheezy was designed explicitly for speed and high concurrency and thus lacks the flexibility of flask and doesn't do nearly as much stuff for you.
In fact if concurrent performance is of that much importance then go would probably be a better choice over python.
@danielchatfield Here is the Wheezy.web strace (weird that it took more time in the background):
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
35.70 0.000876 0 2919 771 stat64
22.58 0.000554 0 2000 send
7.29 0.000179 0 4564 2280 recv
6.48 0.000159 0 3721 3277 open
6.07 0.000149 0 12858 fcntl64
5.70 0.000140 0 2207 65 accept
5.66 0.000139 0 2590 close
5.18 0.000127 0 10153 gettimeofday
2.08 0.000051 0 729 fstat64
1.39 0.000034 0 676 read
1.30 0.000032 32 1 waitpid
0.57 0.000014 0 292 mmap2
0.00 0.000000 0 2002 write
0.00 0.000000 0 1 execve
0.00 0.000000 0 4 time
0.00 0.000000 0 13 13 access
0.00 0.000000 0 39 brk
0.00 0.000000 0 5 1 ioctl
0.00 0.000000 0 4 2 readlink
0.00 0.000000 0 225 munmap
0.00 0.000000 0 1 1 statfs
0.00 0.000000 0 2 clone
0.00 0.000000 0 2 uname
0.00 0.000000 0 26 mprotect
0.00 0.000000 0 69 65 _llseek
0.00 0.000000 0 70 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 1 getcwd
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 5 lstat64
0.00 0.000000 0 1 getuid32
0.00 0.000000 0 1 getgid32
0.00 0.000000 0 1 geteuid32
0.00 0.000000 0 1 getegid32
0.00 0.000000 0 4 getdents64
0.00 0.000000 0 4 1 futex
0.00 0.000000 0 1 set_thread_area
0.00 0.000000 0 282 epoll_ctl
0.00 0.000000 0 89 epoll_wait
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 181 clock_gettime
0.00 0.000000 0 2 openat
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 1 eventfd2
0.00 0.000000 0 1 epoll_create1
0.00 0.000000 0 1 pipe2
0.00 0.000000 0 1 socket
0.00 0.000000 0 1 bind
0.00 0.000000 0 1 listen
0.00 0.000000 0 1 getsockname
0.00 0.000000 0 1 setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00 0.002454 45758 6476 total
The fact is that I'm trying to squeeze the most out of Flask because I like its simplicity and speed of development in primis.
This wasn't an attempt to bash Flask and/or Python, really. I hope that given this benchmark and its strace someone could help me to find the "culprit" in Flask lower concurrency and fix it.
I wouldn't go with Go (sorry for the confusion I introduced) because Python is cleaner and easier to work it.
Flask:
fcntl(15, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000371>
fcntl(15, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.001111>
gettimeofday({1401456796, 717391}, NULL) = 0 <0.000255>
recvfrom(15, "GET / HTTP/1.0\r\nConnection: Keep"..., 8192, 0, NULL, NULL) = 106 <0.000385>
gettimeofday({1401456796, 719872}, NULL) = 0 <0.000293>
gettimeofday({1401456796, 721357}, NULL) = 0 <0.000367>
sendto(15, "HTTP/1.1 200 OK\r\nContent-Type: t"..., 146, 0, NULL, 0) = 146 <0.000408>
gettimeofday({1401456796, 722988}, NULL) = 0 <0.000089>
gettimeofday({1401456796, 723339}, NULL) = 0 <0.000088>
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=331, ...}) = 0 <0.000437>
write(2, "127.0.0.1 - - [2014-05-30 22:33:"..., 70127.0.0.1 - - [2014-05-30 22:33:16] "GET / HTTP/1.0" 200 146 0.003116
) = 70 <0.000489>
recvfrom(15, 0x2703cc4, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000395>
close(15) = 0 <0.000517>
Wheezy.web:
fcntl(15, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000335>
fcntl(15, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000378>
gettimeofday({1401456852, 680479}, NULL) = 0 <0.001002>
recvfrom(15, "GET /hello HTTP/1.0\r\nConnection:"..., 8192, 0, NULL, NULL) = 111 <0.000553>
gettimeofday({1401456852, 684721}, NULL) = 0 <0.000306>
gettimeofday({1401456852, 685890}, NULL) = 0 <0.000578>
sendto(15, "HTTP/1.1 200 OK\r\nContent-Type: t"..., 170, 0, NULL, 0) = 170 <0.000739>
gettimeofday({1401456852, 688582}, NULL) = 0 <0.001020>
gettimeofday({1401456852, 690220}, NULL) = 0 <0.000405>
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=331, ...}) = 0 <0.000339>
write(2, "127.0.0.1 - - [2014-05-30 22:34:"..., 75127.0.0.1 - - [2014-05-30 22:34:12] "GET /hello HTTP/1.0" 200 170 0.003861
) = 75 <0.000424>
recvfrom(15, 0x23b9f54, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000475>
close(15) = 0 <0.000638>
I don't think it can be fixed, because improving Flask's performance to be comparable with Wheezy ultimatively would mean to remove functionality.
Of course that doesn't mean that any improvements could be made, but i am sure there are no low-hanging fruits.
Another data
wheezy.web:
./wrk -d 10 -c 100 http://127.0.0.1:5000/hello
Running 10s test @ http://127.0.0.1:5000/hello
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.67s 1.37s 4.20s 87.65%
Req/Sec 1.02k 0.94k 2.56k 35.43%
19061 requests in 10.00s, 2.74MB read
Socket errors: connect 0, read 0, write 0, timeout 64
Requests/sec: 1906.22
Transfer/sec: 281.09KB
# time of server
real 0m15.023s
user 0m5.431s
sys 0m3.403s
Flask:
$ ./wrk -d 10 -c 100 http://127.0.0.1:5000/
Running 10s test @ http://127.0.0.1:5000/
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.53s 2.33s 7.43s 88.61%
Req/Sec 718.82 691.44 1.67k 29.73%
13451 requests in 10.00s, 1.63MB read
Socket errors: connect 0, read 0, write 0, timeout 160
Requests/sec: 1344.95
Transfer/sec: 166.81KB
real 0m14.723s
user 0m6.828s
sys 0m2.467s
I agree with @untitaker
@methane Looking at the strace make it look like there's no much difference between the two. Now that the community confirmed its performance, what do you suggest I do to handle 1500-2000 Requests per second without modifying Flask? Switch from CPython to Pypy? Spread Python processes over many servers and CPU's?
First of all 10.000 requests is a fairly small number, you want to increase that to about 100.000 or even 1.000.000.
Nevertheless if I replicate your benchmark exactly on my machine (Mid 2011 MacBook Air 1.8 GHz i7) I get more than twice the performance.
Switching to PyPy for faster interpretation, using gunicorn with eventlet (no gevent with PyPy, yet at least), using 6 worker processes which seem to produce optimal results and adjusting the number of requests to 1.000.000 I get a throughput of 780 Kb/s and 4600 req/s.
Further looking at the benchmark method used I can't help but feel that 100 concurrent requests are also fairly low. In fact there are people reconfiguring kernels and developing async systems to achieve more than 10k concurrent requests. Simply setting the file descriptor limit to ulimit -n 10000 allowed me to increase the number of concurrent requests to 350 - by far not as high as I hoped but with more effort one could probably make more requests work - which allowed for a small but decent increase to about 5200 req/s and 900 Kb/s.
This is far faster than what you have achieved for both Flask and Wheezy, even accounting for my apparently faster hardware.
The problem here is not that Flask is slow you simply haven't configured your web server correctly. You could probably improve performance further still by using varnish for example. My machine is not exactly server material and given that hardware costs much less than developer time, getting a nice server would be an easy way to increase performance significantly as well.
@DasIch It looks like your machine has more throughput than mine. May I ask you how you run Gunicorn and PyPy?
I wanted to try PyPy too and got faster results like yours.
For the test I used Monocle + Tornado (and PyPy of course) and 1000 concurrent connections x1000 times.
I got ~6000 req/s with it. I got way worse results with Wheezy.web this time.
I know that Gevent isn't (still) supposed to work with PyPy, but I wanted to give it a try and make it work anyway. You guess, I got it working without too much effort. Though I'm very dubious that it works 100% at all, but good, this is a start point nonetheless.
So, I got the Gevent + Flask snippet to work with PyPy and it wasn't bad (~4-5000 req/s when fully "warmed"). It was still less performing than Monocle + Tornado. But if you have to trade the simplicity of Flask for the performance of Monocle + Tornado you can live with the performance of Flask + Gevent anyway as there's no much difference and you got to produce faster.
I want to share with you how I got Gevent and PyPy working, so we may fix remaining issues.
First make sure that you have all the required libraries in your system:
$ apt-get install libssl-dev libev-dev libffi-dev ncurses-dev
Install the cffi module:
$ pypy -m pip install cffi
Install a version of Gevent which has been modified to run on PyPy:
$ git clone https://github.com/schmir/gevent.git
$ cd gevent
$ git checkout pypy-hacks
$ pypy setup.py install
I also patched the gevent.core cffi module to fix the "erroneous" byte declaration that stopped the installation process. You may want to apply it:
$ git clone https://github.com/yakamooz/pypycore.git
$ cd pypycore
$ CFLAGS=-O2 pypy -m pip install -e .
There is a socket.py that I patched in the "pypycore" folder you cloned from Github. Replace the one in /usr/lib/pypy/lib-python/2.7 with it (make a backup for safety).
Before doing anything with PyPy and Gevent make sure Gevent uses the right gevent.core in the following way:
$ export GEVENT_LOOP=pypycore.loop
Now you can use Gevent and PyPy together!
I'd be glad if you posted your performance with it and see if you get more throughput than the ~4-5000 req/s I had.
UPDATE:
It looks like Syncless is x5-6 faster than Gevent, x1-2 than Tornado+Monocle on PyPy and Gunicorn+Eventlet on PyPy as I got ~7000 req/s with it https://code.google.com/p/syncless/ and Flask on pure Python.
I'm going to patch it to work with PyPy and see how much I get.
@osmantekin I know this might be outdated, but I can't managed to get ~7000 req/s, what was your setup?
Currently I get ~2300 req/s with Gunicorn + Gevent + CPython and ~1600 req/s Gunicorn + Gevent + PyPy on my 8 core PC.
Thanks.
Most helpful comment
First of all 10.000 requests is a fairly small number, you want to increase that to about 100.000 or even 1.000.000.
Nevertheless if I replicate your benchmark exactly on my machine (Mid 2011 MacBook Air 1.8 GHz i7) I get more than twice the performance.
Switching to PyPy for faster interpretation, using gunicorn with eventlet (no gevent with PyPy, yet at least), using 6 worker processes which seem to produce optimal results and adjusting the number of requests to 1.000.000 I get a throughput of 780 Kb/s and 4600 req/s.
Further looking at the benchmark method used I can't help but feel that 100 concurrent requests are also fairly low. In fact there are people reconfiguring kernels and developing async systems to achieve more than 10k concurrent requests. Simply setting the file descriptor limit to ulimit -n 10000 allowed me to increase the number of concurrent requests to 350 - by far not as high as I hoped but with more effort one could probably make more requests work - which allowed for a small but decent increase to about 5200 req/s and 900 Kb/s.
This is far faster than what you have achieved for both Flask and Wheezy, even accounting for my apparently faster hardware.
The problem here is not that Flask is slow you simply haven't configured your web server correctly. You could probably improve performance further still by using varnish for example. My machine is not exactly server material and given that hardware costs much less than developer time, getting a nice server would be an easy way to increase performance significantly as well.