I have a very simple rackup.
# config.ru
run ->(env) { [200, {"Content-Type" => "text/html"}, ["Hello World!"]] }
This app runs slower on TruffleRuby than MRI.
Benchmarking
On MRI
$ ab -n 1000 -c 10 http://127.0.0.1:9292/
...
Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: 127.0.0.1
Server Port: 9292
Document Path: /
Document Length: 12 bytes
Concurrency Level: 10
**Time taken for tests: 0.206 seconds**
Complete requests: 1000
Failed requests: 0
Total transferred: 76000 bytes
HTML transferred: 12000 bytes
Requests per second: 4857.62 [#/sec] (mean)
Time per request: 2.059 [ms] (mean)
Time per request: 0.206 [ms] (mean, across all concurrent requests)
Transfer rate: 360.53 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 0
Processing: 0 2 0.6 2 5
Waiting: 0 1 0.5 1 4
Total: 1 2 0.6 2 5
Percentage of the requests served within a certain time (ms)
50% 2
66% 2
75% 2
80% 2
90% 3
95% 3
98% 4
99% 4
100% 5 (longest request)
On TruffleRuby
$ ab -n 1000 -c 10 http://127.0.0.1:9292/
...
Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: 127.0.0.1
Server Port: 9292
Document Path: /
Document Length: 12 bytes
Concurrency Level: 10
**Time taken for tests: 0.718 seconds**
Complete requests: 1000
Failed requests: 0
Total transferred: 76000 bytes
HTML transferred: 12000 bytes
Requests per second: 1393.08 [#/sec] (mean)
Time per request: 7.178 [ms] (mean)
Time per request: 0.718 [ms] (mean, across all concurrent requests)
Transfer rate: 103.39 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 1 7 14.9 5 154
Waiting: 1 7 14.9 5 154
Total: 1 7 14.9 5 154
Percentage of the requests served within a certain time (ms)
50% 5
66% 6
75% 7
80% 7
90% 10
95% 11
98% 13
99% 152
100% 154 (longest request)
Initially, the test run was taking ~ 1.5 seconds on TruffleRuby. But, after executing the run 4-5 times, the time came down to ~0.7 seconds.
It looks like the issue is not related to rack. I wrote a pure ruby web server and the issue still appears.
require 'socket'
server = TCPServer.new 9292
while session = server.accept
start_time = Time.now
request = session.gets
session.print "HTTP/1.1 200\r\n"
session.print "Content-Type: text/html\r\n"
session.print "\r\n"
session.print "Hello world!"
end_time = Time.now
puts "#{end_time - start_time}"
session.close
end
I'm seeing some slightly different results to you.
I'm comparing against MRI 2.4 running your rack test server like this
ruby server-test.rb 2>/dev/null
where 'sever-test.rb' contains
require 'rack'
Rack::Handler::WEBrick.run(
->(env) { [200, {"Content-Type" => "text/html"}, ["Hello World!"]] },
:DoNotReverseLookup => true )
This way I can easily insert any extra commands to the run time if we need them later. I set the :DoNotReverseLookup option as reverse lookups can be expensive and I know it has changed between ruby versions so I wanted to be certain we'd be comparing like for like.
The result I get from MRI is
This is ApacheBench, Version 2.3 <$Revision: 1430300 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software: WEBrick/1.3.1
Server Hostname: 127.0.0.1
Server Port: 8080
Document Path: /
Document Length: 12 bytes
Concurrency Level: 10
Time taken for tests: 1.006 seconds
Complete requests: 1000
Failed requests: 0
Write errors: 0
Total transferred: 180000 bytes
HTML transferred: 12000 bytes
Requests per second: 994.20 [#/sec] (mean)
Time per request: 10.058 [ms] (mean)
Time per request: 1.006 [ms] (mean, across all concurrent requests)
Transfer rate: 174.76 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 3 10 21.4 7 276
Waiting: 1 7 11.2 6 219
Total: 3 10 21.4 7 276
Percentage of the requests served within a certain time (ms)
50% 7
66% 8
75% 8
80% 8
90% 10
95% 11
98% 20
99% 148
100% 276 (longest request)
testing with the GraalVM 1.0 RC5 release I see
This is ApacheBench, Version 2.3 <$Revision: 1430300 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software: WEBrick/1.3.1
Server Hostname: 127.0.0.1
Server Port: 8080
Document Path: /
Document Length: 12 bytes
Concurrency Level: 10
Time taken for tests: 0.555 seconds
Complete requests: 1000
Failed requests: 0
Write errors: 0
Total transferred: 180000 bytes
HTML transferred: 12000 bytes
Requests per second: 1802.85 [#/sec] (mean)
Time per request: 5.547 [ms] (mean)
Time per request: 0.555 [ms] (mean, across all concurrent requests)
Transfer rate: 316.91 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 1 5 7.4 4 64
Waiting: 1 5 7.4 4 64
Total: 1 5 7.4 4 64
Percentage of the requests served within a certain time (ms)
50% 4
66% 4
75% 5
80% 5
90% 6
95% 7
98% 33
99% 63
100% 64 (longest request)
by about the sixth run, using the default --native option and roughly the same with the --jvm option.
This is going to be sensitive to a number of factors, so it would be good to get a little more info on the machine you're running on. I'm testing on an 2.4GHz i5-6300U CPU 16GB laptop running Oracle Linux 7.5.
FWIW on this sort of benchmark our biggest cost is the creation of an OS thread for each request by WEBrick. I was able to more than halve the average request time and significantly reduce the variance by patching it to use a work queue and a set of worker threads like this
module WEBrick
class GenericServer
def start(&block)
raise ServerError, "already started." if @status != :Stop
server_type = @config[:ServerType] || SimpleServer
server_type.start{
@logger.info \
"#{self.class}#start: pid=#{$$} port=#{@config[:Port]}"
call_callback(:StartCallback)
thgroup = ThreadGroup.new
@status = :Running
begin
work_q = start_thread_pool(thgroup)
while @status == :Running
begin
if svrs = IO.select(@listeners, nil, nil, 2.0)
svrs[0].each{|svr|
@tokens.pop # blocks while no token is there.
if sock = accept_client(svr)
sock.do_not_reverse_lookup = config[:DoNotReverseLookup]
work_q.push([sock, block])
else
@tokens.push(nil)
end
}
end
rescue Errno::EBADF, IOError => ex
# if the listening socket was closed in GenericServer#shutdown,
# IO::select raise it.
rescue StandardError => ex
msg = "#{ex.class}: #{ex.message}\n\t#{ex.backtrace[0]}"
@logger.error msg
rescue Exception => ex
@logger.fatal ex
raise
end
end
ensure
work_q.close
@status = :Shutdown
@logger.info "going to shutdown ..."
thgroup.list.each{|th| th.join if th[:WEBrickThread] }
call_callback(:StopCallback)
@logger.info "#{self.class}#start done."
@status = :Stop
end
}
end
def start_thread(sock, &block)
Thread.start{
process_connection(sock, &block)
}
end
def start_thread_pool(group, threads = 4)
work_q = Queue.new
threads.times do
th = Thread.start {
while w = work_q.pop()
sock = w[0]
block = w[1]
process_connection(sock, &block)
end
}
th[:WEBrickThread] = true
group.add(th)
end
work_q
end
def process_connection(sock, &block)
begin
Thread.current[:WEBrickSocket] = sock
begin
addr = sock.peeraddr
@logger.debug "accept: #{addr[3]}:#{addr[1]}"
rescue SocketError
@logger.debug "accept: <address unknown>"
raise
end
call_callback(:AcceptCallback, sock)
block ? block.call(sock) : run(sock)
rescue Errno::ENOTCONN
@logger.debug "Errno::ENOTCONN raised"
rescue ServerError => ex
msg = "#{ex.class}: #{ex.message}\n\t#{ex.backtrace[0]}"
@logger.error msg
rescue Exception => ex
@logger.error ex
ensure
@tokens.push(nil)
Thread.current[:WEBrickSocket] = nil
if addr
@logger.debug "close: #{addr[3]}:#{addr[1]}"
else
@logger.debug "close: <address unknown>"
end
sock.close unless sock.closed?
end
end
end
end
I do get much closer results to you if I use Puma as my server. I'll need to do some profiling to work out exactly what takes the time.
Just a quick update on progress I've made on this issue.
One of the big problems we had was that Puma's thread pool implementation was not running well on TruffleRuby. I wrote a small benchmark to test this and have been optimising ConditionVariable and Mutex operations. We're now much closer to MRI on this benchmark, and I hope to make some more progress on Monday.
The other area which is slower than I'd like is running the C extension to parse http requests. We can improve things a bit by removing the mutex round calls to the extension, but I think there are some other changes we can make which will help as well.
Any updates on basic Rack/Sinatra-based performance?
I've repeated the test discussed in #1406, which yielded the following results:
(truffleruby 19.2.0)> wrk -c 10 -t 5 -d 60 http://localhost:9292/
Running 1m test @ http://localhost:9292/
5 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 89.85ms 224.30ms 1.45s 90.44%
Req/Sec 675.82 437.38 1.46k 48.39%
149705 requests in 1.00m, 26.41MB read
Socket errors: connect 0, read 0, write 0, timeout 10
Requests/sec: 2492.10
Transfer/sec: 450.24KB
(Ruby 2.6.2)> wrk -c 10 -t 5 -d 60 http://localhost:9292/
Running 1m test @ http://localhost:9292/
5 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.57ms 1.75ms 60.73ms 91.53%
Req/Sec 827.08 160.90 1.07k 67.07%
247059 requests in 1.00m, 43.59MB read
Requests/sec: 4114.73
Transfer/sec: 743.39KB
This was after letting the simple app "warm up" with thousands of requests. To me, this use case (smaller, service-based web apps) is far more important than the Rails case. With all of the performance promised by (and benchmarks for) TruffleRuby, what makes this use case perform poorly?
(cc/ @aardvark179 )
Related question (cc/ @eregon): Is this slowness due to Sinatra specifically, TruffleRuby's HTTP handling in general, or something else? Just trying to determine if I can get decent performance out of TruffleRuby if the interface to the app is HTTP.
We had somewhat put this on hold while we improved general compatibility of C extensions, and then worked on other performance challenges round that change. It is probably now time to investigate Puma performance again with these changes in place.
I'll do some investigation, but my initial guess would be that our main performance blocker will by in Puma's C extensions, and how well Sulong can JIT that, but I will investigate. It is important we get generated C parsers (such as Puma's HTTP one) fast as they are used in several important gems.
Most helpful comment
Just a quick update on progress I've made on this issue.
One of the big problems we had was that Puma's thread pool implementation was not running well on TruffleRuby. I wrote a small benchmark to test this and have been optimising
ConditionVariableandMutexoperations. We're now much closer to MRI on this benchmark, and I hope to make some more progress on Monday.The other area which is slower than I'd like is running the C extension to parse http requests. We can improve things a bit by removing the mutex round calls to the extension, but I think there are some other changes we can make which will help as well.