Truffleruby: 'Hello World' Rack app is slow on TruffleRuby

Created on 8 Aug 2018  路  7Comments  路  Source: oracle/truffleruby

I have a very simple rackup.

# config.ru
run ->(env) { [200, {"Content-Type" => "text/html"}, ["Hello World!"]] }

This app runs slower on TruffleRuby than MRI.

Benchmarking

On MRI

$ ab -n 1000 -c 10 http://127.0.0.1:9292/
...

Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:
Server Hostname:        127.0.0.1
Server Port:            9292

Document Path:          /
Document Length:        12 bytes

Concurrency Level:      10
**Time taken for tests:   0.206 seconds**
Complete requests:      1000
Failed requests:        0
Total transferred:      76000 bytes
HTML transferred:       12000 bytes
Requests per second:    4857.62 [#/sec] (mean)
Time per request:       2.059 [ms] (mean)
Time per request:       0.206 [ms] (mean, across all concurrent requests)
Transfer rate:          360.53 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       0
Processing:     0    2   0.6      2       5
Waiting:        0    1   0.5      1       4
Total:          1    2   0.6      2       5

Percentage of the requests served within a certain time (ms)
  50%      2
  66%      2
  75%      2
  80%      2
  90%      3
  95%      3
  98%      4
  99%      4
 100%      5 (longest request)

On TruffleRuby

$ ab -n 1000 -c 10 http://127.0.0.1:9292/
...
Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:
Server Hostname:        127.0.0.1
Server Port:            9292

Document Path:          /
Document Length:        12 bytes

Concurrency Level:      10
**Time taken for tests:   0.718 seconds**
Complete requests:      1000
Failed requests:        0
Total transferred:      76000 bytes
HTML transferred:       12000 bytes
Requests per second:    1393.08 [#/sec] (mean)
Time per request:       7.178 [ms] (mean)
Time per request:       0.718 [ms] (mean, across all concurrent requests)
Transfer rate:          103.39 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:     1    7  14.9      5     154
Waiting:        1    7  14.9      5     154
Total:          1    7  14.9      5     154

Percentage of the requests served within a certain time (ms)
  50%      5
  66%      6
  75%      7
  80%      7
  90%     10
  95%     11
  98%     13
  99%    152
 100%    154 (longest request)

Initially, the test run was taking ~ 1.5 seconds on TruffleRuby. But, after executing the run 4-5 times, the time came down to ~0.7 seconds.

performance

Most helpful comment

Just a quick update on progress I've made on this issue.

One of the big problems we had was that Puma's thread pool implementation was not running well on TruffleRuby. I wrote a small benchmark to test this and have been optimising ConditionVariable and Mutex operations. We're now much closer to MRI on this benchmark, and I hope to make some more progress on Monday.

The other area which is slower than I'd like is running the C extension to parse http requests. We can improve things a bit by removing the mutex round calls to the extension, but I think there are some other changes we can make which will help as well.

All 7 comments

It looks like the issue is not related to rack. I wrote a pure ruby web server and the issue still appears.

require 'socket'

server = TCPServer.new 9292

while session = server.accept
  start_time = Time.now
  request = session.gets

  session.print "HTTP/1.1 200\r\n"
  session.print "Content-Type: text/html\r\n"
  session.print "\r\n"
  session.print "Hello world!"

  end_time = Time.now
  puts "#{end_time - start_time}"

  session.close
end

I'm seeing some slightly different results to you.
I'm comparing against MRI 2.4 running your rack test server like this

ruby server-test.rb 2>/dev/null

where 'sever-test.rb' contains

require 'rack'
Rack::Handler::WEBrick.run(
  ->(env) { [200, {"Content-Type" => "text/html"}, ["Hello World!"]] },
  :DoNotReverseLookup => true )

This way I can easily insert any extra commands to the run time if we need them later. I set the :DoNotReverseLookup option as reverse lookups can be expensive and I know it has changed between ruby versions so I wanted to be certain we'd be comparing like for like.

The result I get from MRI is

This is ApacheBench, Version 2.3 <$Revision: 1430300 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        WEBrick/1.3.1
Server Hostname:        127.0.0.1
Server Port:            8080

Document Path:          /
Document Length:        12 bytes

Concurrency Level:      10
Time taken for tests:   1.006 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      180000 bytes
HTML transferred:       12000 bytes
Requests per second:    994.20 [#/sec] (mean)
Time per request:       10.058 [ms] (mean)
Time per request:       1.006 [ms] (mean, across all concurrent requests)
Transfer rate:          174.76 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:     3   10  21.4      7     276
Waiting:        1    7  11.2      6     219
Total:          3   10  21.4      7     276

Percentage of the requests served within a certain time (ms)
  50%      7
  66%      8
  75%      8
  80%      8
  90%     10
  95%     11
  98%     20
  99%    148
 100%    276 (longest request)

testing with the GraalVM 1.0 RC5 release I see

This is ApacheBench, Version 2.3 <$Revision: 1430300 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        WEBrick/1.3.1
Server Hostname:        127.0.0.1
Server Port:            8080

Document Path:          /
Document Length:        12 bytes

Concurrency Level:      10
Time taken for tests:   0.555 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      180000 bytes
HTML transferred:       12000 bytes
Requests per second:    1802.85 [#/sec] (mean)
Time per request:       5.547 [ms] (mean)
Time per request:       0.555 [ms] (mean, across all concurrent requests)
Transfer rate:          316.91 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:     1    5   7.4      4      64
Waiting:        1    5   7.4      4      64
Total:          1    5   7.4      4      64

Percentage of the requests served within a certain time (ms)
  50%      4
  66%      4
  75%      5
  80%      5
  90%      6
  95%      7
  98%     33
  99%     63
 100%     64 (longest request)

by about the sixth run, using the default --native option and roughly the same with the --jvm option.

This is going to be sensitive to a number of factors, so it would be good to get a little more info on the machine you're running on. I'm testing on an 2.4GHz i5-6300U CPU 16GB laptop running Oracle Linux 7.5.

FWIW on this sort of benchmark our biggest cost is the creation of an OS thread for each request by WEBrick. I was able to more than halve the average request time and significantly reduce the variance by patching it to use a work queue and a set of worker threads like this

module WEBrick
  class GenericServer
    def start(&block)
      raise ServerError, "already started." if @status != :Stop
      server_type = @config[:ServerType] || SimpleServer

      server_type.start{
        @logger.info \
          "#{self.class}#start: pid=#{$$} port=#{@config[:Port]}"
        call_callback(:StartCallback)

        thgroup = ThreadGroup.new
        @status = :Running
        begin
          work_q = start_thread_pool(thgroup)
          while @status == :Running
            begin
              if svrs = IO.select(@listeners, nil, nil, 2.0)
                svrs[0].each{|svr|
                  @tokens.pop          # blocks while no token is there.
                  if sock = accept_client(svr)
                    sock.do_not_reverse_lookup = config[:DoNotReverseLookup]
                    work_q.push([sock, block])
                  else
                    @tokens.push(nil)
                  end
                }
              end
            rescue Errno::EBADF, IOError => ex
              # if the listening socket was closed in GenericServer#shutdown,
              # IO::select raise it.
            rescue StandardError => ex
              msg = "#{ex.class}: #{ex.message}\n\t#{ex.backtrace[0]}"
              @logger.error msg
            rescue Exception => ex
              @logger.fatal ex
              raise
            end
          end

        ensure
          work_q.close
          @status = :Shutdown
          @logger.info "going to shutdown ..."
          thgroup.list.each{|th| th.join if th[:WEBrickThread] }
          call_callback(:StopCallback)
          @logger.info "#{self.class}#start done."
          @status = :Stop
        end
      }
    end

    def start_thread(sock, &block)
      Thread.start{
        process_connection(sock, &block)
      }
    end

    def start_thread_pool(group, threads = 4)
      work_q = Queue.new
      threads.times do
        th = Thread.start {
          while w = work_q.pop()
            sock = w[0]
            block = w[1]
            process_connection(sock, &block)
          end
        }
        th[:WEBrickThread] = true
        group.add(th)
      end
      work_q
    end

    def process_connection(sock, &block)
      begin
        Thread.current[:WEBrickSocket] = sock
        begin
          addr = sock.peeraddr
          @logger.debug "accept: #{addr[3]}:#{addr[1]}"
        rescue SocketError
          @logger.debug "accept: <address unknown>"
          raise
        end
        call_callback(:AcceptCallback, sock)
        block ? block.call(sock) : run(sock)
      rescue Errno::ENOTCONN
        @logger.debug "Errno::ENOTCONN raised"
      rescue ServerError => ex
        msg = "#{ex.class}: #{ex.message}\n\t#{ex.backtrace[0]}"
        @logger.error msg
      rescue Exception => ex
        @logger.error ex
      ensure
        @tokens.push(nil)
        Thread.current[:WEBrickSocket] = nil
        if addr
          @logger.debug "close: #{addr[3]}:#{addr[1]}"
        else
          @logger.debug "close: <address unknown>"
        end
        sock.close unless sock.closed?
      end
    end
  end
end

I do get much closer results to you if I use Puma as my server. I'll need to do some profiling to work out exactly what takes the time.

Just a quick update on progress I've made on this issue.

One of the big problems we had was that Puma's thread pool implementation was not running well on TruffleRuby. I wrote a small benchmark to test this and have been optimising ConditionVariable and Mutex operations. We're now much closer to MRI on this benchmark, and I hope to make some more progress on Monday.

The other area which is slower than I'd like is running the C extension to parse http requests. We can improve things a bit by removing the mutex round calls to the extension, but I think there are some other changes we can make which will help as well.

Any updates on basic Rack/Sinatra-based performance?

I've repeated the test discussed in #1406, which yielded the following results:

(truffleruby 19.2.0)> wrk -c 10 -t 5 -d 60 http://localhost:9292/
Running 1m test @ http://localhost:9292/
  5 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    89.85ms  224.30ms   1.45s    90.44%
    Req/Sec   675.82    437.38     1.46k    48.39%
  149705 requests in 1.00m, 26.41MB read
  Socket errors: connect 0, read 0, write 0, timeout 10
Requests/sec:   2492.10
Transfer/sec:    450.24KB
(Ruby 2.6.2)> wrk -c 10 -t 5 -d 60 http://localhost:9292/
Running 1m test @ http://localhost:9292/
  5 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.57ms    1.75ms  60.73ms   91.53%
    Req/Sec   827.08    160.90     1.07k    67.07%
  247059 requests in 1.00m, 43.59MB read
Requests/sec:   4114.73
Transfer/sec:    743.39KB

This was after letting the simple app "warm up" with thousands of requests. To me, this use case (smaller, service-based web apps) is far more important than the Rails case. With all of the performance promised by (and benchmarks for) TruffleRuby, what makes this use case perform poorly?

(cc/ @aardvark179 )

Related question (cc/ @eregon): Is this slowness due to Sinatra specifically, TruffleRuby's HTTP handling in general, or something else? Just trying to determine if I can get decent performance out of TruffleRuby if the interface to the app is HTTP.

We had somewhat put this on hold while we improved general compatibility of C extensions, and then worked on other performance challenges round that change. It is probably now time to investigate Puma performance again with these changes in place.

I'll do some investigation, but my initial guess would be that our main performance blocker will by in Puma's C extensions, and how well Sulong can JIT that, but I will investigate. It is important we get generated C parsers (such as Puma's HTTP one) fast as they are used in several important gems.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

deepj picture deepj  路  16Comments

deepj picture deepj  路  20Comments

andrewsheelan picture andrewsheelan  路  22Comments

hsbt picture hsbt  路  28Comments

deepj picture deepj  路  41Comments