Beanstalkd: putdelete benchmarks is 20 times slower on Linux

Created on 30 Jun 2019  路  8Comments  路  Source: beanstalkd/beanstalkd

linux https://travis-ci.org/beanstalkd/beanstalkd/jobs/552358996:

ctbenchheapinsert    5000000           786 ns/op
ctbenchheapremove    1000000          1347 ns/op
ctbenchmakejob  10000000           249 ns/op
ctbenchputdelete1k       100      39572775 ns/op       0.03 MB/s
ctbenchputdelete8byte        100      39696999 ns/op       0.00 MB/s
ctbenchputdelete8k       100      39573764 ns/op       0.26 MB/s

osx https://travis-ci.org/beanstalkd/beanstalkd/jobs/552358997:

ctbenchheapinsert    5000000           705 ns/op
ctbenchheapremove    1000000          1102 ns/op
ctbenchmakejob   5000000           384 ns/op
ctbenchputdelete1k      5000        709160 ns/op       1.68 MB/s
ctbenchputdelete8byte       5000        494429 ns/op       0.02 MB/s
ctbenchputdelete8k      5000        550173 ns/op      19.74 MB/s

I wonder if it is because clang optimizes by default on higher lever or some problem with the generate code on Linux.

And I think that benchmarking should take more iterations.

/cc @sergeyklay

NeedsFix

All 8 comments

That is interesting, but still it's only 3rd party (travis) service. Some real offsite benchmark is what I'd try, if avail.

For me is more interesting to see benchmark with -O2 flag for both compilers.

@potatosalad tested it: using VMware Fusion and SmartOS version joyent_20190718T005708Z the results are:

# make clean bench
SNIP
cc -Wall -Werror -Wformat=2 -g   -c -o sunos.o sunos.c
SNIP
.........................................................................
PASS
ctbench_heap_insert  3000000          1109 ns/op
ctbench_heap_remove  2000000          1002 ns/op
ctbench_make_job     5000000          1202 ns/op
ctbench_put_delete_1k      10000        208513 ns/op       5.10 MB/s
ctbench_put_delete_64k      3000        612422 ns/op     181.42 MB/s
ctbench_put_delete_8       10000        190305 ns/op       0.07 MB/s
ctbench_put_delete_8k      10000        249182 ns/op      39.98 MB/s

https://github.com/beanstalkd/beanstalkd/pull/202#issuecomment-513383738

Running this benchmark on some production Linux server:

ctbench_heap_insert  1000000          1087 ns/op
ctbench_heap_remove  1000000          1641 ns/op
ctbench_make_job     5000000           242 ns/op
ctbench_put_delete_1k        100      39608911 ns/op       0.03 MB/s
ctbench_put_delete_64k       100      39628394 ns/op       2.12 MB/s
ctbench_put_delete_8         100      39606559 ns/op       0.00 MB/s
ctbench_put_delete_8k        100      39606404 ns/op       0.26 MB/s

At the same time running simple python client of N iterations:

#!/usr/bin/env python

import time
from sse import mq


if __name__ == '__main__':
    c = mq.get(use='default', port=11000)
    mq.empty(c)

    N = 100000
    size = 65535
    job = 'a' * size

    t = time.time()
    for i in range(N):
        jid = c.put(job)
        c.delete(jid)

    print float(size*N) / 1024 / 1024 / (time.time()-t), "Mb/sec"

Produces this on the same Linux productions box:

484.369860938 Mb/sec

And on local macOS box:

357.875060921 Mb/sec

Meaning that something is wrong with benchmark itself and not with server.

The reason of this slowness is the readline function:
https://github.com/beanstalkd/beanstalkd/blob/master/testserv.c#L204-L255

Specifically the way the output is read char by char using the select systemcall. It seems to be slower on Linux.

@kr

Specifically the way the output is read char by char using the select systemcall. It seems to be slower on Linux.

Ah, I guess we should put a buffer in front of that fd.

Great job @ysmolsky

Was this page helpful?
0 / 5 - 0 ratings

Related issues

JensRantil picture JensRantil  路  11Comments

ysmolsky picture ysmolsky  路  4Comments

ysmolsky picture ysmolsky  路  15Comments

ysmolsky picture ysmolsky  路  7Comments

vitlav picture vitlav  路  10Comments