linux https://travis-ci.org/beanstalkd/beanstalkd/jobs/552358996:
ctbenchheapinsert 5000000 786 ns/op
ctbenchheapremove 1000000 1347 ns/op
ctbenchmakejob 10000000 249 ns/op
ctbenchputdelete1k 100 39572775 ns/op 0.03 MB/s
ctbenchputdelete8byte 100 39696999 ns/op 0.00 MB/s
ctbenchputdelete8k 100 39573764 ns/op 0.26 MB/s
osx https://travis-ci.org/beanstalkd/beanstalkd/jobs/552358997:
ctbenchheapinsert 5000000 705 ns/op
ctbenchheapremove 1000000 1102 ns/op
ctbenchmakejob 5000000 384 ns/op
ctbenchputdelete1k 5000 709160 ns/op 1.68 MB/s
ctbenchputdelete8byte 5000 494429 ns/op 0.02 MB/s
ctbenchputdelete8k 5000 550173 ns/op 19.74 MB/s
I wonder if it is because clang optimizes by default on higher lever or some problem with the generate code on Linux.
And I think that benchmarking should take more iterations.
/cc @sergeyklay
That is interesting, but still it's only 3rd party (travis) service. Some real offsite benchmark is what I'd try, if avail.
For me is more interesting to see benchmark with -O2 flag for both compilers.
@potatosalad tested it: using VMware Fusion and SmartOS version joyent_20190718T005708Z the results are:
# make clean bench
SNIP
cc -Wall -Werror -Wformat=2 -g -c -o sunos.o sunos.c
SNIP
.........................................................................
PASS
ctbench_heap_insert 3000000 1109 ns/op
ctbench_heap_remove 2000000 1002 ns/op
ctbench_make_job 5000000 1202 ns/op
ctbench_put_delete_1k 10000 208513 ns/op 5.10 MB/s
ctbench_put_delete_64k 3000 612422 ns/op 181.42 MB/s
ctbench_put_delete_8 10000 190305 ns/op 0.07 MB/s
ctbench_put_delete_8k 10000 249182 ns/op 39.98 MB/s
https://github.com/beanstalkd/beanstalkd/pull/202#issuecomment-513383738
Running this benchmark on some production Linux server:
ctbench_heap_insert 1000000 1087 ns/op
ctbench_heap_remove 1000000 1641 ns/op
ctbench_make_job 5000000 242 ns/op
ctbench_put_delete_1k 100 39608911 ns/op 0.03 MB/s
ctbench_put_delete_64k 100 39628394 ns/op 2.12 MB/s
ctbench_put_delete_8 100 39606559 ns/op 0.00 MB/s
ctbench_put_delete_8k 100 39606404 ns/op 0.26 MB/s
At the same time running simple python client of N iterations:
#!/usr/bin/env python
import time
from sse import mq
if __name__ == '__main__':
c = mq.get(use='default', port=11000)
mq.empty(c)
N = 100000
size = 65535
job = 'a' * size
t = time.time()
for i in range(N):
jid = c.put(job)
c.delete(jid)
print float(size*N) / 1024 / 1024 / (time.time()-t), "Mb/sec"
Produces this on the same Linux productions box:
484.369860938 Mb/sec
And on local macOS box:
357.875060921 Mb/sec
Meaning that something is wrong with benchmark itself and not with server.
The reason of this slowness is the readline function:
https://github.com/beanstalkd/beanstalkd/blob/master/testserv.c#L204-L255
Specifically the way the output is read char by char using the select systemcall. It seems to be slower on Linux.
@kr
Specifically the way the output is read char by char using the
selectsystemcall. It seems to be slower on Linux.
Ah, I guess we should put a buffer in front of that fd.
Great job @ysmolsky