I noticed beanstalkd is compiled without optimisations. Adding -O3 to CFLAGS gives a new warning that is converted to an error, but otherwise it compiles fine.
Should optimisations be enabled?
For reference, would you mind sharing the warning/error?
file.c: In function ‘filewclose’:
file.c:536:9: error: ignoring return value of ‘ftruncate’, declared with attribute warn_unused_result [-Werror=unused-result]
Hello @Minnozz ,
Try to replace :
(void)ftruncate(f->fd, f->w->filesize - f->free);
in file.c line 536 with
ftruncate(f->fd, f->w->filesize - f->free);
You still have this issue with this change ?
@JensRantil any idea is somebody is maintaining this project ? @kr seems it doesn't reply to community for a while.
Silviu
Is there a problem that adding this flag is meant to solve?
Hello @kr usually programs compiled with optimizations enabled has better performances. For sure will not fix any other problem but also I don't see how can bring anything bad other than extra performance boost. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
@silviucpp: I created a separate issue about the error (#324).
@kr: I am investigating performance issues related to a beanstalkd instance with high throughput, and noticed that its CPU usage was high. I looked at the build options and wondered why optimisations were not enabled, since that seems like an easy way to get "free" extra performance.
Enabling C compiler optimizations is not free. It significantly increases the risk that exercising undefined behavior will crash the process, corrupt data, or introduce security vulnerabilities. These risks are still present without optimization, of course, but they get worse the more aggressive the optimizer is.
We need to start with a clearly-defined problem, such as "95th percentile response latency is over 100ms on hardware X with load pattern Y", and then we can look at possible solutions. One possible solution is to do some profiling and perhaps edit the beanstalkd code to improve performance. Another possible solution is to enable some compiler optimizations (either individually under their various flags, or as a group with -O2 or -O3), if the benefit outweighs the risk. We'll need measurements demonstrating the performance difference to help weigh that tradeoff.
I'm interested to know more details about the high CPU usage you described. Is it reproducible? What sort of hardware configuration was that on? What else was running on the box at the time? What sort of load patterns did you have (e.g. working set size, throughput, job size distribution)?
Unfortunately I do not have detailed measurements and cannot easily reproduce the problem, since work load that caused it was much higher than our average load.
What I do know from the time of the peak load:
(I do realise this is probably an unconventional usage pattern for beanstalkd.)
Hardware: virtual machine with 4 vCPU's and 4GB RAM running on a 2x Xeon E5645. There are four separate beanstalkd instances running on that VM and nothing more. One beanstalkd with binlog enabled and high traffic (described above) and three without binlog and low traffic. The one with binlog and high traffic was CPU bound, so pretty much using one 2.40GHz core.
Hello,
I just benchmark on my mac the last master compiled as it is right now and also with -O3 . The difference it's around 1.5k requests per second more when compiling with -O3 (from 83k to 84.5k in average) pushes per second.
Silviu
Thanks for running a test @silviucpp! It might not be entirely a fair test if the client was on the same machine as the beanstalkd process, but it's interesting to see what ballpark we're in.
Looks like, in that test, -O3 is about 1.8% faster than -O0.
Also note that "1.2 million jobs reserved in 2.5 hours" is about 133 jobs/s. This is significantly less than the 83,000 reqs/s with no optimization in the test above.
This isn't an apples-to-apples comparison because undoubtedly these numbers involve different hardware with different load patterns, different working set size, etc. (And perhaps the reqs/s number includes several requests per job.) Still, conservatively, we're talking about a couple orders of magnitude of headroom.
It seems pretty likely that the high CPU use mentioned above is due to a bug. I would be surprised if enabling compiler optimizations makes even a dent.
@kr I will try to repeat the test tomorrow between two machines in the same LAN. The current test has both client and server on the same machine. Even if I doubt will be huge difference because bk is single thread and single process and cannot use more than one core.
I was thinking to create a fork and change the server to use http://www.seastar-project.org/ and also to be multi-threading to see what performance boost I can get even if right now is more than ok.
Silviu
Hello,
I repeated the benchmark on a linux machine Ubuntu 14.04 with 24 cores (even if it doesn't matter as time the bk is single thread single process) and surprising even if the machine is more powerful than my MacBook Pro where I tested last time the results are much lower..
I get in average 33729 put/sec with default compiling and 39441 with -O3. In all scenarios I tested the bk process used one core at 100 % so the client sending request is not a problem.
I tested from the same machine as well and also from another machine in LAN with the same config and results were the same.
The only explanation I have for the fact that on mac I get 83.5 req/sec even if hardware is poor is that on mac it's using kqueue and on linux epoll. I know that there is a performance difference between this 2 ..
Silviu
Hello! There hasn't been much progress on this thread and since the objective of the issue is somewhat unclear I'm going to close this. Let me know if you believe it should be reopened and in that case, what exact problem we need to solve.
Most helpful comment
Enabling C compiler optimizations is not free. It significantly increases the risk that exercising undefined behavior will crash the process, corrupt data, or introduce security vulnerabilities. These risks are still present without optimization, of course, but they get worse the more aggressive the optimizer is.
We need to start with a clearly-defined problem, such as "95th percentile response latency is over 100ms on hardware X with load pattern Y", and then we can look at possible solutions. One possible solution is to do some profiling and perhaps edit the beanstalkd code to improve performance. Another possible solution is to enable some compiler optimizations (either individually under their various flags, or as a group with -O2 or -O3), if the benefit outweighs the risk. We'll need measurements demonstrating the performance difference to help weigh that tradeoff.
I'm interested to know more details about the high CPU usage you described. Is it reproducible? What sort of hardware configuration was that on? What else was running on the box at the time? What sort of load patterns did you have (e.g. working set size, throughput, job size distribution)?