Horizon: Performance hit after upgrade to 3.2.1

Created on 28 May 2019  Â·  8Comments  Â·  Source: laravel/horizon

  • Horizon Version: 3.2.1
  • Laravel Version: 5.8.17
  • PHP Version: 7.2.18
  • Redis Driver & Version: predis 1.1.1
  • Database Driver & Version: MySQL 5.7.25

Description:

I noticed a performance hit in production after upgrading to Horizon 3.2.1, which is probably in relation to https://github.com/laravel/horizon/pull/589. This is on a EC2 t2 instance, with constant workload, the hit is an additional 5-6% of CPU usage.

See graph below, we upgraded to 3.2.1 on May 22 (you will notice also that the CPU credits went down to zero way faster after the upgrade):

Screenshot 2019-05-28 at 18 53 38

And this graph is after reverting to 3.1.2:

Screenshot 2019-05-28 at 18 56 16

Steps To Reproduce:

Upgrade Horizon from 3.1.2 to 3.2.1.

bug

Most helpful comment

The original PR that caused the performance hit was reverted.

All 8 comments

Can you check if this is related to the scheduler or one of the other processes?

How can I check that?

I think this may be because it now calls ps twice, once to find out if it need to auto-scale and once to get the current system load. Unfortunately on my system the performance difference is so minute that I'm unable to tell the difference.

I have 3 options for possible improvements.
1: Differentiate the two calls so that the auto-scaler only get the list of commands and not CPU and mem load.
2: Only call ps once and then store the values in the class for the telementry
3: Tightly couple telementry and autoscaling

The pros:
1: Only getting the relevant data for the caller is probably good idea
2: Only one call will be made each tick
3: Only one call will be made each tick

The cons:
1: It is unclear from my tests if ps collects less data or simply just dosen't return it, so the savings might be negligible.
2: If caller A is no longer called each tick in a later version then caller B will get stale data
3: Tightly coupling things is a bad idea

Option 4: boolean to enable / disable telemetry?

For option 1, if you can give me an example of ps command, I can try to benchmark it on my EC2 t2 instance.

exec ps axo %cpu,%mem,command | grep "supervisor=SomeHorizonName" | grep -v "grep"
vs
exec ps axo command | grep "supervisor=SomeHorizonName" | grep -v "grep"

Over 50,000 iterations, no significant difference...

➜  time ./test1.sh
./test1.sh  83.36s user 146.00s system 78% cpu 4:53.98 total
➜  time ./test2.sh
./test2.sh  80.47s user 152.67s system 78% cpu 4:56.86 total

Hmm less then 4% increased performance dosent look promising for option 1. There would also be a reduction in time spent parsing the result, since scaling would only need the number of lines. But I wouldn't expect this to much of a performance issue either.

Currently this only appears to cost 0.16% cpu at the tick rate on the given system so perhaps more then just ps is to blame here.

The original PR that caused the performance hit was reverted.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sojeda picture sojeda  Â·  5Comments

etiennellipse picture etiennellipse  Â·  3Comments

okaufmann picture okaufmann  Â·  3Comments

mahdiahmadi73 picture mahdiahmadi73  Â·  3Comments

dmitryuk picture dmitryuk  Â·  3Comments