Horizon: Confusion about timeouts when it comes to graceful termination (bug?)

Created on 21 Dec 2019  路  6Comments  路  Source: laravel/horizon

  • Horizon Version: 3.4.4
  • Laravel Version: 6.9.0
  • PHP Version: 7.4.0
  • Redis Driver & Version: phpredis 5.1.0

Description:

I was part of the ticket about that horizon seemed to not gracefully shutdown #663 (as we were having shutdown problems) but after this ticket was closed with the confirmation that horizon listens correctly to SIGTERM we continued investigation of other reasons that could explain our problem.

Our issues were that having a long-running job (for example 120 seconds) was not completed, and horizon main PID exited (and as we are running horizon in a container, the container exited). The long running job was not completed correctly.

We have now figured out that this is due to the 'timeout' configuration in horizon (per 'supervisor' entry) that has caused quite some confusion about this setting.

General info:

  • We are not using supervisor (the system process manager from linux i mean) in any way (horizon is launched as main process in a container)
  • This behaviour happens if we run horizon:terminate or if we send SIGTERM to the main PID
  • retry_after setting in redis connection is set to 1200 seconds and has nothing to do with this issue we're facing.

Steps To Reproduce:

  • Create a job with a specific timeout ($timeout defined in job class) of 120 seconds
  • have a timeout of 30 (seconds) set in the horizon supervisor entry that is serving the queue that you'll be dispatching to
  • start horizon (this timeout setting of 30 is set to horizon:supervisor as well as the resulting horizon:work processes)
  • dispatch the job that will run for 120 seconds
  • immediately after either send SIGTERM to main PID or run artisan horizon:terminate

Immediately following happens:

  • all supervisors & work processes that are currently doing nothing will quit correctly
  • there will be one supervisor & one work process still running (that i currently processing the 120 second job) as well as the main horizon process.
  • after 30 seconds, the supervisor process will exit, and so will the main horizon PID
  • BUT the supervisor:work command is still running that job.

Because the main horizon PID exited, in our case the container is terminated, what results in the still running job not being completed and ungracefully stopped.

This behavior can be 'fixed' by setting the 'timeout' value of horizon supervisor config to something larger than 120 seconds, but this is now the very confusion part.

If you run normal artisan queue:work with the optional --timeout parameter, this timeout is only taken into consideration if the Job its self DOES NOT have a timeout defined.
This is also the reason why the supervisor:work process does NOT exit after 30 seconds, despite having the --timeout option set to 30 (this is correct behavior). Also if you would manually run queue:work --timeout=30 a job with large timeout will run for 120 seconds.

What i do not understand is why the horizon:supervisor process (and main process) exits after 30 seconds. Even after digging in the code, this does not make any sense to me that the supervisor process decides to exit, even if the worker is still running.

I would love some insights or clarifications. I don't want to blindly change those settings without knowing the reason for this behaviour, as it might still be not 'by design'.

All 6 comments

In order for Horizon to not get stuck forever waiting for child supervisors to finish, we use the longest timeout value for a supervisor and use it as the maximum time we should wait.

So even if you have your job set to timeout after 120 seconds, because the longest wait time of a supervisor is set to be 30, Horizon will terminate after 30 seconds.

What you can do is run these long running jobs on a separate queue that has a long timeout value, that way Horizon will only terminate based on that long value.

I feel like this is more of a support request than an actual feature request or bug report. Feel free to continue the discussion on another channel (or on this issue).

@themsaid thank you for the clarification. Its understandable & makes sense.

The summary then would be: the timeout value set in horizon has to be equal/longer than the longest job (that will be processed by that supervisor) and the retry_after configuration of the redis connection that is used has to be equal/longer than the longest job that will be processed via that connection.

I feel like maybe the horizon timeout configuration needs a section devoted to it in the documentation, as this parameter is used/handled slightly differently than the timeout parameter for normal queues & workers (and the option is not mentioned in the docs at all). This timeout issues took quite some time for us to figure out and i think we could prevent this for future horizon users with a few lines of documentation, even if the default of 60 is probably not being hit by many.

All in all though horizon is doing a fantastic job even at large scale, a great addition to the laravel stack!

the timeout value set in horizon as to be

Did you mean "has to be"?

Normally what I'd do is have a dedicated queue for long jobs that has a reasonable timeout value and push all long jobs to that queue.

@themsaid yes, i should have been more precise in the summary: this only applies to the supervisor, queue & connection that is serving the long running jobs.

Maybe it would also be sensible to put the default timeout that is currently set to 60 seconds into the horizon.php config that can be published. As there is no timeout there, and nothing in the documentation, this confused us why jobs only running longer than 60 seconds were not working while gracefully shutting down.

Was this page helpful?
0 / 5 - 0 ratings