I was part of the ticket about that horizon seemed to not gracefully shutdown #663 (as we were having shutdown problems) but after this ticket was closed with the confirmation that horizon listens correctly to SIGTERM we continued investigation of other reasons that could explain our problem.
Our issues were that having a long-running job (for example 120 seconds) was not completed, and horizon main PID exited (and as we are running horizon in a container, the container exited). The long running job was not completed correctly.
We have now figured out that this is due to the 'timeout' configuration in horizon (per 'supervisor' entry) that has caused quite some confusion about this setting.
General info:
Immediately following happens:
Because the main horizon PID exited, in our case the container is terminated, what results in the still running job not being completed and ungracefully stopped.
This behavior can be 'fixed' by setting the 'timeout' value of horizon supervisor config to something larger than 120 seconds, but this is now the very confusion part.
If you run normal artisan queue:work with the optional --timeout parameter, this timeout is only taken into consideration if the Job its self DOES NOT have a timeout defined.
This is also the reason why the supervisor:work process does NOT exit after 30 seconds, despite having the --timeout option set to 30 (this is correct behavior). Also if you would manually run queue:work --timeout=30 a job with large timeout will run for 120 seconds.
What i do not understand is why the horizon:supervisor process (and main process) exits after 30 seconds. Even after digging in the code, this does not make any sense to me that the supervisor process decides to exit, even if the worker is still running.
I would love some insights or clarifications. I don't want to blindly change those settings without knowing the reason for this behaviour, as it might still be not 'by design'.
In order for Horizon to not get stuck forever waiting for child supervisors to finish, we use the longest timeout value for a supervisor and use it as the maximum time we should wait.
So even if you have your job set to timeout after 120 seconds, because the longest wait time of a supervisor is set to be 30, Horizon will terminate after 30 seconds.
What you can do is run these long running jobs on a separate queue that has a long timeout value, that way Horizon will only terminate based on that long value.
I feel like this is more of a support request than an actual feature request or bug report. Feel free to continue the discussion on another channel (or on this issue).
@themsaid thank you for the clarification. Its understandable & makes sense.
The summary then would be: the timeout value set in horizon has to be equal/longer than the longest job (that will be processed by that supervisor) and the retry_after configuration of the redis connection that is used has to be equal/longer than the longest job that will be processed via that connection.
I feel like maybe the horizon timeout configuration needs a section devoted to it in the documentation, as this parameter is used/handled slightly differently than the timeout parameter for normal queues & workers (and the option is not mentioned in the docs at all). This timeout issues took quite some time for us to figure out and i think we could prevent this for future horizon users with a few lines of documentation, even if the default of 60 is probably not being hit by many.
All in all though horizon is doing a fantastic job even at large scale, a great addition to the laravel stack!
the timeout value set in horizon as to be
Did you mean "has to be"?
Normally what I'd do is have a dedicated queue for long jobs that has a reasonable timeout value and push all long jobs to that queue.
@themsaid yes, i should have been more precise in the summary: this only applies to the supervisor, queue & connection that is serving the long running jobs.
Maybe it would also be sensible to put the default timeout that is currently set to 60 seconds into the horizon.php config that can be published. As there is no timeout there, and nothing in the documentation, this confused us why jobs only running longer than 60 seconds were not working while gracefully shutting down.