Users typically want to know if their suites have shutdown due to unexpected circumstances, e.g.:
The current shutdown suite event is not good enough, because it does not distinguish between an expected shutdown and an unexpected shutdown. We need a new suite event to only trigger on unexpected suite shutdown.
The current set of suite events includes:
Had some of these last night! (>.<)
Had some of these last night! (>.<)
What was the cause?
Had some of these last night! (>.<)
What was the cause?
The open file limit:
2019-02-18T21:45:40Z INFO - Suite server: url=https://ec-cylc02.kupe.niwa.co.nz:43067/ pid=18106
2019-02-18T21:45:40Z INFO - Run: (re)start=11 log=4
2019-02-18T21:45:40Z INFO - Cylc version: 7.8.0
2019-02-18T21:45:40Z INFO - Run mode: live
2019-02-18T21:45:40Z INFO - Initial point: 20180108T18
2019-02-18T21:45:40Z INFO - Final point: None
2019-03-19T11:32:01Z INFO - DONE
Traceback (most recent call last):
File "/scale_akl_persistent/filesets/opt_niwa/share_prelim/cylc/7.8.0/bin_/cylc-restart", line 25, in <module>
main(is_restart=True)
File "/scale_akl_persistent/filesets/opt_niwa/share_prelim/cylc/7.8.0/lib/cylc/scheduler_cli.py", line 134, in main
scheduler.start()
File "/scale_akl_persistent/filesets/opt_niwa/share_prelim/cylc/7.8.0/lib/cylc/scheduler.py", line 279, in start
raise exc
IOError: [Errno 23] Too many open files in system: '/scale_akl_persistent/filesets/opt_niwa/share_prelim/cylc/7.8.0/etc/global.rc'
Ouch!
@cylc/core I'll like some quick comments/suggestions for the name of the new event before I make an attempt on implementation. Here is my list:
abort on * settings?Thoughts?
Thoughts?
crashed - too generic?
Given that this event maybe triggered by a normal, controlled shutdown due the suite hosts being condemned, I don't like _crashed_ or _died_. Maybe _terminated_ if we're not happy with _aborted_?
I quite like falldown, having the parallel to shutdown but 'fall' implying it is undesired and not clean.
this event maybe triggered by a normal, controlled shutdown due the suite hosts being condemned
Or if we want both terms to indicate some shade of normal, but one undesirable, we could have shutdown as the unexpected suite shutdown and windup to be the expected shutdown. Then there is a nice relation by 'up' to startup. In that case we could then perhaps use the up/down sub-words to help distinguish such states in our UI icons, etc, via corresponding arrows.
Otherwise what about ceased or expired?
Many of the terms above suggest to me an uncontrolled crash, which this isn't. We're talking about the server program detecting some abnormal condition and shutting itself down cleanly. "Falldown" and "ceased" are too non-standard/unusual IMO, so most users wouldn't understand what they mean; and "expired" suggests some kind of timeout has occurred (colloquially, in the sense of "past its use-by date") or a crash (in the sense of "dead") ... plus we already have a task expired event (in the timeout sense).
So I vote for:
In which case, I'll go with aborted because you can say my-suite aborted but you can't really say my-suite abnormal shutdown.
(aborted is fine, but I will just point out that we could have had an "abnormal shutdown" event, and said my suite shut down abnormally - but I forgot to comment 2 weeks ago, so never mind!)
7.8.x version of this is done.
Most helpful comment
In which case, I'll go with aborted because you can say my-suite aborted but you can't really say my-suite abnormal shutdown.