Cylc-flow: Notification of unexpected suite shutdown

Created on 19 Mar 2019  路  12Comments  路  Source: cylc/cylc-flow

Users typically want to know if their suites have shutdown due to unexpected circumstances, e.g.:

  • Disk full. (I/O error on database write.)
  • Suite failing health check. (E.g. contact file changed. suite host condemned, etc.)
  • Cylc bug.

The current shutdown suite event is not good enough, because it does not distinguish between an expected shutdown and an unexpected shutdown. We need a new suite event to only trigger on unexpected suite shutdown.

The current set of suite events includes:

  • startup
  • timeout
  • inactivity
  • stalled
  • shutdown
small

Most helpful comment

In which case, I'll go with aborted because you can say my-suite aborted but you can't really say my-suite abnormal shutdown.

All 12 comments

Had some of these last night! (>.<)

Had some of these last night! (>.<)

What was the cause?

Had some of these last night! (>.<)

What was the cause?

The open file limit:

2019-02-18T21:45:40Z INFO - Suite server: url=https://ec-cylc02.kupe.niwa.co.nz:43067/ pid=18106
2019-02-18T21:45:40Z INFO - Run: (re)start=11 log=4
2019-02-18T21:45:40Z INFO - Cylc version: 7.8.0
2019-02-18T21:45:40Z INFO - Run mode: live
2019-02-18T21:45:40Z INFO - Initial point: 20180108T18
2019-02-18T21:45:40Z INFO - Final point: None
2019-03-19T11:32:01Z INFO - DONE
Traceback (most recent call last):
  File "/scale_akl_persistent/filesets/opt_niwa/share_prelim/cylc/7.8.0/bin_/cylc-restart", line 25, in <module>
    main(is_restart=True)
  File "/scale_akl_persistent/filesets/opt_niwa/share_prelim/cylc/7.8.0/lib/cylc/scheduler_cli.py", line 134, in main
    scheduler.start()
  File "/scale_akl_persistent/filesets/opt_niwa/share_prelim/cylc/7.8.0/lib/cylc/scheduler.py", line 279, in start
    raise exc
IOError: [Errno 23] Too many open files in system: '/scale_akl_persistent/filesets/opt_niwa/share_prelim/cylc/7.8.0/etc/global.rc'

Ouch!

@cylc/core I'll like some quick comments/suggestions for the name of the new event before I make an attempt on implementation. Here is my list:

  • died - too Perl?
  • aborted - may be confused with the abort on * settings?
  • terminated - may be confused with SIGTERM?
  • exited - does not sound severe enough?

Thoughts?

Thoughts?

crashed - too generic?

Given that this event maybe triggered by a normal, controlled shutdown due the suite hosts being condemned, I don't like _crashed_ or _died_. Maybe _terminated_ if we're not happy with _aborted_?

I quite like falldown, having the parallel to shutdown but 'fall' implying it is undesired and not clean.

this event maybe triggered by a normal, controlled shutdown due the suite hosts being condemned

Or if we want both terms to indicate some shade of normal, but one undesirable, we could have shutdown as the unexpected suite shutdown and windup to be the expected shutdown. Then there is a nice relation by 'up' to startup. In that case we could then perhaps use the up/down sub-words to help distinguish such states in our UI icons, etc, via corresponding arrows.

Otherwise what about ceased or expired?

Many of the terms above suggest to me an uncontrolled crash, which this isn't. We're talking about the server program detecting some abnormal condition and shutting itself down cleanly. "Falldown" and "ceased" are too non-standard/unusual IMO, so most users wouldn't understand what they mean; and "expired" suggests some kind of timeout has occurred (colloquially, in the sense of "past its use-by date") or a crash (in the sense of "dead") ... plus we already have a task expired event (in the timeout sense).

So I vote for:

  • aborted (2nd choice)

    • I don't think it matters too much that we already have "abort on task failed" etc. because that arguably has similar meaning.

  • abnormal shutdown (1st choice)

    • this is verbose (does that matter?) but I think it covers all cases quite nicely, including condemned host migration (an abnormal condition from the suite's perspective)

In which case, I'll go with aborted because you can say my-suite aborted but you can't really say my-suite abnormal shutdown.

(aborted is fine, but I will just point out that we could have had an "abnormal shutdown" event, and said my suite shut down abnormally - but I forgot to comment 2 weeks ago, so never mind!)

7.8.x version of this is done.

Was this page helpful?
0 / 5 - 0 ratings