Cylc-flow: cylc message severity levels

Created on 7 Dec 2017 · 11Comments · Source: cylc/cylc-flow

Hi,

I thought it would be good to follow standard syslog severity levels. At the moment it appears to allow NORMAL, WARNING, CRITICAL. Standard syslog is: DEBUG, INFO, NOTICE, WARNING, ERR, CRIT, ALERT, EMERG

See: https://docs.python.org/3/library/syslog.html and https://en.wikipedia.org/wiki/Syslog

Source

ColemanTom

Most helpful comment

Tentatively re-targeting this for Cylc 8. Now that the code is in Python 3, we can implement configurable logging.

matthewrmshin on 11 Mar 2019

👍2

All 11 comments

Python's logging module doesn't do all these either, the last time I looked. What are we trying to support that requires all these levels?

matthewrmshin on 7 Dec 2017

Note we have a "CUSTOM" level too, and this functionality overlaps with event-handling somewhat in Cylc.

I'm of two minds about this. I suppose we could allow additional levels that could be used with custom messages in user job scripting, just for logging - user or site-defined meaning. On the other hand, Matt makes a good point.

hjoliver on 7 Dec 2017

Not really against, but would be interested to understand the requirements here.

matthewrmshin on 7 Dec 2017

Fair point that I didn't really justify anything. I wrote all severity levels, but yes, you are probably correct that it would be overkill. I had looked at the python stdlib syslog library, rather than the logger library, so I thought all were included. I'm not entirely sure how CUSTOM works, so I can't comment on it. Further exploration of why this was in my head is not written above. Basically, I think having a bit more control over the log levels is useful.

For example, in the python logger, you can tell it to only print out messages above a specific level. This would allow people, during development, to put in a bunch of extra log information (e.g. log_debug), but by a configuration have them turned off in operations to avoid polluting log files. This allows a smoother transition to operations, but, if setup properly, would allow people to do an edit run on a failure to turn the debug messages back on to help figure a problem out.

The other aspect of this would relate to the alerting downstream. I don't know the details exactly, but I do know cylc is being configured to work with message brokers to deliver messages to alerting and monitoring systems. The granularity in levels would provide a direct link to the alerting mechanism to help prioritise resolution (when combined with some priority ranking of the system in the organisational context). Perhaps this is already figured out though and I am going down a weird path? But, for example, say a task is running, it is doing some data format conversion on lustre (netcdf to/from grib2 for example), and you find evidence that the file is corrupted. Perhaps that should be raised as an emergency level requiring immediate escalation to 2nd level support rather than first level trying to triage it because there is most likely something wrong with one or more of the lustre OSTs.

tldr - real idea should be more along the lines of;

It would be nice to be able to have a couple of more levels, such as DEBUG, such that you can have cylc message -p "DEBUG" ... peppered through scripts, but, some configuration setting accessible via the edit run interface, would allow you turn them off in operations, but on in the case of trying to resolve an unseen before problem.
I imagine that more levels may provide better granularity for triage support and how 1st level support should act in the case of failures (but this may already be sorted out via however the message broker integration is being done).

Does the above make a bit more sense. Sorry for the brevity/not fully fleshing my thoughts out initially.

ColemanTom on 7 Dec 2017

To weigh in on the use-cases:

I think differentiating between errors and fatal/critical error diagnostic/alerting messages could be useful. For example it is conceivable for a task to encounter real errors when invoking commands or interfacing with databases etc. Sometimes this might result in the suite's progress to halt (let's call this scenario a fatal error). In other cases the task may have fall-back logic programmed in to work-around the errors (e.g. ... if system-wide open file-handle limit is hit, wait awhile and retry on the assumption this condition is sporadic) and these errors could be reported as errors (warranting serious and timely investigation) but they did not cause a critical error for the task or suite (and hence operations).

Supporting a debug severity level would also be nice for reasons Tom has mentioned.

There is a discussion about severity levels and what distinguishes them at https://stackoverflow.com/questions/2031163/when-to-use-the-different-log-levels which I found informative. Ultimately I think this issue comes down to determining if these distinguishing characteristics are useful to downstream applications/customers/operators. I think there is a case, although out of the standard syslog severity levels (IETF RFC5424), I have to admit I don't see a need for ALERT.

A TRACE level to support finer-grained debugging could be useful (this goes beyond the syslog convention).

ivorblockley on 1 Feb 2018

386 is related.

Annoyingly, Python's logging module maps severity levels from 10 (debug) to 50 (critical) - the number increases with severity level, whereas syslog maps the main severity levels from 7 (debug) to 2 (critical) - the number decreases with severity level.

What we can do... Pick either logging or syslog as a basis. (The former is more likely, given that it is already used in the logic.) Modify cylc message to allow any severity level. If the specified level is recognised, the reporting system will respect the level in the normal way. Otherwise, the level is considered custom - and the reporting system will act according to any custom event handlers (but can probably default to e.g. logging.INFO).

matthewrmshin on 1 Feb 2018

👍2

2582 should solve the `cylc message` part of this issue.

Still need to figure out the following:

~How to deal with logging levels on the suite side. I think we need to rationalise how we configure logging for the running suite. My normal instinct is to introduce a setting to configure the logging level of the suite (as opposed to having the verbose and debug flags). We should also consider whether we need to duplicate log entries in both log/suite/log and log/suite/err #386.~ Done by #2781.
A job failure currently has a CRITICAL severity. Should this be an ERROR instead? (And should a job failure be a WARNING for tasks that have retries lined up?) Or perhaps this should be configurable per task? (#2289?)

matthewrmshin on 22 Feb 2018

@matthewrmshin - responding to the previous comments:

the original intention for the debug flag was to print Python tracebacks,, and otherwise just a simple error message for users who should not be expected to understand Python tracebacks. Not sure that's the best approach though, not least because it may be inconvenient to re-run a failed suite in order to get a traceback. Aside from debug ,a multi-level verbosity flag seems sensible to me. Also, I'd be happy to not duplicate suite err message in the suite log (we don't for job.err after all).
this is a tricky one! A job failure is typically critical for the job, but not the suite. Maybe we need two categories of CRITICAL (one for job, one for suite). But as you note, a job failure when there are retries lined up is presumably less critical. I'd prefer not to make it configurable unless we really have to, as I doubt many would resort to that. This might be a good one to discuss in June...

hjoliver on 25 Feb 2018

With #2582 and #2781, we should now be aligned with Python's logging module.

Things left to do before closing this issue:

Agree on the default logging level of a failed job with and without retries lined up.
- CRITICAL - as now.
- ERROR, or WARNING if job is expected to fail from time to time (e.g. has follow-on retries, or where failed output is a prerequisite of a downstream task).
Fully expose suite logging via configuration. (Requires Python 3 for easy implementation.)

matthewrmshin on 17 Dec 2018

Tentatively re-targeting this for Cylc 8. Now that the code is in Python 3, we can implement configurable logging.

matthewrmshin on 11 Mar 2019

👍2

3647

oliver-sanders on 11 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

GTK/Gobject unnecessary for cylc graph --reference (7.8.x)

sadielbartholomew · 5Comments

Scheduler keeps running after CTRL + C (Py3.8)

kinow · 4Comments

03-clock-triggered-non-utc-mode.t failing in NZ time

kinow · 4Comments

empy: Python3 and the future of cylc support

oliver-sanders · 5Comments

Remove reference test functionality from cylc.

hjoliver · 5Comments