Hi,
I thought it would be good to follow standard syslog severity levels. At the moment it appears to allow NORMAL, WARNING, CRITICAL. Standard syslog is: DEBUG, INFO, NOTICE, WARNING, ERR, CRIT, ALERT, EMERG
See: https://docs.python.org/3/library/syslog.html and https://en.wikipedia.org/wiki/Syslog
Python's logging module doesn't do all these either, the last time I looked. What are we trying to support that requires all these levels?
Note we have a "CUSTOM" level too, and this functionality overlaps with event-handling somewhat in Cylc.
I'm of two minds about this. I suppose we could allow additional levels that could be used with custom messages in user job scripting, just for logging - user or site-defined meaning. On the other hand, Matt makes a good point.
Not really against, but would be interested to understand the requirements here.
Fair point that I didn't really justify anything. I wrote all severity levels, but yes, you are probably correct that it would be overkill. I had looked at the python stdlib syslog library, rather than the logger library, so I thought all were included. I'm not entirely sure how CUSTOM works, so I can't comment on it. Further exploration of why this was in my head is not written above. Basically, I think having a bit more control over the log levels is useful.
For example, in the python logger, you can tell it to only print out messages above a specific level. This would allow people, during development, to put in a bunch of extra log information (e.g. log_debug), but by a configuration have them turned off in operations to avoid polluting log files. This allows a smoother transition to operations, but, if setup properly, would allow people to do an edit run on a failure to turn the debug messages back on to help figure a problem out.
The other aspect of this would relate to the alerting downstream. I don't know the details exactly, but I do know cylc is being configured to work with message brokers to deliver messages to alerting and monitoring systems. The granularity in levels would provide a direct link to the alerting mechanism to help prioritise resolution (when combined with some priority ranking of the system in the organisational context). Perhaps this is already figured out though and I am going down a weird path? But, for example, say a task is running, it is doing some data format conversion on lustre (netcdf to/from grib2 for example), and you find evidence that the file is corrupted. Perhaps that should be raised as an emergency level requiring immediate escalation to 2nd level support rather than first level trying to triage it because there is most likely something wrong with one or more of the lustre OSTs.
tldr - real idea should be more along the lines of;
cylc message -p "DEBUG" ... peppered through scripts, but, some configuration setting accessible via the edit run interface, would allow you turn them off in operations, but on in the case of trying to resolve an unseen before problem.Does the above make a bit more sense. Sorry for the brevity/not fully fleshing my thoughts out initially.
To weigh in on the use-cases:
I think differentiating between errors and fatal/critical error diagnostic/alerting messages could be useful. For example it is conceivable for a task to encounter real errors when invoking commands or interfacing with databases etc. Sometimes this might result in the suite's progress to halt (let's call this scenario a fatal error). In other cases the task may have fall-back logic programmed in to work-around the errors (e.g. ... if system-wide open file-handle limit is hit, wait awhile and retry on the assumption this condition is sporadic) and these errors could be reported as errors (warranting serious and timely investigation) but they did not cause a critical error for the task or suite (and hence operations).
Supporting a debug severity level would also be nice for reasons Tom has mentioned.
There is a discussion about severity levels and what distinguishes them at https://stackoverflow.com/questions/2031163/when-to-use-the-different-log-levels which I found informative. Ultimately I think this issue comes down to determining if these distinguishing characteristics are useful to downstream applications/customers/operators. I think there is a case, although out of the standard syslog severity levels (IETF RFC5424), I have to admit I don't see a need for ALERT.
A TRACE level to support finer-grained debugging could be useful (this goes beyond the syslog convention).
Annoyingly, Python's logging module maps severity levels from 10 (debug) to 50 (critical) - the number increases with severity level, whereas syslog maps the main severity levels from 7 (debug) to 2 (critical) - the number decreases with severity level.
What we can do... Pick either logging or syslog as a basis. (The former is more likely, given that it is already used in the logic.) Modify cylc message to allow any severity level. If the specified level is recognised, the reporting system will respect the level in the normal way. Otherwise, the level is considered custom - and the reporting system will act according to any custom event handlers (but can probably default to e.g. logging.INFO).
cylc message part of this issue.Still need to figure out the following:
log/suite/log and log/suite/err #386.~ Done by #2781.@matthewrmshin - responding to the previous comments:
With #2582 and #2781, we should now be aligned with Python's logging module.
Things left to do before closing this issue:
failed output is a prerequisite of a downstream task).Tentatively re-targeting this for Cylc 8. Now that the code is in Python 3, we can implement configurable logging.
Most helpful comment
Tentatively re-targeting this for Cylc 8. Now that the code is in Python 3, we can implement configurable logging.