Scikit-learn: Use python logging to report on convergence progress it level info for long running tasks

Created on 12 Feb 2011  路  31Comments  路  Source: scikit-learn/scikit-learn

This is a proposal to use python's logging module instead of using stdout and verbose flags in the models API.

Using the logging module would make it easier for the user to control the verbosity of the scikit using a single and well documented configuration interface and logging API.

http://docs.python.org/library/logging.html

New Feature

Most helpful comment

A fifth option would be to remove verbose flags, use logging everywhere, and let users adjust verbosity via the logging API. This is what logging was designed for, after all.

I would support removing verbose, as I find per-estimator configuration
frustrating, and the numeric values of verbose arbitrary, poorly
documented,

I think getting rid of verbose and using logging levels would be very nice. The only downside I see is that it would make logging slightly less discoverable.

All 31 comments

Work has started on this in https://github.com/GaelVaroquaux/scikit-learn/tree/progress_logger

What remains to be done is most probably fairly mechanical work.

There is also work in the new Gradient Boosting module.

Logging actually isn't that easy to use at all, in my experience, so -1 on this.

Is anyone working on this ?

How about we add a logger that by default prints to STDOUT? That should be fairly simple, right?

This issue has been open since 2011 and so I wonder whether this is going to be fixed. I've run into this issue with RFECV (https://github.com/scikit-learn/scikit-learn/blob/a24c8b464d094d2c468a16ea9f8bf8d42d949f84/sklearn/feature_selection/rfe.py#L273). I wanted to print the progress but the default verbose printing prints too many messages. I didn't want to monkey patch sys.stdout to make this work and overriding the logger would be the simple and clean solution.

There are other issued in sklearn such as #8105 and #10973 that would benefit from real logging in sklearn. Overall, I think logging would be a great addition to sklearn.

you're welcome to work on it. perhaps a callback system is better than
logging

I'm a bit busy right now but I support customizable logging in sklean in whatever form (although I prefer standard python logging).

Has there been any discussion about what verbose=True will mean when scikit-learn starts using logging? We're dealing with this a bit in dask-ml: https://github.com/dask/dask-ml/pull/528.

Given that libraries aren't supposed to do logging configuration, it's up to the user to configure their "application" (which may just be a script or interactive session) to log things appropriately. This isn't easy always to do correctly.

My proposal in https://github.com/dask/dask-ml/pull/528 is for verbose=True to mean "temporarily configure logging for me". You can use a context manager to configure logging, and scikit-learn would want to ensure that INFO level messages are printed to stdout to match the current behavior.

Does temporarily mean also that the handler set up is specific to that
estimator or estimator type?

My proposal in dask/dask-ml#528 is for verbose=True to mean "temporarily configure logging for me".

This seems like a good balance. Using the logging module is not that too user friendly. Another "hack" would be to use info by default, but when a user sets verbose=True the logs can be elevated to warning. This would work because warnings are displayed by default.

I think changing the level of specific messages when the user asks for more
verbosity is exactly the opposite of how the logging module is meant to
work. But the local handler could change from warning to info to debug
level on stream as verbose increases

@jnothman鈥檚 comment matches my thoughts. scikit-learn would always emit the message, and the verbose keyword controls the logger level and handlers.

But the local handler could change from warning to info to debug
level on stream as verbose increases

Okay, lets go with this. Currently the logging levels are https://docs.python.org/3/library/logging.html#logging-levels By default, we can use the INFO, which does not emit by default. When verbose=1, we have the handler change info -> warning, and debug -> info. When we set verbose>=2, we still have info -> warning but also have debug -> warning, and the estimator can interpret the verbose>=2 to mean "emit more debug messages as verbose increases". This meaning can be different between different estimators.

What do you think?

Hi, I'm very interested in this issue. I have some experience with logging and would love to help implement an enhancement here if some consensus and plan is reached.

might be helpful to recap ideas mentioned here:

  1. use a callback pattern
  2. change the level of the message, depending on verbose
    if verbose:
        logger.debug(message)
    else:
        logger.info(message)
  1. change the level of the logger, depending on verbose
    if verbose:
        logger.selLevel("DEBUG")
  1. add a handler with level DEBUG, depending on verbose
    if verbose:
        verbose_handler = logging.StreamHandler()
        verbose_handler.setLevel("DEBUG")
        logger.addHandler(verbose_handler)

My take on these options:

Option 1 or option 4 would probably be best.

  • Option 1 (callbacks) is good in that it's most agnostic (people can log things however they want). But it might be less flexible from a messaging / state capture perspective. (Aren't callbacks only called once or once per some internal loop iteration?)
  • Option 2, as discussed here, i think is misusing the logging library
  • Option 3 works but, I think, defeats part of the purpose of using the logging library. If sklearn uses logging, then users can adjust verbosity via logging itself, e.g. import logging; logging.getLogger("sklearn").setLevel("DEBUG").
  • Option 4 is probably most canonical. The docs suggest _not_ creating handlers in library code other than NullHandlers, but I think here it makes sense, given that sklearn has verbose flags. In this case, log printing is a "feature" of the library.

A fifth option would be to remove verbose flags, use logging everywhere, and let users adjust verbosity via the logging API. This is what logging was designed for, after all.

@grisaitis thanks! See also more recent related discussions in https://github.com/scikit-learn/scikit-learn/issues/17439 and https://github.com/scikit-learn/scikit-learn/pull/16925#issuecomment-638956487 (regarding callbacks). Your help would be very much appreciated, the main issue is we haven't decided yet what approach would be best :)

I would support removing verbose, as I find per-estimator configuration
frustrating, and the numeric values of verbose arbitrary, poorly
documented, etc. Per-class configuration would be handled by having
multiple scikit-learn logger names.

A fifth option would be to remove verbose flags, use logging everywhere, and let users adjust verbosity via the logging API. This is what logging was designed for, after all.

I would support removing verbose, as I find per-estimator configuration
frustrating, and the numeric values of verbose arbitrary, poorly
documented,

I think getting rid of verbose and using logging levels would be very nice. The only downside I see is that it would make logging slightly less discoverable.

Also, one thing what logging provides is that you can attach extra information to each logging message, not just strings. So whole dict of useful stuff. So if you want to report loss during learning, you can do that and store numerical value. Even more, you can both store the numerical value as number and use it to format user friendly string, like: logger.debug("Current loss: %(loss)s", {'loss': loss}). I find that very useful in general and would love if sklearn exposes that as well.

I think having module or estimator level loggers is a bit overkill for now and we should start with something simple that allows us to extend it later.
Also, whatever we do should allow users to reasonably easy reproduce current behavior.

How do logging and joblib interact? The logging level is not preserved (as expected I guess):

import logging
logger = logging.getLogger('sklearn')
logger.setLevel(2)

def get_level():
    another_logger = logging.getLogger('sklearn')
    return another_logger.level

results = Parallel(n_jobs=2)(
    delayed(get_level)() for _ in range(2)
)
results

```
[0, 0]

But that's probably not needed, since this works:
```python
import logging
import sys
logger = logging.getLogger('sklearn')
logger.setLevel(1)

handler = logging.StreamHandler(sys.stdout)
logger.addHandler(handler)


def log_some():
    another_logger = logging.getLogger('sklearn')
    another_logger.critical("log something")

results = Parallel(n_jobs=2)(
    delayed(log_some)() for _ in range(2)
)

Honestly, I'm not entirely sure how this works though.

both stdout and stderr don't show up in jupyter btw.

My dream: being able to get an approximation of the current behavior with a single line, but also being able to easily use progress bars or plot convergence instead.

re verbose: it's probably cleaner to deprecate verbose, but deprecating verbose and not having estimator-level logging will make it a bit trickier to log one estimator but not another. I think it's fine to have the user filter the messages, though.

hey all, thanks for the friendly replies and info. i read the other issues and have some thoughts.

joblib will be tricky. i have some ideas though.

@amueller that is very weird. i reproduced your example. things do work with concurrent.futures.ProcessPoolExecutor, which i think joblib uses...

seems like joblib is nuking the state in logging. any joblib experts have ideas of what could be going on?

import concurrent.futures
import logging
import os

logger = logging.getLogger("demo馃檪")
logger.setLevel("DEBUG")

handler = logging.StreamHandler()
handler.setFormatter(
    logging.Formatter("%(process)d (%(processName)s) %(levelname)s:%(name)s:%(message)s")
)
logger.addHandler(handler)

def get_logger_info(_=None):
    another_logger = logging.getLogger("demo馃檪")
    print(os.getpid(), "another_logger:", another_logger, another_logger.handlers)
    another_logger.warning(f"hello from {os.getpid()}")
    return another_logger

if __name__ == "__main__":
    print(get_logger_info())

    print()
    print("concurrent.futures demo...")
    with concurrent.futures.ProcessPoolExecutor(2) as executor:
        results = executor.map(get_logger_info, range(2))
        print(list(results))

    print()
    print("joblib demo (@amueller's example #2)...")
    from joblib import Parallel, delayed
    results = Parallel(n_jobs=2)(delayed(get_logger_info)() for _ in range(2))
    print(results)

which outputs

19817 another_logger: <Logger demo馃檪 (DEBUG)> [<StreamHandler <stderr> (NOTSET)>]
19817 (MainProcess) WARNING:demo馃檪:hello from 19817
<Logger demo馃檪 (DEBUG)>

concurrent.futures demo...
19819 another_logger: <Logger demo馃檪 (DEBUG)> [<StreamHandler <stderr> (NOTSET)>]
19819 (SpawnProcess-1) WARNING:demo馃檪:hello from 19819
19819 another_logger: <Logger demo馃檪 (DEBUG)> [<StreamHandler <stderr> (NOTSET)>]
19819 (SpawnProcess-1) WARNING:demo馃檪:hello from 19819
[<Logger demo馃檪 (DEBUG)>, <Logger demo馃檪 (DEBUG)>]

joblib demo (@amueller's example #2)...
19823 another_logger: <Logger demo馃檪 (WARNING)> []
hello from 19823
19823 another_logger: <Logger demo馃檪 (WARNING)> []
hello from 19823
[<Logger demo馃檪 (DEBUG)>, <Logger demo馃檪 (DEBUG)>]

I think you should configure processes joblib spawns to send logging message to the main logger in the main process. Then one can control logging in the main process only. Something like this or this. So there are logging queue sinks and sources and you can tie them together. We use this on our cluster, to send all logging from all workers on all machines to central location, to show it on user's computer.

@mitar i agree, i think that might be the best bet. (not necessarily network-based queues, but inter-process communication queues)

i'm actually coding up an example using logging's QueueHandler / QueueListener right now, to test with joblib and concurrent.futures. will follow up here.

also love your other suggestion:

Also, one thing what logging provides is that you can attach extra information to each logging message, not just strings. So whole dict of useful stuff. So if you want to report loss during learning, you can do that and store numerical value. Even more, you can both store the numerical value as number and use it to format user friendly string, like: logger.debug("Current loss: %(loss)s", {'loss': loss}). I find that very useful in general and would love if sklearn exposes that as well.

i implemented a visualization of gaussian mixture modeling using the extra param and a custom Handler class. works super nicely for passing state around, and letting the user decide how to handle the state.

also re joblib's idiosyncrasies that i noticed above... i'm going to accept that as is and out of scope. a queue based design would be most flexible anyway.

the only limitation of using a QueueHandler, that i can think of, is that any extra state (logger.debug("message", extra={...}) is that the extra dict must be serializable for the queue. so no numpy arrays. :/ cant think of any other issues though

i'm actually coding up an example using QueueHandler / QueueListener right now,

Yes, you should always use queue handler, because you never know when the sending over the socket blocks and you do not want to slow down the model because of logging blocking.

Also, you do not even have to use extra. I think logger.debug("message %(foo)s", {'foo': 1, 'bar': 2}) just works.

Yes, you should always use queue handler, because you never know when the sending over the socket blocks and you do not want to slow down the model because of logging blocking.

for the joblib case, if we implemented QueueHandler / QueueListener, what state would we have to pass to the process pool? just the queue, right?

Also, you do not even have to use extra. I think logger.debug("message %(foo)s", {'foo': 1, 'bar': 2}) just works.

thanks yes. i find it useful also to log state without converting it to text. e.g. including a numpy array in extra, and doing real-time data visualization (visual logging in a way) with a custom logging handler in a jupyter notebook. this would be SUPER nice with sklearn, and looks like @rth has been doing similar work with callbacks.

for the joblib case, if we implemented QueueHandler / QueueListener, what state would we have to pass to the process pool? just the queue, right?

I think so. I have not used this over process boundaries, but it seems they have a documented support for multiprocessing, so it should work with joblib as well. I am using QueueHandler / QueueListener inside the same process. To decouple logging writes from logging transport. So is QueueHandler -> QueueListener -> Send to central logging service. But from documentation it looks like it can work through multiprocessing queue.

i find it useful also to log state without converting it to text

Yes. What I am saying is that you do not have to use extra, but just pass dict directly, and then you use only few item from that dict for message formatting (do note that what is used in format strings is decided by users of sklearn library, not by sklearn library, one can always configure that you want in formatting something you didn't expect, so what exactly is converted to text is not really under sklearn control). All values in extra can also be used for message formatting as well. So I am not sure how useful that extra is. But I am also not saying we should not use it. It is much more explicit what was the payload for the string on the left, and what is extra. So we can also use both. I just wanted to make sure this alternative is known.

@grisaitis FYI if you mention a name in a commit, every time you do anything with the commit (like rebasing it or merging it or pushing it), the person gets an email, so it's generally discouraged ;)

Sorry about that Andreas! 馃槵 That's embarassing... I was just trying to have well-documented commits lol. Will avoid in the future.

In that repo I figured out how logging can work with joblib with a QueueHandler / QueueListener combo. Appears to work nicely.

As a first step I'll implement logging with that approach in a part of sklearn where joblib is used. Maybe one of the ensemble models. Will open a new PR.

for the joblib case, if we implemented QueueHandler / QueueListener,

Yes, it sounds like it would be necessary to start/stop a monitoring thread (here QueueListener) both if use the logging module and callbacks in the case of multiprocessing (approximate example of callbacks with multiprocessing in https://github.com/scikit-learn/scikit-learn/pull/16925#issuecomment-656184396)

So I figure the only reason why what I did above "worked" was that it printed to stdout which was the shared resource and print is threadsafe in python3 or something like that?

So I figure the only reason why what I did above "worked" was that it printed to stdout which was the shared resource and print is threadsafe in python3 or something like that?

Print is not thread safe. They just print to the same file descriptor. Probably running for a longer time you would see that messages sometimes interleave and lines get mixed up.

Was this page helpful?
0 / 5 - 0 ratings