Tpot: max_eval_time_mins parameter doesn't stop a long-running eval

Created on 24 Jun 2017 · 16Comments · Source: EpistasisLab/tpot

During fit(), I noticed that some evaluations were running for over an hour even though the max_eval_time_mins=5.
This is because the code is not actually stopping the thread doing the eval.

In class Interruptable_cross_val_score, stop() is not actually interrupting the thread, but instead calling self._stopevent.set(), which doesn't stop the thread because nothing is checking that event, and then waiting for the thread to stop on its own.

Python doesn't have a good way to interrupt threads. See the discussion at https://stackoverflow.com/questions/323972/is-there-any-way-to-kill-a-thread-in-python

Given that _wrapped_cross_val_score() is already running in a separate process due to joblib, one solution would be to make Interruptable_cross_val_score a daemon thread, and then remove the call to tmp_it.stop() in _wrapped_cross_val_score(). Thus when the timeout passes, the process will exit cleanly.

being worked on bug

Source

dnuffer

Most helpful comment

I just posted a PR #522 and use the way in hyperopt-sklearn to kill child process. Could you try this branch and let us know if that corrects your issue using the command below? @dnuffer @CSNoyes

pip install --upgrade --no-deps --force-reinstall git+https://github.com/weixuanfu2016/tpot.git@timeout_pipe

@rhiever For this way, I need use threading backend in joblib instead of multiprocessing. It maybe not as efficient as before in parallel computing.

One drewback is that CTRL+C only works in Linux and Mac but not in Windows. So I add a warning message about it.

weixuanfu on 5 Jul 2017

👍3

All 16 comments

Thank you for suggestion. I will look into it. One of my branches uses stopit module for timeout function. I need to check if it is better than daemon thread solution.

weixuanfu on 25 Jun 2017

@weixuanfu2016's PR that should fix this issue is merged on to the development branch. @dnuffer, can you try the dev branch and let us know if that corrects your issue?

rhiever on 27 Jun 2017

I tried the development branch, and it didn't fix the issue. I'm think it's probably because the stopit module is a pure python solution and so can only interrupt a thread once it runs some python code, and since the core of most ml training algorithms is written non-python code, stopit won't get a chance to interrupt the thread.

I also tried my earlier suggestion, but it doesn't work because a pool of processes is used, and a process doesn't exit once an evaluation is complete, leaving the threads running.

I have successfully used the timeout feature in hyperopt-sklearn, and so I dug into how it works.
This is the code: https://github.com/hyperopt/hyperopt-sklearn/blob/master/hpsklearn/estimator.py
The trial_timeout variable controls how long each trial is allowed to run. Then fn_with_timeout() is where the action happens. Each trial is run in a separate process (using multiprocessing.Process) using a Pipe for communicating the result at the end of _cost_fn(). When a timeout happens the child process is terminated ensuring a certain and clean exit.

dnuffer on 2 Jul 2017

Hmm, interesting. Thank you for these tests. I will look into it.

weixuanfu on 2 Jul 2017

I can report similar issues. Using the development branch also does not fix it.

CSNoyes on 5 Jul 2017

pip install --upgrade --no-deps --force-reinstall git+https://github.com/weixuanfu2016/tpot.git@timeout_pipe

@rhiever For this way, I need use threading backend in joblib instead of multiprocessing. It maybe not as efficient as before in parallel computing.

One drewback is that CTRL+C only works in Linux and Mac but not in Windows. So I add a warning message about it.

weixuanfu on 5 Jul 2017

👍3

@weixuanfu2016 Tested on OSX 10.12 and Ubuntu 14.04 with high dimensionality dataset (I think poly features was getting stuck), looks good so far. Will update if it creeps back in.

CSNoyes on 5 Jul 2017

The timeout_pipe branch has fixed the issue for me.

dnuffer on 9 Jul 2017

@dnuffer @CSNoyes Thank you for feedbacks.

@rhiever and I had second thoughts about this issue. We thought this issue might be related to the start methods in multiprocessing. I also reproduced the freezing issue when n_jobs >1 in MacOS and Linux but it seems everything is all right when n_jobs = 1.

@dnuffer @CSNoyes Could you please let me know the sys environment, python version and n_jobs settings when this issue happened before? Thanks.

weixuanfu on 11 Jul 2017

I tested the solution of forkserver when n_jobs > 1 and it at least solved the freezing issue when using TPOT on super large datasets. I put the demo in a branch, which can only be tested in Linux and MacOS with python 3.4+

weixuanfu on 11 Jul 2017

I have been using ubuntu 17.04 with python 3.5.3. I've been mostly using n_jobs=22. I am using a dataset with dimensionality of ~17000.

dnuffer on 11 Jul 2017

@dnuffer @CSNoyes

Below is the demo for using forkserver to solving this issue in Linux and MacOS. I am still thinking whether we should put this solution into codes. It seems that it is not easy to use forkserver with joblib in interactive mode. Maybe we should provide a friendly warning message about this and/or tghe solution in the demo below as the Q&A in scikit-learn. Please let me know if the issue still exists with the demo below in your environment. Thanks.

import multiprocessing
if __name__ == '__main__':
    multiprocessing.set_start_method('forkserver')
    # Note: need move import sklearn into main unless a RuntimeError (RuntimeError: context has already been set) will raise
    from sklearn.datasets import make_classification
    from tpot import TPOTClassifier
    # make a huge dataset
    X, y = make_classification(n_samples=50000, n_features=200,
                                        n_informative=20, n_redundant=20,
                                        n_classes=5, random_state=42)

    # max_eval_time_mins=0.04 means 2 seconds limits for evaluating a single pipeline 
    # working in python 3.4+ in Linux OS and MacOS
    tpot = TPOTClassifier(generations=5, population_size=50, offspring_size=100, random_state=42, n_jobs=2, max_eval_time_mins=0.04, verbosity=3) 
    tpot.fit(X, y)

weixuanfu on 20 Jul 2017

The PR branch worked from me for the big dataset ( approx 600 MB). The 0.8/0.9 branch freezes.
Could we reinitiate discussion for the permanent fix? It seems the PR was closed without merging.

jaksmid on 4 Oct 2017

@jaksmid did 0.9 version freezes with forkserver start methods. Or maybe it is be a memory issue with a large number of n_jobs. Could you please provide more details about the issue in your environment?

The reason why I closed that PR is that it did not save computation time with n_jobs > 1 in my tests.

weixuanfu on 4 Oct 2017

Thanks @weixuanfu for the speedy response.

If I add the

import multiprocessing
multiprocessing.set_start_method('forkserver')

lines it seems to be working. Otherwise it utilises all cores to 100%. After some time the CPU consumption per core drops to zero with no observable progress. Memory pressure does not see to be a problem.

Using python 3.6.0 in the virtualenv on Mac Os Sierra.

Please let me know if you need further information.

jaksmid on 4 Oct 2017

In my experiments, tpot still ignores the max_eval_time_mins=5 parameter for datasets between 1000 and 5000 observations (5 to 25 columns). When fit() is called, tpot runs for an indefinitely long time period (at least several hours).

While I am able to stop the process by using the early_stop parameter, I would really like to set a specific time period.

I am using version 0.9.3 of tpot, python 3.6.2 and OSX 10.13.6. tpot runs in single thread mode (n_jobs=1).

Please let me know if you need any further infos.