tpot hangs mid-pipeline evaluation (didn't happen in 0.7 version)

Created on 30 Apr 2017 · 20Comments · Source: EpistasisLab/tpot

I've been seeing TPOT hangs lately.
At first I thought perhaps njobs is somehow related, tested both with njobs=4 and njobs=6 and without njobs at all, it seems not.

The hang is reproducible - same parameters + same input lead to hanging on same pipeline in the order of evaluated pipelines.

I will try to find out which pipeline hung the process by adding some better traceback reporting on CTRL+C, but we might need to add child-process stdin/out/err retrieval in cases like this to find out what happened.

It seems the timeout didn't apply, and the child process crashed by itself.

Context of the issue

Running a standard tpot optimization process on some data - checked with different data sizes/types, all numerical of course.

The data was 17 columns X ~350k lines: numerical/float, the whole file is roughly ~16MB in size.
We also saw this hang on a 43MB input file, with ~48k input lines and 225 columns, also all numerical.

It doesn't seem to happen every run, I suspect its a certain pipeline/scenario that causes the problem.

Process to reproduce the issue

Not sure how to reproduce this easily, one example was a pandas dataframe 16MB in size, we ran

user@centos_machine:~/research_folder$ tpot input.csv -is , -target class -o final_pipeline.py -maxtime 100000 -p 90 -os 30 -cv 5 -s 13 -v 3 -scoring roc_auc -maxeval 120

Expected result

keep evaluating pipelines or hopefully say why it hanged and then died..

Current result

user@centos_machine:~/research_folder$ tpot input.csv -is , -target class -o final_pipeline.py -maxtime 100000 -p 90 -os 30 -cv 5 -s 13 -v 3 -scoring roc_auc -maxeval 120

TPOT settings:
CONFIG_FILE     =
CROSSOVER_RATE  =       0.1
GENERATIONS     =       100
INPUT_FILE      =       input.csv
INPUT_SEPARATOR =       ,
MAX_EVAL_MINS   =       120.0
MAX_TIME_MINS   =       100000
MUTATION_RATE   =       0.9
NUM_CV_FOLDS    =       5
NUM_JOBS        =       1
OFFSPRING_SIZE  =       30
OUTPUT_FILE     =       final_pipeline.py
POPULATION_SIZE =       90
RANDOM_STATE    =       13
SCORING_FN      =       roc_auc
TARGET_NAME     =       class
TPOT_MODE       =       classification
VERBOSITY       =       3

/***/env/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
30 operators have been imported by TPOT.

_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='hinge' are not supported when dual=False, Parameters: penalty='l2', loss='hinge', dual=False
_pre_test decorator: _generate: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
_pre_test decorator: _generate: num_test=1 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=1 Unsupported set of arguments: The combination of penalty='l1' and loss='hinge' is not supported, Parameters: penalty='l1', loss='hinge', dual=True
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 k should be >=0, <= n_features; got 15.Use k='all' to return all features.
_pre_test decorator: _generate: num_test=1 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l1' and loss='hinge' is not supported, Parameters: penalty='l1', loss='hinge', dual=True
_pre_test decorator: _generate: num_test=0 max_features must be in (0, n_features]
_pre_test decorator: _generate: num_test=1 Unsupported set of arguments: The combination of penalty='l2' and loss='hinge' are not supported when dual=False, Parameters: penalty='l2', loss='hinge', dual=False
_pre_test decorator: _generate: num_test=2 Unsupported set of arguments: The combination of penalty='l1' and loss='squared_hinge' are not supported when dual=True, Parameters: penalty='l1', loss='squared_hinge', dual=True
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 max_features must be in (0, n_features]
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='hinge' are not supported when dual=False, Parameters: penalty='l2', loss='hinge', dual=False
_pre_test decorator: _generate: num_test=1 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 'SVC'
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 70
_pre_test decorator: _generate: num_test=1 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l1' and loss='squared_hinge' are not supported when dual=True, Parameters: penalty='l1', loss='squared_hinge', dual=True
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='hinge' are not supported when dual=False, Parameters: penalty='l2', loss='hinge', dual=False
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
[14:22:21] dmlc-core/include/dmlc/logging.h:235: [14:22:21] src/tree/updater_colmaker.cc:161: Check failed: (n) > (0) colsample_bytree=1 is too small that no feature can be included
_pre_test decorator: _generate: num_test=0 [14:22:21] src/tree/updater_colmaker.cc:161: Check failed: (n) > (0) colsample_bytree=1 is too small that no feature can be included
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 Input X must be non-negative
_pre_test decorator: _generate: num_test=0 'SVC'
Invalid pipeline encountered. Skipping its evaluation.
Invalid pipeline encountered. Skipping its evaluation.
Optimization Progress:  19%|██████████████████████████▎                                                                                                                | 17/90 [11:50<45:14, 37.18s/pipelineProcess PoolWorker-1:
Traceback (most recent call last):
  File "/***/env/lib/python2.7/site-packages/multiprocess/process.py", line 258, in _bootstrap
    self.run()
  File "/***/env/lib/python2.7/site-packages/multiprocess/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/***/env/lib/python2.7/site-packages/multiprocess/pool.py", line 122, in worker
    put((job, i, (False, wrapped)))
  File "/***/env/lib/python2.7/site-packages/multiprocess/queues.py", line 395, in put
    return send(obj)
IOError: [Errno 32] Broken pipe

TPOT closed prematurely. Will use the current best pipeline.
Traceback (most recent call last):
  File "/***/env/bin/tpot", line 11, in <module>
    sys.exit(main())
  File "/***/env/lib/python2.7/site-packages/tpot/driver.py", line 272, in main
    tpot.export(args.OUTPUT_FILE)
  File "/***/env/lib/python2.7/site-packages/tpot/base.py", line 629, in export
    raise RuntimeError('A pipeline has not yet been optimized. Please call fit() first.')
RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

This command froze on the 17th pipeline it was evaluating and crashed after what we estimate to be longer than the 120 minute timeout I gave per pipeline evaluation with the child process simply dying - no information what the child output was or what it crashed on.

Possible fix

I am going to investigate this myself at some point since if roughly 1/3 runs freezes mid-work TPOT becomes unusable.

question

Source

kuratsak

All 20 comments

Thank you for letting us know this issue and all those detailed info that was or will be posted here. I will also look into it tomorrow.

weixuanfu on 30 Apr 2017

Have you tried the TPOT light configuration to see if that regularly freezes on you? Some of the default TPOT operators are very expensive, especially for large datasets.

rhiever on 30 Apr 2017

Nope, I'll be back in the office on Wednesday, can try then.

On Sun, 30 Apr 2017, 22:19 Randy Olson, notifications@github.com wrote:

Have you tried the TPOT light configuration
http://rhiever.github.io/tpot/using/#built-in-tpot-configurations to
see if that regularly freezes on you?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rhiever/tpot/issues/436#issuecomment-298251177, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AMzW7M_67ZIa9s-erWs6kEiBNFb7mUSCks5r1N7agaJpZM4NMmAI
.

kuratsak on 30 Apr 2017

👍1

Meanwhile, I've taken a look, I can't seem to understand when TPOT decides to fork and when it doesn't but it seems the hanging problems only occured on spawning forks - the posted above log was when it had njobs=1 by default, and for some reason still spawned a child process.

I suspect this has to do with joblib having weird behaviours in certain cases, the following search shows multiple stories that sound similar to this:

https://www.google.com/search?q=skleanr+Parallel+joblib+hangs&oq=skleanr+Parallel+joblib+hangs&aqs=chrome..69i57j0.4739j0j1&sourceid=chrome&ie=UTF-8#q=sklearn+Parallel+joblib+hangs

An Example:

https://github.com/joblib/joblib/issues/125

Some talk about large matrices, some talk about certain problems with crashes/problems.
It might be that after a timeout on one pipeline, the next fork hangs (people talk about crash causing NEXT fork to hang).

I will investigate further when I have time.

Interestingly enough the timeout in TPOT doesn't seem to kill the spawned hanged processes even after the timeout hits (or maybe it does, but not in an orderly and "move on" fashion as we'd like).

kuratsak on 1 May 2017

👍1

Thank you for these details. After more tests, I think the issue is caused by bad interaction of multiprocessing and third-party libraries, and as you mentioned above, similar issues were reported in both joblib and scikit-learn's repos especially with large datasets. Unfortunately, so far I did not find a nice way to fix this issue with joblib in both python 2.X and python 3.X. I also tried the dask module in one of my branches, which seems more stable than joblib. But I still found sometimes evaluation are freezed. You may try this branch with your datasets. I think we need more tests to solve this issue. I think this issue is also related to #422

weixuanfu on 1 May 2017

I think I solved this issue for large dataset in the branch mentioned above. It is about memmaping of large arrays. I will post a PR soon.

Update: One of my tests shows that it still has issues with forking methods in multiprocessing. Need more workaround.

weixuanfu on 2 May 2017

Related issues in sklearn:

3195
5115

weixuanfu on 2 May 2017

The freezing issue may come from scikit-learn and joblib, so the freezing still happened even without joblib in TPOT since scikit-learn use this own joblib in multiple functions and classes. In the PR #440, I used the dask instead, I found the freezing time is less. Also, I tried to use dask to wrap these sklearn objects to use dask for multiprocessing, but dask do not allow this nested threading.

weixuanfu on 2 May 2017

I don't mind delays, that's reasonable. But infinite hang and subsequent
failure is the problem. I'm taking leaving tpot running a weekend and
coming back to an empty failed process..
This didn't happen in 0.7, I wonder what changed..

On Tue, 2 May 2017, 23:34 Weixuan, notifications@github.com wrote:

The freezing issue may come from scikit-learn and joblib, so the freezing
still happened even without joblib in TPOT since scikit-learn use this own
joblib in multiple functions and classes. In the PR #440
https://github.com/rhiever/tpot/pull/440, I used the dask instead, I
found the freezing time is less. Also, I tried to use dask to wrap these
sklearn objects to use dask for multiprocessing, but dask do not allow this
nested threading.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rhiever/tpot/issues/436#issuecomment-298752997, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AMzW7HoSnqW7ADvaz5-h6PF7HwH_dIWuks5r15NvgaJpZM4NMmAI
.

kuratsak on 2 May 2017

In 0.7, we used pathos for multiprocessing. But it has some unknown issue working with xgboost and does not support Windows. So we switched to joblib in scikit-learn in 0.7.1.

weixuanfu on 2 May 2017

Interesting, perhaps I'll try using pathos to see if it fixes my problem.
Specifically, I used tpot lots of times in that version with xgboost (on
centos x64 though) and had no trouble at all.

On Wed, 3 May 2017, 00:07 Weixuan, notifications@github.com wrote:

In 0.7, we used pathos https://pypi.python.org/pypi/pathos/0.2.0 for
multiprocessing. But it has some unknown issue working with xgboost and
does not support Windows. So we switched to joblib in scikit-learn in 0.7.1.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rhiever/tpot/issues/436#issuecomment-298761507, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AMzW7Gb3ZQPJHouhIbVQVXH8ICRCnD5wks5r15sOgaJpZM4NMmAI
.

kuratsak on 2 May 2017

Another change is about timeout function. In 0.7, we used signal-based timeout instead of thread-based timeout function. thread-based is more safe for multiprocessing

weixuanfu on 2 May 2017

Makes sense, I'm still not sure what happened after the hang, I expected it
to just result in one timed out pipeline, I'll probably have to investigate
why it ruined the whole tpot process..

On Wed, 3 May 2017, 00:17 Weixuan, notifications@github.com wrote:

Another change is about timeout function. In 0.7, we used signal-based
timeout instead of thread-based timeout function. thread-based is more
safe for multiprocessing
https://github.com/glenfant/stopit#comparing-thread-based-and-signal-based-timeout-control

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rhiever/tpot/issues/436#issuecomment-298763938, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AMzW7GVyz-_N-7W8yXnAggtB2a3Svozaks5r151OgaJpZM4NMmAI
.

kuratsak on 2 May 2017

Great, a little bit more details about pathos. It uses dill instead of cPickle for pickling as joblib does. Maybe it is the key for this issue. The dask in #440 uses cloudpickle, which seems works well in my tests. You may also try this branch on your dataset. (commands below for reinstalling the TPOT branch in your environment)

pip install dask[complete]
pip install stopit
pip install --upgrade --no-deps --force-reinstall git+https://github.com/weixuanfu2016/tpot.git@joblib_timeout

weixuanfu on 2 May 2017

Thanks, I can just pull the branch, i made a soft link in my env directly
to the tpot source :)
We're talking about weixuanfu2016/tpot.git + joblib_timeout branch?

On Wed, May 3, 2017 at 12:37 AM, Weixuan notifications@github.com wrote:

Great, a little bit more details about pathos. It uses dill instead of
cPickle for pickling as joblib does. Maybe it is the key for this issue.
The dask in #440 https://github.com/rhiever/tpot/pull/440 uses
cloudpickle, which seems works well in my tests. You may also try this
branch on your dataset. (commands below for reinstalling the TPOT branch in
your environment)

pip install dask[complete]
pip install stopit
pip install --upgrade --no-deps --force-reinstall git+https://github.com/weixuanfu2016/tpot.git@joblib_timeout

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rhiever/tpot/issues/436#issuecomment-298768991, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AMzW7FAj2sDRnNKsAStfJbstJ7uk_Z3cks5r16IbgaJpZM4NMmAI
.

--
Cheers,
Dani K.

kuratsak on 2 May 2017

😄1

Yep

weixuanfu on 3 May 2017

BTW, you could also check a branch named pathos in my repo for a demo of using pathos in tpot. But I tested it, freezing happened.

weixuanfu on 3 May 2017

Version 0.7 didn't freeze once over dozens of runs I did. Some a week long.
Was it pathos then?

On Wed, 3 May 2017, 01:24 Weixuan, notifications@github.com wrote:

BTW, you could also check a branch named pathos in my repo for a demo of
using pathos in tpot. But I tested it, freezing happened.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rhiever/tpot/issues/436#issuecomment-298778678, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AMzW7GN-Y1LumQf7xOVbtODiMzEsdggrks5r160FgaJpZM4NMmAI
.

kuratsak on 3 May 2017

Yes, pathos was removed 3 weeks ago.

On May 3, 2017, at 1:56 AM, Dani K notifications@github.com wrote:

Version 0.7 didn't freeze once over dozens of runs I did. Some a week long.
Was it pathos then?

On Wed, 3 May 2017, 01:24 Weixuan, notifications@github.com wrote:

BTW, you could also check a branch named pathos in my repo for a demo of
using pathos in tpot. But I tested it, freezing happened.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rhiever/tpot/issues/436#issuecomment-298778678, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AMzW7GN-Y1LumQf7xOVbtODiMzEsdggrks5r160FgaJpZM4NMmAI
.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.