Hello,
I am trying to get TPOT running since a while now but always encounter the same errro. I have a linux machine with 24 kernels. When I run TPOT on a large dataset (~6mio rows, ~20 features) it freezes at 0% and after about 10-20 minutes the CPU goes down to a few percent. I already tried setting the multiprocessing to forkserver without any changes. I also tried the dask implementation, but since the max_eval_time_mins does not seem to work there, it runs forever.
However, the problem does not occur when n_jobs != 1 but just if n_jobs > 4. I do not really know what else to try and I would appreciate any suggestions.
Thanks!
aml_tpot = TPOTRegressor(scoring = 'neg_mean_squared_error',
generations=20,
population_size=50,
verbosity=3,
random_state = RANDOM_SEED,
n_jobs = 16,
max_eval_time_mins = 20,
cv = 3,
)
aml_tpot.fit(X_train.values, y_train.values.ravel())
It seems that there is a kind of threading deadlock issue (maybe related to this old issue in joblib). Could you please try to update joblib (> 0.13.2) and scikit-learn (>=0.21) via conda or pip and reinstall TPOT development branch via the command below?
pip install --upgrade --no-deps --force-reinstall git+https://github.com/EpistasisLab/tpot.git@development
We recently noticed that the internal joblib module (based on a older version of joblib) in scikit-learn (<0.20) was deprecated (see #867) and may cause the issue here because it did not have some important updates about limiting the number of threads in joblib (>0.12, see joblib change log). LMK if this solution works or now.
Thanks for the quick reply!
joblib 0.13.2 is the current version and I cant update joblib (> 0.13.2) or did I get something wrong here?
I installed the development branch and tried it again with joblib == 0.13.2 and scikit-learn >=0.21, but unfortunately it is still freezing.
However, I just tried it again and now it is stuck at 5% (54/1050).
Changing the value of DEFAULT_THREAD_BACKEND = 'threading' to e.g. 'loky' in parallel.py of joblib worked for me.
Not working. It runs only when n_jobs is set to 1
@Chowkah Could you please talk about how can you reach 5%. I still stuck at 0%
I cant really nail it down to a point, I tried several different things, sometimes it was working, sometimes not. I changed the parallel backend directly in the parallel.py of joblib which sometimes helped. Additionally I changed my random seed to some other value and with the same setting it was working. So the problem might be related to a specific algorithm (maybe just with some specific parameter setting) that makes TPOT freeze. However, I was not able to identify which one it might be.
Thank you, maybe I should start with examples in official doc, make a few changes every time and see what will happen.