Fastai: set_rf_samples() apparently no longer works

Created on 9 Oct 2018  路  3Comments  路  Source: fastai/fastai

Describe the bug
I'm following along the ML fastai course. In the second lesson, Jeremy talks about how it would be better to train a Random Forest sending a subsample of the whole training data to each tree, instead of just training the whole RF with a single subset of the training data. This would allow us to train it in the same little time but let the RF see the whole data, and as it's not implemented yet in sklearn, the function set_rf_samples would allow us to do it.

But when running tests doing both approaches, the one where we train the RF with a subset of the data, instead of each tree with a subset of the original data, seems much faster:

To Reproduce

Considering that df_raw = pd.read_feather('tmp/bulldozers-raw'), this is what happens

#### Train with a subset of data

reset_rf_samples()

df_trn, y_trn, _ = proc_df(df_raw, 'SalePrice', subset=30000)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

m = RandomForestRegressor(n_estimators=30, n_jobs=-1, oob_score=True)
%time m.fit(X_train, y_train)
print_score(m)

CPU times: user 8.23 s, sys: 31.1 ms, total: 8.26 s
Wall time: 1.34 s
[0.10061, 0.35804, 0.9781, 0.77107, 0.84551]

#### Train with the full data subsampling

df_trn, y_trn, _ = proc_df(df_raw, 'SalePrice')
X_train, X_valid = split_vals(df_trn, n_trn)
y_train, y_valid = split_vals(y_trn, n_trn)

set_rf_samples(20000)

m = RandomForestRegressor(n_estimators=30, n_jobs=-1, oob_score=True)
%time m.fit(X_train, y_train)
print_score(m)

CPU times: user 21.6 s, sys: 862 ms, total: 22.5 s
Wall time: 7.2 s
[0.22954, 0.26696, 0.88989, 0.87273, 0.87821]

Expected behavior
Shouldn't the times (1.3s VS 7.2s) to train the RF be more similar to each other in both cases?

Additional context
I'm using the latest version of fastai (cloned the repo yesterday) and sklearn 0.20.0

Most helpful comment

oob_score=True is causing that. Jeremy talks about in in lesson 3, as it's not fully integrated with sklearn the oob_score is calculated using the whole data set which takes long time.
if using set_rf_samples() set oob_score=False

All 3 comments

This may not be an issue as I've just seen that it takes 539 ms when you train on a subsample of the data and 3.49 s when you use set_rf_samples, as can be see in Jeremy's notebook.

Why does this happen, anyway?

oob_score=True is causing that. Jeremy talks about in in lesson 3, as it's not fully integrated with sklearn the oob_score is calculated using the whole data set which takes long time.
if using set_rf_samples() set oob_score=False

Great! Thanks @miwojc!!

Was this page helpful?
0 / 5 - 0 ratings