I am working with a very large and imbalanced dataset, and want to try with incremental learning using saved binary files containing whole training data. However, it seems that subset Dataset cannot be used for continued training, any ideas other than generating new Datasets from the original subsets?
Also, I notice that pos_bagging_fraction was recently added to the Parameters list and am curious about how to use it(it seems to be not available for python API?).
This issue is related to #1439 .
Thank you for your time!
Operating System: Linux
CPU/GPU model: CPU
C++/Python/R version: Python
LightGBM version or commit hash: 2.2.4
lightgbm.basic.LightGBMError: Cannot set predictor after freed raw data, set free_raw_data=False when construct Dataset to avoid this.
import numpy as np
import lightgbm as lgb
# generate simulation data
para=np.random.random((5000, 2))
data=np.zeros((10000,10))
for i in range(5000):
mu, sigma=para[i,:]
s=np.random.normal(mu, sigma, 1000)
data[i,:]=np.histogram(s, bins=10, density=False,range=[-1,1])[0]
data_shuffle=data[:5000,:].copy()
for i in range(5000):
np.random.shuffle(data_shuffle[i,:])
data[5000:,:]=data_shuffle
# subset_train_data 1,2,3 are subsets of data
train_data = lgb.Dataset(data,label=[1]*5000+[0]*5000,free_raw_data=False)
subset_index=np.random.choice(np.arange(10000), 5000, replace=False)
subset_train_data_1=train_data.subset(subset_index)
# generate new subset_index
subset_index=np.random.choice(np.arange(10000), 5000, replace=False)
subset_train_data_2=train_data.subset(subset_index)
train_data_3 = lgb.Dataset(data,label=[1]*5000+[0]*5000,free_raw_data=False, reference=train_data)
subset_train_data_3=train_data_3.subset(subset_index)
subset_train_data_4=lgb.Dataset(data[subset_index,:],label=np.array([1]*5000+[0]*5000)[subset_index],\
free_raw_data=False,reference=train_data)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': ["binary_error",'binary_logloss'],
'metric_freq': 10,
'num_leaves': 63,
'num_threads': 1,
'learning_rate': 0.1,
'feature_fraction': 1,
'boost_from_average': False,
'verbose': 1
}
# train using subset_train_data_1 and it works
gbm = lgb.train(params=params,
train_set=subset_train_data_1,
num_boost_round=10,
valid_sets=[train_data],
keep_training_booster=True)
# continue training with subset_train_data_2, fail
gbm = lgb.train(params=params,
train_set=subset_train_data_2,
num_boost_round=10,
valid_sets=[train_data],
init_model=gbm)
# continue training with subset_train_data_3, fail
gbm = lgb.train(params=params,
train_set=subset_train_data_3,
num_boost_round=10,
valid_sets=[train_data],
init_model=gbm)
# continue training with subset_train_data_4, succeed
gbm = lgb.train(params=params,
train_set=subset_train_data_4,
num_boost_round=10,
valid_sets=[train_data],
init_model=gbm)
However, it seems that subset Dataset cannot be used for continued training,
Construct subsets prior passing them. Seems this should help, at least your example works fine after after adding construct() explicit calls.
...
subset_train_data_1=train_data.subset(subset_index).construct()
...
subset_train_data_2=train_data.subset(subset_index).construct()
...
subset_train_data_3=train_data_3.subset(subset_index).construct()
Also, I notice that pos_bagging_fraction was recently added to the Parameters list and am curious about how to use it(it seems to be not available for python API?).
Use params for any param which is not presented as a separate argument in Python API, just like you use boost_from_average or feature_fraction.
@guolinke I spotted another one issue with the example above.
Before the second call of train()
# continue training with subset_train_data_2, fail
gbm = lgb.train(params=params,
train_set=subset_train_data_2,
num_boost_round=10,
valid_sets=[train_data],
init_model=gbm)
shapes are the following:
subset_train_data_1.get_data().shape
>>> (5000, 10)
subset_train_data_2.get_data().shape
>>> (5000, 10)
subset_train_data_3.get_data().shape
>>> (5000, 10)
And after the call we get the following result:
subset_train_data_1.get_data().shape
>>> (5000, 10)
subset_train_data_2.get_data().shape
>>> (10000, 10)
subset_train_data_3.get_data().shape
>>> (5000, 10)
The call without init_model=gbm results in the correct shapes.
However, it seems that subset Dataset cannot be used for continued training,
Construct subsets prior passing them. Seems this should help, at least your example works fine after after adding
construct()explicit calls.... subset_train_data_1=train_data.subset(subset_index).construct() ... subset_train_data_2=train_data.subset(subset_index).construct() ... subset_train_data_3=train_data_3.subset(subset_index).construct()Also, I notice that pos_bagging_fraction was recently added to the Parameters list and am curious about how to use it(it seems to be not available for python API?).
Use
paramsfor any param which is not presented as a separate argument in Python API, just like you useboost_from_averageorfeature_fraction.
Thank you! Construct subset did work. And after rebuilding and install lightGBM, pos_bagging_fraction is working now.
@StrikerRUS
I think it is related to
we forget to use subset in the init_predictor...
@Jingyu-Fan although the construct solution seems work, I think it doesn't, as the the predictor is incorrectly used.
@guolinke You are right. After saving the training data to a binary file and reload it, all the examples above won't work even with construct. See updated example code:
import numpy as np
import lightgbm as lgb
# generate simulation data
para=np.random.random((5000, 2))
data=np.zeros((10000,10))
for i in range(5000):
mu, sigma=para[i,:]
s=np.random.normal(mu, sigma, 1000)
data[i,:]=np.histogram(s, bins=10, density=False,range=[-1,1])[0]
data_shuffle=data[:5000,:].copy()
for i in range(5000):
np.random.shuffle(data_shuffle[i,:])
data[5000:,:]=data_shuffle
train_data = lgb.Dataset(data,label=[1]*5000+[0]*5000,free_raw_data=False)
train_data.save_binary('train_data.bin')
train_data = lgb.Dataset("train_data.bin")
subset_index=np.random.choice(np.arange(10000), 5000, replace=False)
subset_train_data_1=train_data.subset(subset_index).construct()
# generate new subset_index
subset_index=np.random.choice(np.arange(10000), 5000, replace=False)
subset_train_data_2=train_data.subset(subset_index).construct()
train_data_3 = lgb.Dataset(data,label=[1]*5000+[0]*5000,free_raw_data=False, reference=train_data)
subset_train_data_3=train_data_3.subset(subset_index).construct()
subset_train_data_4=lgb.Dataset(data[subset_index,:],label=np.array([1]*5000+[0]*5000)[subset_index],\
free_raw_data=False,reference=train_data).construct()
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': ["binary_error",'binary_logloss'],
'metric_freq': 10,
'num_leaves': 31,
'num_threads': 1,
'learning_rate': 0.1,
'feature_fraction': 1,
'boost_from_average': False,
'verbose': 1
}
# train using subset_train_data_1 and it works
gbm = lgb.train(params=params,
train_set=subset_train_data_1,
num_boost_round=10,
valid_sets=[train_data],
keep_training_booster=True)
# continue training with subset_train_data_2, fail
gbm = lgb.train(params=params,
train_set=subset_train_data_2,
num_boost_round=10,
valid_sets=[train_data],
init_model=gbm)
# continue training with subset_train_data_3, fail
gbm = lgb.train(params=params,
train_set=subset_train_data_3,
num_boost_round=10,
valid_sets=[train_data],
init_model=gbm)
# continue training with subset_train_data_4, fail
gbm = lgb.train(params=params,
train_set=subset_train_data_4,
num_boost_round=10,
valid_sets=[train_data],
init_model=gbm)
@guolinke Yeah, it seems to be so...
Do you have time for a quick fix? I'm sorry, I'm quite busy these days.
@StrikerRUS I am also quick busy recently. Maybe free after one or two weeks...
@StrikerRUS I plan to fix this today, could you help to add the test case for this?
@guolinke Nice! Yeah, I'll help with test today or tomorrow.
Most helpful comment
@StrikerRUS I plan to fix this today, could you help to add the test case for this?