This check:
CHECK(preds.size() == (static_cast
<< "SoftmaxMultiClassObj: label size and pred size does not match";
means that number of training examples should be divided by the number of classes (if I understood correctly). For example, if number of classes is 10, then number of training examples should be divided by 10. Of course, it is not true in the majority of cases. Is it a bug? Thanks!
it is not, preds.size() gives the total number of probability entries. Which equals num_training * num_class.
Ok, I see, thanks! Actually, when I run xgboost from python on dataset with 1024389 training examples and num_class = 100, it gives error:
XGBoostError: [21:26:57] src/objective/multiclass_obj.cc:43: Check failed: preds.size() == (static_cast
When I reduce the number of training examples to 1024300, it works fine. I suppose this is a memory problem then. My laptop cannot fit 1024389*100 matrix in the memory. Is it possible that I got such error because of this?
This could happen if your laptop is 32 bit and the predict vector size exceed integer range
Hi I am getting the same issue. @tqchen
@diefimov i was getting the same issue. The problem was the eval_metrics in my case. 'AUC' doesnt work with multi-class i believe. i simple changed the metrics and my code worked.
@shang-vikas what did you change your metric to from 'auc' which fixed the issue?
Changing the metric to "mlogloss" or "merror" fixed the issue for me.
I have converted my sparse matrices to csc_matrix
format as suggested here: https://github.com/dmlc/xgboost/issues/1238#issuecomment-351976852
I have this error using sparse matrices:
[18:11:10] dmlc-core/include/dmlc/././logging.h:235: [18:11:10] src/objective/multiclass_obj.cc:43: Check failed: preds.size() == (static_cast<size_t>(param_.num_class) * info.labels.size()) SoftmaxMultiClassObj: label size and pred size does not match
Traceback (most recent call last):
File "xgboost_word_importance.py", line 165, in <module>
clf = train_xgboost(X_train, X_test, y_train, y_test, n_classes)
File "xgboost_word_importance.py", line 140, in train_xgboost
clf = xgb.train(params, xgb_train)
File "/home/myuser/anaconda3/lib/python3.6/site-packages/xgboost/training.py", line 205, in train
xgb_model=xgb_model, callbacks=callbacks)
File "/home/myuser/anaconda3/lib/python3.6/site-packages/xgboost/training.py", line 76, in _train_internal
bst.update(dtrain, i, obj)
File "/home/myuser/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 806, in update
_check_call(_LIB.XGBoosterUpdateOneIter(self.handle, iteration, dtrain.handle))
File "/home/myuser/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 127, in _check_call
raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: b'[18:11:10] src/objective/multiclass_obj.cc:43: Check failed: preds.size() == (static_cast<size_t>(param_.num_class) * info.labels.size()) SoftmaxMultiClassObj: label size and pred size does not match'
Seems some rows of X_train
are all zeros, can it be a problem?
Code for training:
def is_valid_sparse_matrix(data):
for i in range(data.shape[0]):
if(data.getrow(i).count_nonzero() == 0):
return False
return True
def train_xgboost(X_train, X_test, y_train, y_test, n_classes):
params = {}
params['objective'] = 'multi:softmax'
params['silent'] = 1
params['nthread'] = multiprocessing.cpu_count()
params['num_class'] = n_classes
print('X_train.shape', X_train.shape) #
print('X_test.shape', X_test.shape) #
print('y_train.shape', y_train.shape) #
print('y_test.shape', y_test.shape) #
print('np.unique(y_train)', np.unique(y_train)) #
print('np.unique(y_test)', np.unique(y_test)) #
print('type(X_train)', type(X_train)) #
print('type(X_test)', type(X_test)) #
print('type(y_train)', type(y_train)) #
print('type(y_test)', type(y_test)) #
#Convert to CSC sparse matrix format https://github.com/dmlc/xgboost/issues/1238
X_train = scipy.sparse.csc_matrix(X_train)
X_test = scipy.sparse.csc_matrix(X_test)
print('type(X_train)', type(X_train)) #
print('type(X_test)', type(X_test)) #
print('type(y_train)', type(y_train)) #
print('type(y_test)', type(y_test)) #
#if(not is_valid_sparse_matrix(X_train)):
# print('X_train is not valid!')
# sys.exit()
#if(not is_valid_sparse_matrix(X_test)):
# print('X_train is not valid!')
# sys.exit()
xgb_train = xgb.DMatrix(X_train, label=y_train)
xgb_test = xgb.DMatrix(X_test, label=y_test)
clf = xgb.train(params, xgb_train)
y_pred = clf.predict(xgb_test)
acc = accuracy_score(y_test, y_pred)
print('acc', acc*100.0,'%')
f1 = f1_score(y_test, y_pred)
print('f1_score', f1)
return clf
Output:
X_train.shape (701, 52708)
X_test.shape (176, 52708)
y_train.shape (701,)
y_test.shape (176,)
np.unique(y_train) [0 1]
np.unique(y_test) [0 1]
type(X_train) <class 'scipy.sparse.csr.csr_matrix'>
type(X_test) <class 'scipy.sparse.csr.csr_matrix'>
type(y_train) <class 'numpy.ndarray'>
type(y_test) <class 'numpy.ndarray'>
type(X_train) <class 'scipy.sparse.csc.csc_matrix'>
type(X_test) <class 'scipy.sparse.csc.csc_matrix'>
type(y_train) <class 'numpy.ndarray'>
type(y_test) <class 'numpy.ndarray'>
Minimum example to reproduce error:
pip show xgboost
Name: xgboost
Version: 0.6a2
Summary: XGBoost Python Package
Home-page: https://github.com/dmlc/xgboost
Author: Hongliang Liu
Author-email: [email protected]
License: UNKNOWN
Location: /home/user/miniconda3/lib/python3.6/site-packages
Requires: scikit-learn, numpy, scipy
import xgboost as xgb
import numpy as np
import scipy
import sys
import multiprocessing
X_train = scipy.sparse.csc_matrix((4, 3), dtype=np.float32)
X_test = scipy.sparse.csc_matrix((5, 3), dtype=np.float32)
y_train = np.ones((4, ), dtype=np.int32)
y_test = np.ones((5, ), dtype=np.int32)
def is_valid_sparse_matrix(data):
for i in range(data.shape[0]):
if(data.getrow(i).count_nonzero() == 0):
return False
return True
def train_xgboost(X_train, X_test, y_train, y_test, n_classes):
params = {}
params['objective'] = 'multi:softmax'
params['silent'] = 1
params['nthread'] = multiprocessing.cpu_count()
params['num_class'] = n_classes
print('X_train.shape', X_train.shape) #
print('X_test.shape', X_test.shape) #
print('y_train.shape', y_train.shape) #
print('y_test.shape', y_test.shape) #
print('np.unique(y_train)', np.unique(y_train)) #
print('np.unique(y_test)', np.unique(y_test)) #
print('type(X_train)', type(X_train))
print('type(X_test)', type(X_test))
print('type(y_train)', type(y_train))
print('type(y_test)', type(y_test))
#if(not is_valid_sparse_matrix(X_train)):
# print('X_train is not valid!')
# sys.exit()
#if(not is_valid_sparse_matrix(X_test)):
# print('X_train is not valid!')
# sys.exit()
xgb_train = xgb.DMatrix(X_train, label=y_train)
xgb_test = xgb.DMatrix(X_test, label=y_test)
clf = xgb.train(params, xgb_train)
y_pred = clf.predict(xgb_test)
acc = accuracy_score(y_test, y_pred)
print('acc', acc*100.0,'%')
f1 = f1_score(y_test, y_pred)
print('f1_score', f1)
return clf
train_xgboost(X_train, X_test, y_train, y_test, 2)
Output:
X_train.shape (4, 3)
X_test.shape (5, 3)
y_train.shape (4,)
y_test.shape (5,)
np.unique(y_train) [1]
np.unique(y_test) [1]
type(X_train) <class 'scipy.sparse.csc.csc_matrix'>
type(X_test) <class 'scipy.sparse.csc.csc_matrix'>
type(y_train) <class 'numpy.ndarray'>
type(y_test) <class 'numpy.ndarray'>
[18:45:58] dmlc-core/include/dmlc/././logging.h:235: [18:45:58] src/objective/multiclass_obj.cc:43: Check failed: preds.size() == (static_cast<size_t>(param_.num_class) * info.labels.size()) SoftmaxMultiClassObj: label size and pred size does not match
Traceback (most recent call last):
File "xgboost_fail.py", line 71, in <module>
train_xgboost(X_train, X_test, y_train, y_test, 2)
File "xgboost_fail.py", line 59, in train_xgboost
clf = xgb.train(params, xgb_train)
File "/home/user/miniconda3/lib/python3.6/site-packages/xgboost/training.py", line 205, in train
xgb_model=xgb_model, callbacks=callbacks)
File "/home/user/miniconda3/lib/python3.6/site-packages/xgboost/training.py", line 76, in _train_internal
bst.update(dtrain, i, obj)
File "/home/user/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 806, in update
_check_call(_LIB.XGBoosterUpdateOneIter(self.handle, iteration, dtrain.handle))
File "/home/user/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 127, in _check_call
raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: b'[18:45:58] src/objective/multiclass_obj.cc:43: Check failed: preds.size() == (static_cast<size_t>(param_.num_class) * info.labels.size()) SoftmaxMultiClassObj: label size and pred size does not match'
Most helpful comment
Changing the metric to "mlogloss" or "merror" fixed the issue for me.