Xgboost: The small bug with multiclass xgboost

Created on 18 May 2016  路  9Comments  路  Source: dmlc/xgboost

This check:

CHECK(preds.size() == (static_cast(param_.num_class) * info.labels.size()))
<< "SoftmaxMultiClassObj: label size and pred size does not match";

means that number of training examples should be divided by the number of classes (if I understood correctly). For example, if number of classes is 10, then number of training examples should be divided by 10. Of course, it is not true in the majority of cases. Is it a bug? Thanks!

Most helpful comment

Changing the metric to "mlogloss" or "merror" fixed the issue for me.

All 9 comments

it is not, preds.size() gives the total number of probability entries. Which equals num_training * num_class.

Ok, I see, thanks! Actually, when I run xgboost from python on dataset with 1024389 training examples and num_class = 100, it gives error:

XGBoostError: [21:26:57] src/objective/multiclass_obj.cc:43: Check failed: preds.size() == (static_cast(param_.num_class) * info.labels.size()) SoftmaxMultiClassObj: label size and pred size does not match

When I reduce the number of training examples to 1024300, it works fine. I suppose this is a memory problem then. My laptop cannot fit 1024389*100 matrix in the memory. Is it possible that I got such error because of this?

This could happen if your laptop is 32 bit and the predict vector size exceed integer range

@diefimov i was getting the same issue. The problem was the eval_metrics in my case. 'AUC' doesnt work with multi-class i believe. i simple changed the metrics and my code worked.

@shang-vikas what did you change your metric to from 'auc' which fixed the issue?

Changing the metric to "mlogloss" or "merror" fixed the issue for me.

I have converted my sparse matrices to csc_matrix format as suggested here: https://github.com/dmlc/xgboost/issues/1238#issuecomment-351976852

I have this error using sparse matrices:

[18:11:10] dmlc-core/include/dmlc/././logging.h:235: [18:11:10] src/objective/multiclass_obj.cc:43: Check failed: preds.size() == (static_cast<size_t>(param_.num_class) * info.labels.size()) SoftmaxMultiClassObj: label size and pred size does not match
Traceback (most recent call last):
  File "xgboost_word_importance.py", line 165, in <module>
    clf = train_xgboost(X_train, X_test, y_train, y_test, n_classes)
  File "xgboost_word_importance.py", line 140, in train_xgboost
    clf = xgb.train(params, xgb_train)
  File "/home/myuser/anaconda3/lib/python3.6/site-packages/xgboost/training.py", line 205, in train
    xgb_model=xgb_model, callbacks=callbacks)
  File "/home/myuser/anaconda3/lib/python3.6/site-packages/xgboost/training.py", line 76, in _train_internal
    bst.update(dtrain, i, obj)
  File "/home/myuser/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 806, in update
    _check_call(_LIB.XGBoosterUpdateOneIter(self.handle, iteration, dtrain.handle))
  File "/home/myuser/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 127, in _check_call
    raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: b'[18:11:10] src/objective/multiclass_obj.cc:43: Check failed: preds.size() == (static_cast<size_t>(param_.num_class) * info.labels.size()) SoftmaxMultiClassObj: label size and pred size does not match'

Seems some rows of X_train are all zeros, can it be a problem?

Code for training:

def is_valid_sparse_matrix(data):
    for i in range(data.shape[0]):
        if(data.getrow(i).count_nonzero() == 0):
            return False

    return True

def train_xgboost(X_train, X_test, y_train, y_test, n_classes):
    params = {}
    params['objective'] = 'multi:softmax'
    params['silent'] = 1
    params['nthread'] = multiprocessing.cpu_count()
    params['num_class'] = n_classes

    print('X_train.shape', X_train.shape) #
    print('X_test.shape', X_test.shape) #
    print('y_train.shape', y_train.shape) #
    print('y_test.shape', y_test.shape) #
    print('np.unique(y_train)', np.unique(y_train)) #
    print('np.unique(y_test)', np.unique(y_test)) #

    print('type(X_train)', type(X_train)) #
    print('type(X_test)', type(X_test)) #
    print('type(y_train)', type(y_train)) #
    print('type(y_test)', type(y_test)) #

    #Convert to CSC sparse matrix format https://github.com/dmlc/xgboost/issues/1238
    X_train = scipy.sparse.csc_matrix(X_train)
    X_test = scipy.sparse.csc_matrix(X_test)

    print('type(X_train)', type(X_train)) #
    print('type(X_test)', type(X_test)) #
    print('type(y_train)', type(y_train)) #
    print('type(y_test)', type(y_test)) #

    #if(not is_valid_sparse_matrix(X_train)):
    #   print('X_train is not valid!')
    #   sys.exit()
    #if(not is_valid_sparse_matrix(X_test)):
    #   print('X_train is not valid!')
    #   sys.exit()

    xgb_train = xgb.DMatrix(X_train, label=y_train)
    xgb_test = xgb.DMatrix(X_test, label=y_test)

    clf = xgb.train(params, xgb_train)

    y_pred = clf.predict(xgb_test)

    acc = accuracy_score(y_test, y_pred)
    print('acc', acc*100.0,'%')
    f1 = f1_score(y_test, y_pred)
    print('f1_score', f1)

    return clf

Output:

X_train.shape (701, 52708)
X_test.shape (176, 52708)
y_train.shape (701,)
y_test.shape (176,)
np.unique(y_train) [0 1]
np.unique(y_test) [0 1]
type(X_train) <class 'scipy.sparse.csr.csr_matrix'>
type(X_test) <class 'scipy.sparse.csr.csr_matrix'>
type(y_train) <class 'numpy.ndarray'>
type(y_test) <class 'numpy.ndarray'>
type(X_train) <class 'scipy.sparse.csc.csc_matrix'>
type(X_test) <class 'scipy.sparse.csc.csc_matrix'>
type(y_train) <class 'numpy.ndarray'>
type(y_test) <class 'numpy.ndarray'>

Minimum example to reproduce error:

pip show xgboost

Name: xgboost
Version: 0.6a2
Summary: XGBoost Python Package
Home-page: https://github.com/dmlc/xgboost
Author: Hongliang Liu
Author-email: [email protected]
License: UNKNOWN
Location: /home/user/miniconda3/lib/python3.6/site-packages
Requires: scikit-learn, numpy, scipy
import xgboost as xgb
import numpy as np
import scipy
import sys
import multiprocessing

X_train = scipy.sparse.csc_matrix((4, 3), dtype=np.float32)
X_test = scipy.sparse.csc_matrix((5, 3), dtype=np.float32)

y_train = np.ones((4, ), dtype=np.int32)
y_test = np.ones((5, ), dtype=np.int32)

def is_valid_sparse_matrix(data):
    for i in range(data.shape[0]):
        if(data.getrow(i).count_nonzero() == 0):
            return False

    return True

def train_xgboost(X_train, X_test, y_train, y_test, n_classes):
    params = {}
    params['objective'] = 'multi:softmax'
    params['silent'] = 1
    params['nthread'] = multiprocessing.cpu_count()
    params['num_class'] = n_classes

    print('X_train.shape', X_train.shape) #
    print('X_test.shape', X_test.shape) #
    print('y_train.shape', y_train.shape) #
    print('y_test.shape', y_test.shape) #
    print('np.unique(y_train)', np.unique(y_train)) #
    print('np.unique(y_test)', np.unique(y_test)) #

    print('type(X_train)', type(X_train))
    print('type(X_test)', type(X_test))
    print('type(y_train)', type(y_train))
    print('type(y_test)', type(y_test))

    #if(not is_valid_sparse_matrix(X_train)):
    #   print('X_train is not valid!')
    #   sys.exit()
    #if(not is_valid_sparse_matrix(X_test)):
    #   print('X_train is not valid!')
    #   sys.exit()

    xgb_train = xgb.DMatrix(X_train, label=y_train)
    xgb_test = xgb.DMatrix(X_test, label=y_test)

    clf = xgb.train(params, xgb_train)

    y_pred = clf.predict(xgb_test)

    acc = accuracy_score(y_test, y_pred)
    print('acc', acc*100.0,'%')
    f1 = f1_score(y_test, y_pred)
    print('f1_score', f1)

    return clf


train_xgboost(X_train, X_test, y_train, y_test, 2)

Output:

X_train.shape (4, 3)
X_test.shape (5, 3)
y_train.shape (4,)
y_test.shape (5,)
np.unique(y_train) [1]
np.unique(y_test) [1]
type(X_train) <class 'scipy.sparse.csc.csc_matrix'>
type(X_test) <class 'scipy.sparse.csc.csc_matrix'>
type(y_train) <class 'numpy.ndarray'>
type(y_test) <class 'numpy.ndarray'>
[18:45:58] dmlc-core/include/dmlc/././logging.h:235: [18:45:58] src/objective/multiclass_obj.cc:43: Check failed: preds.size() == (static_cast<size_t>(param_.num_class) * info.labels.size()) SoftmaxMultiClassObj: label size and pred size does not match
Traceback (most recent call last):
  File "xgboost_fail.py", line 71, in <module>
    train_xgboost(X_train, X_test, y_train, y_test, 2)
  File "xgboost_fail.py", line 59, in train_xgboost
    clf = xgb.train(params, xgb_train)
  File "/home/user/miniconda3/lib/python3.6/site-packages/xgboost/training.py", line 205, in train
    xgb_model=xgb_model, callbacks=callbacks)
  File "/home/user/miniconda3/lib/python3.6/site-packages/xgboost/training.py", line 76, in _train_internal
    bst.update(dtrain, i, obj)
  File "/home/user/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 806, in update
    _check_call(_LIB.XGBoosterUpdateOneIter(self.handle, iteration, dtrain.handle))
  File "/home/user/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 127, in _check_call
    raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: b'[18:45:58] src/objective/multiclass_obj.cc:43: Check failed: preds.size() == (static_cast<size_t>(param_.num_class) * info.labels.size()) SoftmaxMultiClassObj: label size and pred size does not match'
Was this page helpful?
0 / 5 - 0 ratings