Xgboost: Why is xgboost great?

Created on 23 Dec 2019  路  5Comments  路  Source: dmlc/xgboost

Hi,

I want to show xgboost is strictly better than deep learning model such as MLP or CNN in some scenes. Could someone give some advice on dataset selection? The dataset would be better to be widely used in research papers.

Thank you.

Most helpful comment

There are some domains where deep learning (neural networks) excels: computer visual, natural language processing, and reinforcement learning. These domains involves unstructured or semi-structured data (pixels, sequences, state spaces).

On the other hand, XGBoost is a good choice if you have tabular data, i.e. each feature has a well-defined meaning. Some reasons why you may want to choose XGBoost (or tree-based algorithms) over deep learning:

  • No need to re-scale your data
  • Less need for hyper-parameter tuning: in deep learning, you have to pick the right architecture, right hyper-parameters, and right learning rate schedule etc, etc.
  • Less computation required for training
  • Simpler model that's more interpretable. You can plot feature importances and even read tree outputs. Furthermore, SHAP (feature importance measure) is much faster to compute for tree models than deep learning models. See https://arxiv.org/abs/1802.03888
  • Ability to constrain training, e.g. monotonicity constraints, feature interaction constraints. And conceivably you can also constrain the set of features that would be selected in different parts of training.

All 5 comments

There's no free lunch...

There's no free lunch...

That means there must be some datasets that xgboost does better than deep model, I just want them.

There are some domains where deep learning (neural networks) excels: computer visual, natural language processing, and reinforcement learning. These domains involves unstructured or semi-structured data (pixels, sequences, state spaces).

On the other hand, XGBoost is a good choice if you have tabular data, i.e. each feature has a well-defined meaning. Some reasons why you may want to choose XGBoost (or tree-based algorithms) over deep learning:

  • No need to re-scale your data
  • Less need for hyper-parameter tuning: in deep learning, you have to pick the right architecture, right hyper-parameters, and right learning rate schedule etc, etc.
  • Less computation required for training
  • Simpler model that's more interpretable. You can plot feature importances and even read tree outputs. Furthermore, SHAP (feature importance measure) is much faster to compute for tree models than deep learning models. See https://arxiv.org/abs/1802.03888
  • Ability to constrain training, e.g. monotonicity constraints, feature interaction constraints. And conceivably you can also constrain the set of features that would be selected in different parts of training.

Actually tree models also excel at image segmentation and similar tasks, it's just currently XGBoost not being optimized for wide dataset.

Was this page helpful?
0 / 5 - 0 ratings