Catboost: Training Catboost models with Multiple Chunks of data

Created on 30 Oct 2017 · 4Comments · Source: catboost/catboost

When I tried to load data of 22 GB csv file in panda dataframes for training, the process got crashed.. As I couldn't hold that big data in a single panda dataframe, is it possible to train catboost with multiple chunk of panda dataframes?

Source

mathankumart

Most helpful comment

Here is an example:
https://tech.yandex.com/catboost/doc/dg/concepts/python-usages-examples-docpage/#load-the-dataset-from-a-file

annaveronika on 30 Oct 2017

👍4

All 4 comments

You could try training from file, this will not use additional memory for dataframe.
https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_pool-docpage/
So you need to pass filename when constructing Pool

annaveronika on 30 Oct 2017

Here is an example:
https://tech.yandex.com/catboost/doc/dg/concepts/python-usages-examples-docpage/#load-the-dataset-from-a-file

annaveronika on 30 Oct 2017

👍4

Also, if you want to work with large datasets it is more efficient to perform training on gpu. This might speed up the training up to 40 times. It is currently only commandline, but we'll add python in very short time.

annaveronika on 30 Oct 2017

👍3

About training on multiple chunks of data - we do not plan to train on multiple chunks of data on single machine, but it will be possible to train on multiple chunks on multiple machines when we finish
https://github.com/catboost/catboost/issues/153

annaveronika on 30 Oct 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

feature request: handling nan categorical features out of the box

chanansh · 4Comments

Correct parsing of dataset lines with ""

mathankumart · 3Comments

Categorical variables dominating feature importance

Rajat700 · 3Comments

set_params method is not working

abdullahalsaidi16 · 3Comments

CatBoostRegressor and CatBoostClassifier have not `_get_tags()`

old-fashion-donut · 4Comments