Catboost: Training Catboost models with Multiple Chunks of data

Created on 30 Oct 2017  路  4Comments  路  Source: catboost/catboost

When I tried to load data of 22 GB csv file in panda dataframes for training, the process got crashed.. As I couldn't hold that big data in a single panda dataframe, is it possible to train catboost with multiple chunk of panda dataframes?

Most helpful comment

All 4 comments

You could try training from file, this will not use additional memory for dataframe.
https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_pool-docpage/
So you need to pass filename when constructing Pool

Also, if you want to work with large datasets it is more efficient to perform training on gpu. This might speed up the training up to 40 times. It is currently only commandline, but we'll add python in very short time.

About training on multiple chunks of data - we do not plan to train on multiple chunks of data on single machine, but it will be possible to train on multiple chunks on multiple machines when we finish
https://github.com/catboost/catboost/issues/153

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chanansh picture chanansh  路  4Comments

mathankumart picture mathankumart  路  3Comments

Rajat700 picture Rajat700  路  3Comments

abdullahalsaidi16 picture abdullahalsaidi16  路  3Comments

old-fashion-donut picture old-fashion-donut  路  4Comments