When I tried to load data of 22 GB csv file in panda dataframes for training, the process got crashed.. As I couldn't hold that big data in a single panda dataframe, is it possible to train catboost with multiple chunk of panda dataframes?
You could try training from file, this will not use additional memory for dataframe.
https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_pool-docpage/
So you need to pass filename when constructing Pool
Here is an example:
https://tech.yandex.com/catboost/doc/dg/concepts/python-usages-examples-docpage/#load-the-dataset-from-a-file
Also, if you want to work with large datasets it is more efficient to perform training on gpu. This might speed up the training up to 40 times. It is currently only commandline, but we'll add python in very short time.
About training on multiple chunks of data - we do not plan to train on multiple chunks of data on single machine, but it will be possible to train on multiple chunks on multiple machines when we finish
https://github.com/catboost/catboost/issues/153
Most helpful comment
Here is an example:
https://tech.yandex.com/catboost/doc/dg/concepts/python-usages-examples-docpage/#load-the-dataset-from-a-file